New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AttributeError: module 'torchvision.edgeailite.xnn.model_surgery' has no attribute 'get_replacements_dict' #7
Comments
I'd commented line no 172 in tools/train.py
Working fine now!! Training has started successfully. I encountered some bugs in model_surgery.py , Can you please help me out ? |
You need to pull he repository edgeai-torchvision as it has been updated. Once you pull that, the error will go away. |
Thanks! it worked. Another quick question, Is it possible to train QAT with pretrained weights ? I'd trained Centernet with customized model and datasets. I'm trying to do QAT on Pretrained saved weights. |
Yes. See the example here and you can understand how it uses the config parameter load_from to load the pretrained weights: https://github.com/TexasInstruments/edgeai-mmdetection/blob/master/configs/edgeailite/ssd/ssd_regnet_fpn_bgr_lite.py#L46 |
By the pre-trained weights whether we can use original pytorch weights(without QAT) or weights after some epochs after QAT? As I understand, there will be change in model layers after QAT, so would like to know whether we can use the original model (before QAT) as pre-trained weights for doing QAT? |
Also would like to know whether we need to do QAT using all the training images or small percentage of it is enough? |
It is possible to load the original floating point weights while doing QAT. Few number of epochs are sufficient for QAT, may be 10. If the dataset is large (like ImageNet) a small portion may be sufficient for QAT. See the following link: Add see the following example code snippet there:
For better accuracy, we have seen that it is better to freeze the BN layers and also the Quant range around half the number of epochs. An example is here: |
How much GPU memory do we need for this training ? I tried with single GPU instance of 16 GB throws CUDA out of memory. I'm passing a Model, dummy input, xnn.quantize.QuantTrainModule to cuda memory. Do you have any solution for this? |
GPU memory depends on the batch size used. Reduce the batch size if you get CUDA out of memory |
i reduced to batch size 16,8,4,2,1. facing same memory issue with batch size 1. If I comment xnn.quantize.QuantTrainModule in the code. Training has started without quantization module. |
Can you share the exact error? Which model are you using? What is the input image size being used. You can also try to do that line |
I'm using Centernet model and pretrained weights ERROR: Loaded train 929 samples |
Can you try reducing the input image size - let's see if this is really related to the memory usage. |
issue is insufficient GPU memory. Changing input size of an images doesn't work. still thros same run time error. I removed loading pretrained weights section in the code Training has started with scratch and xnn.quantize.QuantTrainModule moved to cuda memory `creating index... /home/ubuntu/anaconda3/envs/edge-ai/lib/python3.7/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. ctdet/train_test |# | train: [1][0/29]|Tot: 0:00:11 |ETA: 0:00:00 |loss 15.7850 |hm_loss 14.5027 |wh_loss 8.0129 |off_loss 0.4810 |Data ctdet/train_test |## | train: [1][1/29]|Tot: 0:00:12 |ETA: 0:05:15 |loss 15.6908 |hm_loss 14.3788 |wh_loss 8.2404 |off_loss 0.4879 |Data ctdet/train_test |### | train: [1][2/29]|Tot: 0:00:14 |ETA: 0:02:57 |loss 15.6135 |hm_loss 14.3530 |wh_loss 7.8958 |off_loss 0.4709 |Data ctdet/train_test |#### | train: [1][3/29]|Tot: 0:01:25 |ETA: 0:02:10 |loss 15.0860 |hm_loss 13.8389 |wh_loss 7.7483 |off_loss 0.4723 |Data ctdet/train_test |##### | train: [1][4/29]|Tot: 0:02:20 |ETA: 0:08:59 |loss 14.8916 |hm_loss 13.6098 |wh_loss 8.0658 |off_loss 0.4752 |Data ctdet/train_test |###### | train: [1][5/29]|Tot: 0:03:21 |ETA: 0:11:15 |loss 14.7081 |hm_loss 13.4047 |wh_loss 8.2868 |off_loss 0.4747 |Data ctdet/train_test |####### | train: [1][6/29]|Tot: 0:04:20 |ETA: 0:12:55 |loss 14.6586 |hm_loss 13.3553 |wh_loss 8.3268 |off_loss 0.4706 |Data 0.353s(1.196s) |Net 37.158s |
Is that the change that reduced memory requirement significantly? Surprising! |
I'm getting following error when i tried to run a ./run_detection_train.sh
work_dir = './work_dirs/yolov3_regnet_bgr_lite'
gpu_ids = range(0, 1)
2022-01-03 09:13:58,990 - mmdet - INFO - Set random seed to 886029822, deterministic: False
2022-01-03 09:13:59,511 - mmdet - INFO - initialize RegNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'open-mmlab://regnetx_1.6gf'}
2022-01-03 09:13:59,512 - mmcv - INFO - load model from: open-mmlab://regnetx_1.6gf
2022-01-03 09:13:59,512 - mmcv - INFO - load checkpoint from openmmlab path: open-mmlab://regnetx_1.6gf
2022-01-03 09:13:59,562 - mmcv - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
Traceback (most recent call last):
File "./scripts/train_detection_main.py", line 65, in
train_mmdet.main(args)
File "/home/ubuntu/edgeai-mmdetection/tools/train.py", line 172, in main
model = convert_to_lite_model(model, cfg)
File "/home/ubuntu/edgeai-mmdetection/mmdet/utils/model_surgery.py", line 38, in convert_to_lite_model
replacements_dict = copy.deepcopy(xnn.model_surgery.get_replacements_dict())
AttributeError: module 'torchvision.edgeailite.xnn.model_surgery' has no attribute 'get_replacements_dict'
Done.
The text was updated successfully, but these errors were encountered: