Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained model #12

Closed
fiv21 opened this issue Sep 17, 2020 · 9 comments
Closed

Pretrained model #12

fiv21 opened this issue Sep 17, 2020 · 9 comments

Comments

@fiv21
Copy link

fiv21 commented Sep 17, 2020

Hi!
I would like to ask if you can share the pre-trained model for Jetson with the calibration file. I mean if you can share a .zip with the model trained by you and I can prune in my side.
The main problem I have is not achieving the same values in mAP even if I retrain.
Thanks in advance!
All the best,

Franco.

@ak-nv
Copy link
Contributor

ak-nv commented Sep 17, 2020

We cannot provide pre-trained model for face-mask-detection.
What is mAP you are achieving currently? What are the errors? Based on you batch size and hyper-parameters, your mAP might drop.

@fiv21
Copy link
Author

fiv21 commented Sep 17, 2020 via email

@ak-nv
Copy link
Contributor

ak-nv commented Sep 17, 2020

By the way the error with a bigger batch size was an allocation memory error.

With 24 Batch Size, you will need about 15 GB memory, whereas once you prune model (as in git ~12% prune ratio) you can be okay with 8GB memory.

I have personally tried AWS and used Tesla V100 for this task and it worked out pretty well. I have not tried any other Cloud instances yet so cannot comment on it.

@fiv21
Copy link
Author

fiv21 commented Sep 17, 2020

Okay, but I didn't get a point. When you say about prune the model I think you're talking about infer. But my problem is during the training or I misunderstand you sorry for that. Again, if I can train with my own GPU to move then to my AGX Xavier it will be great. Just to put some context, I'm trying to learn how to build the workflow pipeline with my personal computer and the Xavier before moving to cloud. Following the idea of this repo: If I can train the model in my GTX1070 and move the model to infer in the Jetson then mission completed by now. After that I can think in a way to deploy via docker or something like that.
By the other hand, if my GPU memory is too low for this experiment, I think it can be updated as a requirement for the git to prevent future problems with someone else. As a solution for this, I would like to ask if you have any idea (theorical) about if it's possible to make this process using google collab, after all there are GPUs enabled runtimes with Teslas T4 that can be very usefull for my purpouse.

@ak-nv
Copy link
Contributor

ak-nv commented Sep 17, 2020

I feel it should be doable even with GTX1070, you need to experiment with other hyper-params as well when you reduce batch size, in detectnet_v2_train_resnet18_kitti.txt look at model_config . I had better luck with even batch size 16 previously. As batch size reduces your training time will also increase.

When you say about prune the model I think you're talking about infer.

In TLT, we do training first and once we achieve satisfactory accuracy, we prune model and retrain the pruned model. See 3 to 8 steps on TLT workflow

@fiv21
Copy link
Author

fiv21 commented Sep 17, 2020 via email

@fiv21
Copy link
Author

fiv21 commented Sep 22, 2020

Okay, working around with the GTX1070 I didn't find a good result in any possible way, I don't know exactly why. My suspicious points to my dataset and probably the batch size, but without luck to find a scientific answer.
So, to avoid this waste of time with my computer, I used a Tesla V100 from a VM, and here comes the interesting part, I built again the dataset as you suggested in the repo, but now the unpruned model hits:

Validation cost: 0.001015
Mean average_precision (in %): 70.6842

class name      average precision (in %)
------------  --------------------------
mask                             81.1487
no-mask                          60.2197

Median Inference Time: 0.005730

after this I think: Okay, let's try retraining the pruned model with the default settings from the notebook and I get this:

Validation cost: 0.000977
Mean average_precision (in %): 65.6631

class name      average precision (in %)
------------  --------------------------
mask                             60.5674
no-mask                          70.7589

Median Inference Time: 0.003357

I think I've misunderstood something, so I decided to make some changes.

First, I changed the amount of the images for each class to 12000 and the results are quite better:

Validation cost: 0.000234
Mean average_precision (in %): 80.8005

class name      average precision (in %)
------------  --------------------------
mask                             84.494
no-mask                          77.1069

Median Inference Time: 0.005580

then I retrained with a tweak in the prune step, I changed this cell:

!tlt-prune -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned_12k/weights/resnet18_detector.tlt \
           -o $USER_EXPERIMENT_DIR/experiment_dir_pruned_12k_test2/resnet18_nopool_bn_detectnet_v2_pruned.tlt \
           -eq union \
           -pth 0.12 \
           -k $KEY

I changed the -pth from 0.8 to 0.12 and I get this result:

=========================

Validation cost: 0.000245
Mean average_precision (in %): 80.6542

class name      average precision (in %)
------------  --------------------------
mask                             84.6146
no-mask                          76.6938

Median Inference Time: 0.004232

At this point I don't know exactly what to change or where point the effort to get closer to your results, If you can help me with this I'll really appreciate your help.
If it's needed I can share my TFRecords to reproduce the training.
Thanks for your time!

NOTE: You should add a warning in the repo about the changes in the path in the config files for those who like me doesn't know much about TLT and stuff like that

EDIT: I miss clarify about the configurations. I used the default config files provided in this repo with the change in the point where the models and data are located

@ak-nv
Copy link
Contributor

ak-nv commented Sep 27, 2020

@fiv21 Thanks for working through this and recommendations. I am trying to add more detailed steps and examples with sample video soon in upcoming weeks, if that helps.

@fiv21
Copy link
Author

fiv21 commented Sep 27, 2020

That's great!
It could work to understand a little bit more about this. It's interesting, because, maybe it's a base example for other cases, perhaps to look if a person uses a security helmet or similar.
I close the issue, waiting for new info in the repo to improve the results or the right way to achieve the same accuracy.
Thanks in advance!

@fiv21 fiv21 closed this as completed Sep 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants