[SSD/Tensorflow] Super slow training on COCO when switching from resnet50 to mobilenet_v2 backbone

Hi all, 

I was able to replicate the results on COCO with 4 V100 GPUs using [ssd320_full_4gpus.config](https://github.com/NVIDIA/DeepLearningExamples/blob/65211bd9621c781bc29d3a4e1c3c287983645d50/TensorFlow/Detection/SSD/configs/ssd320_full_4gpus.config), with mAP=0.28. Training is also quite fast as stated. Please find a training screenshot below.
<img width="608" alt="ssd_resnet50_coco nvidia training" src="https://user-images.githubusercontent.com/9111479/83706244-d644d200-a5cb-11ea-9d52-73252e036977.png">

However, when I switched to [ssdlite_mobilenet_v2_coco.config](https://github.com/NVIDIA/DeepLearningExamples/blob/65211bd9621c781bc29d3a4e1c3c287983645d50/TensorFlow/Detection/SSD/models/research/object_detection/samples/configs/ssdlite_mobilenet_v2_coco.config), training is super slow as below, even slower than using 1 GPU.
<img width="613" alt="mobilenet_v2_coco nvidia training" src="https://user-images.githubusercontent.com/9111479/83706294-f6749100-a5cb-11ea-8fa8-bb9e9b1dc9a6.png">

As illustrated in the screenshots, there were 8 threads spawn when training with the default config, while there were only 4 threads spawn when training with the new config.
What's more strange is that with the new config, the whole computer/instance is frozen, while for the default config, I can still perform coding or other tasks seamlessly.

I am wondering why this is happening? I checked the codes and it doesn't seem like there's anything special binding to retinanet related codes. 

**Environment**
I'm using the provided Dockerfile: 
* nvcr.io/nvidia/tensorflow:19.05-py3
* GPUs in the system: 4x Tesla V100-SXM2-16GB
* CUDA driver version: 440.82

Thank you very much!

Regards,
thoang3


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SSD/Tensorflow] Super slow training on COCO when switching from resnet50 to mobilenet_v2 backbone #547

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SSD/Tensorflow] Super slow training on COCO when switching from resnet50 to mobilenet_v2 backbone #547

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions