Skip to content

[EfficientDet/PyTorch] TypeError: new(): invalid data type 'str' when training EfficientDet on Waymo dataset #1108

@ChongyuNVIDIA

Description

@ChongyuNVIDIA

Related to EfficientDet/PyTorch

Describe the bug
When I try to reproduce the EfficientDet training result on Waymo dataset as described in:
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Detection/Efficientdet
Meet the " TypeError: new(): invalid data type 'str' " issue after loading the Waymo dataset and start training.

To Reproduce
Steps to reproduce the behavior:

  1. Git clone 'https://github.com/NVIDIA/DeepLearningExamples', cd DeepLearningExamples/PyTorch/Detection/Efficientdet
  2. run 'waymo_tool/waymo_data_converter.py' to downloads and converts the Waymo data into COCO format
  3. Change the dataset path according to 'scripts/waymo/train_waymo_AMP_8xA100-80G.sh'
  4. Launch './distributed_train.sh 8 /datasets/Waymo_JoC --model efficientdet_d0 -b 8 --amp --lr 0.2 --sync-bn --opt fusedmomentum --warmup-epochs 1 --output Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N --worker 8 --fill-color mean --model-ema --model-ema-decay 0.999 --eval-after 24 --epochs 24 --save-checkpoint-interval 1 --smoothing 0.0 --waymo --remove-weights class_net box_net anchor --input_size 1536 --num_classes 3 --resume --freeze-layers backbone --waymo-train /datasets/Waymo_JoC/waymo_coco_format_train/images --waymo-val /datasets/Waymo_JoC/waymo_coco_format_val/images --waymo-val-annotation /datasets/Waymo_JoC/waymo_coco_format_val/annotations/annotations.json --waymo-train-annotation /datasets/Waymo_JoC/waymo_coco_format_train/annotations/annotations.json'

Expected behavior
Expect the EfficientDet training on Waymo dataset can be smooth.

Environment

  • Container version: pytorch:21.06-py3
  • GPUs in the system: 8x Tesla A100-80GB
  • CUDA version: 11.4
  • CUDA driver version: 470.82.01

The log info for the training execution:
Added key: store_based_barrier_key:1 to store for rank: 6
Added key: store_based_barrier_key:1 to store for rank: 5
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 2
Added key: store_based_barrier_key:1 to store for rank: 7
Added key: store_based_barrier_key:1 to store for rank: 4
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 8 nodes.
Rank 6: Completed store-based barrier for 8 nodes.
Rank 5: Completed store-based barrier for 8 nodes.
Rank 3: Completed store-based barrier for 8 nodes.
Rank 2: Completed store-based barrier for 8 nodes.
Rank 7: Completed store-based barrier for 8 nodes.
Rank 4: Completed store-based barrier for 8 nodes.
Rank 1: Completed store-based barrier for 8 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 4, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 5, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 6, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 7, total 8.
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
DLL 2022-04-06 06:03:54.651781 - PARAMETER model_name : efficientdet_d0 param_count : 3826868
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Converted model to use Synchronized BatchNorm. WARNING: You may have issues if using zero initialized BN layers (enabled by default for ResNets) while sync-bn enabled.
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Using torch DistributedDataParallel. Install NVIDIA Apex for Apex DDP.
DLL 2022-04-06 06:03:56.451268 - PARAMETER Scheduled_epochs : 34
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=69.44s)
creating index...
Done (t=71.93s)
creating index...
Done (t=72.49s)
creating index...
Done (t=72.71s)
creating index...
Done (t=72.77s)
creating index...
Done (t=73.04s)
creating index...
Done (t=73.05s)
creating index...
Done (t=73.08s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=22.79s)
creating index...
index created!
Done (t=23.14s)
creating index...
Done (t=22.88s)
creating index...
Done (t=23.24s)
creating index...
Done (t=23.59s)
creating index...
Done (t=23.13s)
creating index...
Done (t=23.37s)
creating index...
Done (t=23.39s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
Traceback (most recent call last):
File "train.py", line 635, in
main()
File "train.py", line 461, in main
train_metrics = train_epoch(
File "train.py", line 522, in train_epoch
input, target = next(loader_iter)
File "/ngc_0/Chong_dxxz_Projects/Gitlab/Adversarial_Detection/DeepLearningExamples/Detection/Waymo/data/loader.py", line 84, in iter
for next_input, next_target in self.loader:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/ngc_0/Chong_dxxz_Projects/Gitlab/Adversarial_Detection/DeepLearningExamples/Detection/Waymo/data/loader.py", line 65, in fast_collate
target[tk][i] = torch.tensor(tv, dtype=target[tk].dtype)
TypeError: new(): invalid data type 'str'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions