[EfficientDet/PyTorch] TypeError: new(): invalid data type 'str' when training EfficientDet on Waymo dataset

Related to **EfficientDet/PyTorch** 

**Describe the bug**
When I try to reproduce the EfficientDet training result on Waymo dataset as described in:
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Detection/Efficientdet
Meet the " TypeError: new(): invalid data type 'str' " issue after loading the Waymo dataset and start training.

**To Reproduce**
Steps to reproduce the behavior:
1. Git clone 'https://github.com/NVIDIA/DeepLearningExamples', cd DeepLearningExamples/PyTorch/Detection/Efficientdet
2. run 'waymo_tool/waymo_data_converter.py' to downloads and converts the Waymo data into COCO format
3. Change the dataset path according to 'scripts/waymo/train_waymo_AMP_8xA100-80G.sh'
4. Launch './distributed_train.sh 8 /datasets/Waymo_JoC --model efficientdet_d0 -b 8 --amp --lr 0.2 --sync-bn --opt fusedmomentum --warmup-epochs 1 --output Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N --worker 8 --fill-color mean --model-ema --model-ema-decay 0.999 --eval-after 24 --epochs 24 --save-checkpoint-interval 1 --smoothing 0.0 --waymo --remove-weights class_net box_net anchor --input_size 1536 --num_classes 3 --resume --freeze-layers backbone --waymo-train /datasets/Waymo_JoC/waymo_coco_format_train/images --waymo-val /datasets/Waymo_JoC/waymo_coco_format_val/images --waymo-val-annotation /datasets/Waymo_JoC/waymo_coco_format_val/annotations/annotations.json --waymo-train-annotation /datasets/Waymo_JoC/waymo_coco_format_train/annotations/annotations.json'

**Expected behavior**
Expect the EfficientDet training on Waymo dataset can be smooth.

**Environment**
* Container version: pytorch:21.06-py3
* GPUs in the system: 8x Tesla A100-80GB
* CUDA version: 11.4
* CUDA driver version: 470.82.01

The log info for the training execution:
Added key: store_based_barrier_key:1 to store for rank: 6
Added key: store_based_barrier_key:1 to store for rank: 5
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 2
Added key: store_based_barrier_key:1 to store for rank: 7
Added key: store_based_barrier_key:1 to store for rank: 4
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 8 nodes.
Rank 6: Completed store-based barrier for 8 nodes.
Rank 5: Completed store-based barrier for 8 nodes.
Rank 3: Completed store-based barrier for 8 nodes.
Rank 2: Completed store-based barrier for 8 nodes.
Rank 7: Completed store-based barrier for 8 nodes.
Rank 4: Completed store-based barrier for 8 nodes.
Rank 1: Completed store-based barrier for 8 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 4, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 5, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 6, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total 8.
Training in distributed mode with multiple processes, 1 GPU per process. Process 7, total 8.
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
model does not have attribute module...
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Restoring model state from stete_dict ...
Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth'
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
DLL 2022-04-06 06:03:54.651781 - PARAMETER model_name : efficientdet_d0  param_count : 3826868
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Converted model to use Synchronized BatchNorm. WARNING: You may have issues if using zero initialized BN layers (enabled by default for ResNets) while sync-bn enabled.
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Input size to be passed to dataloaders: 1536
Image size used in model: 1536
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint
Using torch DistributedDataParallel. Install NVIDIA Apex for Apex DDP.
DLL 2022-04-06 06:03:56.451268 - PARAMETER Scheduled_epochs : 34
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=69.44s)
creating index...
Done (t=71.93s)
creating index...
Done (t=72.49s)
creating index...
Done (t=72.71s)
creating index...
Done (t=72.77s)
creating index...
Done (t=73.04s)
creating index...
Done (t=73.05s)
creating index...
Done (t=73.08s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=22.79s)
creating index...
index created!
Done (t=23.14s)
creating index...
Done (t=22.88s)
creating index...
Done (t=23.24s)
creating index...
Done (t=23.59s)
creating index...
Done (t=23.13s)
creating index...
Done (t=23.37s)
creating index...
Done (t=23.39s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
Traceback (most recent call last):
  File "train.py", line 635, in <module>
    main()
  File "train.py", line 461, in main
    train_metrics = train_epoch(
  File "train.py", line 522, in train_epoch
    input, target = next(loader_iter)
  File "/ngc_0/Chong_dxxz_Projects/Gitlab/Adversarial_Detection/DeepLearningExamples/Detection/Waymo/data/loader.py", line 84, in __iter__
    for next_input, next_target in self.loader:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/ngc_0/Chong_dxxz_Projects/Gitlab/Adversarial_Detection/DeepLearningExamples/Detection/Waymo/data/loader.py", line 65, in fast_collate
    target[tk][i] = torch.tensor(tv, dtype=target[tk].dtype)
TypeError: new(): invalid data type 'str'



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EfficientDet/PyTorch] TypeError: new(): invalid data type 'str' when training EfficientDet on Waymo dataset #1108

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EfficientDet/PyTorch] TypeError: new(): invalid data type 'str' when training EfficientDet on Waymo dataset #1108

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions