Training hangs at the end of the first epoch in image classification task. #33

realgump · 2022-04-07T07:30:44Z

Dear author:

When I training the Uniformer model with 8 GPUs, I start the code with the following run.sh:

work_path=$(dirname $0)
PYTHONPATH=$PYTHONPATH:../../ \
python -m torch.distributed.launch --nproc_per_node=8 --master_port=22335 --use_env main.py \
    --model uniformer_base \
    --batch-size 64 \
    --num_workers 8 \
    --drop-path 0.3 \
    --epoch 300 \
    --dist-eval \
    --output_dir ${work_path}/ckpt \
    2>&1 | tee -a ${work_path}/log.txt

And the logs are (I have deleted the display of model details):

/home/data/user/local/anaconda3/envs/uniformer/lib/python3.9/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : main.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 8
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:22335
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_yejsfquq/none_6nw5du9x
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/data/user/local/anaconda3/envs/uniformer/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=22335
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_yejsfquq/none_6nw5du9x/attempt_0/7/error.json
Please update your PyTorchVideo to latest master
Please update your PyTorchVideo to latest master
Please update your PyTorchVideo to latest master
Please update your PyTorchVideo to latest master
Please update your PyTorchVideo to latest master
Please update your PyTorchVideo to latest master
Please update your PyTorchVideo to latest master
Please update your PyTorchVideo to latest master
| distributed init (rank 3): env://
| distributed init (rank 6): env://
| distributed init (rank 1): env://
| distributed init (rank 0): env://
| distributed init (rank 5): env://
| distributed init (rank 2): env://
| distributed init (rank 7): env://
| distributed init (rank 4): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 5 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 7 using best-guess GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Warning: Enabling distributed evaluation with an eval dataset not divisible by process number. This will slightly alter validation results as extra duplicate entries are added to achieve equal num of samples per-process.
Creating model: uniformer_base
number of params: 49468752
Start training for 300 epochs
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch: [0]  [0/9]  eta: 0:01:24  lr: 0.000001  loss: 6.0915 (6.0915)  time: 9.3436  data: 3.3803  max mem: 11905
Epoch: [0]  [8/9]  eta: 0:00:02  lr: 0.000001  loss: 6.0828 (6.0922)  time: 2.0037  data: 0.3758  max mem: 12141
Epoch: [0] Total time: 0:00:18 (2.0372 s / it)
Averaged stats: lr: 0.000001  loss: 6.0828 (6.0883)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

The displayed logs are very messy and maddening, and I am totally no clue whether the code runs correctly. These warnings only occur when distributed training. Have you ever met this situation? I will appreciate it if you can give me some advice.

The text was updated successfully, but these errors were encountered:

Andy1621 · 2022-04-07T08:35:03Z

I also met these warnings. But it does not affect the performance, just ignore it~

Andy1621 · 2022-04-07T08:37:28Z

You can simply check the stdout.log. It will not show the warnings.

realgump · 2022-04-07T08:44:04Z

Thanks for your kindly response.

realgump · 2022-04-07T11:41:56Z

However, the training really hang :(

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incom
plete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802691 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incom
plete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802719 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incom
plete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802704 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incom
plete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802712 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incom
plete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802694 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 74124) of binary: /home/data/user/local/anaconda3/envs/uniformer/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

Andy1621 · 2022-04-07T12:27:28Z

It seems that you should increase share memory?

Andy1621 · 2022-04-07T12:28:11Z

See here. You can try to use a smaller batch size if you train UniFormer-B. Note that the learning rate should be adjusted.

realgump · 2022-04-10T16:31:30Z

Thanks for your advice first.

However, it might not be the reason of the shared memory. My server has a shared memory of more than 30G. I have watched the usage of shared memory when training, but only less than 500M is used throughout the whole stage.

I have also tried many different possible solutions from the Internet but never worked.

The training is always stuck at the end of the first epoch, and before or after the end of the last batch under Pytorch 1.7.0:

Epoch: [0]  [1234/1237]  eta: 0:00:01  lr: 0.000001  loss: 4.8091 (5.1428)  time: 0.6466  data: 0.0003  max mem: 12501
Epoch: [0]  [1235/1237]  eta: 0:00:01  lr: 0.000001  loss: 4.8209 (5.1427)  time: 0.6463  data: 0.0003  max mem: 12501

or

Epoch: [0]  [1234/1237]  eta: 0:00:01  lr: 0.000001  loss: 4.8091 (5.1428)  time: 0.6466  data: 0.0003  max mem: 12501
Epoch: [0]  [1235/1237]  eta: 0:00:01  lr: 0.000001  loss: 4.8209 (5.1427)  time: 0.6463  data: 0.0003  max mem: 12501
Epoch: [0]  [1236/1237]  eta: 0:00:00  lr: 0.000001  loss: 4.8091 (5.1423)  time: 0.6451  data: 0.0003  max mem: 12501
Epoch: [0] Total time: 0:13:13 (0.6412 s / it)

I think something might be wrong when processes synchronizing, but no idea to fix that.

UniFormer/image_classification/utils.py

Lines 35 to 47 in e802470

    
               def synchronize_between_processes(self): 
        
                   """ 
        
                   Warning: does not synchronize the deque! 
        
                   """ 
        
                   if not is_dist_avail_and_initialized(): 
        
                       return 
        
                   t = torch.tensor([self.count, self.total], 
        
                                    dtype=torch.float64, device='cuda') 
        
                   dist.barrier() 
        
                   dist.all_reduce(t) 
        
                   t = t.tolist() 
        
                   self.count = int(t[0]) 
        
                   self.total = t[1]

realgump · 2022-04-10T16:35:45Z

About the messy warnings:

It might be the reason of the version of PyTorch. I reinstall my virtual environment of PyTorch 1.7.0, and all warnings and errors are disappeared. However, instead of being interrupted, the training is always stuck at the end of the first epoch.

realgump · 2022-04-10T17:02:46Z

There is another strange case worth noticing.

When I use a smaller dataset (about 1000 images) to debug, anything goes well.
And when I add

debug += 1
if debug > 1000:
     break

at Line33

UniFormer/image_classification/engine.py

Lines 30 to 33 in e802470

    
           for samples, targets in metric_logger.log_every(data_loader, print_freq, header): 
        
               samples = samples.to(device, non_blocking=True) 
        
               targets = targets.to(device, non_blocking=True)

anything goes well.

I really don't know what's wrong with my training.

Andy1621 · 2022-04-10T17:14:08Z

It looks so strange. I never meet this bug....
Since my code is forked from DeiT, maybe you can try it and open an issue if you meet the same problem. I just add my model and adopt the same training settings.

Andy1621 · 2022-04-10T17:19:22Z

Maybe you can check the CUDA, CUDNN, Pytorch, Torchvision versions....
Recently, I have met some strange bugs when using mismatched torch and torchvision.

I suggest you try DeiT first and check the log~~

realgump · 2022-04-10T17:19:45Z

It's really strange, and I have never met similar questions before. I have debugged the code for several days but still failed to run the network.

realgump · 2022-04-10T17:23:57Z

I have tried Pytorch 1.7 or Pytorch 1.9 for Cuda 11. Considering the DeiT is an earlier work, using Pytorch 1.9 or Cuda 11 might not be a good idea? Did you remember your versions of CUDA and CUDNN?

Andy1621 · 2022-04-10T17:24:15Z

Come on! Try DeiT and see if you will meet the same bugs.
If so, maybe you need to create a new environment. If not, you can simply copy the model code and run it!

Andy1621 · 2022-04-10T17:32:39Z

In my environment, torch1.6-1.10 is okay for DeiT.
Now I use torch1.9 + CUDA11.1 + cudnn8.0.4.
I use torch1.6 + CUDA10.1 + cudnn7.6.5 before.

Andy1621 · 2022-04-21T16:13:48Z

Hi! Have you solved your problem?

realgump · 2022-08-08T03:07:23Z

Thanks a lot. I had solved the problem by following your kind advice.

realgump closed this as completed Apr 7, 2022

realgump reopened this Apr 10, 2022

realgump changed the title ~~Lots of warning occurs when training.~~ Training hangs at the end of the first epoch in image classification task. Apr 10, 2022

realgump closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training hangs at the end of the first epoch in image classification task. #33

Training hangs at the end of the first epoch in image classification task. #33

realgump commented Apr 7, 2022

Andy1621 commented Apr 7, 2022

Andy1621 commented Apr 7, 2022

realgump commented Apr 7, 2022

realgump commented Apr 7, 2022

Andy1621 commented Apr 7, 2022

Andy1621 commented Apr 7, 2022 •

edited

realgump commented Apr 10, 2022 •

edited

realgump commented Apr 10, 2022

realgump commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

realgump commented Apr 10, 2022

realgump commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

Andy1621 commented Apr 21, 2022

realgump commented Aug 8, 2022

Training hangs at the end of the first epoch in image classification task. #33

Training hangs at the end of the first epoch in image classification task. #33

Comments

realgump commented Apr 7, 2022

Andy1621 commented Apr 7, 2022

Andy1621 commented Apr 7, 2022

realgump commented Apr 7, 2022

realgump commented Apr 7, 2022

Andy1621 commented Apr 7, 2022

Andy1621 commented Apr 7, 2022 • edited

realgump commented Apr 10, 2022 • edited

realgump commented Apr 10, 2022

realgump commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

realgump commented Apr 10, 2022

realgump commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

Andy1621 commented Apr 10, 2022

Andy1621 commented Apr 21, 2022

realgump commented Aug 8, 2022

Andy1621 commented Apr 7, 2022 •

edited

realgump commented Apr 10, 2022 •

edited