python train.py Loss not decreasing #26

cclamd · 2024-01-03T02:03:09Z

hi ,i add the data according to the readme document ,but when i run python train.py it shows

class screw
args1.json defaultdict(<class 'str'>, {'img_size': [256, 256], 'Batch_Size': 2, 'EPOCHS': 300, 'T': 1000, 'base_channels': 128, 'beta_schedule': 'linear', 'loss_type': 'l2', 'diffusion_lr': 0.0001, 'seg_lr': 1e-05, 'random_slice': True, 'weight_decay': 0.0, 'save_imgs': True, 'save_vids': False, 'dropout': 0, 'attention_resolutions': '32,16,8', 'num_heads': 4, 'num_head_channels': -1, 'noise_fn': 'gauss', 'channels': 3, 'mvtec_root_path': '/content/drive/MyDrive/DiffusionAD/datasets/mvtec', 'visa_root_path': 'datasets/VisA_1class/1cls', 'dagm_root_path': 'datasets/dagm', 'mpdd_root_path': 'datasets/mpdd', 'anomaly_source_path': '/content/drive/MyDrive/DiffusionAD/datasets/dtd', 'noisier_t_range': 600, 'less_t_range': 300, 'condition_w': 1, 'eval_normal_t': 200, 'eval_noisier_t': 400, 'output_path': 'outputs', 'arg_num': '1'})
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Epoch:0, Train loss: nan: 1% 1/160 [00:04<12:14, 4.62s/it]thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/309.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/309.png
Epoch:0, Train loss: nan: 1% 2/160 [00:06<08:03, 3.06s/it]thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/151.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/151.png
thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/023.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/023.png
thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/180.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/180.png
Epoch:0, Train loss: nan: 2% 3/160 [00:08<06:13, 2.38s/it]thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/015.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/015.png
thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/292.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/292.png
Epoch:0, Train loss: nan: 2% 4/160 [00:09<05:21, 2.06s/it]thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/113.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/113.png
thresh_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/DISthresh/good/152.png
image_path /content/drive/MyDrive/DiffusionAD/datasets/mvtec/screw/train/good/152.png

i print the data path of the image_path and thresh_path ,the path is right,but why loss can't decrease

HuiZhang0812 · 2024-01-04T05:17:01Z

The default configuration for batch size is 16. A small batch size, such as the one you set to 2, may lead to an entire batch consisting solely of abnormal samples, thereby affecting the calculation of the paper's loss formula (Formula 9).

cclamd · 2024-01-04T06:31:06Z

ok, thanks ,so which value should i set for the min of batch size to decrease the loss, should i have to set it to 16 ?

#16

HuiZhang0812 · 2024-01-05T03:57:00Z

If your GPU RAM is sufficiently large, setting the batch size to 16 is recommended.

cclamd · 2024-01-08T01:24:10Z

thanks , i try some value and find "batch size =6 " is the min value

HuiZhang0812 closed this as completed Jan 8, 2024

cclamd mentioned this issue Jan 31, 2024

loss nan for distributed deep learning #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python train.py Loss not decreasing #26

python train.py Loss not decreasing #26

cclamd commented Jan 3, 2024 •

edited

Loading

HuiZhang0812 commented Jan 4, 2024

cclamd commented Jan 4, 2024

HuiZhang0812 commented Jan 5, 2024

cclamd commented Jan 8, 2024 •

edited

Loading

python train.py Loss not decreasing #26

python train.py Loss not decreasing #26

Comments

cclamd commented Jan 3, 2024 • edited Loading

HuiZhang0812 commented Jan 4, 2024

cclamd commented Jan 4, 2024

HuiZhang0812 commented Jan 5, 2024

cclamd commented Jan 8, 2024 • edited Loading

cclamd commented Jan 3, 2024 •

edited

Loading

cclamd commented Jan 8, 2024 •

edited

Loading