Training and OOM #8

hcleung3325 · 2021-10-08T02:34:21Z

Thanks for your code.
I tried to train the model with train_stage1.yml, and the Cuda OOM.
I am using 2080 Ti, I tried to reduce the batch size from 16 to 2 and the GT_size from 192 to 48.
However, the training still OOM.
May I know is there anything I missed?
Thanks.

JingyunLiang · 2021-10-08T09:05:22Z

MANet training doesn't take much memory. Did you turn on cal_lr_psnr?

MANet/codes/options/train/train_stage1.yml

Line 28 in 34f90ba

cal_lr_psnr: False # calculate lr psnr consumes huge memory

hcleung3325 · 2021-10-11T08:20:56Z

MANet training doesn't take much memory. Did you turn on cal_lr_psnr?

MANet/codes/options/train/train_stage1.yml

Line 28 in 34f90ba

cal_lr_psnr: False # calculate lr psnr consumes huge memory

Thanks for reply.
No, it keeps false.


#### general settings
name: 001_MANet_aniso_x4_TMO_40_stage1
use_tb_logger: true
model: blind
distortion: sr
scale: 4
gpu_ids: [1]
kernel_size: 21
code_length: 15
# train
sig_min: 0.7 # 0.7, 0.525, 0.35 for x4, x3, x2
sig_max: 10.0  # 10, 7.5, 5 for x4, x3, x2
train_noise: False
noise_high: 15
train_jpeg: False
jpeg_low: 70
# validation
sig: 1.6
sig1: 6 # 6, 5, 4 for x4, x3, x2
sig2: 1
theta: 0
rate_iso: 0 # 1 for iso, 0 for aniso
test_noise: False
noise: 15
test_jpeg: False
jpeg: 70
pca_path: ./pca_matrix_aniso21_15_x4.pth
cal_lr_psnr: False # calculate lr psnr consumes huge memory


#### datasets
datasets:
  train:
    name: TMO
    mode: GT
    dataroot_GT: ../datasets/HR
    dataroot_LQ: ~

    use_shuffle: true
    n_workers: 8
    batch_size: 4
    GT_size: 192
    LR_size: ~
    use_flip: true
    use_rot: true
    color: RGB
  val:
    name: Set5
    mode: GT
    dataroot_GT: ../../data
    dataroot_LQ: ~


#### network structures
network_G:
  which_model_G: MANet_s1
  in_nc: 3
  out_nc: ~
  nf: ~
  nb: ~
  gc: ~
  manet_nf: 128
  manet_nb: 1
  split: 2


#### path
path:
  pretrain_model_G: ~
  strict_load: true
  resume_state:  ~ #../experiments/001_MANet_aniso_x4_DIV2K_40_stage1/training_state/5000.state


#### training settings: learning rate scheme, loss
train:
  lr_G: !!float 2e-4
  lr_scheme: MultiStepLR
  beta1: 0.9
  beta2: 0.999
  niter: 300000
  warmup_iter: -1
  lr_steps: [100000, 150000, 200000, 250000]
  lr_gamma: 0.5
  restarts: ~
  restart_weights: ~
  eta_min: !!float 1e-7

  kernel_criterion: l1
  kernel_weight: 1.0

  manual_seed: 0
  val_freq: !!float 2e7


#### logger
logger:
  print_freq: 200
  save_checkpoint_freq: !!float 2e4

JingyunLiang · 2021-10-11T08:37:48Z

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

hcleung3325 · 2021-10-18T06:43:48Z

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

Thanks for reply.
I have tried the manet_nf=32 still OOM.

hcleung3325 · 2021-10-18T06:49:56Z

is that to run
python train.py --opt options/train/train_stage1.yml?

JingyunLiang · 2021-10-18T09:17:54Z

I think it's the problem of your GPU. Can you train other models normally? Can you test the MANet on your GPU?

hcleung3325 · 2021-10-18T09:32:41Z

My Gpu is 2080 Ti only get 11GB. Is that need a gpu with bigger ram to train it?

JingyunLiang · 2021-10-18T10:34:22Z

I don't think so. 2080 should at least be enough when manet_nf=32. Can you try to monitor the gpu usage by watch -d -n 0.5 nvidia-smi when you start to train the model?

hcleung3325 · 2021-10-20T02:51:03Z

Thanks a lot. The problem is solved. I can run the training now.

hcleung3325 closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training and OOM #8

Training and OOM #8

hcleung3325 commented Oct 8, 2021

JingyunLiang commented Oct 8, 2021

hcleung3325 commented Oct 11, 2021

JingyunLiang commented Oct 11, 2021

hcleung3325 commented Oct 18, 2021

hcleung3325 commented Oct 18, 2021

JingyunLiang commented Oct 18, 2021

hcleung3325 commented Oct 18, 2021

JingyunLiang commented Oct 18, 2021

hcleung3325 commented Oct 20, 2021

Training and OOM #8

Training and OOM #8

Comments

hcleung3325 commented Oct 8, 2021

JingyunLiang commented Oct 8, 2021

hcleung3325 commented Oct 11, 2021

JingyunLiang commented Oct 11, 2021

hcleung3325 commented Oct 18, 2021

hcleung3325 commented Oct 18, 2021

JingyunLiang commented Oct 18, 2021

hcleung3325 commented Oct 18, 2021

JingyunLiang commented Oct 18, 2021

hcleung3325 commented Oct 20, 2021