Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training and OOM #8

Closed
hcleung3325 opened this issue Oct 8, 2021 · 9 comments
Closed

Training and OOM #8

hcleung3325 opened this issue Oct 8, 2021 · 9 comments

Comments

@hcleung3325
Copy link

Thanks for your code.
I tried to train the model with train_stage1.yml, and the Cuda OOM.
I am using 2080 Ti, I tried to reduce the batch size from 16 to 2 and the GT_size from 192 to 48.
However, the training still OOM.
May I know is there anything I missed?
Thanks.

@JingyunLiang
Copy link
Owner

MANet training doesn't take much memory. Did you turn on cal_lr_psnr?

cal_lr_psnr: False # calculate lr psnr consumes huge memory

@hcleung3325
Copy link
Author

MANet training doesn't take much memory. Did you turn on cal_lr_psnr?

cal_lr_psnr: False # calculate lr psnr consumes huge memory

Thanks for reply.
No, it keeps false.


#### general settings
name: 001_MANet_aniso_x4_TMO_40_stage1
use_tb_logger: true
model: blind
distortion: sr
scale: 4
gpu_ids: [1]
kernel_size: 21
code_length: 15
# train
sig_min: 0.7 # 0.7, 0.525, 0.35 for x4, x3, x2
sig_max: 10.0  # 10, 7.5, 5 for x4, x3, x2
train_noise: False
noise_high: 15
train_jpeg: False
jpeg_low: 70
# validation
sig: 1.6
sig1: 6 # 6, 5, 4 for x4, x3, x2
sig2: 1
theta: 0
rate_iso: 0 # 1 for iso, 0 for aniso
test_noise: False
noise: 15
test_jpeg: False
jpeg: 70
pca_path: ./pca_matrix_aniso21_15_x4.pth
cal_lr_psnr: False # calculate lr psnr consumes huge memory


#### datasets
datasets:
  train:
    name: TMO
    mode: GT
    dataroot_GT: ../datasets/HR
    dataroot_LQ: ~

    use_shuffle: true
    n_workers: 8
    batch_size: 4
    GT_size: 192
    LR_size: ~
    use_flip: true
    use_rot: true
    color: RGB
  val:
    name: Set5
    mode: GT
    dataroot_GT: ../../data
    dataroot_LQ: ~


#### network structures
network_G:
  which_model_G: MANet_s1
  in_nc: 3
  out_nc: ~
  nf: ~
  nb: ~
  gc: ~
  manet_nf: 128
  manet_nb: 1
  split: 2


#### path
path:
  pretrain_model_G: ~
  strict_load: true
  resume_state:  ~ #../experiments/001_MANet_aniso_x4_DIV2K_40_stage1/training_state/5000.state


#### training settings: learning rate scheme, loss
train:
  lr_G: !!float 2e-4
  lr_scheme: MultiStepLR
  beta1: 0.9
  beta2: 0.999
  niter: 300000
  warmup_iter: -1
  lr_steps: [100000, 150000, 200000, 250000]
  lr_gamma: 0.5
  restarts: ~
  restart_weights: ~
  eta_min: !!float 1e-7

  kernel_criterion: l1
  kernel_weight: 1.0

  manual_seed: 0
  val_freq: !!float 2e7


#### logger
logger:
  print_freq: 200
  save_checkpoint_freq: !!float 2e4

@JingyunLiang
Copy link
Owner

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

@hcleung3325
Copy link
Author

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

Thanks for reply.
I have tried the manet_nf=32 still OOM.

@hcleung3325
Copy link
Author

is that to run
python train.py --opt options/train/train_stage1.yml?

@JingyunLiang
Copy link
Owner

I think it's the problem of your GPU. Can you train other models normally? Can you test the MANet on your GPU?

@hcleung3325
Copy link
Author

My Gpu is 2080 Ti only get 11GB. Is that need a gpu with bigger ram to train it?

@JingyunLiang
Copy link
Owner

I don't think so. 2080 should at least be enough when manet_nf=32. Can you try to monitor the gpu usage by watch -d -n 0.5 nvidia-smi when you start to train the model?

@hcleung3325
Copy link
Author

Thanks a lot. The problem is solved. I can run the training now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants