How to interpret the training result: high Loss, low WER? #4423

psydok · 2022-06-08T09:02:24Z

psydok
Jun 8, 2022

I am training a conformer-ctc model using the encoder weights of a pre-trained English conformer-ctc model.

For the test, I trained on a validation sample of GOLOS dataset and on my own data. By the time 14 hours. For validation, I used our other data and a test GOLOS dataset.

At first, the loss and ver on the train dataset and on the validation one converged.
After a 100k step, the loss began to diverge, and the ver slowly but continued to converge.

Could you help me understand what this means?
After all, if the predictions began to differ, then the words should be predicted incorrectly and WER should increase in this case...

Train

Val

config.yaml for Conformer-ctc:

name: "Conformer-CTC-Char"

model:
  sample_rate: 8000

  # timesteps: 128
 
  labels: &labels [ " ", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к", "л", "м", "н", "о", "п",
                    "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ", "ъ", "ы", "ь", "э", "ю", "я" ]

  log_prediction: true # enables logging sample predictions in the output during training
  ctc_reduction: 'mean_batch'
  skip_nan_grad: false

  train_ds:
    manifest_filepath: "../data/processed/manifests/etalon_golos_train.json"
    labels: ${model.labels}
    sample_rate: ${model.sample_rate}
    batch_size: &batch_size 4 
    shuffle: true
    num_workers: 8
    pin_memory: true
    trim_silence: false # true
    max_duration: 55
    min_duration: 0.1
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null
    # augmentor:
    #   # shift:
    #   #   prob: 1.0
    #   #   min_shift_ms: -5.0
    #   #   max_shift_ms: 5.0
    #   white_noise:
    #     prob: 0.5
    #     min_level: -90
    #     max_level: -46
    #   speed:
    #     prob: 0.5
    #     sr: ${model.sample_rate}
    #     resample_type: 'kaiser_fast'
    #     min_speed_rate: 0.95
    #     max_speed_rate: 1.05
  
  validation_ds:
    manifest_filepath: "../data/processed/manifests/etalon_val.json"
    labels: ${model.labels}
    sample_rate: ${model.sample_rate}
    batch_size: *batch_size 
    shuffle: false
    num_workers: 8
    pin_memory: true

  test_ds:
    manifest_filepath: "../data/processed/manifests/test.json"
    labels: ${model.labels}
    sample_rate: ${model.sample_rate}
    batch_size: *batch_size 
    shuffle: false
    num_workers: 8
    pin_memory: true

  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    sample_rate: ${model.sample_rate}
    normalize: "per_feature"
    window_size: 0.025
    window_stride: 0.01
    window: "hann"
    features: 80
    n_fft: 512
    log: true
    frame_splicing: 1
    dither: 0.00001
    pad_to: 0
    pad_value: 0.0

  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    # freq_masks: 2 # set to zero to disable it
    # # you may use lower time_masks for smaller models to have a faster convergence
    # time_masks: 10 # set to zero to disable it
    # freq_width: 27
    # time_width: 0.05
    freq_masks: 2
    time_masks: 2
    freq_width: 15
    time_width: 25
    rect_masks: 5
    rect_time: 25
    rect_freq: 15

  # crop_or_pad_augment:
  #   _target_: nemo.collections.asr.modules.CropOrPadSpectrogramAugmentation
  #   audio_length: ${model.timesteps}

  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: ${model.preprocessor.features}
    feat_out: -1 # you may set it if you need different output size other than the default d_model
    n_layers: 16
    d_model: 176

    # Sub-sampling params
    subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
    subsampling_factor: 4 # must be power of 2
    subsampling_conv_channels: 176 # set to -1 to make it equal to the d_model

    # Feed forward module's params
    ff_expansion_factor: 4

    # Multi-headed Attention Module's params
    self_attention_model: rel_pos # rel_pos or abs_pos
    n_heads: 4 # may need to be lower for smaller d_models
    # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
    att_context_size: [-1, -1] # -1 means unlimited context
    xscaling: true # scales up the input embeddings by sqrt(d_model)
    untie_biases: true # unties the biases of the TransformerXL layers
    pos_emb_max_len: 5000

    # Convolution module's params
    conv_kernel_size: 31
    conv_norm_type: 'batch_norm' # batch_norm or layer_norm

    ### regularization
    dropout: 0.1 # The dropout used in most of the Conformer Modules
    dropout_emb: 0.0 # The dropout used for embeddings
    dropout_att: 0.1 # The dropout for multi-headed attention modules

  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: null
    num_classes: 33
    vocabulary: ${model.labels}

  optim:
    name: adamw
    lr: 2.0
    # optimizer arguments
    betas: [0.9, 0.98]
    # less necessity for weight_decay as we already have large augmentations with SpecAug
    # you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
    # weight decay of 0.0 with lr of 2.0 also works fine
    weight_decay: 1e-3

    # scheduler setup NoamAnnealing
    sched:
      name: NoamAnnealing
      d_model: ${model.encoder.d_model}
      # scheduler config override
      warmup_steps: 10000
      warmup_ratio: null
      min_lr: 1e-6

trainer:
  devices: -1 # number of GPUs, -1 would use all available GPUs
  num_nodes: 1
  max_epochs: 200
  max_steps: null # computed at runtime if not set
  val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
  accelerator: auto
  strategy: dp
  accumulate_grad_batches: 1
  gradient_clip_val: 0.0
  precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
  log_every_n_steps: 100  # Interval of logging.
  progress_bar_refresh_rate: 10
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
  check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
  sync_batchnorm: true
  enable_checkpointing: True  # Provided by exp_manager
  # logger: false  # Provided by exp_manager
  benchmark: false # needs to be false for models with variable-length speech input as it slows down training


exp_manager:
  exp_dir: null
  name: ${name}
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    # in case of multiple validation sets, first one is used
    monitor: "val_wer"
    mode: "min"
    save_top_k: 1
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints

  resume_if_exists: false
  resume_ignore_no_checkpoint: false

Thanks a lot for any answer!

Answered by titu1994

Jun 8, 2022

Your model is overfitting severely, see the training loss continue to go down but Val loss spike. You should use more spec augment (say 5 time masks rather than 2) to slow down overfitting.

Another thing is your train duration distribution is very large - max duration of 55 seconds limits your batch size too much. Conformer requires a minimum global batch size of at least 256 to converge stably with CTC loss.

Val loss and val wer are loosely correlated for ASR training, which is one of the reasons we directly select model using wer as metric rather than loss. However I've not seen such a large divergence still be able to reduce wer.

The fact that train wer is so much lower means that ther…

View full answer

titu1994 · 2022-06-08T19:20:50Z

titu1994
Jun 8, 2022
Maintainer

Your model is overfitting severely, see the training loss continue to go down but Val loss spike. You should use more spec augment (say 5 time masks rather than 2) to slow down overfitting.

Another thing is your train duration distribution is very large - max duration of 55 seconds limits your batch size too much. Conformer requires a minimum global batch size of at least 256 to converge stably with CTC loss.

Val loss and val wer are loosely correlated for ASR training, which is one of the reasons we directly select model using wer as metric rather than loss. However I've not seen such a large divergence still be able to reduce wer.

The fact that train wer is so much lower means that there is a large difference between train set and validation set so it may just be that many words that don't occur in train occur in dev and vice versa c

0 replies

titu1994 · 2022-06-08T19:21:02Z

titu1994
Jun 8, 2022
Maintainer

@VahidooX for any additional advice

0 replies

psydok · 2022-06-09T15:29:34Z

psydok
Jun 9, 2022
Author

Thank you very much for your answer! I put 5 temporary masks and analyzed the dataset - there were defects.
Now I put it on the correct dataset and only on my own data. Total 4k records, with a batch size of 3. Doesn't allow GPU anymore...
I also set trainer's val_check_interval to 0.25.

Interestingly, according to the validation, WER still converges faster. But the losses are slowly converging and the difference between WER on train and validation is much smaller than it was before.
In addition, Loss and WER on the validation set are now explicitly correlated.

0 replies

VahidooX · 2022-06-09T19:30:57Z

VahidooX
Jun 9, 2022
Collaborator

We sometimes observe that validation loss goes up to some extent after many epochs while WERs are the same or even getting better, but those cases were not as bad as yours. We have tested batch sizes of 256 and it works with the default learning policy but it should also work with batch size 128. Some suggestions:

++ For your final run, make save_top_k to 10 to use checkpoint averaging. It always help the WER.

++ Looks like you are using both the SpecAug and Cutout. I think it is better to just use one so that you have a better control on the amount of augmentation. You may try time masking of 2 or 5 for small models of CTC.

++ using 50s audios for training is not efficient. You may drop the long ones if they are small part of your train data or use our segmentation tools to split your audios into smaller ones.

2 replies

psydok Jul 7, 2022
Author

Batch size 128 is too big for my capacity.
I have a 32 GB video card, I manage to run training on a maximum of a 24-28 batch.

titu1994 Jul 7, 2022
Maintainer

He means global batch size, you can get that much with accumulation of gradients.

psydok · 2022-06-14T05:52:55Z

psydok
Jun 14, 2022
Author

Thanks for your reply!
I had another question, I trained the model at the current settings, with a maximum record length of 55 seconds, then I tried to reduce the maximum record duration in the config - the learning rate became incredibly low. From 2 hours - 150+ hours per epoch. Also if I change trainer.val_check_interval to 1.
Tell me, please, what is the reason for this?

0 replies

psydok · 2022-07-07T09:49:42Z

psydok
Jul 7, 2022
Author

It's strange, but now I have the opposite WER on the train data 40, and on the validation data 20. But the loss is about the same - 40.
I'm confused how so?

1 reply

titu1994 Jul 7, 2022
Maintainer

Train wer will be high because of heavy spec augment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to interpret the training result: high Loss, low WER? #4423

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to interpret the training result: high Loss, low WER? #4423

psydok Jun 8, 2022

Replies: 6 comments · 3 replies

titu1994 Jun 8, 2022 Maintainer

titu1994 Jun 8, 2022 Maintainer

psydok Jun 9, 2022 Author

VahidooX Jun 9, 2022 Collaborator

psydok Jul 7, 2022 Author

titu1994 Jul 7, 2022 Maintainer

psydok Jun 14, 2022 Author

psydok Jul 7, 2022 Author

titu1994 Jul 7, 2022 Maintainer

psydok
Jun 8, 2022

Replies: 6 comments 3 replies

titu1994
Jun 8, 2022
Maintainer

titu1994
Jun 8, 2022
Maintainer

psydok
Jun 9, 2022
Author

VahidooX
Jun 9, 2022
Collaborator

psydok Jul 7, 2022
Author

titu1994 Jul 7, 2022
Maintainer

psydok
Jun 14, 2022
Author

psydok
Jul 7, 2022
Author

titu1994 Jul 7, 2022
Maintainer