-
I am training a conformer-ctc model using the encoder weights of a pre-trained English conformer-ctc model. For the test, I trained on a validation sample of GOLOS dataset and on my own data. By the time 14 hours. For validation, I used our other data and a test GOLOS dataset. At first, the loss and ver on the train dataset and on the validation one converged. Could you help me understand what this means?
name: "Conformer-CTC-Char"
model:
sample_rate: 8000
# timesteps: 128
labels: &labels [ " ", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к", "л", "м", "н", "о", "п",
"р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ", "ъ", "ы", "ь", "э", "ю", "я" ]
log_prediction: true # enables logging sample predictions in the output during training
ctc_reduction: 'mean_batch'
skip_nan_grad: false
train_ds:
manifest_filepath: "../data/processed/manifests/etalon_golos_train.json"
labels: ${model.labels}
sample_rate: ${model.sample_rate}
batch_size: &batch_size 4
shuffle: true
num_workers: 8
pin_memory: true
trim_silence: false # true
max_duration: 55
min_duration: 0.1
# tarred datasets
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
# bucketing params
bucketing_strategy: "synced_randomized"
bucketing_batch_size: null
# augmentor:
# # shift:
# # prob: 1.0
# # min_shift_ms: -5.0
# # max_shift_ms: 5.0
# white_noise:
# prob: 0.5
# min_level: -90
# max_level: -46
# speed:
# prob: 0.5
# sr: ${model.sample_rate}
# resample_type: 'kaiser_fast'
# min_speed_rate: 0.95
# max_speed_rate: 1.05
validation_ds:
manifest_filepath: "../data/processed/manifests/etalon_val.json"
labels: ${model.labels}
sample_rate: ${model.sample_rate}
batch_size: *batch_size
shuffle: false
num_workers: 8
pin_memory: true
test_ds:
manifest_filepath: "../data/processed/manifests/test.json"
labels: ${model.labels}
sample_rate: ${model.sample_rate}
batch_size: *batch_size
shuffle: false
num_workers: 8
pin_memory: true
preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
sample_rate: ${model.sample_rate}
normalize: "per_feature"
window_size: 0.025
window_stride: 0.01
window: "hann"
features: 80
n_fft: 512
log: true
frame_splicing: 1
dither: 0.00001
pad_to: 0
pad_value: 0.0
spec_augment:
_target_: nemo.collections.asr.modules.SpectrogramAugmentation
# freq_masks: 2 # set to zero to disable it
# # you may use lower time_masks for smaller models to have a faster convergence
# time_masks: 10 # set to zero to disable it
# freq_width: 27
# time_width: 0.05
freq_masks: 2
time_masks: 2
freq_width: 15
time_width: 25
rect_masks: 5
rect_time: 25
rect_freq: 15
# crop_or_pad_augment:
# _target_: nemo.collections.asr.modules.CropOrPadSpectrogramAugmentation
# audio_length: ${model.timesteps}
encoder:
_target_: nemo.collections.asr.modules.ConformerEncoder
feat_in: ${model.preprocessor.features}
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 16
d_model: 176
# Sub-sampling params
subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
subsampling_factor: 4 # must be power of 2
subsampling_conv_channels: 176 # set to -1 to make it equal to the d_model
# Feed forward module's params
ff_expansion_factor: 4
# Multi-headed Attention Module's params
self_attention_model: rel_pos # rel_pos or abs_pos
n_heads: 4 # may need to be lower for smaller d_models
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
att_context_size: [-1, -1] # -1 means unlimited context
xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
pos_emb_max_len: 5000
# Convolution module's params
conv_kernel_size: 31
conv_norm_type: 'batch_norm' # batch_norm or layer_norm
### regularization
dropout: 0.1 # The dropout used in most of the Conformer Modules
dropout_emb: 0.0 # The dropout used for embeddings
dropout_att: 0.1 # The dropout for multi-headed attention modules
decoder:
_target_: nemo.collections.asr.modules.ConvASRDecoder
feat_in: null
num_classes: 33
vocabulary: ${model.labels}
optim:
name: adamw
lr: 2.0
# optimizer arguments
betas: [0.9, 0.98]
# less necessity for weight_decay as we already have large augmentations with SpecAug
# you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
# weight decay of 0.0 with lr of 2.0 also works fine
weight_decay: 1e-3
# scheduler setup NoamAnnealing
sched:
name: NoamAnnealing
d_model: ${model.encoder.d_model}
# scheduler config override
warmup_steps: 10000
warmup_ratio: null
min_lr: 1e-6
trainer:
devices: -1 # number of GPUs, -1 would use all available GPUs
num_nodes: 1
max_epochs: 200
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: auto
strategy: dp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
log_every_n_steps: 100 # Interval of logging.
progress_bar_refresh_rate: 10
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: true
enable_checkpointing: True # Provided by exp_manager
# logger: false # Provided by exp_manager
benchmark: false # needs to be false for models with variable-length speech input as it slows down training
exp_manager:
exp_dir: null
name: ${name}
create_tensorboard_logger: true
create_checkpoint_callback: true
checkpoint_callback_params:
# in case of multiple validation sets, first one is used
monitor: "val_wer"
mode: "min"
save_top_k: 1
always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints
resume_if_exists: false
resume_ignore_no_checkpoint: false
Thanks a lot for any answer! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 3 replies
-
Your model is overfitting severely, see the training loss continue to go down but Val loss spike. You should use more spec augment (say 5 time masks rather than 2) to slow down overfitting. Another thing is your train duration distribution is very large - max duration of 55 seconds limits your batch size too much. Conformer requires a minimum global batch size of at least 256 to converge stably with CTC loss. Val loss and val wer are loosely correlated for ASR training, which is one of the reasons we directly select model using wer as metric rather than loss. However I've not seen such a large divergence still be able to reduce wer. The fact that train wer is so much lower means that there is a large difference between train set and validation set so it may just be that many words that don't occur in train occur in dev and vice versa c |
Beta Was this translation helpful? Give feedback.
-
@VahidooX for any additional advice |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for your answer! I put 5 temporary masks and analyzed the dataset - there were defects. Interestingly, according to the validation, WER still converges faster. But the losses are slowly converging and the difference between WER on train and validation is much smaller than it was before. |
Beta Was this translation helpful? Give feedback.
-
We sometimes observe that validation loss goes up to some extent after many epochs while WERs are the same or even getting better, but those cases were not as bad as yours. We have tested batch sizes of 256 and it works with the default learning policy but it should also work with batch size 128. Some suggestions: ++ For your final run, make save_top_k to 10 to use checkpoint averaging. It always help the WER. ++ Looks like you are using both the SpecAug and Cutout. I think it is better to just use one so that you have a better control on the amount of augmentation. You may try time masking of 2 or 5 for small models of CTC. ++ using 50s audios for training is not efficient. You may drop the long ones if they are small part of your train data or use our segmentation tools to split your audios into smaller ones. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply! |
Beta Was this translation helpful? Give feedback.
-
It's strange, but now I have the opposite WER on the train data 40, and on the validation data 20. But the loss is about the same - 40. |
Beta Was this translation helpful? Give feedback.
Your model is overfitting severely, see the training loss continue to go down but Val loss spike. You should use more spec augment (say 5 time masks rather than 2) to slow down overfitting.
Another thing is your train duration distribution is very large - max duration of 55 seconds limits your batch size too much. Conformer requires a minimum global batch size of at least 256 to converge stably with CTC loss.
Val loss and val wer are loosely correlated for ASR training, which is one of the reasons we directly select model using wer as metric rather than loss. However I've not seen such a large divergence still be able to reduce wer.
The fact that train wer is so much lower means that ther…