Would a QuartzNet 15x5 pretrained on English have a notably hard time hearing nuances in Vietnamese? #2110

catskillsresearch · 2020-10-24T15:01:24Z

catskillsresearch
Oct 24, 2020

Would a QuartzNet 15x5 pretrained on English have a notably hard time hearing nuances in Vietnamese? Will training gradually improve the pretrained Encoder layer to fix this, or is the Encoder layer too far down, gradient-wise, to notice? I'm just asking because I've been retraining QuartzNet 15x5 English for a number of days on Vietnamese and it still seems to be missing what seems to be isolated phonemic subtleties quite frequently, and my intuitive sense is that it is listening with an English "ear" that just can't distinguish the Vietnamese phonemes.

Can you think of some workarounds to improve this other than just grinding on with the training? For example:

[NeMo I 2020-10-24 10:51:06 wer:149] reference:chứ sao cũng gần mà chắc là cố gắng cưới lúc mùa hè để cho bé Dung về
[NeMo I 2020-10-24 10:51:06 wer:150] decoded  :chứ sao cũng gần mà chắc là cố gắng cướii lúc mùa hè để cho bé Dung vềé
Epoch 171:  21%|███████████████████████████████████████████▋                                                                                                                                                                 | 99/464 [00:58<03:35,  1.69it/s, loss=8.938, v_num=4][NeMo I 2020-10-24 10:51:35 wer:148] 
    
[NeMo I 2020-10-24 10:51:35 wer:149] reference:chửi gì chút xíu nữa đi chút xíu nữa à
[NeMo I 2020-10-24 10:51:35 wer:150] decoded  :chứi vì chút xíu nữa đâi chút xíu nữa àa
Epoch 171:  32%|█████████████████████████████████████████████████████████████████▌                                                                                                                                          | 149/464 [01:28<03:07,  1.68it/s, loss=9.278, v_num=4][NeMo I 2020-10-24 10:52:05 wer:148] 
    
[NeMo I 2020-10-24 10:52:05 wer:149] reference:chắc là không ơi cuối tháng sáu mới thi xong mà
[NeMo I 2020-10-24 10:52:05 wer:150] decoded  :chắc là không ơi cuối tháng sáu mới thi xongàé
Epoch 171:  43%|███████████████████████████████████████████████████████████████████████████████████████▍                                                                                                                    | 199/464 [01:55<02:34,  1.72it/s, loss=8.951, v_num=4][NeMo I 2020-10-24 10:52:32 wer:148] 
    
[NeMo I 2020-10-24 10:52:32 wer:149] reference:tự nhiên ở nhà với Tư rồi vòng về cho tụi nó ăn bữa còn có một ngày à
[NeMo I 2020-10-24 10:52:32 wer:150] decoded  :tự nhên ở nhà với Tư rồi vòng về cho tụi ng ăn bữa còn có một ngày àé

gopesh97 · 2020-10-26T06:07:52Z

gopesh97
Oct 26, 2020

Hey @catskillsresearch, I don't have a solution to your problem but I rather have a question for you.
I am also working on Unicode range characters, like yours. When I am training on a pretrained model, the reference string I am getting while training is empty. Did you face the same issue, how did you resolve it. If not, could you please tell how did you go about this transfer learning?
thanks.

0 replies

catskillsresearch · 2020-10-26T12:47:39Z

catskillsresearch
Oct 26, 2020
Author

Hi @gopesh97 I had a similar issue. NVidia folks gave some help. Here is what I do now to start a new language.

To use pretrained QuartzNet15x5, your training data needs to be 16000Hz. If that's not the case then start by resampling your data.

Then, you need to copy and modify the YAML file ./NeMo/examples/asr/conf/config.yaml to replace the English vocabulary with your new one. For example for Vietnamese the top of the file then becomes

name: &name "QuartzNet15x5"
sample_rate: &sample_rate 16000
repeat: &repeat 1
dropout: &dropout 0.0
separable: &separable true
labels: &labels  [' ',  "'",  '-',  'A',  'B',  'C',  'D',  'E',  'F',  'G',  'H',
 'I',  'J',  'K',  'L',  'M',  'N',  'O',  'P',  'Q',  'R',  'S',  'T',  'U',  'V',  'W',
 'X',  'Y',  'Z',  'a',  'b',  'c',  'd',  'e',  'f',  'g',  'h',  'i',  'j',  'k',
 'l',  'm',  'n',  'o',  'p',  'q',  'r',  's',  't',  'u',  'v',  'w',  'x',
 'y',  'z',  'Á',  'Â',  'Ú',  'à',  'á',  'â',  'ã',  'è',  'é',  'ê',  'ì',  'í',  'ò',
 'ó',  'ô',  'õ',  'ù',  'ú',  'ý',  'ă',  'Đ',  'đ',  'ĩ',  'ũ',  'ơ',  'ư',  'ạ',
 'Ả',  'ả',  'ấ',  'ầ',  'ẩ',  'ẫ',  'ậ',  'ắ',  'ằ',  'ẳ',  'ẵ',  'ặ',  'ẹ',  'ẻ',
 'ẽ',  'ế',  'ề',  'ể',  'ễ',  'ệ',  'ỉ',  'ị',  'ọ',  'ỏ',  'ố',  'ồ',  'ổ',  'ỗ',
 'ộ',  'ớ',  'ờ',  'ở',  'ỡ',  'ợ',  'ụ',  'ủ',  'ứ',  'ừ',  'ử',  'ữ',  'ự',  'ỳ',  'ỷ',  'ỹ']

Then load the first time the pretrained model:

model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")
yaml = YAML(typ='safe')
yaml_path='new_yaml.yaml'
with open(yaml_path) as f:
    params = yaml.load(f)
model.change_vocabulary(new_vocabulary=params['labels'])

Set up the checkpointing as follows:

import pytorch_lightning as pl
import os, datetime

class ModelCheckpointAtEpochEnd(pl.callbacks.ModelCheckpoint):
    def on_epoch_end(self, trainer, pl_module):
        metrics = trainer.callback_metrics
        metrics['epoch'] = trainer.current_epoch
        trainer.checkpoint_callback.on_validation_end(trainer, pl_module)

pid=os.getpid()
dt=datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

checkpoint_callback = ModelCheckpointAtEpochEnd(
    filepath='vietnamese_'+f'{dt}_{pid}'+'_{epoch:02d}',
    verbose=True,
    save_top_k=-1,
    save_weights_only=False,
    period=1)

trainer = pl.Trainer(gpus=[0], max_epochs=1000, amp_level='O1', precision=16, checkpoint_callback=checkpoint_callback)

Subsequently, when you load .CKPT files for retraining, that will look like this:

model = nemo_asr.models.EncDecCTCModel.load_from_checkpoint(last_fn)

0 replies

dangvansam · 2020-10-29T03:02:00Z

dangvansam
Oct 29, 2020

@catskillsresearch why is it necessary to declare upper and lower case characters in label and use pretrained model which is better than train model from scratch? I'm also experimenting with Vietnamese.
Thanks!

0 replies

catskillsresearch · 2020-10-29T04:05:58Z

catskillsresearch
Oct 29, 2020
Author

Not necessarily better. I am just pulling the letters out of the NIST transcription. If there are upper case letters it's because NIST threw them into the transcription somewhere.

Here is the official position from NIST:

We will use caseless scoring for ASR. Sclite will convert both reference and system to a single case before alignment.

Knowing that, next time around, I would take the uppercase characters out of the vocabulary and lowercase the ground truth, to reduce the size of the model. I don't expect it to have a big effect on results though.

0 replies

dangvansam · 2020-10-30T10:51:42Z

dangvansam
Oct 30, 2020

yeah. you can try converting phonemes to grapheme . have better results than charater level. how many hours of your data

0 replies

catskillsresearch · 2020-10-30T13:11:16Z

catskillsresearch
Oct 30, 2020
Author

I don't have ground truth phonemic transcription. I have audio with grapheme transcription. The whole point of these fancy new neural net architectures is to cut out the middleman, i.e. go Audio -> Grapheme directly rather than Audio -> Phoneme -> Grapheme. That way we can just drop the study of phonemes altogether. Phonemes are old school.

As far as #hours, NIST gave us 20. 20 hours for 2020. That's enough, right? Maybe with a little data augmentation? And that transfer learning?

0 replies

hoangtuanvu · 2020-10-31T05:28:48Z

hoangtuanvu
Oct 31, 2020

@catskillsresearch Could you pls send me the link of Vietnamese ASR dataset?

0 replies

dangvansam · 2020-10-31T10:05:41Z

dangvansam
Oct 31, 2020

Nice!
Can you give me the OpenASR2020 dataset? I just registered but the website is closed and can not be downloaded again. If possible, please send me the link to my email: dangvansam98@gmail.com and maybe i will also share with you my dataset. I use it for research purposes only.
Thanks very much!

0 replies

catskillsresearch · 2020-10-31T14:01:14Z

catskillsresearch
Oct 31, 2020
Author

@dangvansam98 and @hoangtuanvu, to get this dataset, first register for the NIST OpenASR20 Challenge by following these instructions. However, the registration deadline has expired. The evaluation period is about to begin. You can email NIST at openasr_poc@nist.gov to ask them to let you in, they might say Yes.

I can't give you the data myself because of the data usage agreement. You have to sign this agreement and submit it to them as a later part of the registration process.

0 replies

hoangtuanvu · 2020-11-01T01:57:14Z

hoangtuanvu
Nov 1, 2020

@catskillsresearch Thanks for your suggestion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would a QuartzNet 15x5 pretrained on English have a notably hard time hearing nuances in Vietnamese? #2110

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Would a QuartzNet 15x5 pretrained on English have a notably hard time hearing nuances in Vietnamese? #2110

catskillsresearch Oct 24, 2020

Replies: 10 comments

gopesh97 Oct 26, 2020

catskillsresearch Oct 26, 2020 Author

dangvansam Oct 29, 2020

catskillsresearch Oct 29, 2020 Author

dangvansam Oct 30, 2020

catskillsresearch Oct 30, 2020 Author

hoangtuanvu Oct 31, 2020

dangvansam Oct 31, 2020

catskillsresearch Oct 31, 2020 Author

hoangtuanvu Nov 1, 2020

catskillsresearch
Oct 24, 2020

gopesh97
Oct 26, 2020

catskillsresearch
Oct 26, 2020
Author

dangvansam
Oct 29, 2020

catskillsresearch
Oct 29, 2020
Author

dangvansam
Oct 30, 2020

catskillsresearch
Oct 30, 2020
Author

hoangtuanvu
Oct 31, 2020

dangvansam
Oct 31, 2020

catskillsresearch
Oct 31, 2020
Author

hoangtuanvu
Nov 1, 2020