Would a QuartzNet 15x5 pretrained on English have a notably hard time hearing nuances in Vietnamese? #2110
Replies: 10 comments
-
Hey @catskillsresearch, I don't have a solution to your problem but I rather have a question for you. |
Beta Was this translation helpful? Give feedback.
-
Hi @gopesh97 I had a similar issue. NVidia folks gave some help. Here is what I do now to start a new language. To use pretrained QuartzNet15x5, your training data needs to be 16000Hz. If that's not the case then start by resampling your data. Then, you need to copy and modify the YAML file
Then load the first time the pretrained model:
Set up the checkpointing as follows:
Subsequently, when you load .CKPT files for retraining, that will look like this:
|
Beta Was this translation helpful? Give feedback.
-
@catskillsresearch why is it necessary to declare upper and lower case characters in label and use pretrained model which is better than train model from scratch? I'm also experimenting with Vietnamese. |
Beta Was this translation helpful? Give feedback.
-
Not necessarily better. I am just pulling the letters out of the NIST transcription. If there are upper case letters it's because NIST threw them into the transcription somewhere. Here is the official position from NIST:
Knowing that, next time around, I would take the uppercase characters out of the vocabulary and lowercase the ground truth, to reduce the size of the model. I don't expect it to have a big effect on results though. |
Beta Was this translation helpful? Give feedback.
-
yeah. you can try converting phonemes to grapheme . have better results than charater level. how many hours of your data |
Beta Was this translation helpful? Give feedback.
-
I don't have ground truth phonemic transcription. I have audio with grapheme transcription. The whole point of these fancy new neural net architectures is to cut out the middleman, i.e. go Audio -> Grapheme directly rather than Audio -> Phoneme -> Grapheme. That way we can just drop the study of phonemes altogether. Phonemes are old school. As far as #hours, NIST gave us 20. 20 hours for 2020. That's enough, right? Maybe with a little data augmentation? And that transfer learning? |
Beta Was this translation helpful? Give feedback.
-
@catskillsresearch Could you pls send me the link of Vietnamese ASR dataset? |
Beta Was this translation helpful? Give feedback.
-
Nice! |
Beta Was this translation helpful? Give feedback.
-
@dangvansam98 and @hoangtuanvu, to get this dataset, first register for the NIST OpenASR20 Challenge by following these instructions. However, the registration deadline has expired. The evaluation period is about to begin. You can email NIST at openasr_poc@nist.gov to ask them to let you in, they might say Yes. I can't give you the data myself because of the data usage agreement. You have to sign this agreement and submit it to them as a later part of the registration process. |
Beta Was this translation helpful? Give feedback.
-
@catskillsresearch Thanks for your suggestion |
Beta Was this translation helpful? Give feedback.
-
Would a QuartzNet 15x5 pretrained on English have a notably hard time hearing nuances in Vietnamese? Will training gradually improve the pretrained Encoder layer to fix this, or is the Encoder layer too far down, gradient-wise, to notice? I'm just asking because I've been retraining QuartzNet 15x5 English for a number of days on Vietnamese and it still seems to be missing what seems to be isolated phonemic subtleties quite frequently, and my intuitive sense is that it is listening with an English "ear" that just can't distinguish the Vietnamese phonemes.
Can you think of some workarounds to improve this other than just grinding on with the training? For example:
Beta Was this translation helpful? Give feedback.
All reactions