Visual Speech Recognition For Low-Resource Languages with Automatic Labels From Whisper Model
We will provide all VSR models, training code, and inference code for low-resource languages soon.
We release the automatic labels of the four low-resource languages(French, Italian, Portuguese, and Spanish).
To generate the automatic labels, we identify the languages of all videos in VoxCeleb2 and AVSpeech, and then the transcription (automatic labels) is produced by the pretrained ASR model. In this project, we use a "whisper/large-v2" model to conduct these processes.
Multilingual TEDx(mTEDx), VoxCeleb2, and AVSpeech Datasets.
- Download the mTEDx dataset from the mTEDx link of the official website.
- Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
- Download the AVSpeech dataset from the AVSpeech link of the official website.
When you are interested in training the model for a specific target language VSR, we recommend using language-detected files (e.g., link provided in this project instead of video lists of the AVSpeech dataset provided on the official website to reduce the dataset preparation time. Because of the huge amount of AVSpeech dataset, it takes a lot of time.
After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.
The training code is available soon.
The inference code is available soon.
mTEDx Fr
Model | Training Datasets | Training data (h) | WER [%] | Target Languages |
---|---|---|---|---|
best_ckpt.pt | mTEDx | 85 | 65.25 | Fr |
best_ckpt.pt | mTEDx + VoxCeleb2 | 209 | 60.61 | Fr |
best_ckpt.pt | mTEDx + VoxCeleb2 + AVSpeech | 331 | 58.30 | Fr |
mTEDx It
Model | Training Datasets | Training data (h) | WER [%] | Target Languages |
---|---|---|---|---|
best_ckpt.pt | mTEDx | 46 | 60.40 | It |
best_ckpt.pt | mTEDx + VoxCeleb2 | 84 | 56.48 | It |
best_ckpt.pt | mTEDx + VoxCeleb2 + AVSpeech | 152 | 51.79 | It |
mTEDx Es
Model | Training Datasets | Training data (h) | WER [%] | Target Languages |
---|---|---|---|---|
best_ckpt.pt | mTEDx | 72 | 59.91 | Es |
best_ckpt.pt | mTEDx + VoxCeleb2 | 114 | 54.05 | Es |
best_ckpt.pt | mTEDx + VoxCeleb2 + AVSpeech | 384 | 45.71 | Es |
mTEDx Pt
Model | Training Datasets | Training data (h) | WER [%] | Target Languages |
---|---|---|---|---|
best_ckpt.pt | mTEDx | 82 | 59.45 | Pt |
best_ckpt.pt | mTEDx + VoxCeleb2 | 91 | 58.82 | Pt |
best_ckpt.pt | mTEDx + VoxCeleb2 + AVSpeech | 420 | 47.89 | Pt |