Skip to content

JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual-Speech-Recognition-for-Low-Resource-Languages

Visual Speech Recognition For Low-Resource Languages with Automatic Labels From Whisper Model

We will provide all VSR models, training code, and inference code for low-resource languages soon.

We release the automatic labels of the four low-resource languages(French, Italian, Portuguese, and Spanish).

To generate the automatic labels, we identify the languages of all videos in VoxCeleb2 and AVSpeech, and then the transcription (automatic labels) is produced by the pretrained ASR model. In this project, we use a "whisper/large-v2" model to conduct these processes.

Dataset preparation

Multilingual TEDx(mTEDx), VoxCeleb2, and AVSpeech Datasets.

  1. Download the mTEDx dataset from the mTEDx link of the official website.
  2. Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
  3. Download the AVSpeech dataset from the AVSpeech link of the official website.

When you are interested in training the model for a specific target language VSR, we recommend using language-detected files (e.g., link provided in this project instead of video lists of the AVSpeech dataset provided on the official website to reduce the dataset preparation time. Because of the huge amount of AVSpeech dataset, it takes a lot of time.

Preprocessing

After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.

Training the Model

The training code is available soon.

Inference

The inference code is available soon.

Models

mTEDx Fr
Model Training Datasets Training data (h) WER [%] Target Languages
best_ckpt.pt mTEDx 85 65.25 Fr
best_ckpt.pt mTEDx + VoxCeleb2 209 60.61 Fr
best_ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 331 58.30 Fr
mTEDx It
Model Training Datasets Training data (h) WER [%] Target Languages
best_ckpt.pt mTEDx 46 60.40 It
best_ckpt.pt mTEDx + VoxCeleb2 84 56.48 It
best_ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 152 51.79 It
mTEDx Es
Model Training Datasets Training data (h) WER [%] Target Languages
best_ckpt.pt mTEDx 72 59.91 Es
best_ckpt.pt mTEDx + VoxCeleb2 114 54.05 Es
best_ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 384 45.71 Es
mTEDx Pt
Model Training Datasets Training data (h) WER [%] Target Languages
best_ckpt.pt mTEDx 82 59.45 Pt
best_ckpt.pt mTEDx + VoxCeleb2 91 58.82 Pt
best_ckpt.pt mTEDx + VoxCeleb2 + AVSpeech 420 47.89 Pt

About

Visual Speech Recognition For Low-Resource Languages with Automatic Labels

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published