Visual-Speech-Recognition-for-Low-Resource-Languages

Visual Speech Recognition For Low-Resource Languages with Automatic Labels From Whisper Model

We will provide all VSR models, training code, and inference code for low-resource languages soon.

We release the automatic labels of the four low-resource languages(French, Italian, Portuguese, and Spanish).

To generate the automatic labels, we identify the languages of all videos in VoxCeleb2 and AVSpeech, and then the transcription (automatic labels) is produced by the pretrained ASR model. In this project, we use a "whisper/large-v2" model to conduct these processes.

Dataset preparation

Multilingual TEDx(mTEDx), VoxCeleb2, and AVSpeech Datasets.

Download the mTEDx dataset from the mTEDx link of the official website.
Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
Download the AVSpeech dataset from the AVSpeech link of the official website.

When you are interested in training the model for a specific target language VSR, we recommend using language-detected files (e.g., link provided in this project instead of video lists of the AVSpeech dataset provided on the official website to reduce the dataset preparation time. Because of the huge amount of AVSpeech dataset, it takes a lot of time.

Preprocessing

After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.

Training the Model

The training code is available soon.

Inference

The inference code is available soon.

Models

mTEDx Fr

Model	Training Datasets	Training data (h)	WER [%]	Target Languages
best_ckpt.pt	mTEDx	85	65.25	Fr
best_ckpt.pt	mTEDx + VoxCeleb2	209	60.61	Fr
best_ckpt.pt	mTEDx + VoxCeleb2 + AVSpeech	331	58.30	Fr

mTEDx It

Model	Training Datasets	Training data (h)	WER [%]	Target Languages
best_ckpt.pt	mTEDx	46	60.40	It
best_ckpt.pt	mTEDx + VoxCeleb2	84	56.48	It
best_ckpt.pt	mTEDx + VoxCeleb2 + AVSpeech	152	51.79	It

mTEDx Es

Model	Training Datasets	Training data (h)	WER [%]	Target Languages
best_ckpt.pt	mTEDx	72	59.91	Es
best_ckpt.pt	mTEDx + VoxCeleb2	114	54.05	Es
best_ckpt.pt	mTEDx + VoxCeleb2 + AVSpeech	384	45.71	Es

mTEDx Pt

Model	Training Datasets	Training data (h)	WER [%]	Target Languages
best_ckpt.pt	mTEDx	82	59.45	Pt
best_ckpt.pt	mTEDx + VoxCeleb2	91	58.82	Pt
best_ckpt.pt	mTEDx + VoxCeleb2 + AVSpeech	420	47.89	Pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

French

French

Italian

Italian

Portuguese

Portuguese

Spanish

Spanish

README.md

README.md

Repository files navigation

Visual-Speech-Recognition-for-Low-Resource-Languages

Dataset preparation

Preprocessing

Training the Model

Inference

Models

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
French		French
Italian		Italian
Portuguese		Portuguese
Spanish		Spanish
README.md		README.md

JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages

Folders and files

Latest commit

History

Repository files navigation

Visual-Speech-Recognition-for-Low-Resource-Languages

Dataset preparation

Preprocessing

Training the Model

Inference

Models

About

Resources

Stars

Watchers

Forks