Skip to content

TF code for our CVPR2020 paper "Discriminative Multi-modality Speech Recognition"

Notifications You must be signed in to change notification settings

JackSyu/Discriminative-Multi-modality-Speech-Recognition

Repository files navigation

Discriminative Multi-modality Speech Recognition

In this paper, we propose a two-stage speech recognition model. In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly. At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate.

Paper

Paper(Arxiv)

Preparation

First of all, clone the code

git clone https://github.com/JackSyu/Discriminative-Multi-modality-Speech-Recognition.git

Then, create a folder:

cd AE-MSR && mkdir data

Requirement

Python 3.5
Tensorflow 1.12.0.
CUDA 9.0 or higher. 
MATLAB (optionally)

Data preprocessing

LRS3:
Download or use your own data.
Extract the video frames and crop lip area.

cd preprocessing
python dataset_tfrecord_trainval.py

Training & Testing

We train the audio enhancement sub-network and the MSR sub-network separately.

python Train_Audio_Visual_Speech_Enhancement.py
python Train_Audio_Visual_Speech_Recognition.py

Then we freeze the AE sub-network and complete the subsequent joint training.

python Train_AE_MSR.py
Python Test_AE_MSR.py

Citation

If you find our code useful, please consider citing:

@InProceedings{Xu_2020_CVPR,
author = {Xu, Bo and Lu, Cheng and Guo, Yandong and Wang, Jacob},
title = {Discriminative Multi-Modality Speech Recognition},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

About

TF code for our CVPR2020 paper "Discriminative Multi-modality Speech Recognition"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages