Audio-visual speech recognition based on DCM

this repo is implementing AVSR task in Fairseq==0.8.0 toolkit.

The dependencies are noticed in conda_env.yml file.
Arguments about train or inference same with speech_recognition example in the original Fairseq toolkit.
The model is composed about three blocks. 1) self-attention transformer based modality encoder, 2) dual-cross modality attention layer and 3) transformer based attention decoder.
The mel-filterbank audio features and pre-trained CNN video features are fed in the model, then the model creates character-based sentence.
WER and CER calculated by Sclite package using prediction and ground-truth sentences.

Provide feedback