this repo is implementing AVSR task in Fairseq==0.8.0
toolkit.
- The dependencies are noticed in
conda_env.yml
file. - Arguments about
train
orinference
same withspeech_recognition
example in the originalFairseq
toolkit. - The model is composed about three blocks. 1)
self-attention transformer based modality encoder
, 2)dual-cross modality attention layer
and 3)transformer based attention decoder
. - The mel-filterbank audio features and pre-trained CNN video features are fed in the model, then the model creates character-based sentence.
WER
andCER
calculated bySclite
package using prediction and ground-truth sentences.