Skip to content

Latest commit

 

History

History
87 lines (68 loc) · 5.57 KB

README_mmspeech.md

File metadata and controls

87 lines (68 loc) · 5.57 KB

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for speech recognition

ModelScope  |  Paper  

We propose a novel multi-modal multi-task encoder-decoder pre-training framework~(MMSpeech) for Mandarin automatic speech recognition~(ASR), which employs a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.




Datasets & Checkpoints

Model Model Size Unlabeled Speech Unlabeled Text labeled Pre-Training Fine-Tuning
MMSpeech-Base1 210M AISHELL-2 M6-Corpus AISHELL-1 checkpoint checkpoint
MMSpeech-Base2 210M WenetSpeech M6-Corpus AISHELL-1 checkpoint checkpoint
MMSpeech-Large 609M WenetSpeech M6-Corpus AISHELL-1 checkpoint checkpoint

Results on AISHELL-1

  • Compare MMSpeech-Base1 with the model of the same encoder size and amount of unlabeled speech data.
Model dev (w/o LM) dev (wit LM) test (w/o LM) test (with LM)
w/o pre-training 6.4 5.2 6.8 5.7
Data2Vec 3.8 3.7 4.1 3.9
MMSpeech-Base1 2.4 2.1 2.6 2.3
MMSpeech-Base1 (w/o Fine-Tuning) 2.5 2.3 2.6 2.3
  • Compare MMSpeech-Base2 with the model of the same encoder size and amount of unlabeled speech data.
Model dev (wit LM) test (with LM)
Wav2vec 2.0-Base 4.2 4.7
HuBERT-Base 4.1 4.3
MMSpeech-Base2 2.0 2.1
  • Compare MMSpeech-Large with the model of the same encoder size and amount of unlabeled speech data.
Model dev (wit LM) test (with LM)
Wav2vec 2.0-Large 3.8 4.1
HuBERT-Large 3.1 3.3
MMSpeech-Large 1.6 1.9

Quick start

Installation

Note that we update the fairseq version for mmspeech.

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt

Data preparation

Input files for all tasks include three columns: "speech_id, wav_path, text", delimited by a "\t".

  • "wav_path" denotes the path for the wav files.
  • "text" denotes raw text inputs.
  • "pseduo-codes" can be obtained by following the steps in wav2seq.
Data Task speech_id_col wav_path_col text_col
unlabeled speech data S2C, MSP speech_id wav_path pseduo-codes
unlabeled text data P2T speech_id un-used text
speech-text data S2T speech_id wav_path text

We also provide example config_yaml of input fbank features for your reference in here.

training

cd run_scripts/mmspeech
sh mmspeech_cn_base_stage1.sh
sh mmspeech_cn_base_stage2.sh
sh mmspeech_cn_base_stage3.sh

evaluation

cd run_scripts/mmspeech
sh evaluate_mmspeech_base.sh