Sequential2.0: A Self Supervised Speech Translation Model Based on Deberta and Squeezeformer Using Pseudo Languages

Authors:: FelHong Liu, RongCai Zhao

Paper link:

Model Checkpoints

Pre-trained Models

Model	Pre-training updates	Dataset	Link
Sequtial2.0 (from HuBERT-base)	400K + 25K	LibriSpeech 960h	Download
Sequtial2.0 (from HuBERT-base)	400K + 100K	LibriSpeech 960h	Download
Sequtial2.0 (from fat_en_zh)	400K + 25K	AiShell 10Kh	Download
Sequtial2.0 (from fat_en_zh)	400K + 100K	AiShell 10Kh	Download

Fine-tuned Models

Model	Pre-training updates	Finetuning split	Link
Sequtial2.0 (from HuBERT-base)	400K + 25K	LibriSpeech 10h	Download
Sequtial2.0 (from HuBERT-base)	400K + 100K	LibriSpeech 100h	Download
Sequtial2.0 (from fat_en_zh)	400K + 25K	ted_en_zh 10h	Download
Sequtial2.0 (from fat_en_zh)	400K + 100K	ted_en_zh 100h	Download

Pre-trained k-means Models for Psuedo Characters

Number of Clusters	Link
25	Download
100	Download
500	Download

Pre-trained BPE model for Psuedo Subwords

Number of Clusters	Number of Subwords	Link
25	1000	Download
25	3000	Download
25	10000	Download
25	30000	Download
100	3000	Download
100	10000	Download
100	30000	Download
500	3000	Download
500	10000	Download
500	30000	Download

Usage

Dependency

torch==1.9.0+cu111
torchaudio==0.9.0
tqdm==4.62.3
hydra-core==1.0.7
omegaconf==2.0.6
einops==0.3.0
fire==0.4.0
fairseq==1.0.0a0+bba000d
paddlepaddle==2.4.1
paddlespeech==1.4.1

Installation

git clone git@github.com:961241279/Sequential2.0.git
cd wav2seq
pip install -e .

Download the manifests generated by Paddle

Please download the files from: manifests
unzipped and put these files under "data/"

Creatining Psuedo Subword Tokens

Create wav2vec style manifest files Please set LIBRISPEECH_PATH to your librispeech folder which contains three subfolders train-clean-100, train-clean-360, train-other-500.

#librispeech:
mkdir -p manifest/librispeech/train-960
python -m examples.wav2vec.wav2vec_manifest LIBRISPEECH_PATH  --dest manifest/librispeech/train-960 --ext flac --valid-percent 0.01 --path-must-contain train
#aishell:
python utils/aishell.py --tgt-dir=YOUR_DATASET_DIR --src-dir=manifest/aishell
#ted_en_zh:
python utils/ted_en_zh.py --tgt-dir=YOUR_DATASET_DIR --src-dir=
manifest/ted_en_zh

Train k-means model and get cluster indices. Please make sure that you have download pre-trained hubert-base checkpoint at HUBERT_PATH. Notably, this step requires a GPU for feature extraction and 64GB main memory for k-means training. Extracting HuBERT features takes about 15 minutes, training k-means may take about an hour, dumping the cluster ids of the whole Librispeech 960h data takes more than two hours.

HUBERT_PATH="save/pretrained/hubert_base_ls960.pt"
FAT_PATH="save/pretrained/fat_en_zh.pdparams"
mkdir -p save/pretrained
if ! [ -f $HUBERT_PATH ]; then
    wget https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt -O  $HUBERT_PATH
fi
if ! [ -f $FAT_PATH ]; then
    wget 
https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/paddle.98.pdparams --no-check-certificate -O $FAT_PATH
fi
bash scripts/pl/extract-features.sh $HUBERT_PATH 9 2 2 500 False
bash scripts/pl/extract-features.sh $FAT_PATH 9 2 2 500 True

where 9, 2, 2, 500 means that we use the 9-th layer of HuBERT, kernel size 2 and stride size 2 for average pooling, and 500 custers in k-means.

Training BPE model and create pseudo subword tokens.

bash scripts/pl/create-pseudo-language.sh labels/hubert_base-l9-k2s2-fp16-ls0.1/c500 30000
bash scripts/pl/create-pseudo-language.sh labels/fat-l9-k2s2-fp16-ls0.1/c500 30000

Pre-training Sequntial2.0

bash scripts/sequntial2.0-pt.sh wav2seq-hubert-base-ls960
bash scripts/sequntial2.0-pt.sh wav2seq-fat-base-ls960

Fine-tuning Sequntial2.0 on LibriSpeech

To fine-tune a pretrained checkpoint on librispeech with 10h data. Please use this command.

bash scripts/sequntial2.0-ft-ls.sh $pretrained_ckpt ft-ls-10h

where $pretrained_ckpt is your pretrained checkpoint.

With 100h supervised data, please use this command.

bash scripts/sequntial2.0-ft-ls.sh $pretrained_ckpt ft-ls-100h

Please make sure that your manifest files are stored in manifest/librispeech. We provide our manifest here for reproducibility. Please make sure that you change the first line of all tsv files so that the path of the data is set correctly. We use a pretrained subword tokenizer link to convert LibriSpeech transcripts into subword tokens.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.pre-commit-hooks		.pre-commit-hooks
conf		conf
config		config
dataset		dataset
docs		docs
utils		utils
wav2seq		wav2seq
.clang-format		.clang-format
.flake8		.flake8
.gitconfig		.gitconfig
.gitignore		.gitignore
.gitignore1		.gitignore1
.mergify.yml		.mergify.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
.style.yapf		.style.yapf
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup (2).py		setup (2).py
setup.cfg		setup.cfg
setup.py		setup.py
transcripts2subwords.gz		transcripts2subwords.gz

License

961241279/Sequential2.0

Folders and files

Latest commit

History

Repository files navigation