The original dataset is available on Phonbank. However, a child-project version can be found there. This is where we'll start from.
If you want to prepare the Providence corpus, you must install this conda env:
# Install dependencies
conda env create -f data_prep.yml
conda activate provi
# Install paraphone: used to phonemized sentences
git clone https://github.com/MarvinLvn/paraphone.git
cd paraphone
pip install -e .
To force align the corpus, we'll need abkhazia
whose installation instructions can be found there.
Tips for our cluster users for abkhazia installation:
gfortran
(required dependency) comes withmodule load gcc
clang ++
(required dependency) comes withmodule load llvm
First, we need to extract speech segments along with their annotations:
python scripts/providence/extract_providence.py --audio ~/DATA/CPC_data/train/providence/recordings/raw \
--annotation ~/DATA/CPC_data/train/providence/annotations/cha/raw \
--out ~/DATA/CPC_data/train/providence/cleaned
As human-annotated utterances boundaries are a bit off, we'll correct them with what the vtc has been identified as SPEECH (we take the intersection of HUMAN & VTC).
WARNING: This script will directly recompute boundaries and will modify the files in the sentences
and the audio
folders. You might want to save them before running this command:
python scripts/providence/correct_boundaries.py --audio ~/DATA/CPC_data/train/providence/cleaned/audio \
--annotation ~/DATA/CPC_data/train/providence/cleaned/sentences \
--rttm ~/DATA/CPC_data/train/providence/annotations/vtc/raw
Then, we need to phonemize sentences:
python scripts/providence/phonemize_sentences.py --sentences ~/DATA/CPC_data/train/providence/cleaned/sentences \
--out ~/DATA/CPC_data/train/providence/cleaned/phonemes
This will create new folders phonemes
and phonemes_with_space
that contains the phonemized version of the utterances without and with spaces, respectively.
It will also clean the sentences
folder by removing punctuations and special characters such as ^. You can deactivate this behavior with the --no_clean
flag.
We BPE-encode sentences (to later train a BPE-LSTM):
python scripts/providence/bpe_encode.py --sentences ~/DATA/CPC_data/train/providence/cleaned/sentences
If you want to synthetize sentences from Providence, you can run:
python scripts/providence/synthetize.py --credentials_path /path/to/credentials.json \
Then convert the .ogg to .wav files with:
for ogg in /path/to/providence/audio_synthetized/en-US-Wavenet-I/*/*.ogg; do
ffmpeg -i ${ogg} -acodec pcm_s16le -ac 1 -ar 16000 ${ogg%.*}.wav;
done;
Last, we create 30mn, 1h, 2h, ... 128h training sets with the command:
python scripts/providence/create_training_sets.py --sentences1 ~/DATA/CPC_data/train/providence/cleaned/sentences \
--sentences2 ~/DATA/CPC_data/train/providence/cleaned/sentences_bpe \
--phones1 ~/DATA/CPC_data/train/providence/cleaned/phonemes \
--phones2 ~/DATA/CPC_data/train/providence/cleaned/phonemes_with_space \
--audio ~/DATA/CPC_data/train/providence/cleaned/audio \
--out ~/DATA/CPC_data/train/providence/cleaned/training_sets