Download the Providence corpus

The original dataset is available on Phonbank. However, a child-project version can be found there. This is where we'll start from.

Required dependencies

If you want to prepare the Providence corpus, you must install this conda env:

# Install dependencies
conda env create -f data_prep.yml
conda activate provi
# Install paraphone: used to phonemized sentences
git clone https://github.com/MarvinLvn/paraphone.git
cd paraphone
pip install -e .

To force align the corpus, we'll need abkhazia whose installation instructions can be found there.

Tips for our cluster users for abkhazia installation:

gfortran (required dependency) comes with module load gcc
clang ++ (required dependency) comes with module load llvm

Preparing the training sets

First, we need to extract speech segments along with their annotations:

python scripts/providence/extract_providence.py --audio ~/DATA/CPC_data/train/providence/recordings/raw \
  --annotation ~/DATA/CPC_data/train/providence/annotations/cha/raw \
  --out ~/DATA/CPC_data/train/providence/cleaned

As human-annotated utterances boundaries are a bit off, we'll correct them with what the vtc has been identified as SPEECH (we take the intersection of HUMAN & VTC). WARNING: This script will directly recompute boundaries and will modify the files in the sentences and the audio folders. You might want to save them before running this command:

python scripts/providence/correct_boundaries.py --audio ~/DATA/CPC_data/train/providence/cleaned/audio \
  --annotation ~/DATA/CPC_data/train/providence/cleaned/sentences \
  --rttm ~/DATA/CPC_data/train/providence/annotations/vtc/raw

Then, we need to phonemize sentences:

python scripts/providence/phonemize_sentences.py --sentences ~/DATA/CPC_data/train/providence/cleaned/sentences \
  --out ~/DATA/CPC_data/train/providence/cleaned/phonemes

This will create new folders phonemes and phonemes_with_space that contains the phonemized version of the utterances without and with spaces, respectively. It will also clean the sentences folder by removing punctuations and special characters such as ^. You can deactivate this behavior with the --no_clean flag.

We BPE-encode sentences (to later train a BPE-LSTM):

python scripts/providence/bpe_encode.py --sentences ~/DATA/CPC_data/train/providence/cleaned/sentences

If you want to synthetize sentences from Providence, you can run:

python scripts/providence/synthetize.py --credentials_path /path/to/credentials.json \

Then convert the .ogg to .wav files with:

for ogg in /path/to/providence/audio_synthetized/en-US-Wavenet-I/*/*.ogg; do 
  ffmpeg -i ${ogg} -acodec pcm_s16le -ac 1 -ar 16000 ${ogg%.*}.wav; 
done;

Last, we create 30mn, 1h, 2h, ... 128h training sets with the command:

python scripts/providence/create_training_sets.py --sentences1 ~/DATA/CPC_data/train/providence/cleaned/sentences \
  --sentences2 ~/DATA/CPC_data/train/providence/cleaned/sentences_bpe \
  --phones1 ~/DATA/CPC_data/train/providence/cleaned/phonemes \
  --phones2 ~/DATA/CPC_data/train/providence/cleaned/phonemes_with_space \
  --audio ~/DATA/CPC_data/train/providence/cleaned/audio \
  --out ~/DATA/CPC_data/train/providence/cleaned/training_sets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.md

data.md

Download the Providence corpus

Required dependencies

Preparing the training sets

Files

data.md

Latest commit

History

data.md

File metadata and controls

Download the Providence corpus

Required dependencies

Preparing the training sets