Skip to content

Latest commit

 

History

History
86 lines (63 loc) · 3.44 KB

data.md

File metadata and controls

86 lines (63 loc) · 3.44 KB

Download the Providence corpus

The original dataset is available on Phonbank. However, a child-project version can be found there. This is where we'll start from.

Required dependencies

If you want to prepare the Providence corpus, you must install this conda env:

# Install dependencies
conda env create -f data_prep.yml
conda activate provi
# Install paraphone: used to phonemized sentences
git clone https://github.com/MarvinLvn/paraphone.git
cd paraphone
pip install -e .

To force align the corpus, we'll need abkhazia whose installation instructions can be found there.

Tips for our cluster users for abkhazia installation:

  • gfortran (required dependency) comes with module load gcc
  • clang ++ (required dependency) comes with module load llvm

Preparing the training sets

First, we need to extract speech segments along with their annotations:

python scripts/providence/extract_providence.py --audio ~/DATA/CPC_data/train/providence/recordings/raw \
  --annotation ~/DATA/CPC_data/train/providence/annotations/cha/raw \
  --out ~/DATA/CPC_data/train/providence/cleaned

As human-annotated utterances boundaries are a bit off, we'll correct them with what the vtc has been identified as SPEECH (we take the intersection of HUMAN & VTC). WARNING: This script will directly recompute boundaries and will modify the files in the sentences and the audio folders. You might want to save them before running this command:

python scripts/providence/correct_boundaries.py --audio ~/DATA/CPC_data/train/providence/cleaned/audio \
  --annotation ~/DATA/CPC_data/train/providence/cleaned/sentences \
  --rttm ~/DATA/CPC_data/train/providence/annotations/vtc/raw

Then, we need to phonemize sentences:

python scripts/providence/phonemize_sentences.py --sentences ~/DATA/CPC_data/train/providence/cleaned/sentences \
  --out ~/DATA/CPC_data/train/providence/cleaned/phonemes

This will create new folders phonemes and phonemes_with_space that contains the phonemized version of the utterances without and with spaces, respectively. It will also clean the sentences folder by removing punctuations and special characters such as ^. You can deactivate this behavior with the --no_clean flag.

We BPE-encode sentences (to later train a BPE-LSTM):

python scripts/providence/bpe_encode.py --sentences ~/DATA/CPC_data/train/providence/cleaned/sentences

If you want to synthetize sentences from Providence, you can run:

python scripts/providence/synthetize.py --credentials_path /path/to/credentials.json \
  

Then convert the .ogg to .wav files with:

for ogg in /path/to/providence/audio_synthetized/en-US-Wavenet-I/*/*.ogg; do 
  ffmpeg -i ${ogg} -acodec pcm_s16le -ac 1 -ar 16000 ${ogg%.*}.wav; 
done;

Last, we create 30mn, 1h, 2h, ... 128h training sets with the command:

python scripts/providence/create_training_sets.py --sentences1 ~/DATA/CPC_data/train/providence/cleaned/sentences \
  --sentences2 ~/DATA/CPC_data/train/providence/cleaned/sentences_bpe \
  --phones1 ~/DATA/CPC_data/train/providence/cleaned/phonemes \
  --phones2 ~/DATA/CPC_data/train/providence/cleaned/phonemes_with_space \
  --audio ~/DATA/CPC_data/train/providence/cleaned/audio \
  --out ~/DATA/CPC_data/train/providence/cleaned/training_sets