# Voicebox Data Preparation with Montreal Forced Aligner (MFA)
This is the tutorial of preparing manifests of training/validation dataset.

Since Voicebox requires frame-level alignment with phonemes as input, we need to prepare alignments for each utterance in the dataset.


## Prepare Environment
**We recommand [install MFA environment with conda](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html#general-installation) inside the NeMo docker**.
The reason is, we need to use both NeMo environment and MFA simultaneously for speech editing, so it is required to make sure both environments are accessible.

### Install [Miniconda](https://docs.anaconda.com/miniconda/#quick-command-line-install)

In [None]:
# Install miniconda from source
!mkdir -p ~/miniconda3
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
!bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
!rm ~/miniconda3/miniconda.sh

# Add conda to shell initialization (choose your shell)
!~/miniconda3/bin/conda init bash
!~/miniconda3/bin/conda init zsh

Restart the shell environment (reload window if using vscode) to enable the conda command, then run:

In [1]:
! source ~/.bashrc
! echo $PATH

# Set default conda activation to false, so that it doesn't interfere with the NeMo docker environment
!conda config --set auto_activate_base false

/usr/bin:/root/.local/bin:/vscode/vscode-server/bin/linux-x64/fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/bin/remote-cli:/root/miniconda3/bin:/root/miniconda3/condabin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/vscode/vscode-server/bin/linux-x64/fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/bin/remote-cli:/root/miniconda3/bin:/root/miniconda3/condabin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/loc

### Install MFA

In [None]:
# create new conda environment and install montreal forced aligner
!conda create -n aligner -c conda-forge montreal-forced-aligner -y

For the following command, if you failed running in the jupyter notebook (e.g. with `CondaError: Run 'conda init' before 'conda activate'`), please run it in the terminal by yourself.

In [1]:
# activate conda environment
!conda activate aligner

# download pre-trained MFA models
!mfa model download g2p english_us_arpa
!mfa model download acoustic english_us_arpa
!mfa model download dictionary english_us_arpa
!conda deactivate


CondaError: Run 'conda init' before 'conda activate'

/bin/bash: line 1: mfa: command not found
/bin/bash: line 1: mfa: command not found
/bin/bash: line 1: mfa: command not found



CondaError: Run 'conda init' before 'conda deactivate'



Note: we use "aligner" as the environment name throughout the Voicebox project. Please don't use your customized name at this point, unless you know how to fix the code accordingly.

### Install Other Pip Requirements

In [None]:
# check if torchaudio is installed
! pip list | grep 'torchaudio'
# if not, run the following commands
!./scripts/installers/install_torchaudio_latest.sh

## Preprocess LibriLight w/ LibriHeavy
- LibriLight is an audio dataset consists of audiobooks.
- LibriHeavy is a transcripted version of LibriLight, which is a manifest consists of LibriLight's utterances' transcripts.
- In this section, we need to cut the audiobook chapter audios into utterances according to LibriHeavy's provided transcripts, such that MFA can align each utterance with its transcript.

In [6]:
import os
import logging
from functools import partial
import lhotse
from lhotse.recipes.utils import manifests_exist
from lhotse.cut import CutSet, Cut
from lhotse.serialization import load_manifest_lazy_or_eager, load_manifest

from scripts.dataset_processing.tts.libriheavy.mfa_prepare import get_subset_audio, change_prefix, save_texts_and_audios

old_prefix = "download/librilight"  # prefix to replace in libriheavy manifest
librilight_dir = "data/download/LibriLight" # directory with librilight audio data
libriheavy_dir = "data/download/LibriHeavy" # directory with libriheavy manifest data
audio_cuts_dir = "data/aligned/LibriHeavy/raw_data_cuts" # directory to save processed audio data

### Download LibriLight Audios
LibriHeavy repo: https://github.com/k2-fsa/libriheavy/tree/master

In [7]:
# download librilight

dataset_parts = ["small", "medium", "large"]
target_dir = librilight_dir

print("Stage -1: Downloading audio file.")

os.makedirs(target_dir, exist_ok=True)
for subset in dataset_parts:
    logging.info("Downloading ${subset} subset.")
    if not os.path.exists(f"{target_dir}/{subset}"):
        os.system(f"wget -P {target_dir} -c https://dl.fbaipublicfiles.com/librilight/data/{subset}.tar")
        os.system(f"tar xf {target_dir}/{subset}.tar -C {target_dir}")
    else:
        print(f"Skipping download, {subset} subset exists.")

Stage -1: Downloading audio file.
Skipping download, small subset exists.
Skipping download, medium subset exists.
Skipping download, large subset exists.


### Download LibriHeavy Manifests

In [8]:
# download libriheavy manifests

dataset_parts = ["small", "medium", "large", "dev", "test_clean", "test_other", "test_clean_large", "test_other_large"]
target_dir = libriheavy_dir

print(f"mkdir -p {target_dir}")
os.makedirs(target_dir, exist_ok=True)
for subset in dataset_parts:
    if not manifests_exist(subset, target_dir, ["cuts"], "libriheavy"):
        print(f"Downloading {subset} subset.")
        os.system(f"wget -P {target_dir} -c https://huggingface.co/datasets/pkufool/libriheavy/resolve/main/libriheavy_cuts_{subset}.jsonl.gz")
    else:
        print(f"Skipping download, {subset} subset exists.")

mkdir -p data/download/LibriHeavy
Skipping download, small subset exists.
Skipping download, medium subset exists.
Skipping download, large subset exists.
Skipping download, dev subset exists.
Skipping download, test_clean subset exists.
Skipping download, test_other subset exists.
Skipping download, test_clean_large subset exists.
Skipping download, test_other_large subset exists.


### Process for MFA
Cut LibriLight audios according to LibriHeavy transcripts, then store them with MFA-required corpus formats and structure.

In [None]:
# process libriheavy manifests for MFA

subsets = ["small", "dev", "test_clean", "test_other"]
# subsets = ["medium"]
# subsets = ["large", "test_clean_large", "test_other_large"]

for subset in subsets:
    # can not lazily split with progress bar
    cuts: CutSet = load_manifest_lazy_or_eager(f"{libriheavy_dir}/libriheavy_cuts_{subset}.jsonl.gz", CutSet)
    cuts = cuts.filter(lambda c: ',' not in c.id)
    cuts = cuts.map(partial(change_prefix, old_prefix=old_prefix, new_prefix=librilight_dir))

    storage_path=f"{audio_cuts_dir}/{subset}"
    cuts = cuts.to_eager()
    save_texts_and_audios(cuts=cuts, storage_path=storage_path, num_jobs=32)

### MFA align LibriHeavy
Run the following in your terminal by yourself if any conda error occurs:

In [None]:
! mkdir -p data/aligned/LibriHeavy/textgrids

# change the subsets according to the processed subsets you want to align
! subsets="small dev test_clean test_other"

! conda activate aligner
! for subset in small dev test_clean test_other; do echo $subset; corpus_dir=data/aligned/LibriHeavy/raw_data_cuts/$subset/; textgrid_dir=data/aligned/LibriHeavy/textgrids/$subset/; mfa align $corpus_dir english_us_arpa english_us_arpa $textgrid_dir --config_path scripts/dataset_processing/tts/libriheavy/mfa_config.yaml --clean True -j 16; done
! conda deactivate

## Preprocess GigaSpeech

### Download GigaSpeech
GigaSpeech is downloaded from [huggingface ESB datasets](https://huggingface.co/datasets/esb/datasets) with its `datasets` api. Remember to grant access through the webpage beforehand.

In [None]:
! pip install datasets
import os
import datasets
from datasets import load_dataset
from tqdm import tqdm
import soundfile as sf

# change and modify the following variables
subsets = ["xs", "s", "m", "l", "xl"]
token = "your huggingface token"

def has_valid_audio(ex):
    try:
        sf.read(ex["audio"]["path"])
    except Exception:
        print(ex["audio"]["path"])
        return False
    return True

for subset in tqdm(subsets, desc="subset"):
    ds = load_dataset(
        "esb/datasets", "gigaspeech", subconfig=subset,
        download_config=datasets.DownloadConfig(resume_download=True),
        num_proc=8,
    )
    print(ds)
    ds = ds.cast_column("audio", datasets.Audio(decode=False))
    ds = ds.filter(has_valid_audio)
    ds = ds.cast_column("audio", datasets.Audio(decode=True))
    print(ds)
    for data in ds["train"]:
        pass

### MFA align GigaSpeech
Run the following in your terminal by yourself if any conda error occurs:

In [None]:
! ln -s ~/.cache/huggingface/datasets/ data/download/GigaSpeech/
! conda activate aligner
! mfa align data/download/GigaSpeech/downloads/extracted/ english_us_arpa english_us_arpa data/download/GigaSpeech/downloads/MFA/ --config_path scripts/dataset_processing/tts/libriheavy/mfa_config.yaml --clean True -j 16
! conda deactivate

## After MFA
After you've aligned your training dataset, please follow [tutorials/tts/Voicebox_Training.ipynb](Voicebox_Training.ipynb) for further training.