# Voicebox Data Preparation with Montreal Forced Aligner (MFA)
This is the tutorial of preparing manifests of training/validation dataset.

Since Voicebox requires frame-level alignment with phonemes as input, we need to prepare alignments for each utterance in the dataset.

**We recommand [install MFA environment with conda](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html#general-installation) inside the NeMo docker**.

## Install [Miniconda](https://docs.anaconda.com/miniconda/#quick-command-line-install)

In [1]:
# Install miniconda from source
!mkdir -p ~/miniconda3
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
!bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
!rm ~/miniconda3/miniconda.sh

# Add conda to shell initialization (choose your shell)
!~/miniconda3/bin/conda init bash
!~/miniconda3/bin/conda init zsh

--2024-08-29 06:03:28--  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.32.241, 104.16.191.158, 2606:4700::6810:bf9e, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.32.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148981743 (142M) [application/octet-stream]
Saving to: ‘/root/miniconda3/miniconda.sh’


2024-08-29 06:03:30 (62.9 MB/s) - ‘/root/miniconda3/miniconda.sh’ saved [148981743/148981743]

PREFIX=/root/miniconda3
Unpacking payload ...

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working... done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python

Restart the shell environment (reload window if using vscode) to enable the conda command, then run:

In [1]:
! source ~/.bashrc
! echo $PATH

# Set default conda activation to false, so that it doesn't interfere with the NeMo docker environment
!conda config --set auto_activate_base false

/usr/bin:/root/.local/bin:/vscode/vscode-server/bin/linux-x64/fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/bin/remote-cli:/root/miniconda3/bin:/root/miniconda3/condabin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/vscode/vscode-server/bin/linux-x64/fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/bin/remote-cli:/root/miniconda3/bin:/root/miniconda3/condabin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/local/cmake/bin:/usr/loc

## Install MFA

In [1]:
# create new conda environment and install montreal forced aligner
!conda create -n aligner -c conda-forge montreal-forced-aligner -y

Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /root/miniconda3/envs/aligner

  added / updated specs:
    - montreal-forced-aligner


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |            2_gnu          23 KB  conda-forge
    aom-3.9.1                  |       hac33072_0         2.6 MB  conda-forge
    atk-1.0-2.38.0             |       h04ea711_2         348 KB  conda-forge
    audioread-3.0.1            |   py39hf3d152e_1          36 KB  conda-forge
    baumwelch-0.3.9            |       h434a139_3         368 KB  conda-forge
    biopython-1.79             |   py39hb9d737c_3         2.6 MB  conda-forge
    brotli-1.1.0               |       hd5903

For the following command, if you failed running in the jupyter notebook (e.g. with `CondaError: Run 'conda init' before 'conda activate'`), please run it in the terminal by yourself.

In [1]:
# activate conda environment
!conda activate aligner

# download pre-trained MFA models
!mfa model download g2p english_us_arpa
!mfa model download acoustic english_us_arpa
!mfa model download dictionary english_us_arpa
!conda deactivate


CondaError: Run 'conda init' before 'conda activate'

/bin/bash: line 1: mfa: command not found
/bin/bash: line 1: mfa: command not found
/bin/bash: line 1: mfa: command not found



CondaError: Run 'conda init' before 'conda deactivate'



Note: we use "aligner" as the environment name throughout the Voicebox project. Please don't use your customized name at this point, unless you know how to fix the code accordingly.

## Other Requirements

In [None]:
!./scripts/installers/install_torchaudio_latest.sh
!pip install textgrid descript-audio-codec openai-whisper s3prl torchode encodec vocos resemblyzer lhotse==1.26.0
!pip install -U jiwer onnx wandb tensorboard
!pip install bitarray git+https://github.com/facebookresearch/fairseq.git#fairseq --no-deps

# Preprocess LibriLight w/ LibriHeavy
- LibriLight is an audio dataset consists of audiobooks.
- LibriHeavy is a transcripted version of LibriLight, which is a manifest consists of LibriLight's utterances' transcripts.
- In this section, we need to cut the audiobook chapter audios into utterances according to LibriHeavy's provided transcripts, such that MFA can align each utterance with its transcript.

In [6]:
import os
import logging
from functools import partial
import lhotse
from lhotse.recipes.utils import manifests_exist
from lhotse.cut import CutSet, Cut
from lhotse.serialization import load_manifest_lazy_or_eager, load_manifest

from scripts.dataset_processing.tts.libriheavy.mfa_prepare import get_subset_audio, change_prefix, save_texts_and_audios

old_prefix = "download/librilight"  # prefix to replace in libriheavy manifest
librilight_dir = "data/download/LibriLight" # directory with librilight audio data
libriheavy_dir = "data/download/LibriHeavy" # directory with libriheavy manifest data
audio_cuts_dir = "data/aligned/LibriHeavy/raw_data_cuts" # directory to save processed audio data

## Download LibriLight
LibriHeavy repo: https://github.com/k2-fsa/libriheavy/tree/master

In [7]:
# download librilight

dataset_parts = ["small", "medium", "large"]
target_dir = librilight_dir

print("Stage -1: Downloading audio file.")

os.makedirs(target_dir, exist_ok=True)
for subset in dataset_parts:
    logging.info("Downloading ${subset} subset.")
    if not os.path.exists(f"{target_dir}/{subset}"):
        os.system(f"wget -P {target_dir} -c https://dl.fbaipublicfiles.com/librilight/data/{subset}.tar")
        os.system(f"tar xf {target_dir}/{subset}.tar -C {target_dir}")
    else:
        print(f"Skipping download, {subset} subset exists.")

Stage -1: Downloading audio file.
Skipping download, small subset exists.
Skipping download, medium subset exists.
Skipping download, large subset exists.


In [8]:
# download libriheavy manifests

dataset_parts = ["small", "medium", "large", "dev", "test_clean", "test_other", "test_clean_large", "test_other_large"]
target_dir = libriheavy_dir

print(f"mkdir -p {target_dir}")
os.makedirs(target_dir, exist_ok=True)
for subset in dataset_parts:
    if not manifests_exist(subset, target_dir, ["cuts"], "libriheavy"):
        print(f"Downloading {subset} subset.")
        os.system(f"wget -P {target_dir} -c https://huggingface.co/datasets/pkufool/libriheavy/resolve/main/libriheavy_cuts_{subset}.jsonl.gz")
    else:
        print(f"Skipping download, {subset} subset exists.")

mkdir -p data/download/LibriHeavy
Skipping download, small subset exists.
Skipping download, medium subset exists.
Skipping download, large subset exists.
Skipping download, dev subset exists.
Skipping download, test_clean subset exists.
Skipping download, test_other subset exists.
Skipping download, test_clean_large subset exists.
Skipping download, test_other_large subset exists.


In [10]:
# process libriheavy manifests for MFA

subsets = ["small", "dev", "test_clean", "test_other"]
# subsets = ["medium"]
# subsets = ["large", "test_clean_large", "tesst_other_large"]

for subset in subsets:
    # can not lazily split with progress bar
    cuts: CutSet = load_manifest_lazy_or_eager(f"{libriheavy_dir}/libriheavy_cuts_{subset}.jsonl.gz", CutSet)
    cuts = cuts.filter(lambda c: ',' not in c.id)
    cuts = cuts.map(partial(change_prefix, old_prefix=old_prefix, new_prefix=librilight_dir))

    storage_path=f"{audio_cuts_dir}/{subset}"
    cuts = cuts.to_eager()
    save_texts_and_audios(cuts=cuts, storage_path=storage_path, num_jobs=32)

Storing audio and transcripts: 100%|██████████| 3829/3829 [00:00<00:00, 7394.60it/s]
Storing audio and transcripts (chunks progress): 100%|██████████| 32/32 [00:09<00:00,  3.37it/s]
Storing audio and transcripts: 100%|██████████| 168/168 [00:03<00:00, 44.63it/s]
Storing audio and transcripts (chunks progress): 100%|██████████| 32/32 [00:05<00:00,  5.86it/s]
Storing audio and transcripts: 100%|██████████| 80/80 [00:01<00:00, 69.61it/s]
Storing audio and transcripts (chunks progress): 100%|██████████| 32/32 [00:03<00:00,  9.61it/s]
Storing audio and transcripts: 100%|██████████| 88/88 [00:01<00:00, 65.43it/s]
Storing audio and transcripts (chunks progress): 100%|██████████| 32/32 [00:02<00:00, 10.83it/s]


### (Optional) get eval data only
If your environment does not allow 

# Preprocess GigaSpeech

In [None]:
! pip install datasets
import os
import datasets
from datasets import load_dataset
from tqdm import tqdm
import soundfile as sf

subsets = ["xs", "s", "m", "l", "xl"]

def has_valid_audio(ex):
    try:
        sf.read(ex["audio"]["path"])
    except Exception:
        print(ex["audio"]["path"])
        return False
    return True

for subset in tqdm(subsets, desc="subset"):
    ds = load_dataset(
        "esb/datasets", "gigaspeech", subconfig=subset,
        download_config=datasets.DownloadConfig(resume_download=True),
        num_proc=8,
    )
    print(ds)
    ds = ds.cast_column("audio", datasets.Audio(decode=False))
    ds = ds.filter(has_valid_audio)
    ds = ds.cast_column("audio", datasets.Audio(decode=True))
    print(ds)
    for data in ds["train"]:
        pass