# Voicebox Speech Editing
This tutorial is about generating edited speech and zero-shot TTS. The main code is in `scripts/voicebox_edit/test_edit.py`.

## Requirements
Our code includes evaluating speaker similarity with WavLM-TDNN. To make sure this part of code won't cause trouble,
1. run `pip install bitarray git+https://github.com/facebookresearch/fairseq.git#fairseq --no-deps` for installing fairseq, then 
2. place the [speaker_verification](https://github.com/microsoft/UniSpeech/tree/e3043e2021d49429a406be09b9b8432febcdec73/downstreams/speaker_verification) folder at `scripts/voicebox_edit/speaker_verification`.

## Load Voicebox From Checkpoints
Voicebox include a duration prediction model for phone length prediction, and an audio model as the main part for speech editing and zero-shot TTS.

In [None]:
from scripts.voicebox_edit.test_edit import MainExc

# checkpoint paths
vb_ckpt_path = "nemo_experiments/checkpoints/a100-GS_XL-DAC-pymha-unet-warmup/checkpoints/vb-val_loss/vb=0.2913-epoch=167-step=500000-last.ckpt"
dp_ckpt_path = "nemo_experiments/checkpoints/dp_no_sil_spn=1.4410-epoch=8.ckpt"
# output path
output_path = "nemo_experiments/gen_dataset"

# run the main script
main_exc = MainExc(vb_ckpt_path=vb_ckpt_path, dp_ckpt_path=dp_ckpt_path, gen_data_dir=output_path, sample_std=0.95)
# load the model checkpoint
print(main.model)

Note:
- Since our code support separate training for the duration prediction model and the voicebox audio model, we therefore might need two separate checkpoints.
- The duration model and the voicebox audio model are loaded separately, so there is no need to make sure their configurations exactly match each other. Just ensure they're using the same acoustic feature is enough.
- There might be warnings about missing keys when checkpoint loading, this is cause by loading checkpoints saved with the old version of code.

## Generate SINE Dataset
Paper ref: TBD

In [None]:
# generate json file for LLM to generate transcript edits
main_exc.gen_v3_transcript_json()

# LLM generated response, we use zephyr-7b-beta
gpt_file = "nemo_experiments/data_1a_medium.json"

# generate dataset
main_exc.gen_v3(gpt_file=gpt_file)

## Generate RealEdit Dataset with Voicebox
First ask VoiceCraft's author for the RealEdit dataset, then download [`RealEdit.txt`](https://github.com/jasonppy/VoiceCraft/blob/master/RealEdit.txt), then fix the end of line 188 from "5\t5\tsubstitution|substitution" into "5|14,15\t5|15,16\tsubstitution|insertion".

In [None]:
# RealEdit dataset
realedit_dir = "nemo_experiments/RealEdit"

# RealEdit.txt (recommand rename to RealEdit.tsv for better format understanding)
filepath = "nemo_experiments/RealEdit/RealEdit.txt"

# dataset output path
output_dir = "nemo_experiments/gen_dataset"

# generate RealEdit dataset
main_exc.gen_RealEdit(realedit_dir=realedit_dir, filepath=filepath, output_dir=output_dir)

Note: during generation, we would simultaneously evaluate each generated audio (WER and speaker similarity).

## Generate Speech Editing / Zero-Shot TTS Examples
Use `main_exc.infer.riva_demo(data)` to do speech editing or zero-shot TTS.
Check `main_exc.riva_demo()` for a typical generation pipeline, which is as follow:

```python
# main_exc.riva_demo()
datas = self.dataprocessor.get_riva_demo_data(output_dir)
for data in datas:
    ori_mel, edit_mel = main_exc.infer.riva_demo(data)
```

Or, first create a list of editing metadata with the following format, then generate accordingly:

```python
# datas format
datas = [
    ...,
    {
        "audio_path": f"{realedit_dir}/Original/{row.wav_fn}",  # original audio path to be edited or as a reference for zero-shot TTS
        "text": row.orig_transcript,                            # original transcript
        "textgrid_path": textgrid_path,                         # (optional) textgrid path for the original audio. If not provided, the script will generate it.
        "from": a_data["from"],                                 # original transcript part to be substituted from
        "to": t_data,                                           # original transcript part to be substituted to
        "edit_type": row.type,                                  # (optional) edit type: "substitution", "insertion", "deletion". Default is "substitution".
        "out_ori_path": f"{output_dir}_ori/{row.wav_fn}",       # output path for saving the original audio
        "out_gen_path": f"{output_dir}/{row.wav_fn}",           # output path for saving the edited audio
        "out_tts_path": f"{output_dir}/tts_{row.wav_fn}",       # (optional) output path for saving the zero-shot TTS audio used for cut-and-paste editing
    },
]
for data in datas:
    ori_mel, edit_mel = main_exc.infer.riva_demo(data)
```

Note that the above "out_tts_path" is actually for cut-and-paste speech editing, which takes the original full utterance as a reference and generates the full utterance of the edited transcript. If you're willing to do the actual zero-shot TTS, please set the "from" as the last few words of your original transcript, then set "to" to your new transcript.