# Formatting refined audio

In this tutorial, you will learn how to apply `VocalForge.text` pipelines on audio files either manually processed or through through `VocalForge.audio`.

Each step is a step toward converting plain audio file(s) into a standardized format that is perfect for TTS training, speaker identification, or content Analysis (and more). These can be used independently of eachother, so if theres a step that isn't useful to you, feel free to simply skip it. This demo is just showing all capabilities of VocalForge. The final output of this dataset follows the same format to [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) but the class can be pretty easily modified to export another format.

The pipelines are as follows:

- `Transcribe` Using OpenAI's Whisper, it goes through each file and, well, transcribes them into a text document. Pretty self explicatory. 

- `NormalizeText` Taking the raw text from Whisper, NormalizeText will output three text files, along with a copied version of the input audio file. The three text files split the text into utterances, such as sentences or exclamations. Each file will vary in terms of text normalization in order to expand compatibility with any range of preprocessors.

- `Segment` generates timestamps (and confidence levels) for each utterence generated from `NormalizeText` that matches to segments of the audio file.

- `SplitAudio` is also very straightforward. It splits the audio file based on timestamps, assuming the segment reaches a certain confidence level (alongwith offsets and padding, but we will get to that)

- `Export` combines the text generated from `NormalizeText` and the audio files from `SplitAudio` in a nice little folder, alongwith a metadata file.

More pipelines are coming soon™

NOTE: It is highly reccomended to run this on a conda enviroment if running locally by running the command
`conda create -n VocalForge python=3.8 pytorch=1.11.0 torchvision=0.12.0 torchaudio=0.11.0 cudatoolkit=11.3.1 -c pytorch`

### Getting Started

First, let's get to creating our work directory and installing `VocalForge`'s text features

In [2]:
import os
root_path = os.getcwd()
print(root_path)
if not os.path.exists(os.path.join(root_path, 'work')):
    os.mkdir(os.path.join(root_path, 'work'))
elif not os.path.exists(os.path.join(root_path, 'work/text')):
    os.mkdir(os.path.join(root_path, 'work/text'))
work_path = os.path.join(root_path, 'work/text')

/home/rio/Desktop/VocalForgeDev/VocalForge


In [None]:
#this might take a while
!pip install VocalForge['text']

In [3]:
#creates folders to store files for each step
from VocalForge.text.text_utils import create_core_folders
create_core_folders(['Input_Audio', 'Transcription', 'Normalized', 'Segments', 'Sliced_Audio', 'Dataset'], workdir=os.path.join(root_path, 'work/text'))

In [None]:
from torch import cuda
cuda.empty_cache()

Make sure to put your audio in the 'Input_Audio' folder at this point!

### Transcription

Now all the sitting around is done, we can finally get to...pushing a button and then waiting a little while longer! Welcome to AI. There are a few things we can do however before pressing the play button to make our process a little bit better.

First, let's talk models. Whisper can use models of varying size. Personally, I'd go with the largest model you can get your grubby hands on, which depends on the VRAM your GPU has. `large` requires ~10gb, while `medium` and `small` require 5GB and 2GB respectively.  

The other thing to take note of is the prompt. This allows us to guide the diction Whisper will include in the transcription. For instance, if the speaker(s) stutter or use filler words frequently, and our goal is to have as accurate of a transcription as possible, we can include an example of that (as seen below). This also works on slang that isn't in the standard lexicon. 

In [None]:
from VocalForge.text import Transcribe
transcriber = Transcribe(
    input_dir=os.path.join(work_path, 'Input_Audio'),
    output_dir=os.path.join(work_path, 'Transcription'),
    model='large',
    prompt = "uh, um, I...I think that what, what you're saying is terrible, mhm. Ya'know?",
    do_write=True
)

In [None]:
transcriber.run()

### Normalization

Ok, so this one is pretty straightforward 🙌 Still some things to know:

`max_length` and `min_length` are pretty self explanatory. The only thing to note is that `min_length` is the minimum length of the utterance after normalization. This is to prevent the utterance from being too short, which can cause issues with the model.

`lang` can be changed to any language supported by en, es, ru, cn, and probably more. This will change the normalization process to be more accurate to the language and its phonetics.

`model` can be changed pending a model that is an upgrade from the current one or to better fit your language. The list can be found [here](https://catalog.ngc.nvidia.com/models)

In [None]:
from VocalForge.text import NormalizeText

normalizer = NormalizeText(
    input_dir=os.path.join(work_path, 'Transcription'),
    output_dir=os.path.join(work_path, 'Normalized'),
    max_length=25,
    min_length=5,
)

In [None]:
normalizer.run()

### Normalization

Time to align the text and audio! Using another NVIDIA ASR model, we can generate timestamps for each utterance. There's not much to say here, but the `window_size` may be changed based on the length of each audio file, as too small a window size can hinder the performance of the model.

In [None]:
from VocalForge.text import Segment

segmenter = Segment(
    input_dir=os.path.join(work_path, 'Normalized'),
    output_dir=os.path.join(work_path, 'Segments'),
)

In [None]:
segmenter.run()

Now that we have these segments, things won't be perfect. Theres some things we can do to make it better, such as padding and offsets. I made a little function to help with this.

`threshold` is the confidence level that the segment must reach in order to be included in the final dataset. The closer to 0, the more confident the model is of the segment timing to be included. Feel free to change this around to your liking (it won't affect the dataset, just the previews)

This will iterate through each file, prompting you to input an offset and padding value. Type in the values and press enter. The function will then display an audio clip with the specified offset and padding. Try to find the best values for each file. Once you have, type "stop" and it will move on to the next file.


In [None]:
import scipy.io.wavfile as wav
import IPython.display as ipd
threshold = 0.75

def test_offset_padding(alignment_file: str):
    offset = 0
    padding = 0
    if not os.path.exists(alignment_file):
        raise ValueError(f"{alignment_file} not found")
    # read the segments, note the first line contains the path to the original audio
    print(f"Reading {alignment_file}")
    with open(alignment_file, "r") as f:
        
        sample = [next(f) for _ in range(10)]
        offset = input('Input offset (type "stop" to end): ')
        padding = input('Input padding: ')
        padding = float(padding)
        
        while True:
            print(f"offset: {offset}, padding: {padding}")
            for line in sample:
                line = line.split("|")
                if len(line) == 1:
                    audio_file = line[0].strip()
                    continue
                text = line[2]
                line = line[0].split()
                sampling_rate, signal = wav.read(audio_file)
                if float(line[2]) < -threshold:
                    continue
                segment = [float(line[0]) + float(offset) + padding, float(line[1]) + float(offset) - padding]
                if float(line[0]) + float(offset) < 0:
                    segment[0] = 0
                st, end = segment
                audio = signal[round(st * sampling_rate) : round(end * sampling_rate)]
                print(f"loss: {line[2]}")
                print(f'sample text: {text}')
                ipd.display(ipd.Audio(audio, rate=sampling_rate))
            previous_offset = offset
            offset = input('Input offset (type "stop" to end): ')
            if offset == 'stop':
                offset = previous_offset
                print(f"final values for {alignment_file}: offset: {offset}, padding: {padding}") 
                break
            padding = float(input('Input padding: '))
        return offset, padding

In [None]:
offsets = []
paddings = []
from VocalForge.text import get_files
segment_dir = os.path.join(work_path, 'segments')
for file in get_files(segment_dir, '.txt'):
    file_dir = os.path.join(segment_dir, file)
    offset, padding = test_offset_padding(alignment_file=file_dir)
    offsets.append(offset)
    paddings.append(padding)

### Splitting Audio

We can now move onto cutting the audio files. Passing along the padding and offset values, alongwith the confidence threshold and max length each split audio file (in s), we can now cut the audio files into segments.

In [None]:
from VocalForge.text.split_audio import SplitAudio
threshold = 0.6
max_duration = 20
split = SplitAudio(
    input_dir=segment_dir,
    output_dir=os.path.join(work_path, 'sliced_audio'),
    threshold=threshold,
    offsets=offsets,
    paddings=paddings,
)

In [None]:
split.run()

### Dataset Creation

Almost done! This step essentially combines the audio we just split, alongwith with the text normalization we did earlier. The product of this will be a folder containing the audio files and a metadata file that contains the normalized text and the name of the corresponding audio file.

Note that currently it will only export as a LJSpeech format, but it can be easily modified to export in any format.

In [None]:
from VocalForge.text.create_dataset import GenerateDataset
dataset = GenerateDataset(
    segment_dir=segment_dir,
    sliced_aud_dir=os.path.join(work_path, 'sliced_audio'),
    output_dir=os.path.join(work_path, 'dataset'),
    threshold=threshold,
)

In [None]:
dataset.run()

Now, every now and again there might be an extra audio file or metadata entry that will throw any preprocesser into a fit. But don't fret! This function will help you find and remove them. This can also sync mannual changes in either removing metadata entries or audio files.

In [None]:
import os
import pandas as pd
bad_files = []
dataset_dir = os.path.join(work_path, 'dataset')
for file in os.listdir(dataset_dir+'/wavs/'):
    #if file is below 100kb remove it and append it to the list
    if os.stat(dataset_dir +'/wavs/'+file).st_size < 50000:
        bad_files.append(file)
print(bad_files)

df = pd.read_csv(dataset_dir+'/metadata.csv', sep='|', on_bad_lines='skip')
for index, row in df.iterrows():
    if row[0]+'.wav' in bad_files:
        df.drop(index, inplace=True)
        os.remove(dataset_dir + '/wavs/' + row[0]+'.wav')
df.to_csv(dataset_dir+'/metadata.csv', sep='|', index=False)

## And done! Get curatin'!