# Learn OpenAI Whisper - Chapter 9 - Cloning a voice step 1: Converting audio files to LJSpeech format


This notebook represents the initial step in the 3-step voice cloning process outlined in the chapter. This step takes an audio sample of the target voice as input and processes it into the LJSpeech dataset format. The notebook demonstrates how to use the OZEN Toolkit and OpenAI's Whisper to extract speech, transcribe it, and organize the data according to the LJSpeech structure. The resulting LJSpeech-formatted dataset, consisting of segmented audio files and corresponding transcriptions, serves as the input for the second step, "Cloning a voice step 2: Fine-tuning a discrete variational autoencoder using the DLAS toolkit," where a voice cloning model is fine-tuned using this dataset.

## Notebook 2: Process audio files to a LJ format with Whisper and OZEN

This notebook complements the book [Learn OpenAI Whisper](https://a.co/d/1p5k4Tg).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wnomL0dxmU9CgPKIgazR8AocolEYjAe5)

This notebook is based on the [OZEN Toolkit](https://github.com/devilismyfriend/ozen-toolkit) project. Given a folder of files or a single audio file, it will extract the speech, transcribe using Whisper and save in the LJ format (segmented audio files in WAV format in `wavs` folder, transcriptions in folders `train` and `valid`).

**NOTE**: The notebook stores the files using the following format.

`dataset/`
* ---├── `valid.txt`
* ---├── `train.txt`
* ---├── `wavs/`

`wavs/` directory must contain `.wav` files.

Example for `train.txt` and `valid.txt`:

* `wavs/A.wav|Write the transcribed audio here.`



## 1.	Cloning the OZEN Toolkit repository:

The following command clones the OZEN Toolkit repository from GitHub, which contains the necessary scripts and utilities for processing audio files:

In [1]:
!git clone https://github.com/devilismyfriend/ozen-toolkit

Cloning into 'ozen-toolkit'...
remote: Enumerating objects: 35, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 35 (delta 15), reused 20 (delta 5), pack-reused 0[K
Receiving objects: 100% (35/35), 11.37 KiB | 11.37 MiB/s, done.
Resolving deltas: 100% (15/15), done.


## 2.	Installing required libraries

These following commands install the necessary libraries for audio processing, speech recognition, and text formatting:

In [2]:
!pip -q install transformers
!pip -q install huggingface
!pip -q install pydub
!pip -q install yt-dlp
!pip -q install pyannote.audio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.4/194.4 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.2/130.2 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.7/208.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

# RESTART SESSION

In Google Colab, from the top menu, select `Runtime`, then `Restart session`.
<img src="https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/Restart_the_runtime_600x102.png" width=600>

In [1]:
!pip -q install colorama
!pip -q install termcolor
!pip -q install pyfiglet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.1 MB[0m [31m8.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━[0m [32m0.8/1.1 MB[0m [31m12.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h/content/ozen-toolkit


## 3.	Changing the working directory

The next command changes the working directory to the cloned ozen-toolkit directory:

In [2]:
%cd ozen-toolkit

[Errno 2] No such file or directory: 'ozen-toolkit'
/content/ozen-toolkit


## 4.	Downloading a sample audio file

If you do not have an audio file for cloning, this command downloads a sample audio file from the specified URL for demonstration purposes:

In [3]:
# Download sample file
!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3

2024-04-11 21:08:54 URL:https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3 [363247/363247] -> "Learn_OAI_Whisper_Sample_Audio01.mp3" [1]


## 5.	Uploading custom audio files

If you have your audio file, this code block allows users to upload their audio files to the Colab environment. It creates a directory in `/content/ozen-toolkit` to store the uploaded files and saves them in that directory:

In [12]:
# CUSTOM_VOICE_NAME = "custom"

import os
from google.colab import files

custom_voice_folder = "./myaudiofile"

os.makedirs(custom_voice_folder, exist_ok=True)  # Create the directory if it doesn't exist

for filename, file_data in files.upload().items():
    with open(os.path.join(custom_voice_folder, filename), 'wb') as f:
        f.write(file_data)

%ls -l "$PWD"/{*,.*}

Saving Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3 to Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3
-rw-r--r-- 1 root root 1980150 Apr 11 21:10  /content/ozen-toolkit/20150415-Fracking_the_debate.mp4
-rw-r--r-- 1 root root    2392 Apr 11 20:59  /content/ozen-toolkit/Drag_Here.cmd
-rw-r--r-- 1 root root     276 Apr 11 20:59  /content/ozen-toolkit/environment.yaml
-rw-r--r-- 1 root root      66 Apr 11 20:59  /content/ozen-toolkit/.gitattributes
-rw-r--r-- 1 root root     144 Apr 11 20:59  /content/ozen-toolkit/.gitignore
-rw-r--r-- 1 root root  363247 Apr 11 21:08  /content/ozen-toolkit/Learn_OAI_Whisper_Sample_Audio01.mp3
-rw-r--r-- 1 root root   24361 Apr 11 21:17  /content/ozen-toolkit/Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3
-rw-r--r-- 1 root root   14248 Apr 11 20:59  /content/ozen-toolkit/ozen.py
-rw-r--r-- 1 root root    1066 Apr 11 20:59  /content/ozen-toolkit/README.md
-rw-r--r-- 1 root root      80 Apr 11 20:59  /content/ozen-toolkit/requirements.txt
-rw-r--r-- 1 root root 

## 6.	Creating a configuration file
The following code section creates a configuration file named `config.ini` using the `configparser` library. It defines various settings such as the Hugging Face API key, Whisper model, device, diarization and segmentation models, validation ratio, and segmentation parameters:

In [14]:
import configparser

# Create a new ConfigParser object
config = configparser.ConfigParser()

# Add the 'DEFAULT' section and set the options
config['DEFAULT'] = {
    'hf_token': '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',
    'whisper_model': 'openai/whisper-medium',
    'device': 'cuda',
    'diaization_model': 'pyannote/speaker-diarization',
    'segmentation_model': 'pyannote/segmentation',
    'valid_ratio': '0.2',
    'seg_onset': '0.7',
    'seg_offset': '0.55',
    'seg_min_duration': '2.0',
    'seg_min_duration_off': '0.0'
}

# Write the configuration to a file
with open('config.ini', 'w') as configfile:
    config.write(configfile)

# Print the contents of the file
with open('config.ini', 'r') as configfile:
    print(configfile.read())

[DEFAULT]
hf_token = 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
whisper_model = openai/whisper-medium
device = cuda
diaization_model = pyannote/speaker-diarization
segmentation_model = pyannote/segmentation
valid_ratio = 0.2
seg_onset = 0.7
seg_offset = 0.55
seg_min_duration = 2.0
seg_min_duration_off = 0.0




## 7.	Running the OZEN script

This command runs the ozen.py script with the sample audio file as an argument (or the file you uploaded).

# IMPORTANT:
`ozen.py` requires Hugging Face's `pyannote/segmentation` model. This is a gated model; you MUST request access before attempting to run the next cell. Thankfully, getting access is relatively straightforward and fast.

*   You must already have a Hugging Face account; if you do not have one, see the instructions in the notebook for chapter 3:  [LOAIW_ch03_working_with_audio_data_via_Hugging_Face.ipynb](https://colab.research.google.com/drive/1bIiGyv_YiTdq97a7KrowCceOrZlG2hXL#scrollTo=VCEKs-Y4wAYQ)
*   Visit https://hf.co/pyannote/segmentation to accept the user conditions.

<img src="https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/HF_pyanonote_sementation_gated_model.JPG" width=600>

The script processes the audio file, extracts speech, transcribes it using Whisper, and saves the output in the LJSpeech format. The script saves the DJ format files in a folder called `ozen-toolkit/output/<audio file name + timestamp>/`. Here is an example of the expected file structure:
```
ozen-toolkit/output/
---├── Learn_OAI_Whisper_Sample_Audio01.mp3_2024_03_16-16_36/
------------------├── valid.txt
------------------├── train.txt
------------------├── wavs/
--------------------------├── 0.wav
--------------------------├── 1.wav
--------------------------├── 2.wav
```

In [15]:
!python ozen.py Learn_OAI_Whisper_Sample_Audio01.mp3

2024-04-11 21:39:36.166216: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 21:39:36.166264: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 21:39:36.167639: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[1m[40m[33m  ______    ________   _______ .__   __. 
 /  __  \  |       /  |   ____||  \ |  | 
|  |  |  | `---/  /   |  |__   |   \|  | 
|  |  |  |    /  /    |   __|  |  . `  | 
|  `--'  |   /  /----.|  |____ |  |\   | 
 \______/   /________||_______||__| \__| 
                                         
[0m
[32mConverting to WAV...[39m
[32mLoading Segment 

#### Mount Google Drive (To save trained checkpoints and to load the dataset from)

## 8.	Mounting Google Drive

These lines mount the user's Google Drive to the Colab environment, allowing access to the drive for saving checkpoints and loading datasets:

In [17]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## 9.	Copying the output to Google Drive

The following command copies the processed output files from the `ozen-toolkit/output` directory to your Google Drive.

<img src="https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/images/ch09_2_Google_Colab_directory.JPG" width=600>


In [19]:
%cp -r /content/ozen-toolkit/output/ /content/gdrive/MyDrive/

After running the cell, go to your Google Drive using a web browser, and you will see a directory called `output` with the DJ format dataset files in it.

<img src="https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter09/images/ch09_2_Google_Drive_directory.JPG" width=500>

---
With our audio data now converted to the LJSpeech format, we are well-prepared to embark on the following critical stage of the voice cloning journey: fine-tuning a voice cloning model using the powerful DLAS toolkit. The notebook [LOAIW_ch09_3_Fine_tuning_voice_cloning_with_DLAS.ipynb](/Chapter09/LOAIW_ch09_3_Fine_tuning_voice_cloning_with_DLAS.ipynb) will cover that process in detail. By leveraging the DLAS toolkit's comprehensive features and the structured LJSpeech dataset, we can create a personalized voice model that captures the unique characteristics of our target speaker with remarkable accuracy and naturalness.