# Learn OpenAI Whisper - Chapter 9
## Notebook 1: Synthesizing voices with tortoise-tts-fast

This notebook complements the book [Learn OpenAI Whisper](https://a.co/d/1p5k4Tg).

This notebook is based on the [TorToiSe-TTS-Fast](https://github.com/152334H/tortoise-tts-fast) project, which drastically boost the performance of [TorToiSe](https://github.com/neonbjb/tortoise-tts), without modifying the base models.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zxZ7TCYr8hiU7ExY6QuXjPfVldWtcz-s)

## 1. Setting up the environment:
   - The code starts by cloning the "tortoise-tts-fast" repository and installing the required dependencies.
   - It uses `notebook_login()` from the `huggingface_hub` library to authenticate with Hugging Face.
   - The necessary imports are made, including `torch`, `torchaudio`, and modules from the `tortoise` package.

In [None]:
%%capture loading_libraries
!git clone https://github.com/152334H/tortoise-tts-fast
%cd tortoise-tts-fast
!pip3 install -r requirements.txt --no-deps
!pip3 install -e .

# RESTART NOTEBOOK BEFORE CONTINUING
![Restart_the_runtime_600x102.png](https://raw.githubusercontent.com/PacktPublishing/Learn-OpenAI-Whisper/main/Chapter08/Restart_the_runtime_600x102.png)

***Install BigVGAN: A Universal Neural Vocoder with Large-Scale Training***

In [None]:
!pip3 install -q git+https://github.com/152334H/BigVGAN.git
!pip install -q transformers==4.29.2
!pip install -q voicefixer==0.1.2
%cd tortoise-tts-fast

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.2/52.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h/content/tortoise-tts-fast


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import whoami

whoami()
# you should see something like {'type': 'user',  'id': '...',  'name': 'Wauplin', ...}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'type': 'user',
 'id': '6404c20d3d49e1e066b97c10',
 'name': 'jbatista79',
 'fullname': 'Josue Batista',
 'email': 'josue@josuebatista.com',
 'emailVerified': True,
 'canPay': False,
 'periodEnd': None,
 'isPro': False,
 'avatarUrl': '/avatars/e686904a94e267b4570907a7e734fbb4.svg',
 'orgs': [],
 'auth': {'type': 'access_token',
  'accessToken': {'displayName': 'Learn OpenAI Whisper', 'role': 'write'}}}

## 2. Initializing the TextToSpeech model:
   ```python
   from tortoise.api import TextToSpeech
   tts = TextToSpeech()
   ```
   - The `TextToSpeech` class is imported from the `tortoise.api` module.
   - An instance of `TextToSpeech` is created, which will download all the required models from the Hugging Face hub.


In [None]:
# Imports used through the rest of the notebook.
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

import IPython

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices

# This will download all the models used by Tortoise from the HuggingFace hub.
tts = TextToSpeech()

config.json:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading autoregressive.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth...





Done.
Downloading diffusion_decoder.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth...





Done.
Downloading clvp2.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clvp2.pth...





Done.
Downloading vocoder.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth...





Done.


In [None]:
#@markdown Tortoise will attempt to mimic voices you provide. It comes pre-packaged
#@markdown with some voices you might recognize.

#@markdown Let's list all the voices available. These are just some random clips I've gathered
#@markdown from the internet as well as a few voices from the training dataset.
#@markdown Feel free to add your own clips to the voices/ folder.
%ls tortoise/voices

[0m[01;34mangie[0m/                [01;34memma[0m/     [01;34mlj[0m/      [01;34mrainbow[0m/       [01;34mtrain_daws[0m/     [01;34mtrain_kennard[0m/
[01;34mapplejack[0m/            [01;34mfreeman[0m/  [01;34mmol[0m/     [01;34msnakes[0m/        [01;34mtrain_dotrice[0m/  [01;34mtrain_lescault[0m/
[01;34mcond_latent_example[0m/  [01;34mgeralt[0m/   [01;34mmyself[0m/  [01;34mtim_reynolds[0m/  [01;34mtrain_dreams[0m/   [01;34mtrain_mouse[0m/
[01;34mdaniel[0m/               [01;34mhalle[0m/    [01;34mpat[0m/     [01;34mtom[0m/           [01;34mtrain_empire[0m/   [01;34mweaver[0m/
[01;34mdeniro[0m/               [01;34mjlaw[0m/     [01;34mpat2[0m/    [01;34mtrain_atkins[0m/  [01;34mtrain_grace[0m/    [01;34mwilliam[0m/


## 3. Selecting a voice:
   - The code uses the `os` module to list all the available voice folders in the "tortoise/voices" directory.
   - It creates a dropdown widget using `Dropdown` from the `ipywidgets` library to allow the user to select a voice folder.
   - Another dropdown widget is created to select a specific voice file within the selected voice folder.
   - The selected voice can be played using `IPython.display.Audio`.

In [None]:
import os
from ipywidgets import Dropdown

voices_dir = "tortoise/voices"

# Get a list of all directories in the voices directory
voice_names = os.listdir(voices_dir)

voice_folder = Dropdown(
    options=sorted(voice_names),
    description='Select a voice:',
    value='freeman',
    disabled=False,
    style={'description_width': 'initial'},
)

voice_folder

Dropdown(description='Select a voice:', index=6, options=('angie', 'applejack', 'cond_latent_example', 'daniel…

In [None]:
import os
from ipywidgets import Dropdown

voices_dir = f"tortoise/voices/{voice_folder.value}"

# Get a list of all directories in the voices directory
voice_files = os.listdir(voices_dir)

voice = Dropdown(
    options=sorted(voice_files),
    description='Select a voice:',
    # value='tom',
    disabled=False,
    style={'description_width': 'initial'},
)

voice

Dropdown(description='Select a voice:', options=('1.wav', '2.wav', '3.wav'), style=DescriptionStyle(descriptio…

In [None]:
#Pick one of the voices from the output above
IPython.display.Audio(filename=f'tortoise/voices/{voice_folder.value}/{voice.value}')

## 4. Generating speech with a selected voice:
   - The text to be spoken is defined in the `text` variable.
   - The `preset` variable determines the quality of the generated speech (options: "ultra_fast", "fast", "standard", "high_quality").
   - The selected voice is loaded using `load_voice` from `tortoise.utils.audio`, which returns `voice_samples` and `conditioning_latents`.
   - The `tts_with_preset` method of the `tts` object is called with the text, voice samples, conditioning latents, and preset to generate the speech.
   - The generated speech is saved as a WAV file using `torchaudio.save` and played using `IPython.display.Audio`.

In [None]:
# This is the text that will be spoken.
text = "The happiness which comes from long practice, which leads to the end of suffering, which at first is like poison, but at last like nectar – this kind of happiness arises from the serenity of one’s own mind" #@param {type:"string"}
# Pick a "preset mode" to determine quality. Options: {"ultra_fast", "fast" (default), "standard", "high_quality"}. See docs in api.py
preset = "standard" #@param ["ultra_fast", "fast", "standard", "high_quality"]

In [None]:
# Pick one of the voices from the output above
voice = voice_folder.value

#Load it and send it through Tortoise.
voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
                          preset=preset)

generated_filename = f'generated-{preset}-{voice}.wav'
torchaudio.save(generated_filename, gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(generated_filename)

mode 0
Generating autoregressive samples..


100%|██████████| 16/16 [03:14<00:00, 12.14s/it]


Computing best candidates using CLVP


100%|██████████| 16/16 [00:07<00:00,  2.23it/s]


Transforming autoregressive outputs into audio..


  0%|          | 0/200 [00:00<?, ?it/s]

## 5. Generating speech with a random voice:
   - Similar to the previous step, but with `voice_samples` and `conditioning_latents` set to `None`, which generates speech using a random voice.

In [None]:
#@markdown Tortoise can also generate speech using a random voice. The voice changes each time you execute this!
#@markdown (Note: random voices can be prone to strange utterances)
gen = tts.tts_with_preset(text, voice_samples=None, conditioning_latents=None, preset=preset)
torchaudio.save('synthetized_voice_sample.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('synthetized_voice_sample.wav')

Downloading rlg_auto.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth...





Done.
Downloading rlg_diffuser.pth from https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth...





Done.
Generating autoregressive samples..


100%|██████████| 16/16 [02:53<00:00, 10.82s/it]


Computing best candidates using CLVP


100%|██████████| 16/16 [00:07<00:00,  2.25it/s]


Transforming autoregressive outputs into audio..


  0%|          | 0/200 [00:00<?, ?it/s]

## 6. Using a custom voice:
   - The code allows the user to upload their own WAV files (6-10 seconds long) to create a custom voice.
   - It creates a custom voice folder using `os.makedirs` and saves the uploaded files in that folder.
   - The custom voice is then loaded and used to generate speech, similar to steps 4 and 5.

In [None]:
#@markdown Optionally, upload use your own voice by running the next two cells. I recommend
#@markdown you upload at least 2 audio clips. They must be a WAV file, 6-10 seconds long.
CUSTOM_VOICE_NAME = "custom"

import os
from google.colab import files

custom_voice_folder = f"tortoise/voices/{CUSTOM_VOICE_NAME}"
os.makedirs(custom_voice_folder)
for i, file_data in enumerate(files.upload().values()):
  with open(os.path.join(custom_voice_folder, f'{i}.wav'), 'wb') as f:
    f.write(file_data)

Saving 1.wav to 1.wav
Saving 2.wav to 2.wav
Saving 3.wav to 3.wav
Saving 4.wav to 4.wav


In [None]:
# Generate speech with the custotm voice.
voice_samples, conditioning_latents = load_voice(CUSTOM_VOICE_NAME)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
                          preset=preset)
torchaudio.save(f'generated-{CUSTOM_VOICE_NAME}.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio(f'generated-{CUSTOM_VOICE_NAME}.wav')

mode 0
Generating autoregressive samples..


100%|██████████| 16/16 [04:06<00:00, 15.39s/it]


Computing best candidates using CLVP


100%|██████████| 16/16 [00:07<00:00,  2.25it/s]


Transforming autoregressive outputs into audio..


  0%|          | 0/200 [00:00<?, ?it/s]

## 7. Combining voices:
   - The `load_voices` function is used to load multiple voices (in this case, 'freeman' and 'deniro').
   - The `tts_with_preset` method is called with the combined voice samples and conditioning latents to generate speech with traits from both voices.

In [None]:
# You can also combine conditioning voices. Combining voices produces a new voice
# with traits from all the parents.
#
# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:
voice_samples, conditioning_latents = load_voices(['freeman', 'deniro'])

gen = tts.tts_with_preset("Words, once silent, now dance on digital breath, speaking volumes through the magic of text-to-speech.",
                          voice_samples=voice_samples, conditioning_latents=conditioning_latents,
                          preset=preset)
torchaudio.save('freeman_deniro.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('freeman_deniro.wav')

mode 0
Generating autoregressive samples..


100%|██████████| 16/16 [01:43<00:00,  6.49s/it]


Computing best candidates using CLVP


100%|██████████| 16/16 [00:06<00:00,  2.33it/s]


Transforming autoregressive outputs into audio..


  0%|          | 0/200 [00:00<?, ?it/s]