# Text-To-Speech (TTS) 
(also called speech synthesis)

# Text-to-speech datasets
(only texts)

**Some datasets for ASR can be used to train TTS model.**

## LJSpeech

single speaker reading sentences from 7 non-fiction books in English.

## Multilingual LibriSpeech

multilingual extension of the LibriSpeech dataset for developing multilingual TTS systems and exploring cross-lingual speech synthesis techniques.

## VCTK (Voice Cloning Toolkit)

audio recordings of 110 English speakers with various accents for training TTS models with varied voices and accents, enabling more natural and diverse speech synthesis.

## Libri-TTS/ LibriTTS-R

multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate.

# Pre-trained models for text-to-speech

## SpeechT5 model

In [1]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")



In [2]:
inputs = processor(text="Don't count the days, make the days count.", return_tensors="pt")

In [3]:
from datasets import load_dataset

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

import torch

speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

Downloading data:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/7931 [00:00<?, ? examples/s]

In [7]:
spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)

In [10]:
from transformers import SpeechT5HifiGan

vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/50.7M [00:00<?, ?B/s]

In [12]:
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

In [14]:
from IPython.display import Audio

Audio(speech, rate=16000)

## Bark

In [1]:
from transformers import BarkModel, BarkProcessor

model = BarkModel.from_pretrained("suno/bark-small")
processor = BarkProcessor.from_pretrained("suno/bark-small")



In [2]:
# add a speaker embedding
inputs = processor("This is a test!", voice_preset="v2/en_speaker_3")

speech_output = model.generate(**inputs).cpu().numpy()

en_speaker_3_semantic_prompt.npy:   0%|          | 0.00/3.54k [00:00<?, ?B/s]

en_speaker_3_coarse_prompt.npy:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

en_speaker_3_fine_prompt.npy:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [11]:
speech_output

array([[ 7.8175898e-04,  2.1792462e-04,  2.6435815e-04, ...,
        -2.9434550e-06,  4.7763824e-06,  5.0405001e-06]], dtype=float32)

In [4]:
# add a French speaker embedding
fr_inputs = processor("C'est un test!", voice_preset="v2/fr_speaker_1")

fr_speech_output = model.generate(**inputs).cpu().numpy()

fr_speaker_1_semantic_prompt.npy:   0%|          | 0.00/2.62k [00:00<?, ?B/s]

fr_speaker_1_coarse_prompt.npy:   0%|          | 0.00/7.60k [00:00<?, ?B/s]

fr_speaker_1_fine_prompt.npy:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [20]:
fr_speech_output

array([[ 2.7224014e-03,  2.0410451e-03,  2.0769674e-03, ...,
        -1.4201907e-04, -4.4215758e-05,  1.8003853e-05]], dtype=float32)

In [6]:
inputs = processor(
    "[clears throat] This is a test ... and I just took a long pause.",
    voice_preset="v2/fr_speaker_1"
)

speech_output = model.generate(**inputs).cpu().numpy()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [None]:
inputs = processor("♪ In the mighty jungle, I'm trying to generate barks.")

speech_output = model.generate(**inputs).cpu().numpy()

In [27]:
input_list =[
    "[clears throat] Hello uh ..., my dog is cute [laughter]",
    "Let's try generating speech, with Bark, a text-to-speech model",
    "♪ In the jungle, the mighty jungle, the lion barks tonight ♪",
]

# add a speaker embedding
inputs = processor(input_list, voice_preset="v2/en_speaker_3")

speech_output = model.generate(**inputs).cpu().numpy()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


In [30]:
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_output[0], rate=sampling_rate)

In [31]:
Audio(speech_output[1], rate=sampling_rate)