<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/Parler_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🗣️ Parler_TTS Google Colab

## 📄 Description  
This Colab notebook uses Parler TTS to generate voice audio from text with guided audio quality.

**Languages supported**: English, French, Spanish, Portuguese, Polish, German, Italian and Dutch

**Capabilities**: Text-to-speech, Multi-lingual, Guided generation

---

## How to use
- Follow the instructions to input text to generate and adjust params.
- Run all cells and output will be in `output.wav`

---

## 🔗 Resources

- **GitHub Repository:** [huggingface/parler-tts](https://github.com/huggingface/parler-tts)
- **Model Availability:** [parler-tts/parler-tts-mini-multilingual-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-multilingual-v1.1) (multi-lingual model)

---

## 🎙️ Explore More TTS Models  
Want to try out additional TTS models? Check out the curated collection here:  
👉 [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)


## Multi-lingual Parler TTS Mini Multilingual v1.1

In [1]:
!pip install git+https://github.com/huggingface/parler-tts.git

Collecting git+https://github.com/huggingface/parler-tts.git
  Cloning https://github.com/huggingface/parler-tts.git to /tmp/pip-req-build-6ztmw2d9
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/parler-tts.git /tmp/pip-req-build-6ztmw2d9
  Resolved https://github.com/huggingface/parler-tts.git to commit d108732cd57788ec86bc857d99a6cabd66663d68
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting descript-audiotools@ git+https://github.com/descriptinc/audiotools (from parler_tts==0.2.2)
  Cloning https://github.com/descriptinc/audiotools to /tmp/pip-install-vuvk2zgv/descript-audiotools_f5fd414d29e740ab877246987cbaf4bd
  Running command git clone --filter=blob:none --quiet https://github.com/descriptinc/audiotools /tmp/pip-install-vuvk2zgv/descript-audiotools_f5fd414d29e740ab877246987cbaf4bd
  Resolved https://github.com/d

In [2]:
language_list = ["English", "Dutch", "French", "German", "Italian", "Polish", "Portuguese", "Spanish"]

dutch_speakers = ["Mark", "Jessica", "Michelle"]
french_speakers = ["Daniel", "Michelle", "Christine", "Megan"]
german_speakers = ["Nicole", "Christopher", "Megan", "Michelle"]
italian_speakers = ["Julia", "Richard", "Megan"]
polish_speakers = ["Alex", "Natalie"]
portuguese_speakers = ["Sophia", "Nicholas"]
spanish_speakers = ["Steven", "Olivia", "Megan"]
english_speakers = ['Alex', 'Christine', 'Christopher', 'Daniel', 'Jessica', 'Julia', 'Mark', 'Megan', 'Michelle', 'Natalie', 'Nicholas', 'Nicole', 'Olivia', 'Richard', 'Sophia', 'Steven']

In [5]:
# Modify this cell for your generation, see previous cell for reference

language = language_list[0] # Modify for language
speaker = english_speakers[3] # Modify for speaker
text_to_generate = "This is audio generated by Parler TTS" # Modify for text to generate

tone = "monotone"  # Modify for tone, like monotone, harsh, gentle, assertive, soft, flat, melodic
speed = "normal"  # Modify for speed, like normal, slow, fast, steady
quality = "very clear audio"  # Modify for audio quality, like very clear audio or very noisy audio
space = "small space"  # Modify for echo, like small space, large hall, open space
background_noise = "quiet"  # Modify for background noise, like quiet, bustling crowd, windy
pitch = "low"  # Modify for pitch, like low, high, medium
volume = "moderate"  # Modify for volume, like loud, soft, moderate

description = f"{speaker}'s voice is {tone}, at a {speed} speed, with {quality}. It has a {pitch} pitch and a {volume} volume. The environment is a {space} with {background_noise} background."

In [6]:
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
from IPython.display import Audio, display

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_to_generate, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)

display(Audio("output.wav"))

  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

  "_name_or_path": "ylacombe/dac_44khz",
  "architectures": [
    "DacModel"
  ],
  "codebook_dim": 8,
  "codebook_loss_weight": 1.0,
  "codebook_size": 1024,
  "commitment_loss_weight": 0.25,
  "decoder_hidden_si