# Text to Speech

### 👽 Import Libraries

In [1]:
#!pip install transformers
#!pip install gradio
#!pip install timm
#!pip install timm
#!pip install inflect
#!pip install phonemizer

- **!pip install transformers:** Installs the Transformers library, which provides state-of-the-art natural language processing models for various tasks such as text classification, translation, summarization, and question answering.

- **!pip install gradio:** Installs Gradio, a Python library that simplifies the creation of interactive web-based user interfaces for machine learning models, allowing users to interact with models via a web browser.

- **!pip install timm:** Installs Timm, a PyTorch library that offers a collection of pre-trained models and a simple interface to use them, primarily focused on computer vision tasks such as image classification and object detection.

- **!pip install inflect:** Installs Inflect, a Python library used for converting numbers to words, pluralizing and singularizing nouns, and generating ordinals and cardinals.

- **!pip install phonemizer:** Installs Phonemizer, a Python library for converting text into phonetic transcriptions, useful for tasks such as text-to-speech synthesis and linguistic analysis.

- To run locally in a Linux machine, follow these commands:

```sh
  sudo apt-get update
```
```sh
  sudo apt-get install espeak-ng
```
```sh
  pip install py-espeak-ng
```


**📕 APT stands for Advanced Package Tool**. It is a package management system used by various Linux distributions, including Debian and Ubuntu. APT allows users to install, update, and remove software packages on their system from repositories. It also resolves dependencies automatically, ensuring that all required dependencies for a package are installed.


- sudo apt-get update: Updates the package index of APT.
- sudo apt-get install espeak-ng: Installs the espeak-ng text-to-speech synthesizer.
- pip install py-espeak-ng: Installs the Python interface for espeak-ng.

In [2]:
# to avoid warnings

from transformers.utils import logging

logging.set_verbosity_error()

### Build the text-to-speech pipeline using the 🤗 Transformers Library

In [3]:
from transformers import pipeline

**🔍 kakao-enterprise/vits-ljs:**

🔊📚 VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
👉 model: https://huggingface.co/kakao-enterprise/vits-ljs

- Overview:

VITS is an end-to-end model for speech synthesis, utilizing a conditional variational autoencoder (VAE) architecture. It predicts speech waveforms based on input text sequences, incorporating a flow-based module and a stochastic duration predictor to handle variations in speech rhythm.

In [4]:
narrator = pipeline(task = "text-to-speech",
                   model = "./models/kakao-enterprise/vits-ljs")

In [5]:
print(f"narrator pipeline is {narrator} in memory address.")

narrator pipeline is <transformers.pipelines.text_to_audio.TextToAudioPipeline object at 0x7fb80ccb5fd0> in memory address.


In [7]:
text = """
Why did the Python programmer bring a ladder to the machine learning party? 
Because they heard the models were reaching new heights and they wanted to climb the accuracy tree! 
Now they're scripting their way to the top, one algorithm at a time!

"""

### 📯 Pass the text to pipeline


In [8]:
narrated_text = narrator(text)

print(narrated_text)

{'audio': array([[ 0.00112925,  0.00134222,  0.00107496, ..., -0.00083117,
        -0.00077596, -0.00064528]], dtype=float32), 'sampling_rate': 22050}


**📌note:** This dictionary contains an audio waveform represented as a NumPy array, along with its corresponding sampling rate. 🎵 The audio array consists of amplitude values sampled at a rate of 22,050 Hz.

In [9]:
from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

In [11]:
IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])