<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/Pocket_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üó£Ô∏è Pocket TTS Colab

## üìÑ Description

This Colab notebook runs **Pocket TTS**, a **lightweight, CPU-efficient text-to-speech (TTS)** model designed for fast, real-time audio generation. With only **100M parameters**, it provides **low-latency** TTS capabilities without the need for GPUs, making it ideal for on-the-go or resource-constrained devices.

**Capabilities:**
CPU-Based Speech Generation, Voice Cloning, Instant Audio Streaming, Low Latency (~200ms), Faster Than Real-Time (~6x), Python API and CLI, English Language Support, Handles Long Text Inputs

---

## How to use

* Install the required dependencies via pip
* Modify text and run the following cells to generate speech

---

## ‚öôÔ∏è Model Highlights

* üñ• **CPU-optimized** ‚Äì designed to run efficiently on devices without a GPU
* ‚ö° **Ultra-low latency** ‚Äì audio begins streaming in under 200ms
* üöÄ **Real-time performance** ‚Äì faster than real-time (~6x) on a MacBook Air M4
* üß¨ **Voice cloning** ‚Äì replicate voices from short samples
* üí° **Python API and CLI support** ‚Äì easy integration into scripts and applications
* üåç **English-only** ‚Äì currently supports only English text input
* üìè **Compact size** ‚Äì lightweight 100M parameter model

---

## üß† Model Details

* **Base Model:** Kyutai Pocket TTS
* **Supported Language:** English
* **Audio Streaming:** Real-time audio generation with minimal delay
* **Format:** PyTorch Model (CPU-optimized)
* **Performance:** ~6x faster-than-real-time on mid-range CPUs

---

## üîó Resources

* **GitHub Repository:** [https://github.com/kyutai-labs/pocket-tts](https://github.com/kyutai-labs/pocket-tts)
* **Model Availability:** [https://huggingface.co/kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts)

---

## üéôÔ∏è Explore More TTS Models

Looking for more cutting-edge voice models?  
üëâ Check out the full collection: [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)


## TTS with Voice Cloning

In [None]:
# Install required package
!pip install pocket-tts

In [None]:
# Install the necessary package for Hugging Face authentication
!pip install huggingface_hub

from huggingface_hub import login

# Prompt the user to login to Hugging Face
print("Please log in to Hugging Face to access voice cloning models.")
login()


In [None]:
from pocket_tts import TTSModel
import scipy.io.wavfile
from IPython.display import Audio
from google.colab import files

# Load the model
tts_model = TTSModel.load_model()

# Function to generate audio from reference voice
def generate_audio_from_reference(input_text, reference_audio_path):
    # Get the voice state from the reference audio
    voice_state = tts_model.get_state_for_audio_prompt(reference_audio_path)

    # Generate audio from the input text
    audio = tts_model.generate_audio(voice_state, input_text)

    # Save the generated audio as a .wav file
    output_filename = "/content/generated_audio_cloning.wav"
    scipy.io.wavfile.write(output_filename, tts_model.sample_rate, audio.numpy())

    # Display the audio player in the notebook
    return Audio(output_filename)

In [None]:
# Upload a reference audio file
uploaded = files.upload()
for filename in uploaded.keys():
    reference_audio_path = f"/content/{filename}"

# Input cell for setting text and generating audio
input_text_cloning = "This is a test of voice cloning with Pocket TTS."

# Generate audio from reference voice and input text
generate_audio_from_reference(input_text_cloning, reference_audio_path)
