<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/Neu_TTS_Air.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🗣️ NeuTTS Air Colab

## 📄 Description

This Colab notebook runs **NeuTTS Air**, a **super-realistic, on-device text-to-speech (TTS)** language model with **instant voice cloning**.
Built on a **0.5B-parameter LLM backbone**, NeuTTS Air delivers **natural-sounding**, **real-time**, and **privacy-safe** speech generation—bringing high-quality voice synthesis directly to your local device.

**Capabilities:**
Real-Time On-Device Speech, Ultra-Realistic Human-Like Voices, Instant Voice Cloning (3s sample), Embedded-Optimized GGUF Format, Secure & Watermarked Output

---

## How to use

* Run the first cell, it will pin numpy version and restart session
* Modify text to generate variable
* Run all following cells, upload your reference audio and text, wait for it to generate

---

## ⚙️ Model Highlights

* 🗣 **Best-in-class realism** for its size – produces natural, expressive voices that sound genuinely human
* 📱 **Optimized for on-device deployment** – runs locally on phones, laptops, or even Raspberry Pis via GGML
* 🧬 **Instant voice cloning** – recreate a custom speaker from as little as 3 seconds of audio
* 🚀 **Lightweight 0.5B architecture** – balances performance, speed, and quality for real-world TTS applications
* 🔒 **Privacy & compliance-safe** – fully local inference with watermarked outputs

---

## 🧠 Model Details

* **Base Model:** Qwen-0.5B
* **Supported Language:** English
* **Audio Codec:** NeuCodec (50 Hz neural codec, high quality at low bitrate)
* **Context Window:** 2048 tokens (~30 seconds of audio, including prompt)
* **Format:** GGUF (efficient on-device inference)
* **Performance:** Real-time generation on mid-range devices

---

## 🔗 Resources

* **GitHub Repository:** [https://github.com/neuphonic/neutts-air](https://github.com/neuphonic/neutts-air)
* **Model Availability:** [https://huggingface.co/neuphonic/neutts-air](https://huggingface.co/neuphonic/neutts-air)

---

## 🎙️ Explore More TTS Models

Looking for more cutting-edge voice models?
👉 Check out the full collection: [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)

## Dependency & Set Up

In [None]:
# Force the exact NumPy version the repo needs, then restart the kernel so native
# extensions load against the right ABI.

%pip -q install --upgrade pip setuptools wheel
%pip -q install --upgrade --force-reinstall "numpy==2.2.6"

import os, sys, time
print("NumPy pinned. Restarting kernel to load the correct ABI...")
# Hard restart the runtime so imports bind to this NumPy build.
os.kill(os.getpid(), 9)

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m1.5/1.8 MB[0m [31m44.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m80.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of 

In [None]:
# OS dep
!apt-get update -qq
!apt-get install -y -qq espeak

# Get the repo
!git clone https://github.com/neuphonic/neutts-air.git /content/neutts-air || true

# Make sure NumPy stays pinned while installing everything else
%pip -q install --upgrade --force-reinstall "numpy==2.2.6"
%pip -q install -r /content/neutts-air/requirements.txt

# Optional backends (comment in if needed)
# %pip -q install llama-cpp-python
# %pip -q install onnxruntime

# I/O + playback
%pip -q install soundfile ipython

import numpy as np, sys
print("NumPy version:", np.__version__)

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package libportaudio2:amd64.
(Reading database ... 126675 files and directories currently installed.)
Preparing to unpack .../libportaudio2_19.6.0-1.1_amd64.deb ...
Unpacking libportaudio2:amd64 (19.6.0-1.1) ...
Selecting previously unselected package libsonic0:amd64.
Preparing to unpack .../libsonic0_0.2.0-11build1_amd64.deb ...
Unpacking libsonic0:amd64 (0.2.0-11build1) ...
Selecting previously unselected package espeak-data:amd64.
Preparing to unpack .../espeak-data_1.48.15+dfsg-3_amd64.deb ...
Unpacking espeak-data:amd64 (1.48.15+dfsg-3) ...
Selecting previously unselected package libespeak1:amd64.
Preparing to unpack .../libespeak1_1.48.15+dfsg-3_amd64.deb ...
Unpacking libespeak1:amd64 (1.48.15+dfsg-3) ...
Selecting previously unselected package espeak.
Preparing

## Reference Audio and Text

In [None]:
# Upload at least one mono WAV (3–15s, 16–44.1 kHz recommended).
# You can also upload a .txt that transcribes the reference audio (optional).
# We'll auto-detect the uploaded filenames.

import os
from google.colab import files

upload_dir = "/content/uploads"
os.makedirs(upload_dir, exist_ok=True)

uploaded = files.upload()  # Choose your files (e.g., reference.wav and optional reference.txt)
for name in uploaded.keys():
    with open(os.path.join(upload_dir, name), "wb") as f:
        f.write(uploaded[name])

# Auto-pick a .wav and .txt if present
ref_audio_path = None
ref_text_path = None
for fn in os.listdir(upload_dir):
    lower = fn.lower()
    if lower.endswith(".wav") and ref_audio_path is None:
        ref_audio_path = os.path.join(upload_dir, fn)
    if lower.endswith(".txt") and ref_text_path is None:
        ref_text_path = os.path.join(upload_dir, fn)

print("Detected reference audio:", ref_audio_path)
print("Detected reference text file:", ref_text_path)

Saving ref.txt to ref.txt
Saving trump_promptvn.wav to trump_promptvn.wav
Detected reference audio: /content/uploads/trump_promptvn.wav
Detected reference text file: /content/uploads/ref.txt


In [None]:
# Define your texts here. If you uploaded a reference .txt, you can leave REF_TEXT_STR = None.
# Otherwise write your reference transcript inline in REF_TEXT_STR.

INPUT_TEXT = "Okay, this is generated by Neu TTS Air. How does it sound?"
REF_TEXT_STR = None  # e.g., "Hi! My name is Alex and I'm from Seattle."  (set to None to read from file if uploaded)

# Choose model backbones (defaults usually fine).
BACKBONE_REPO = "neuphonic/neutts-air"  # GGUF variant: "neuphonic/neutts-air-q4-gguf" (requires llama-cpp-python)
BACKBONE_DEVICE = "cpu"                  # "cpu" is safest in Colab free tier; use "cuda" if you know you have a GPU session.
CODEC_REPO = "neuphonic/neucodec"
CODEC_DEVICE = "cpu"

# Output
OUTPUT_WAV = "/content/output_neutts_air.wav"

# Basic sanity checks
assert ref_audio_path is not None, "Please upload a reference .wav file in the previous cell."
if REF_TEXT_STR is None and ref_text_path is None:
    raise ValueError("No reference text provided. Either upload a .txt or set REF_TEXT_STR above.")

## Run TTS and Output Audio

In [None]:
import os, io
import soundfile as sf

# Import the model from the repo
import sys
sys.path.insert(0, "/content/neutts-air")
from neuttsair.neutts import NeuTTSAir

# Resolve reference text (file or inline)
if 'REF_TEXT_STR' in globals() and REF_TEXT_STR is not None:
    ref_text = REF_TEXT_STR.strip()
else:
    with open(ref_text_path, "r", encoding="utf-8") as f:
        ref_text = f.read().strip()

# Initialize TTS
tts = NeuTTSAir(
    backbone_repo=BACKBONE_REPO,
    backbone_device=BACKBONE_DEVICE,
    codec_repo=CODEC_REPO,
    codec_device=CODEC_DEVICE
)

# Encode the reference audio into style codes
ref_codes = tts.encode_reference(ref_audio_path)

# Infer a waveform at 24 kHz sampling rate
wav = tts.infer(INPUT_TEXT, ref_codes, ref_text)

# Save to disk
sf.write(OUTPUT_WAV, wav, 24000)
print("Saved:", OUTPUT_WAV, "(", len(wav), "samples @ 24kHz )")


In [None]:
from IPython.display import Audio, display
display(Audio(OUTPUT_WAV, rate=24000, autoplay=False))
