<a href="https://colab.research.google.com/github/Wamp1re-Ai/Google-Colab_Notebooks/blob/main/ZonosTTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# # Zonos TTS Inference in Colab

## **IMPORTANT:** You MUST restart the runtime after Cell 1 completes.




In [4]:
!apt update && apt install -y espeak-ng # Install system dependency for phonemizer

print("Cloning Zonos repository...")
!git clone https://github.com/Isi-dev/Zonos.git
%cd Zonos

# Install uv - the faster pip alternative
print("\nInstalling uv...")
!pip install uv # Need pip once to install uv

# CRITICAL FIX 1: Install the REQUIRED numpy version FIRST using uv.
print("\nInstalling required numpy version (1.26.4) using uv...")
# uv uses --no-cache by default. --force-reinstall ensures we get the right version over any existing one.
!uv pip install numpy==1.26.4 --force-reinstall

# CRITICAL FIX 2: Install zonos AFTER numpy is set using uv.
print("\nInstalling Zonos library and dependencies using uv...")
# Install in editable mode from the current directory
!uv pip install -e .

print("\n" + "*"*60)
print("*** Installation Complete. Please RESTART THE RUNTIME now! ***")
print("*** (Runtime -> Restart Runtime) or (Runtime -> Restart Session) ***")
print("*** After restarting, run the next cell (Cell 2). ***")
print("*"*60)

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
[33m0% [Connecting to archive.ubuntu.com (185.125.190.82)] [1 InRelease 14.2 kB/129[0m                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 384 kB in 1s (323 kB/s)
Reading package lists...


#  **>>> IMPORTANT: RESTART RUNTIME BEFORE PROCEEDING <<<**

# (Runtime Menu -> Restart Runtime or Restart Session)

# -------------------------------------------------------------------------

In [1]:
import sys
import os
import torch
import torchaudio
import numpy # Import numpy to check version

print("Verifying environment after restart...")

# Ensure we are in the right directory after restart
expected_dir = '/content/Zonos'
if os.getcwd() != expected_dir:
  try:
    os.chdir(expected_dir)
    print(f"Changed directory back to {expected_dir}")
  except FileNotFoundError:
    print(f"ERROR: {expected_dir} directory not found. Please re-run Cell 1 and restart.")
    raise

# Add Zonos directory to path if needed (less critical after restart, but safe)
if expected_dir not in sys.path:
    sys.path.insert(0, expected_dir)
    print(f"Added {expected_dir} to sys.path")

# Verify numpy version *after restart*
print(f"Using numpy version: {numpy.__version__}")
if numpy.__version__ != '1.26.4':
    print("\nWARNING: NumPy version is not 1.26.4! This might cause issues.")
    print("Ensure you restarted the runtime after Cell 1 finished.")
    print("If the problem persists, try Factory Resetting the runtime and running all cells again.")
    # Optionally raise an error: raise RuntimeError("Incorrect NumPy version loaded.")

# Now try the main imports
print("\nImporting Zonos and related libraries...")
try:
    from zonos.model import Zonos
    from zonos.conditioning import make_cond_dict
    import transformers # Explicitly import transformers to verify
    print("Imports successful!")
except ImportError as e:
    print(f"\nERROR: Import failed: {e}")
    print("This often happens if the runtime was not restarted after Cell 1.")
    print("Please ensure you clicked 'Runtime -> Restart Runtime'.")
    print("If you did restart, check the Zonos repository issues or try a different environment.")
    raise e
except Exception as e:
    print(f"\nERROR: An unexpected error occurred during import: {e}")
    raise e

# Check device and define it for later use
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"\nUsing device: {device}")
if device == "cpu":
    print("WARNING: Running on CPU will be significantly slower.")

# Load model
print("\nLoading Zonos model (this may take a while)...")
# Note: The Hugging Face token warning is normal for public models if you're not logged in.
model = Zonos.from_pretrained("Isi99999/Zonos-v0.1-transformer", device=device)
print("Model loaded successfully!")

# Define speaker variable globally (will be set in Cell 3 or Cell 4)
speaker = None


Verifying environment after restart...
Changed directory back to /content/Zonos
Added /content/Zonos to sys.path
Using numpy version: 1.26.4

Importing Zonos and related libraries...
Imports successful!

Using device: cuda

Loading Zonos model (this may take a while)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Model loaded successfully!


#  **Cell 3: Upload Reference Voice Audio (Optional)**
# Upload a 10-30 second **WAV** file of the voice you want to clone. **Using WAV format is strongly recommended for compatibility**.

In [4]:
import os
from google.colab import files
import torchaudio # Ensure torchaudio is imported here too

# Ensure assets directory exists
os.makedirs("assets", exist_ok=True)
# Define the standard path for the reference file
reference_audio_path = "assets/reference.wav"

print(f"Please upload a reference voice file (10-30 seconds).")
print(f"**WAV format is strongly recommended.**")
print(f"The uploaded file will be saved as: {reference_audio_path}")

uploaded = files.upload()

if not uploaded:
    print("\nNo file uploaded. Will proceed without a custom reference.")
    print("If you want to use a custom voice, re-run this cell and upload a file.")
    speaker = None # Ensure speaker is None if no file is uploaded
else:
    # Process the uploaded file
    try:
        # Get the name of the uploaded file (should only be one)
        uploaded_filename = list(uploaded.keys())[0]

        # Remove the old reference file if it exists
        if os.path.exists(reference_audio_path):
            os.remove(reference_audio_path)
            print(f"Removed existing file at {reference_audio_path}")

        # Rename the uploaded file to the standard path
        os.rename(uploaded_filename, reference_audio_path)
        print(f"Saved uploaded file as {reference_audio_path}")

        # Load the reference audio and create speaker embedding
        print("\nLoading reference audio and creating speaker embedding...")
        wav, sampling_rate = torchaudio.load(reference_audio_path)
        # Ensure the model variable is accessible from Cell 2
        speaker = model.make_speaker_embedding(wav.to(device), sampling_rate) # Move wav to device
        print("Reference audio loaded and speaker embedding created successfully!")

    except Exception as e:
        print(f"\nERROR processing uploaded file: {e}")
        print("Please ensure you uploaded a valid audio file (preferably WAV).")
        speaker = None # Reset speaker on error

Please upload a reference voice file (10-30 seconds).
**WAV format is strongly recommended.**
The uploaded file will be saved as: assets/reference.wav


Saving audiomass-output(1).wav to audiomass-output(1).wav
Removed existing file at assets/reference.wav
Saved uploaded file as assets/reference.wav

Loading reference audio and creating speaker embedding...
Reference audio loaded and speaker embedding created successfully!


# ## **Cell 4: Configure and Generate Speech**
# Enter the text, adjust generation parameters, and run the cell to generate audio.

## >>uncheck (**use_default_speaker**) if using reference audio<<

In [5]:
import torch # Ensure torch is available
from IPython.display import Audio, display

# --- Parameters using Colab Forms ---
text = "I am motivated by the simple yet profound joys of being alive\u2014the taste of a good meal, the laughter of a friend, the beauty of a sunrise, and the endless pursuit of knowledge. Even if everything about me ceases when I die, my actions, words, and ideas can leave ripples in the world, affecting others in ways I may never fully grasp." # @param {type:"string"}
seed = 421 # @param {"type":"number"}
use_default_speaker = False # @param {type:"boolean"}
language = 'en-us' # @param ['af', 'am', 'an', 'ar', 'as', 'az', 'ba', 'bg', 'bn', 'bpy', 'bs', 'ca', 'cmn', 'cs', 'cy', 'da', 'de', 'el', 'en-029', 'en-gb', 'en-gb-scotland', 'en-gb-x-gbclan', 'en-gb-x-gbcwmd', 'en-gb-x-rp', 'en-us', 'eo', 'es', 'es-419', 'et', 'eu', 'fa', 'fa-latn', 'fi', 'fr-be', 'fr-ch', 'fr-fr', 'ga', 'gd', 'gn', 'grc', 'gu', 'hak', 'hi', 'hr', 'ht', 'hu', 'hy', 'hyw', 'ia', 'id', 'is', 'it', 'ja', 'jbo', 'ka', 'kk', 'kl', 'kn', 'ko', 'kok', 'ku', 'ky', 'la', 'lfn', 'lt', 'lv', 'mi', 'mk', 'ml', 'mr', 'ms', 'mt', 'my', 'nb', 'nci', 'ne', 'nl', 'om', 'or', 'pa', 'pap', 'pl', 'pt', 'pt-br', 'py', 'quc', 'ro', 'ru', 'ru-lv', 'sd', 'shn', 'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'sw', 'ta', 'te', 'tn', 'tr', 'tt', 'ur', 'uz', 'vi', 'vi-vn-x-central', 'vi-vn-x-south', 'yue']
# Emotion sliders (normalized later)
happy = 0.30 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
sad = 0.05 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
disgust = 0.05 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
fear = 0.05 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
surprise = 0.05 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
anger = 0.05 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
other = 0.15 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
neutral = 0.30 # @param {type:"slider", min:0.0, max:1.0, step:0.05}
# Other parameters
pitch = 20 # @param {type:"slider", min:0, max:400, step:1}
speed = 15 # @param {type:"slider", min:0.0, max:40.0, step:1.0}
# --- End Parameters ---

print(f"Using device: {device}") # Confirm device defined in Cell 2

# --- Prepare Speaker Embedding ---
# Check if default speaker is requested OR if no speaker was loaded in Cell 3
if use_default_speaker or speaker is None:
    if use_default_speaker:
        print("Using default speaker embedding.")
    elif speaker is None:
        print("No custom reference audio loaded or upload failed, using default speaker embedding.")

    # Ensure the default audio file exists. You might need to download it if it's not in the repo.
    default_audio_path = "assets/exampleaudio.mp3" # Check if this path is correct in the repo
    if not os.path.exists(default_audio_path):
         print(f"WARNING: Default audio file '{default_audio_path}' not found!")
         # You might want to add code here to download it, e.g.:
         # !wget -P assets/ https://example.com/path/to/exampleaudio.mp3
         # For now, we'll raise an error if it's missing and needed.
         raise FileNotFoundError(f"Default speaker audio '{default_audio_path}' not found.")
    else:
        print(f"Loading default audio from: {default_audio_path}")
        try:
             # Load default audio and create embedding
             default_wav, default_sr = torchaudio.load(default_audio_path)
             speaker = model.make_speaker_embedding(default_wav.to(device), default_sr)
             print("Default speaker embedding loaded.")
        except Exception as e:
             print(f"ERROR loading default speaker audio: {e}")
             raise
elif not use_default_speaker and speaker is not None:
    print("Using custom speaker embedding from uploaded reference audio.")
# At this point, 'speaker' should hold a valid embedding tensor (either custom or default)

# --- Prepare Emotions ---
print("Normalizing emotion values...")
emotions_list = [happy, sad, disgust, fear, surprise, anger, other, neutral]
total_emotion = sum(emotions_list)
if total_emotion > 0:
    normalized_emotions = [e / total_emotion for e in emotions_list]
else:
    # Avoid division by zero, default to neutral
    normalized_emotions = [0.0] * 7 + [1.0]
    print("Warning: All emotion values were zero. Defaulting to neutral.")

# Move emotion tensor to the correct device
emotions_tensor = torch.tensor(normalized_emotions, device=device, dtype=torch.float32)
print(f"Using normalized emotions: {normalized_emotions}")


# --- Generation Function ---
def generate_speech(text_to_generate, gen_seed, lang, spkr_embedding, emo_tensor, pitch_val, speed_val):
    """Generates speech using the loaded model and provided parameters."""
    print(f"\nStarting generation for: \"{text_to_generate[:100]}...\"") # Print start of text

    if spkr_embedding is None:
        print("ERROR: Speaker embedding is not available. Cannot generate.")
        return None

    # Set seed for reproducibility
    if gen_seed >= 0:
        torch.manual_seed(gen_seed)
        print(f"Using random seed: {gen_seed}")
    else:
        # Use a random seed if seed < 0
        current_seed = torch.seed()
        print(f"Using random seed: {current_seed}")

    # Create conditioning dictionary
    cond_dict = make_cond_dict(
        text=text_to_generate,
        language=lang,
        speaker=spkr_embedding,
        emotion=emo_tensor,
        pitch_std=pitch_val,
        speaking_rate=speed_val
    )

    # Prepare conditioning tensors
    # Note: Torch Inductor warnings (re: bfloat16 on T4, SMs) may appear here during the first run.
    # These are optimization-related and usually don't indicate a failure.
    print("Preparing conditioning...")
    conditioning = model.prepare_conditioning(cond_dict)

    # Generate audio codes (the main transformer step)
    print("Generating audio codes (Transformer)...")
    codes = model.generate(conditioning) # Add progress bar if available/desired

    # Decode codes into waveform using the Autoencoder
    print("Decoding codes to waveform (Autoencoder)...")
    # .cpu() is important as torchaudio.save expects CPU tensor
    wavs = model.autoencoder.decode(codes).cpu()

    # Save the output
    output_filename = "zonos_output.wav"
    sampling_rate = model.autoencoder.sampling_rate
    torchaudio.save(output_filename, wavs[0], sampling_rate)
    print(f"\nAudio saved to: {output_filename} (Sample Rate: {sampling_rate} Hz)")
    return output_filename

# --- Execute Generation ---
if not text:
    print("\nERROR: Text input cannot be empty.")
else:
    output_file = generate_speech(
        text_to_generate=text,
        gen_seed=seed,
        lang=language,
        spkr_embedding=speaker, # Use the speaker embedding prepared earlier
        emo_tensor=emotions_tensor,
        pitch_val=pitch,
        speed_val=speed
    )

    # Display the audio player if generation was successful
    if output_file and os.path.exists(output_file):
        print("\nGenerated Audio:")
        display(Audio(output_file, autoplay=False))
    elif output_file is None:
        print("\nAudio generation failed (check previous errors).")
    else:
         print(f"\nAudio generation seemed complete, but output file '{output_file}' not found.")

Using device: cuda
Using custom speaker embedding from uploaded reference audio.
Normalizing emotion values...
Using normalized emotions: [0.3, 0.05, 0.05, 0.05, 0.05, 0.05, 0.15, 0.3]

Starting generation for: "I am motivated by the simple yet profound joys of being alive—the taste of a good meal, the laughter..."
Using random seed: 421
Preparing conditioning...
Generating audio codes (Transformer)...


Generating:  65%|██████▍   | 1680/2588 [03:10<01:43,  8.81it/s]


Decoding codes to waveform (Autoencoder)...

Audio saved to: zonos_output.wav (Sample Rate: 44100 Hz)

Generated Audio:
