# Notebook Overview

This notebook demonstrates the complete workflow for fine-tuning and deploying a speech synthesis model (Spark-TTS) using the Unsloth and TRL libraries. The main steps covered include:

- **Environment Setup:** Installing required dependencies and cloning the Spark-TTS repository, with special handling for Google Colab environments.
- **Model Download and Loading:** Downloading the pretrained Spark-TTS model from Hugging Face Hub and loading it with support for full fine-tuning.
- **LoRA Integration:** Applying Low-Rank Adaptation (LoRA) for efficient parameter-efficient fine-tuning, optimized for bfloat16 precision.
- **Dataset Loading:** Importing a custom training dataset from Hugging Face’s datasets library.
- **Trainer Configuration:** Setting up the `SFTTrainer` from the TRL library with custom hyperparameters for supervised fine-tuning.
- **Training Execution:** Running the training loop and capturing training statistics for evaluation.
- **Model Saving and Uploading:** Saving the fine-tuned model locally and pushing it to the Hugging Face Hub with 16-bit merged weights.

This workflow is designed to optimize GPU memory usage while enabling flexible training on long sequences and large batch sizes.


### Installation

In [1]:
!nvidia-smi

Tue Aug 12 05:40:35 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   48C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

<a name="Setup"></a>
### Environment Setup and Dependency Installation

This block installs necessary libraries and dependencies, handling differences between running locally or in Google Colab, and clones the Spark-TTS repository.


In [2]:
%%capture
# Import os module to access environment variables
import os

# Check if the code is running outside Google Colab by looking for "COLAB_" in environment variable keys
if "COLAB_" not in "".join(os.environ.keys()):
    # If NOT running in Colab, install the 'unsloth' package using pip
    !pip install unsloth
else:
    # If running in Google Colab, install specific dependencies without their dependencies to avoid conflicts:
    # - bitsandbytes: for efficient 8-bit optimizers
    # - accelerate: Hugging Face library to easily run on different hardware
    # - xformers (fixed version): efficient transformers implementations
    # - peft: parameter-efficient fine-tuning library
    # - trl: transformer reinforcement learning tools
    # - triton: GPU kernel language for ML
    # - cut_cross_entropy: specialized loss function
    # - unsloth_zoo: additional models and utilities for unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo

    # Install additional essential libraries:
    # - sentencepiece: tokenizer library
    # - protobuf: protocol buffers for data serialization
    # - datasets (version >=3.4.1 and <4.0.0): Hugging Face datasets library
    # - huggingface_hub (>=0.34.0): for model hub interaction
    # - hf_transfer: Hugging Face file transfer utilities
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer

    # Finally, install 'unsloth' without dependencies to avoid conflicts
    !pip install --no-deps unsloth

# Clone the Spark-TTS GitHub repository for text-to-speech project files and code
!git clone https://github.com/SparkAudio/Spark-TTS

# Install additional required Python libraries:
# - omegaconf: configuration management
# - einx: enhanced numpy with extra features
!pip install omegaconf einx


In [None]:
# Import the login function from the huggingface_hub library
from huggingface_hub import login

# Authenticate your Hugging Face account using your access token
# This allows you to interact with private repositories and push models securely
login(token="")


<a name="LoadModel"></a>
### Downloading and Loading Spark-TTS Model with Unsloth

This code downloads the Spark-TTS model from Hugging Face Hub and loads it with full fine-tuning enabled using the Unsloth library.


In [4]:
# Import FastModel class from unsloth library for easy model loading and fine-tuning
from unsloth import FastModel

# Import PyTorch library for tensor operations and specifying data types
import torch

# Import snapshot_download to download a snapshot of a model repository from Hugging Face Hub
from huggingface_hub import snapshot_download

# Set maximum sequence length for model inputs; longer context possible with higher values
max_seq_length = 2048  # Adjust as needed for your application

# Download the entire model repository "unsloth/Spark-TTS-0.5B" from Hugging Face Hub
# Save it locally in the folder "Spark-TTS-0.5B"
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")

# Load the Spark-TTS model and tokenizer from the downloaded files using FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name=f"Spark-TTS-0.5B/LLM",  # Specify the model subfolder to load
    max_seq_length=max_seq_length,    # Use the max sequence length defined above
    dtype=torch.float32,               # Use float32 precision (Spark currently supports float32 only)
    full_finetuning=True,              # Enable full fine-tuning of the model weights
    load_in_4bit=False,                # Do NOT load model in 4-bit quantization mode
    #token="hf_...",                 # Uncomment and add token if loading gated/private models
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Fetching 31 files:   0%|          | 0/31 [00:00<?, ?it/s]

BiCodec/model.safetensors:   0%|          | 0.00/626M [00:00<?, ?B/s]

config.yaml: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

LLM/model.safetensors:   0%|          | 0.00/2.03G [00:00<?, ?B/s]

LLM/tokenizer.json:   0%|          | 0.00/14.1M [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/658 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.yaml:   0%|          | 0.00/169 [00:00<?, ?B/s]

gradio_TTS.png:   0%|          | 0.00/81.8k [00:00<?, ?B/s]

gradio_control.png:   0%|          | 0.00/62.2k [00:00<?, ?B/s]

src/figures/infer_control.png:   0%|          | 0.00/127k [00:00<?, ?B/s]

src/figures/infer_voice_cloning.png:   0%|          | 0.00/119k [00:00<?, ?B/s]

src/logo/HKUST.jpg:   0%|          | 0.00/102k [00:00<?, ?B/s]

src/logo/NPU.jpg:   0%|          | 0.00/152k [00:00<?, ?B/s]

NTU.jpg:   0%|          | 0.00/77.6k [00:00<?, ?B/s]

src/logo/SJU.jpg:   0%|          | 0.00/364k [00:00<?, ?B/s]

SparkTTS.jpg:   0%|          | 0.00/52.5k [00:00<?, ?B/s]

SparkAudio.jpg:   0%|          | 0.00/89.0k [00:00<?, ?B/s]

SparkAudio2.jpg:   0%|          | 0.00/40.7k [00:00<?, ?B/s]

src/logo/SparkTTS.png:   0%|          | 0.00/102k [00:00<?, ?B/s]

src/logo/mobvoi.jpg:   0%|          | 0.00/431k [00:00<?, ?B/s]

src/logo/mobvoi.png:   0%|          | 0.00/120k [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

wav2vec2-large-xlsr-53/pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

==((====))==  Unsloth 2025.8.4: Fast Qwen2 patching. Transformers: 4.55.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Float16 full finetuning uses more memory since we upcast weights to float32.


<a name="LoRA"></a>

# Applying LoRA (Low-Rank Adaptation) to the Model

This code applies LoRA adapters to the model using `FastModel.get_peft_model`. Note that LoRA only works with **bfloat16 precision**, not float32.

Key parameters include the rank (`r`), target layers for adaptation (`target_modules`), scaling factor (`lora_alpha`), and optimized settings like dropout and bias. The `use_gradient_checkpointing="unsloth"` option helps reduce VRAM usage for longer contexts.


In [5]:
# Note: LoRA (Low-Rank Adaptation) only works with bfloat16 precision,
# it does NOT work with float32 models.

# Apply LoRA adapters to the existing model using FastModel's get_peft_model method
model = FastModel.get_peft_model(
    model,
    r=128,  # Rank of the LoRA update matrices; larger values increase capacity. Typical values: 8,16,32,64,128
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projection layers to adapt
        "gate_proj", "up_proj", "down_proj",     # Feedforward network layers to adapt
    ],
    lora_alpha=128,         # LoRA scaling factor to control update magnitude
    lora_dropout=0,         # Dropout rate applied to LoRA layers; 0 means no dropout (optimized)
    bias="none",            # Bias handling mode; "none" is optimized (no bias parameters updated)
    # Use 'unsloth' gradient checkpointing mode which reduces VRAM usage by ~30%,
    # allowing larger batch sizes and longer context lengths.
    use_gradient_checkpointing="unsloth",
    random_state=3407,      # Seed for reproducibility of LoRA parameter initialization
    use_rslora=False,       # Whether to use rank stabilized LoRA (disabled here)
    loftq_config=None,      # Configuration for LoftQ quantization (not used here)
)


Unsloth: Full finetuning is enabled, so .get_peft_model has no effect


<a name="LoadDataset"></a>

# Loading a Dataset from Hugging Face Hub

This code uses the `datasets` library to load the `"train"` split of the `"Elise"` dataset published by user `"MrDragonFox"` on Hugging Face Hub.


In [6]:
# Import the 'load_dataset' function from the Hugging Face datasets library
from datasets import load_dataset

# Load the 'Elise' dataset from the user 'MrDragonFox' on Hugging Face Hub
# Here, we specifically load the "train" split of the dataset
dataset = load_dataset("MrDragonFox/Elise", split="train")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/328M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1195 [00:00<?, ? examples/s]

In [7]:
#@title Tokenization Function

import locale
import torchaudio.transforms as T
import os
import torch
import sys
import numpy as np
sys.path.append('Spark-TTS')
from sparktts.models.audio_tokenizer import BiCodecTokenizer
from sparktts.utils.audio import audio_volume_normalize

audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
def extract_wav2vec2_features( wavs: torch.Tensor) -> torch.Tensor:
        """extract wav2vec2 features"""

        if wavs.shape[0] != 1:

             raise ValueError(f"Expected batch size 1, but got shape {wavs.shape}")
        wav_np = wavs.squeeze(0).cpu().numpy()

        processed = audio_tokenizer.processor(
            wav_np,
            sampling_rate=16000,
            return_tensors="pt",
            padding=True,
        )
        input_values = processed.input_values

        input_values = input_values.to(audio_tokenizer.feature_extractor.device)

        model_output = audio_tokenizer.feature_extractor(
            input_values,
        )


        if model_output.hidden_states is None:
             raise ValueError("Wav2Vec2Model did not return hidden states. Ensure config `output_hidden_states=True`.")

        num_layers = len(model_output.hidden_states)
        required_layers = [11, 14, 16]
        if any(l >= num_layers for l in required_layers):
             raise IndexError(f"Requested hidden state indices {required_layers} out of range for model with {num_layers} layers.")

        feats_mix = (
            model_output.hidden_states[11] + model_output.hidden_states[14] + model_output.hidden_states[16]
        ) / 3

        return feats_mix
def formatting_audio_func(example):
    text = f"{example['source']}: {example['text']}" if "source" in example else example["text"]
    audio_array = example["audio"]["array"]
    sampling_rate = example["audio"]["sampling_rate"]

    target_sr = audio_tokenizer.config['sample_rate']

    if sampling_rate != target_sr:
        resampler = T.Resample(orig_freq=sampling_rate, new_freq=target_sr)
        audio_tensor_temp = torch.from_numpy(audio_array).float()
        audio_array = resampler(audio_tensor_temp).numpy()

    if audio_tokenizer.config["volume_normalize"]:
        audio_array = audio_volume_normalize(audio_array)

    ref_wav_np = audio_tokenizer.get_ref_clip(audio_array)

    audio_tensor = torch.from_numpy(audio_array).unsqueeze(0).float().to(audio_tokenizer.device)
    ref_wav_tensor = torch.from_numpy(ref_wav_np).unsqueeze(0).float().to(audio_tokenizer.device)


    feat = extract_wav2vec2_features(audio_tensor)

    batch = {

        "wav": audio_tensor,
        "ref_wav": ref_wav_tensor,
        "feat": feat.to(audio_tokenizer.device),
    }


    semantic_token_ids, global_token_ids = audio_tokenizer.model.tokenize(batch)

    global_tokens = "".join(
        [f"<|bicodec_global_{i}|>" for i in global_token_ids.squeeze().cpu().numpy()] # Squeeze batch dim
    )
    semantic_tokens = "".join(
        [f"<|bicodec_semantic_{i}|>" for i in semantic_token_ids.squeeze().cpu().numpy()] # Squeeze batch dim
    )

    inputs = [
        "<|task_tts|>",
        "<|start_content|>",
        text,
        "<|end_content|>",
        "<|start_global_token|>",
        global_tokens,
        "<|end_global_token|>",
        "<|start_semantic_token|>",
        semantic_tokens,
        "<|end_semantic_token|>",
        "<|im_end|>"
    ]
    inputs = "".join(inputs)
    return {"text": inputs}


dataset = dataset.map(formatting_audio_func, remove_columns=["audio"])
print("Moving Bicodec model and Wav2Vec2Model to cpu.")
audio_tokenizer.model.cpu()
audio_tokenizer.feature_extractor.cpu()
torch.cuda.empty_cache()

  WeightNorm.apply(module, name, dim)


Missing tensor: mel_transformer.spectrogram.window
Missing tensor: mel_transformer.mel_scale.fb




Map:   0%|          | 0/1195 [00:00<?, ? examples/s]

Moving Bicodec model and Wav2Vec2Model to cpu.


<a name="SFTTrainer"></a>

# Setting Up Supervised Fine-Tuning (SFT) with TRL's SFTTrainer

This code initializes an `SFTTrainer` from the `trl` library to fine-tune a model with supervised learning. Key configurations include batch size, gradient accumulation, learning rate, optimizer type, and precision settings. The trainer uses the specified dataset and tokenizer, and supports training steps control and logging.


In [8]:
# Import configuration and trainer classes from the trl library for supervised fine-tuning (SFT)
from trl import SFTConfig, SFTTrainer

# Initialize the supervised fine-tuning trainer with the model, tokenizer, and training dataset
trainer = SFTTrainer(
    model=model,                    # The model to be fine-tuned
    tokenizer=tokenizer,            # Corresponding tokenizer
    train_dataset=dataset,          # Dataset used for training
    dataset_text_field="text",      # Name of the field containing text inputs in the dataset
    max_seq_length=max_seq_length, # Maximum sequence length to truncate/pad inputs
    packing=False,                  # Packing multiple short sequences into one batch for efficiency (disabled here)

    # Configuration arguments for the training process
    args=SFTConfig(
        per_device_train_batch_size=2,       # Number of samples processed per device (GPU) per step
        gradient_accumulation_steps=4,       # Number of steps to accumulate gradients before updating weights
        warmup_steps=5,                      # Number of warmup steps for learning rate scheduler
        # num_train_epochs=1,                 # Uncomment to train for 1 full epoch
        max_steps=60,                        # Maximum number of training steps (overrides epochs)
        learning_rate=2e-4,                  # Learning rate for the optimizer
        fp16=False,                         # Disable mixed precision FP16 training (using full float32)
        bf16=False,                         # Disable bfloat16 precision (also full float32 training)
        logging_steps=1,                    # Log training info every step
        optim="adamw_8bit",                 # Use 8-bit AdamW optimizer for memory efficiency
        weight_decay=0.01,                  # Weight decay factor for regularization
        lr_scheduler_type="linear",         # Learning rate scheduler type
        seed=3407,                         # Random seed for reproducibility
        output_dir="outputs",               # Directory to save model checkpoints and logs
        report_to="none",                   # Disable reporting to experiment trackers (e.g., WandB)
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1195 [00:00<?, ? examples/s]

In [9]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.01 GB of memory reserved.


<a name="TrainModel"></a>

# Starting the Training Process

This line initiates model training using the configured trainer. The resulting training statistics, such as loss and performance metrics, are stored in `trainer_stats` for later analysis or monitoring.


In [10]:
# Start the training process and store the training statistics (e.g., loss, accuracy) in trainer_stats
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,195 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 506,634,112 of 506,634,112 (100.00% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,6.9905
2,6.776
3,6.8735
4,6.5846
5,6.4273
6,6.2694
7,6.2434
8,6.2096
9,6.2873
10,5.748


In [11]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

641.5162 seconds used for training.
10.69 minutes used for training.
Peak reserved memory = 7.641 GB.
Peak reserved memory for training = 2.631 GB.
Peak reserved memory % of max memory = 51.835 %.
Peak reserved memory for training % of max memory = 17.848 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the prompts


In [23]:
input_text = "Hey there my name is suresh, beekhani <giggles> and I'm a speech generation model that can sound like a person."

chosen_voice = None # None for single-speaker

In [24]:
#@title Run Inference

import torch
import re
import numpy as np
from typing import Dict, Any
import torchaudio.transforms as T

FastModel.for_inference(model) # Enable native 2x faster inference

@torch.inference_mode()
def generate_speech_from_text(
    text: str,
    temperature: float = 0.8,   # Generation temperature
    top_k: int = 50,            # Generation top_k
    top_p: float = 1,        # Generation top_p
    max_new_audio_tokens: int = 2048, # Max tokens for audio part
    device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
    """
    Generates speech audio from text using default voice control parameters.

    Args:
        text (str): The text input to be converted to speech.
        temperature (float): Sampling temperature for generation.
        top_k (int): Top-k sampling parameter.
        top_p (float): Top-p (nucleus) sampling parameter.
        max_new_audio_tokens (int): Max number of new tokens to generate (limits audio length).
        device (torch.device): Device to run inference on.

    Returns:
        np.ndarray: Generated waveform as a NumPy array.
    """

    torch.compiler.reset()

    prompt = "".join([
        "<|task_tts|>",
        "<|start_content|>",
        text,
        "<|end_content|>",
        "<|start_global_token|>"
    ])

    model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

    print("Generating token sequence...")
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=max_new_audio_tokens, # Limit generation length
        do_sample=True,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=tokenizer.eos_token_id, # Stop token
        pad_token_id=tokenizer.pad_token_id # Use models pad token id
    )
    print("Token sequence generated.")


    generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]


    predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
    # print(f"\nGenerated Text (for parsing):\n{predicts_text}\n") # Debugging

    # Extract semantic token IDs using regex
    semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
    if not semantic_matches:
        print("Warning: No semantic tokens found in the generated output.")
        # Handle appropriately - perhaps return silence or raise error
        return np.array([], dtype=np.float32)

    pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0) # Add batch dim

    # Extract global token IDs using regex (assuming controllable mode also generates these)
    global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
    if not global_matches:
         print("Warning: No global tokens found in the generated output (controllable mode). Might use defaults or fail.")
         pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
    else:
         pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0) # Add batch dim

    pred_global_ids = pred_global_ids.unsqueeze(0) # Shape becomes (1, 1, N_global)

    print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
    print(f"Found {pred_global_ids.shape[2]} global tokens.")


    # 5. Detokenize using BiCodecTokenizer
    print("Detokenizing audio tokens...")
    # Ensure audio_tokenizer and its internal model are on the correct device
    audio_tokenizer.device = device
    audio_tokenizer.model.to(device)
    # Squeeze the extra dimension from global tokens as seen in SparkTTS example
    wav_np = audio_tokenizer.detokenize(
        pred_global_ids.to(device).squeeze(0), # Shape (1, N_global)
        pred_semantic_ids.to(device)           # Shape (1, N_semantic)
    )
    print("Detokenization complete.")

    return wav_np

if __name__ == "__main__":
    print(f"Generating speech for: '{input_text}'")
    text = f"{chosen_voice}: " + input_text if chosen_voice else input_text
    generated_waveform = generate_speech_from_text(input_text)

    if generated_waveform.size > 0:
        import soundfile as sf
        output_filename = "generated_speech_controllable.wav"
        sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
        sf.write(output_filename, generated_waveform, sample_rate)
        print(f"Audio saved to {output_filename}")

        # Optional: Play in notebook
        from IPython.display import Audio, display
        display(Audio(generated_waveform, rate=sample_rate))
    else:
        print("Audio generation failed (no tokens found?).")

Generating speech for: 'Hey there my name is suresh, beekhani <giggles> and I'm a speech generation model that can sound like a person.'
Generating token sequence...
Token sequence generated.
Found 415 semantic tokens.
Found 32 global tokens.
Detokenizing audio tokens...
Detokenization complete.
Audio saved to generated_speech_controllable.wav


<a name="Save16bit"></a>
### Saving and pushing merged 16-bit finetuned models

To save the final finetuned model in merged 16-bit format, use `save_pretrained` for local saving or `push_to_hub` to upload it to Hugging Face Hub.

**[NOTE]** This saves the entire model merged as 16-bit weights, not just adapters.


In [None]:
# Save the merged 16-bit model weights and configuration locally
# "local_16bit_model" is the directory where model files will be saved
model.save_pretrained("local_16bit_model", save_method="merged_16bit")

# Save the tokenizer files to the same local directory
tokenizer.save_pretrained("local_16bit_model")

# Upload (push) the merged 16-bit model to the Hugging Face Hub repository
# Specify the repository path and use the same "merged_16bit" save method
# The 'token' parameter authorizes the push operation with your access token
model.push_to_hub(
    "sureshbeekhani/spark-tts-0.5b-finetune-16bit",
    save_method="merged_16bit",
    token=""
)

# Upload the tokenizer files to the same Hugging Face Hub repository
# Token is required for authentication during the push
tokenizer.push_to_hub(
    "sureshbeekhani/spark-tts-0.5b-finetune-16bit",
    token=""
)



README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpa53dlsy1/model.safetensors    :   0%|          |  544kB / 2.03GB            

Saved model to https://huggingface.co/sureshbeekhani/spark-tts-0.5b-finetune-16bit


README.md:   0%|          | 0.00/56.0 [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmprywiy8tj/tokenizer.json       :  19%|#9        | 2.73MB / 14.1MB            