<a href="https://colab.research.google.com/github/ABIIHA/DATA-SCIENCE/blob/main/nb/Orpheus_(3B)-TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


[Vision RL](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) is now supported! Train Qwen2.5-VL, Gemma 3 etc. with GSPO or GRPO.

Introducing Unsloth [Standby for RL](https://docs.unsloth.ai/basics/memory-efficient-rl): GRPO is now faster, uses 30% less memory with 2x longer context.

Gpt-oss fine-tuning now supports 8× longer context with 0 accuracy loss. [Read more](https://docs.unsloth.ai/basics/long-context-gpt-oss-training)

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [2]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2
!pip install snac

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

Thank you to [Etherl](https://huggingface.co/Etherll) for creating this notebook!

In [3]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
    # Qwen3 new models
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    # Other very popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/orpheus-3b-0.1-ft",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Select None for auto detection
    load_in_4bit = False, # Select True for 4bit which reduces memory usage
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.7: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.9.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep  

We will use the `MrDragonFox/Elise`, which is designed for training TTS models. Ensure that your dataset follows the required format: **text, audio** for single-speaker models or **source, text, audio** for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.

In [26]:
import os, pathlib, pandas as pd
from datasets import load_dataset, Audio

# 1) where your wav files live in Colab
AUDIO_DIR = "/content/audio"   # adjust if different
CSV_IN    = "/content/cleaned.csv"
CSV_OUT   = "/content/cleaned_abs.csv"

# 2) make absolute paths in the CSV
df = pd.read_csv(CSV_IN)
# assume your audio column is named "audio"; if it's "Audio" or something else, change here
df["audio"] = df["audio"].apply(lambda x: os.path.join(AUDIO_DIR, os.path.basename(str(x).strip())))
df.to_csv(CSV_OUT, index=False)

# (optional) sanity check: any missing files?
missing = [p for p in df["audio"] if not pathlib.Path(p).exists()]
print("Missing files:", len(missing))  # should be 0

# 3) load with datasets
ds = load_dataset("csv", data_files=CSV_OUT, split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 4) now this works
print(ds[0]["audio"])           # shows path/array
# print(ds[0])                  # full first row if you like


Missing files: 0


Generating train split: 0 examples [00:00, ? examples/s]

{'path': '/content/audio/01_Vocab-Short-Batch-1-Akber.wav', 'array': array([1.85519457e-06, 2.27242708e-05, 4.88534570e-05, ...,
       5.50528616e-03, 6.30982872e-03, 4.35098121e-03]), 'sampling_rate': 24000}


In [5]:
from datasets import load_dataset, Audio

dataset = load_dataset("csv", data_files="/content/roman_data_updated_499 (1).csv", split="train")
dataset = dataset.cast_column("Audio", Audio(sampling_rate=24000))

print(dataset[0])


{'Audio': {'path': '/content/audio/01_Vocab-Short-Batch-1-Akber.wav', 'array': array([1.85519457e-06, 2.27242708e-05, 4.88534570e-05, ...,
       5.50528616e-03, 6.30982872e-03, 4.35098121e-03]), 'sampling_rate': 24000}, 'Text': 'Ireland ek khubsurat aur par sakoon mulk hai jo apni fitrati khubsurat aur tareekh ke liye mashhur hai.'}


In [None]:
from datasets import load_dataset
dataset = load_dataset("MrDragonFox/Elise", split = "train")

In [6]:
#@title Tokenization Function

import locale
import torchaudio.transforms as T
import os
import torch
from snac import SNAC
locale.getpreferredencoding = lambda: "UTF-8"
# Changed 'audio' to 'Audio' to match the dataset column name
ds_sample_rate = dataset[0]["Audio"]["sampling_rate"]

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
def tokenise_audio(waveform):
  waveform = torch.from_numpy(waveform).unsqueeze(0)
  waveform = waveform.to(dtype=torch.float32)
  resample_transform = T.Resample(orig_freq=ds_sample_rate, new_freq=24000)
  waveform = resample_transform(waveform)

  waveform = waveform.unsqueeze(0).to("cuda")

  #generate the codes from snac
  with torch.inference_mode():
    codes = snac_model.encode(waveform)

  all_codes = []
  for i in range(codes[0].shape[1]):
    all_codes.append(codes[0][0][i].item()+128266)
    all_codes.append(codes[1][0][2*i].item()+128266+4096)
    all_codes.append(codes[2][0][4*i].item()+128266+(2*4096))
    all_codes.append(codes[2][0][(4*i)+1].item()+128266+(3*4096))
    all_codes.append(codes[1][0][(2*i)+1].item()+128266+(4*4096))
    all_codes.append(codes[2][0][(4*i)+2].item()+128266+(5*4096))
    all_codes.append(codes[2][0][(4*i)+3].item()+128266+(6*4096))


  return all_codes

def add_codes(example):
    # Always initialize codes_list to None
    codes_list = None

    try:
        # Changed 'audio' to 'Audio' to match the dataset column name
        answer_audio = example.get("Audio")
        # If there's a valid audio array, tokenise it
        if answer_audio and "array" in answer_audio:
            audio_array = answer_audio["array"]
            codes_list = tokenise_audio(audio_array)
    except Exception as e:
        print(f"Skipping row due to error: {e}")
        # Keep codes_list as None if we fail
    example["codes_list"] = codes_list

    return example

# Changed remove_columns=["audio"] to remove_columns=["Audio"]
dataset = dataset.map(add_codes, remove_columns=["Audio"])

tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009

start_of_speech = tokeniser_length + 1
end_of_speech = tokeniser_length + 2

start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4

start_of_ai = tokeniser_length + 5
end_of_ai =  tokeniser_length + 6
pad_token = tokeniser_length + 7

audio_tokens_start = tokeniser_length + 10

dataset = dataset.filter(lambda x: x["codes_list"] is not None)
dataset = dataset.filter(lambda x: len(x["codes_list"]) > 0)

def remove_duplicate_frames(example):
    vals = example["codes_list"]
    if len(vals) % 7 != 0:
        raise ValueError("Input list length must be divisible by 7")

    result = vals[:7]

    removed_frames = 0

    for i in range(7, len(vals), 7):
        current_first = vals[i]
        previous_first = result[-7]

        if current_first != previous_first:
            result.extend(vals[i:i+7])
        else:
            removed_frames += 1

    example["codes_list"] = result

    return example

dataset = dataset.map(remove_duplicate_frames)

tok_info = '''*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"
'''
print(tok_info)

def create_input_ids(example):
    # Determine whether to include the source field
    # Changed 'text' to 'Text' to match the dataset column name
    text_prompt = f"{example['source']}: {example['Text']}" if "source" in example else example["Text"]

    text_ids = tokenizer.encode(text_prompt, add_special_tokens=True)
    text_ids.append(end_of_text)

    example["text_tokens"] = text_ids
    input_ids = (
        [start_of_human]
        + example["text_tokens"]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_speech]
        + example["codes_list"]
        + [end_of_speech]
        + [end_of_ai]
    )
    example["input_ids"] = input_ids
    example["labels"] = input_ids
    example["attention_mask"] = [1] * len(input_ids)

    return example

# Changed remove_columns=["text", "codes_list"] to remove_columns=["Text", "codes_list"]
dataset = dataset.map(create_input_ids, remove_columns=["Text", "codes_list"])
columns_to_keep = ["input_ids", "labels", "attention_mask"]
columns_to_remove = [col for col in dataset.column_names if col not in columns_to_keep]

dataset = dataset.remove_columns(columns_to_remove)



Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Filter:   0%|          | 0/499 [00:00<?, ? examples/s]

Filter:   0%|          | 0/499 [00:00<?, ? examples/s]

Map:   0%|          | 0/499 [00:00<?, ? examples/s]

*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"



Map:   0%|          | 0/499 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface  `Trainer`! More docs here: [Transformers docs](https://huggingface.co/docs/transformers/main_classes/trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

**Note:** Using a per_device_train_batch_size >1 may lead to errors if multi-GPU setup to avoid issues, ensure CUDA_VISIBLE_DEVICES is set to a single GPU (e.g., CUDA_VISIBLE_DEVICES=0).

In [18]:
from transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq
trainer = Trainer(
    model = model,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 200,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [19]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
7.676 GB of memory reserved.


In [20]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 499 | Num Epochs = 2 | Total steps = 200
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 97,255,424 of 3,398,122,496 (2.86% trained)


Step,Training Loss
1,5.2874
2,5.3828
3,5.4068
4,5.231
5,5.3478
6,5.229
7,5.2547
8,5.2994
9,5.173
10,5.2243


In [21]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

522.8905 seconds used for training.
8.71 minutes used for training.
Peak reserved memory = 7.676 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 52.072 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the prompts



In [27]:
prompts = [
    "Isay azmanye ke liye apni zuban aur an put tool zail mein chunein aur type karna shuru karein."
]

chosen_voice = None # None for single-speaker

In [28]:
#@title Run Inference


FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Moving snac_model cuda to cpu
snac_model.to("cpu")

prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples

Isay azmanye ke liye apni zuban aur an put tool zail mein chunein aur type karna shuru karein.


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

### Saving to float16

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 15.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.99 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 27.83it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving model/pytorch_model-00002-of-00002.bin...
Done.


And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>


In [8]:
!pip -q install -U uroman

import uroman as ur

# Load once (takes ~1s)
u = ur.Uroman()

# Use ISO-639-3 language codes; Urdu = "urd" (Persian = "fas", Hindi = "hin", etc.)
def urdu_to_roman(text: str) -> str:
    return u.romanize_string(text, lcode="urd")

print(urdu_to_roman("آج موسم بہت خوشگوار ہے، ہم پارک جا رہے ہیں۔"))


aj mwsm bht khwshgwar hye, hm park ja rhye hin.


In [9]:
# Install once
!pip -q install TTS soundfile

import torch, soundfile as sf
from TTS.api import TTS
from IPython.display import Audio

# Load multilingual TTS model (will use GPU in Colab if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "tts_models/multilingual/multi-dataset/xtts_v2"
tts = TTS(model_name).to(device)

def roman_urdu_tts(text, out_path="roman_urdu.wav", speaker="female_en_5", language="en"):
    """
    For Roman-Urdu, set language='en' (Latin script).
    Try different speakers: 'female_en_5', 'male_en_3', etc.
    """
    wav = tts.tts(text=text, speaker=speaker, language=language)
    sf.write(out_path, wav, 22050)
    return out_path

# Example with your romanizer output
roman = urdu_to_roman("آج موسم بہت خوشگوار ہے، ہم پارک جا رہے ہیں۔")
print("Roman:", roman)
path = roman_urdu_tts(roman, speaker="female_en_5", language="en")
Audio(path)

[31mERROR: Ignored the following versions that require a different python version: 0.0.10.2 Requires-Python >=3.6.0, <3.9; 0.0.10.3 Requires-Python >=3.6.0, <3.9; 0.0.11 Requires-Python >=3.6.0, <3.9; 0.0.12 Requires-Python >=3.6.0, <3.9; 0.0.13.1 Requires-Python >=3.6.0, <3.9; 0.0.13.2 Requires-Python >=3.6.0, <3.9; 0.0.14.1 Requires-Python >=3.6.0, <3.9; 0.0.15 Requires-Python >=3.6.0, <3.9; 0.0.15.1 Requires-Python >=3.6.0, <3.9; 0.0.9 Requires-Python >=3.6.0, <3.9; 0.0.9.1 Requires-Python >=3.6.0, <3.9; 0.0.9.2 Requires-Python >=3.6.0, <3.9; 0.0.9a10 Requires-Python >=3.6.0, <3.9; 0.0.9a9 Requires-Python >=3.6.0, <3.9; 0.1.0 Requires-Python >=3.6.0, <3.10; 0.1.1 Requires-Python >=3.6.0, <3.10; 0.1.2 Requires-Python >=3.6.0, <3.10; 0.1.3 Requires-Python >=3.6.0, <3.10; 0.10.0 Requires-Python >=3.7.0, <3.11; 0.10.1 Requires-Python >=3.7.0, <3.11; 0.10.2 Requires-Python >=3.7.0, <3.11; 0.11.0 Requires-Python >=3.7.0, <3.11; 0.11.1 Requires-Python >=3.7.0, <3.11; 0.12.0 Requires-Pytho

ModuleNotFoundError: No module named 'TTS'

# **aliya text file**

In [1]:
# %%capture
import os, re, torch, sys

# Install deps (adjust if not in Colab)
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth snac transformers==4.55.4 --quiet
    !pip install --no-deps trl==0.22.2 --quiet
else:
    v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo -q
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer -q
    !pip install --no-deps unsloth -q
    !pip install snac -q

import numpy as np


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.7/564.7 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.9/233.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth-zoo 2025.9.9 requires msgspec, which is not installed.
unsloth-zoo 2025.9.9 requires tyro, which is not installed.
unsloth-zoo 2025.9.9 requires transformers!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4

In [2]:
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

MODEL_NAME = "unsloth/orpheus-3b-0.1-ft"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name       = MODEL_NAME,
    max_seq_length   = 2048,
    dtype            = None,
    load_in_4bit     = False,
)

# Add minimal special tokens for chat + audio spans
SPECIAL_TOKENS = {
    "additional_special_tokens": [
        "<|human|>", "<|/human|>",
        "<|assistant|>", "<|/assistant|>",
        "<|audio|>", "<|/audio|>",
    ]
}
added = tokenizer.add_special_tokens(SPECIAL_TOKENS)
if added > 0:
    model.resize_token_embeddings(len(tokenizer))

HUMAN_BEG, HUMAN_END = tokenizer.convert_tokens_to_ids(["<|human|>", "<|/human|>"])
ASST_BEG,  ASST_END  = tokenizer.convert_tokens_to_ids(["<|assistant|>", "<|/assistant|>"])
AUD_BEG,   AUD_END   = tokenizer.convert_tokens_to_ids(["<|audio|>", "<|/audio|>"])

# LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha=64,
    lora_dropout=0.05,    # (was 0)
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.7: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/22.8M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.9.7 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [4]:
from datasets import load_dataset, Audio
from snac import SNAC
import torchaudio.transforms as T
import locale
locale.getpreferredencoding = lambda: "UTF-8"

CSV_PATH = "/content/roman_data_updated_499 (1) - Copy.csv"  # <-- your file
AUDIO_COL = "Audio"
TEXT_COL  = "Text"

dataset = load_dataset("csv", data_files=CSV_PATH, split="train")
dataset = dataset.cast_column(AUDIO_COL, Audio(sampling_rate=24000))
ds_sample_rate = dataset[0][AUDIO_COL]["sampling_rate"]

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cuda")


Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
# SNAC packing params
SNAC_BASE      = 128266
CODEBOOK_SIZE  = 4096
OFF = lambda k: SNAC_BASE + k*CODEBOOK_SIZE  # k=0..6

def tokenise_audio(numpy_wave):
    wav = torch.from_numpy(numpy_wave).unsqueeze(0).to(dtype=torch.float32)
    if ds_sample_rate != 24000:
        wav = T.Resample(orig_freq=ds_sample_rate, new_freq=24000)(wav)
    wav = wav.unsqueeze(0).to("cuda")  # (B=1, C=1, T)

    with torch.inference_mode():
        codes = snac_model.encode(wav)  # tuple of 3 code streams

    t0, t1, t2 = codes[0].shape[1], codes[1].shape[1], codes[2].shape[1]
    assert t1 == 2*t0 and t2 == 4*t0, f"SNAC strides mismatch: {t0=}, {t1=}, {t2=}"

    for j, c in enumerate(codes):
        vmin = int(c.min().item()); vmax = int(c.max().item())
        assert 0 <= vmin and vmax < CODEBOOK_SIZE, f"Codebook {j} out of range {vmin}..{vmax}"

    all_codes = []
    for i in range(t0):
        all_codes.append( int(codes[0][0][i].item())       + OFF(0) )
        all_codes.append( int(codes[1][0][2*i].item())     + OFF(1) )
        all_codes.append( int(codes[2][0][4*i].item())     + OFF(2) )
        all_codes.append( int(codes[2][0][(4*i)+1].item()) + OFF(3) )
        all_codes.append( int(codes[1][0][(2*i)+1].item()) + OFF(4) )
        all_codes.append( int(codes[2][0][(4*i)+2].item()) + OFF(5) )
        all_codes.append( int(codes[2][0][(4*i)+3].item()) + OFF(6) )
    return all_codes

def add_codes(example):
    try:
        a = example.get(AUDIO_COL)
        codes_list = None
        if a and "array" in a:
            codes_list = tokenise_audio(a["array"])
        example["codes_list"] = codes_list
    except Exception as e:
        example["codes_list"] = None
    return example

dataset = dataset.map(add_codes)
dataset = dataset.filter(lambda x: x["codes_list"] is not None and len(x["codes_list"]) > 0)




Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

In [6]:
# Ensure model embeddings can index the max audio token id
MAX_ID = OFF(6) + (CODEBOOK_SIZE - 1)
need = MAX_ID + 1
if model.get_input_embeddings().num_embeddings < need:
    model.resize_token_embeddings(need)

def create_input_ids(example):
    text_prompt = f"{example['source']}: {example[TEXT_COL]}" if "source" in example else example[TEXT_COL]
    text_ids = tokenizer.encode(text_prompt, add_special_tokens=True)

    input_ids = (
        [HUMAN_BEG] + text_ids + [HUMAN_END] +
        [ASST_BEG] +
        [AUD_BEG] + example["codes_list"] + [AUD_END] +
        [ASST_END]
    )

    example["input_ids"] = input_ids
    example["labels"] = input_ids
    example["attention_mask"] = [1] * len(input_ids)
    return example

dataset = dataset.map(create_input_ids, remove_columns=[col for col in dataset.column_names if col not in ["input_ids","labels","attention_mask"]])


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [7]:
from transformers import TrainingArguments, Trainer

trainer = Trainer(
    model = model,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.03,
        max_steps = 2000,
        learning_rate = 1e-4,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
        save_steps = 500,
    ),
)

trainer_stats = trainer.train()


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100 | Num Epochs = 154 | Total steps = 2,000
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 97,255,424 of 3,398,137,856 (2.86% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,5.974
20,5.8766
30,5.7357
40,5.612
50,5.7746
60,5.7316
70,5.919
80,5.7805
90,5.4652
100,5.3412


Step,Training Loss
10,5.974
20,5.8766
30,5.7357
40,5.612
50,5.7746
60,5.7316
70,5.919
80,5.7805
90,5.4652
100,5.3412


KeyboardInterrupt: 

In [None]:
import torch

def decode_audio_tokens(token_ids):
    # find <|audio|> ... <|/audio|>
    try:
        i0 = token_ids.index(AUD_BEG) + 1
        i1 = token_ids.index(AUD_END)
    except ValueError:
        raise RuntimeError("Audio span not found.")
    aud = token_ids[i0:i1]
    assert len(aud) % 7 == 0, "Packed audio tokens must be multiple of 7."

    t0 = len(aud) // 7
    c0 = torch.empty((1, t0),    dtype=torch.long)
    c1 = torch.empty((1, 2*t0),  dtype=torch.long)
    c2 = torch.empty((1, 4*t0),  dtype=torch.long)

    for i in range(t0):
        a0 = aud[7*i+0] - OFF(0)
        a1 = aud[7*i+1] - OFF(1)
        a2 = aud[7*i+2] - OFF(2)
        a3 = aud[7*i+3] - OFF(3)
        a4 = aud[7*i+4] - OFF(4)
        a5 = aud[7*i+5] - OFF(5)
        a6 = aud[7*i+6] - OFF(6)
        c0[0, i]       = a0
        c1[0, 2*i]     = a1
        c2[0, 4*i]     = a2
        c2[0, 4*i + 1] = a3
        c1[0, 2*i + 1] = a4
        c2[0, 4*i + 2] = a5
        c2[0, 4*i + 3] = a6

    codes = (c0.unsqueeze(0).to("cuda"),
             c1.unsqueeze(0).to("cuda"),
             c2.unsqueeze(0).to("cuda"))
    with torch.inference_mode():
        wav = snac_model.decode(codes)  # (B, C=1, T)
    return wav.squeeze().cpu().numpy(), 24000

def synthesize(text_prompt, temperature=0.8, top_p=0.9, max_new_tokens=8192):
    text_ids = tokenizer.encode(text_prompt, add_special_tokens=True)
    inp = torch.tensor([[HUMAN_BEG] + text_ids + [HUMAN_END] + [ASST_BEG]], device=model.device)
    out = model.generate(
        input_ids=inp,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=top_p,
        temperature=temperature,
        eos_token_id=ASST_END,
    )
    gen = out[0].tolist()[len(inp[0]):]
    wav, sr = decode_audio_tokens(gen)
    return wav, sr

# Example:
# urdu_line = "Isay azmanye ke liye apni zuban aur input tool zail mein chunein aur type karna shuru karein."
# wav, sr = synthesize(urdu_line)
# import soundfile as sf; sf.write("ablation_out.wav", wav, sr)
