<a href="https://colab.research.google.com/github/TrelisResearch/ai-worlds-fair-2025/blob/main/Trelis_TTS_Fine_tuning_Worlds_Fair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trelis Text to Speech Fine-tuning at the AI World's Fair

Find more detailed videos on the [Trelis YouTube channel](https://youtube.com/@TrelisResearch).

*Adapted, with appreciation, from the original [Unsloth Sesame CSM (1B) TTS notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Sesame_CSM_(1B)-TTS.ipynb) by **UnslothAI**.*

## Data Generation (YouTube → Whisper → HF Dataset)

This optional section lets you **bootstrap a speech dataset from any YouTube video** in just a few commands:

1. Enter a YouTube URL.  
2. The audio is downloaded (yt‑dlp) and saved locally.  
3. `whisper` transcribes the audio → a single JSON file you can manually edit in Colab if needed.  
4. Audio is automatically sliced into ≤ 30‑second clips, one row per clip (`audio`, `text`).  
5. The resulting `datasets.Dataset` is **pushed to Hugging Face** under the org/repo of your choice.

Feel free to skip this entire block if you already have prepared data.


In [5]:
#@title 📥 Download & transcribe a YouTube video with Whisper
#@markdown ℹ️ After running, you'll find **transcript_whisper.json** in the working directory. Edit it manually if you wish and then execute the next cell.

import torch

youtube_url = "https://youtu.be/hFZROKQ0PS0"  #@param {type:"string"}
model_size  = "turbo"                                   #@param ["tiny","base","small","medium","large-v3","turbo"]
device      = "cuda" if torch.cuda.is_available() else "cpu"

In [8]:
# ▸ 1.  Dependencies  ──────────────────────────────────────────────
# ▸ installs only once – runs fast on Colab
!pip -q install --upgrade yt_dlp ffmpeg-python \
                     git+https://github.com/openai/whisper.git \
                     nltk datasets soundfile

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone


In [9]:
import subprocess, json, whisper, uuid, os

# ▸ fetch the audio
audio_out = "source_audio.m4a"
subprocess.run(
    ["yt-dlp", "-x", "--audio-format", "m4a", "-o", audio_out, youtube_url],
    check=True
)

# ▸ load Whisper & transcribe
model = whisper.load_model(model_size, device=device)
result = model.transcribe(
    audio_out,
    fp16 = device == "cuda",   # keeps VRAM low on GPU, ignored on CPU
    verbose = False
)

# ▸ save the raw Whisper JSON
json_path = "transcript_whisper.json"
with open(json_path, "w") as f:
    json.dump(result, f, indent=2)

print(f"✅ Transcript saved → {json_path}. "
      "Open it to review or edit before we slice/merge into ≤30-s rows.")

100%|█████████████████████████████████████| 1.51G/1.51G [00:23<00:00, 68.5MiB/s]


Detected language: English


100%|██████████| 111838/111838 [00:51<00:00, 2166.41frames/s]

✅ Transcript saved → transcript_whisper.json. Open it to review or edit before we slice/merge into ≤30-s rows.





In [10]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [11]:
# ✂️ Group sentences into ≤30-s clips & push to Hugging Face
from pathlib import Path
import json, subprocess, uuid, datasets, os

# ── parameters ───────────────────────────────────────────────────
HF_ORG    = "Trelis"          #@param {type:"string"}
REPO_NAME = "my_youtube_tts"  #@param {type:"string"}
MAX_SEC   = 30                #@param {type:"integer"}

AUDIO_SOURCE = "source_audio.m4a"   # created earlier
JSON_PATH    = "transcript_whisper.json"

# ── load Whisper JSON ────────────────────────────────────────────
with open(JSON_PATH) as f:
    data = json.load(f)

segments = data["segments"]            # list of dicts with start/end/text

rows, bundle = [], None                # bundle = [start, end, text]

def flush_bundle(b):
    """Cut audio [b[0], b[1]) → wav, append row dict to rows."""
    if b is None: return
    start, end, text = b
    clip = f"clip_{uuid.uuid4().hex}.wav"
    subprocess.run([
        "ffmpeg","-loglevel","error","-y",
        "-i", AUDIO_SOURCE,
        "-ss", f"{start}",
        "-to", f"{end}",
        "-ar","16000","-ac","1", clip
    ], check=True)
    rows.append({"audio": clip, "text": text.strip()})

for seg in segments:
    s, e, t = seg["start"], seg["end"], seg["text"]
    dur = e - s
    if dur > MAX_SEC:
        # individual sentence too long → drop
        continue

    if bundle is None:
        bundle = [s, e, t]
        continue

    b_start, b_end, b_text = bundle
    if (e - b_start) <= MAX_SEC:
        # we can extend current bundle
        bundle = [b_start, e, b_text + " " + t]
    else:
        # flush current bundle, start new one
        flush_bundle(bundle)
        bundle = [s, e, t]

# flush last bundle
flush_bundle(bundle)

print(f"Generated {len(rows)} clips.")

# ── build HF dataset ─────────────────────────────────────────────
ds = datasets.Dataset.from_list(rows)
ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))

repo_id = f"{HF_ORG}/{REPO_NAME}"
print(f"Pushing to {repo_id} …")
ds.push_to_hub(repo_id, private=False)
print("✅ Done!")

Generated 41 clips.
Pushing to Trelis/my_youtube_tts …


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/41 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

✅ Done!


## Fine-tuning

### Unsloth Installation

In [12]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.52.3

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [1]:
from unsloth import FastModel
from transformers import CsmForConditionalGeneration
import torch

model, processor = FastModel.from_pretrained(
    model_name = "unsloth/csm-1b",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Leave as None for auto-detection
    auto_model = CsmForConditionalGeneration,
    load_in_4bit = False, # Select True for 4bit - reduces memory usage
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.7: Fast Csm patching. Transformers: 4.52.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/4.15G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/449 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/2.00k [00:00<?, ?B/s]

unsloth/csm-1b does not have a padding token! Will use pad_token = <|PAD_TOKEN|>.


In [2]:
print(model)

CsmForConditionalGeneration(
  (lm_head): Linear(in_features=2048, out_features=2051, bias=False)
  (embed_text_tokens): Embedding(128256, 2048)
  (backbone_model): CsmBackboneModel(
    (embed_tokens): CsmBackboneModelEmbeddings(
      (embed_audio_tokens): Embedding(65632, 2048)
    )
    (layers): ModuleList(
      (0-15): 16 x CsmDecoderLayer(
        (self_attn): CsmAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): CsmMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.backbone_model` require gradients


<a name="Data"></a>
### Data Prep  

We will use the `Trelis/my_youtube_tts` file, which is designed for training TTS models. Ensure that your dataset follows the required format: **text, audio** for single-speaker models or **source, text, audio** for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.

In [16]:
#@title Dataset Prep functions
from datasets import load_dataset, Audio, Dataset
import os
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("unsloth/csm-1b")

raw_ds = load_dataset("Trelis/my_youtube_tts", split="train")
# raw_ds = load_dataset("Trelis/orpheus-ft", split="train")

# Getting the speaker id is important for multi-speaker models and speaker consistency
speaker_key = "source"
if "source" not in raw_ds.column_names and "speaker_id" not in raw_ds.column_names:
    print("Unsloth: No speaker found, adding default \"source\" of 0 for all examples")
    new_column = ["0"] * len(raw_ds)
    raw_ds = raw_ds.add_column("source", new_column)
elif "source" not in raw_ds.column_names and "speaker_id" in raw_ds.column_names:
    speaker_key = "speaker_id"

target_sampling_rate = 24000
raw_ds = raw_ds.cast_column("audio", Audio(sampling_rate=target_sampling_rate))

# Assuming your dataset is loaded into a variable named 'raw_ds'
# If you loaded it differently, adjust the variable name accordingly.
# raw_ds = load_dataset("Trelis/orpheus-ft", split="train")

max_audio_length = 0
for example in raw_ds:
    # Access the audio array length
    audio_length = len(example["audio"]["array"])
    if audio_length > max_audio_length:
        max_audio_length = audio_length

print(f"Maximum audio length in the dataset: {max_audio_length}")

Unsloth: No speaker found, adding default "source" of 0 for all examples
Maximum audio length in the dataset: 718080


In [17]:
max_text_length = 0
for example in raw_ds:
    # Access the length of the text string
    text_length = len(example["text"])
    if text_length > max_text_length:
        max_text_length = text_length

print(f"Maximum text length in the dataset: {max_text_length}")

Maximum text length in the dataset: 587


In [18]:
def preprocess_example(example):
    # # Check if example[speaker_key] is 'Ronan' and set speaker_id accordingly. This will override if your data has a speaker column with a name.
    # speaker_id = '0' if example[speaker_key] == "Ronan" else '0'

    conversation = [
        {
            "role": str(speaker_id),
            "content": [
                {"type": "text", "text": example["text"]},
                {"type": "audio", "path": example["audio"]["array"]},
            ],
        }
    ]

    try:
        model_inputs = processor.apply_chat_template(
            conversation,
            tokenize=True,
            return_dict=True,
            output_labels=True,
            text_kwargs = {
                "padding": "max_length", # pad to the max_length
                "max_length": max_text_length, # this should be the max length of audio
                "pad_to_multiple_of": 8,
                "padding_side": "right",
            },
            audio_kwargs = {
                "sampling_rate": 24_000,
                "max_length": max_audio_length, # max input_values length of the whole dataset
                "padding": "max_length",
            },
            common_kwargs = {"return_tensors": "pt"},
        )
    except Exception as e:
        print(f"Error processing example with text '{example['text'][:50]}...': {e}")
        return None

    required_keys = ["input_ids", "attention_mask", "labels", "input_values", "input_values_cutoffs"]
    processed_example = {}
    # print(model_inputs.keys())
    for key in required_keys:
        if key not in model_inputs:
            print(f"Warning: Required key '{key}' not found in processor output for example.")
            return None

        value = model_inputs[key][0]
        processed_example[key] = value


    # Final check (optional but good)
    if not all(isinstance(processed_example[key], torch.Tensor) for key in processed_example):
         print(f"Error: Not all required keys are tensors in final processed example. Keys: {list(processed_example.keys())}")
         return None

    return processed_example

processed_ds = raw_ds.map(
    preprocess_example,
    remove_columns=raw_ds.column_names,
    desc="Preprocessing dataset",
)

Preprocessing dataset:   0%|          | 0/41 [00:00<?, ? examples/s]

In [19]:
print(processed_ds)

Dataset({
    features: ['input_ids', 'attention_mask', 'labels', 'input_values', 'input_values_cutoffs'],
    num_rows: 41
})


<a name="Train"></a>
### Train the model
Now let's use Huggingface  `Trainer`! More docs here: [Transformers docs](https://huggingface.co/docs/transformers/main_classes/trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [20]:
from transformers import TrainingArguments, Trainer
from unsloth import is_bfloat16_supported

trainer = Trainer(
    model = model,
    train_dataset = processed_ds,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 3,
        max_steps = 60,
        # num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01, # Turn this on if overfitting
        lr_scheduler_type = "constant",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB or Tensorboard.
    ),
)

In [21]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.719 GB of memory reserved.


In [22]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 41 | Num Epochs = 10 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 29,032,448/1,661,132,609 (1.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,6.3424
2,6.3569
3,5.9232
4,5.812
5,5.8806
6,6.4937
7,5.5143
8,5.6623
9,5.4446
10,5.3492


In [23]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

643.9334 seconds used for training.
10.73 minutes used for training.
Peak reserved memory = 7.287 GB.
Peak reserved memory for training = 0.568 GB.
Peak reserved memory % of max memory = 49.434 %.
Peak reserved memory for training % of max memory = 3.853 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the prompts

In [24]:
from IPython.display import Audio, display
import soundfile as sf

text = "We just finished fine tuning a text to speech model... and it's pretty good!"
speaker_id = 0
inputs = processor(f"[{speaker_id}]{text}", add_special_tokens=True).to("cuda")
audio_values = model.generate(
    **inputs,
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example_without_context.wav", audio, 24000)
display(Audio(audio, rate=24000))

In [25]:
text = "Sesame is a super cool TTS model which can be fine tuned with Unsloth."

speaker_id = 0
# Another equivalent way to prepare the inputs
conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
    **processor.apply_chat_template(
        conversation,
        tokenize=True,
        return_dict=True,
    ).to("cuda"),
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer speech increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example_without_context.wav", audio, 24000)
display(Audio(audio, rate=24000))

#### Voice and style consistency

Sesame CSM's power comes from providing audio context for each speaker. Let's pass a sample utterance from our dataset to ground speaker identity and style.

In [26]:
speaker_id = 0

utterance = raw_ds[3]["audio"]["array"]
utterance_text = raw_ds[3]["text"]
text = "Sesame is a super cool TTS model which can be fine tuned with Unsloth."

# CSM will fill in the audio for the last text.
# You can even provide a conversation history back in as you generate new audio

conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": utterance_text},{"type": "audio", "path": utterance}]},
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]

inputs = processor.apply_chat_template(
        conversation,
        tokenize=True,
        return_dict=True,
    )
audio_values = model.generate(
    **inputs.to("cuda"),
    max_new_tokens=125, # 125 tokens is 10 seconds of audio, for longer text increase this
    # play with these parameters to get the best results
    depth_decoder_temperature=0.6,
    depth_decoder_top_k=0,
    depth_decoder_top_p=0.9,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example_with_context.wav", audio, 24000)
display(Audio(audio, rate=24000))

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
processor.save_pretrained("lora_model")

[]

In [27]:
hub_slug = "Trelis/csm-trelis-voice"

# Push LoRA - NOTE THAT YOU MUST BE LOGGED IN TO HF.
model.push_to_hub(hub_slug) # Online saving
processor.push_to_hub(hub_slug) # Online saving

README.md:   0%|          | 0.00/543 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/116M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Trelis/csm-trelis-voice


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Trelis/csm-trelis-voice/commit/ec49996495890fdd450f1bbef707f62c75aa24f6', commit_message='Upload processor', commit_description='', oid='ec49996495890fdd450f1bbef707f62c75aa24f6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Trelis/csm-trelis-voice', endpoint='https://huggingface.co', repo_type='model', repo_id='Trelis/csm-trelis-voice'), pr_revision=None, pr_num=None)

### Saving to float16

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
