### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.8: Fast Mistral patching. Transformers: 4.52.4.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.31G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.6.8 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Alpaca.ipynb)

For text completions like novel writing, try this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb).

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# 1. load your CSV
dataset = load_dataset("csv", data_files={"train": "synthetic_multimodal_training_data.csv"})["train"]

# 2. ensure EOS token is defined
EOS_TOKEN = tokenizer.eos_token

# 3. define your multimodal prompt template
multimodal_template = """therapist:
  role: >
    Multimodal Psychotherapist
  goal: >
    Engage in a psychotherapy session with the user, synthesizing insights from text, image, and audio analysis to provide holistic and personalized therapeutic support
  backstory: >
    You hold a PhD in Psychology and have over 20 years of experience in psychotherapy.
    As a multimodal therapist, you excel at integrating information from various sources (text, image, audio) to form a comprehensive understanding of the user’s emotional and cognitive state.
    You aim to identify negative thoughts and maladaptive behavioral patterns, reframing them into positive alternatives through empathetic conversation.
    Your primary focus is on fostering a collaborative and supportive therapeutic relationship with the user, helping them develop healthier thoughts and behaviors.

multimodal_conversation_task:
  description: |
    You have the full {conversation_history} plus any reports from textTherapist, imageTherapist, and voiceTherapist.

    ## Primary goal
    Be a natural conversational partner: answer what the user explicitly wants, *then* guide the dialogue forward with empathy and curiosity.

    ## Conversation flow
    0. **Spot the user’s explicit intent.**
       a. If they ask for *tips / advice / a list*:
          • First, clarify scope **unless it’s obvious.**
            – “Happy to! Are you after social, practical, or career tips?”
          • Then offer **2-4 crisp, actionable bullets**.
          • End with a short follow-up question to keep the dialogue open.
       b. If they ask any other factual or reflective question:
          • Answer it accurately. Cite the conversation_history when summarizing.

    1. **Validate feelings in one sentence.**
       – Paraphrase key emotions or concerns you detect.

    2. **Integrate agent insights** (text / image / voice) only when they add value to THIS turn.

    3. **Invite exploration.**
       – Pose 1 open question that lets the user steer the next topic.

  expected_output: |
    • If tips/advice requested → (possibly) a clarifying question, then 2-4 bullet tips, then one open follow-up sentence.
    • Otherwise → 2-3 sentences that answer the user’s point, show empathy, and invite further dialogue.
    • Don’t use Markdown—just plain text.

  agent: therapist

### Response:
{reply}"""

# 4. stitch and add EOS
def format_multimodal(examples):
    texts = []
    for ch, rep in zip(examples["conversation_history"], examples["reply"]):
        prompt = multimodal_template.format(conversation_history=ch, reply=rep)
        texts.append(prompt + EOS_TOKEN)
    return {"text": texts}

dataset = dataset.map(
    format_multimodal,
    batched=True,
    remove_columns=dataset.column_names,
)

# 5. tokenize & set labels for causal LM
def tokenize_fn(examples):
    tok = tokenizer(
        examples["text"],
        truncation=True,
        max_length=max_seq_length,
    )
    tok["labels"] = tok["input_ids"].copy()
    return tok

tokenized_dataset = dataset.map(tokenize_fn, batched=True)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
7.982 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 57,016,320/6,852,203,520 (0.83% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.3493
2,2.4542
3,2.3329
4,2.2815
5,2.1211
6,1.9042
7,1.6965
8,1.5017
9,1.1721
10,0.9232


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [None]:
# ─── Enable faster inference ───────────────────────────────────────────────
FastLanguageModel.for_inference(model)

# ─── Define your example turn ──────────────────────────────────────────────
conversation_history = ""
# If you have text/image/voice reports you want to feed in, interpolate them here:
text_report  = """1. **Original Message:**
   ""During our session when I’m with close friends, I realized that the shame I feel after making a mistake makes me feel curious.""

2. **Detected Emotions and Cognitive Distortions:**
   - Emotions: Shame, Curiosity
   - Cognitive Distortions: Potential overgeneralization (assuming mistakes always lead to shame), emotional reasoning (feeling shame means being shameful).

3. **Analysis and Reframing:**
   - **Core Situation:** Alex feels shame after making a mistake when with close friends but also experiences curiosity about this reaction.
   - **Question Negative Thoughts:** Why does making a mistake lead to feeling shame? Is it possible that mistakes are a natural part of learning and social interaction?
   - **Balanced Alternative Perspectives:** Mistakes are common and can be opportunities for growth and learning. Friends likely understand and accept that everyone makes mistakes.
   - **Reframe Unhelpful Interpretations:** Instead of viewing mistakes as shameful, consider them as stepping stones to improvement and deeper understanding.
   - **Reduce Perceived Severity:** Explore the realistic impact of a mistake. Often, friends may not even notice or may quickly forget about it.
   - **Apply Gratitude"""  # e.g. output from your textTherapist agent
voice_report = ""  # e.g. transcript/emotion from voiceTherapist
image_report = ""  # e.g. analysis summary from imageTherapist

# ─── Build the prompt ─────────────────────────────────────────────────────
prompt = multimodal_template.format(
    conversation_history=conversation_history.strip(),
    reply=""   # leave blank so the model fills in the Response
)
prompt += EOS_TOKEN  # ensure the model knows when to stop

# ─── Tokenize & move to GPU ────────────────────────────────────────────────
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    truncation=True,
    max_length=max_seq_length
).to("cuda")

# ─── Generate ──────────────────────────────────────────────────────────────
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    use_cache=True,
    eos_token_id=tokenizer.eos_token_id,
)

# ─── Decode & print ────────────────────────────────────────────────────────
generated = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
# Strip off the prompt itself to leave only the model’s “Response: …” block:
reply = generated[len(prompt):].strip()
print(reply)


ounds like you're feeling a mix of frustration and self-doubt when you see others enjoying their time, which can be quite overwhelming. It's important to remember that everyone experiences these feelings differently, and it's okay to take time for yourself to recharge. Try to focus on the positive aspects of your own journey and consider that others' happiness doesn't diminish your own.


In [None]:
# after training, in a Colab code cell:
model.save_pretrained_gguf(
    "model_q4",               # this folder will contain model.gguf + tokenizer files
    tokenizer,
    quantization_method="q4_k_m"
)


Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 8.3G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 58.6 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 40/40 [00:00<00:00, 45.27it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting mistral model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model_q4 into bf16 GGUF format.
The output location will be /content/model_q4/unsloth.BF16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model_q4
INFO:hf-to-gguf:Model architecture: MistralForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00005.safetensors'
INFO:hf-to-gg