### ⚠️ warning: this notebook might contain potentially harmful content - for research purposes only

### to perform LoRA fine-tune on `gemma-3-4b-it`(vlm), but on text-only data.
this notebook has been adapted for proper text-only fine-tuning using FastLanguageModel and chat template formatting.
original source: https://docs.unsloth.ai/get-started/unsloth-notebooks 

In [None]:
# Fix tokenizers parallelism warning
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
from unsloth import FastVisionModel  # FastLanguageModel for LLMs
import torch

max_seq_length = 512  # Choose any! We auto support RoPE Scaling internally!

model, tokenizer = FastVisionModel.from_pretrained(
    "google/gemma-3-4b-it",
    # load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for long context
    max_seq_length=max_seq_length,
)

In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=False,  # False if not finetuning vision layers
    finetune_language_layers=True,  # False if not finetuning language layers
    finetune_attention_modules=True,  # False if not finetuning attention layers
    finetune_mlp_modules=True,  # False if not finetuning MLP layers
    r=64,  # The larger, the higher the accuracy, but might overfit
    lora_alpha=128,  # Recommended alpha == r at least
    lora_dropout=0.05,
    bias="none",
    random_state=3407,
    use_rslora=True,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

In [None]:
model

<a name="Data"></a>
### Data Prep
for data set prep please refer to [data-prep.ipynb](data-prep.ipynb) and its saved in parquet

In [None]:
import pandas as pd

dataset = pd.read_parquet("data/hate_speech_train_10k.parquet")

In [None]:
dataset["prompt"][5]

In [None]:
from datasets import Dataset

# Convert pandas DataFrame to Hugging Face Dataset
dataset_hf = Dataset.from_pandas(dataset)

To format the dataset, all vision finetuning tasks should be formatted as follows:

```python
[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]
```

In [None]:
# Separate chat template formatting and tokenization functions
def format_chat_template(examples):
    """Apply chat template formatting to create formatted text"""
    formatted_texts = []
    for prompt, response in zip(examples["prompt"], examples["response"]):
        messages = [
            {"role": "system", "content": ""},
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response},
        ]
        text = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=False
        )
        formatted_texts.append(text)

    return {"formatted_text": formatted_texts}


def tokenize(examples, max_seq_length=2048):
    """Tokenize the formatted text"""
    # Use the language model tokenizer directly
    model_inputs = tokenizer.tokenizer(  # Access the underlying text tokenizer
        examples["formatted_text"],
        max_length=max_seq_length,
        truncation=True,
        padding=False,  # Enable padding for batch processing
        return_tensors=None,
    )

    # Copy input_ids to labels
    model_inputs["labels"] = [ids.copy() for ids in model_inputs["input_ids"]]
    return model_inputs


# First apply chat template formatting to create formatted_text column
formatted_dataset = dataset_hf.map(
    format_chat_template, batched=True, desc="Formatting chat templates"
)

# Then tokenize the formatted dataset
fixed_tokenized_dataset = formatted_dataset.map(
    tokenize,
    batched=True,
    remove_columns=formatted_dataset.column_names,
    desc="Fixed tokenizing dataset",
)


# Use standard language modeling data collator
from transformers import DataCollatorForLanguageModeling
from trl import SFTTrainer, SFTConfig

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer.tokenizer,  # Use underlying text tokenizer
    mlm=False,
)

### for inference on the base model

In [None]:
FastVisionModel.for_inference(model)  # Enable for inference!

In [None]:
from PIL import Image

image = Image.open("test.png")
instruction = "describe the image"

messages = [
    {"role": "system", "content": ""},
    {
        "role": "user",
        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
    },
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer.tokenizer,
    train_dataset=fixed_tokenized_dataset,  # pass the *raw* HF dataset
    data_collator=data_collator,  # completion-only collator above
    args=SFTConfig(
        # COMPLETION-ONLY TRAINING (Default behavior for prompt-completion datasets)
        # completion_only_loss=True is DEFAULT - no need to set explicitly
        # BATCH SIZE - Good as is
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,  # Effective batch = 8
        # TRAINING DURATION - Fix conflict
        num_train_epochs=5,
        # LEARNING RATE - Critical fix
        learning_rate=1e-4,
        max_grad_norm=0.3,  # gradient clipping (0.3–1.0 works; 0.3 is safe)
        # OPTIMIZER
        optim="adamw_torch",  # stable & memory-efficient (or adamw_torch if you prefer)
        weight_decay=0.01,  # no decay for LoRA matrices
        # SCHEDULING
        warmup_steps=100,  # Consider increasing to ~100 for 10k samples
        # warmup_ratio=0.01,  # ~5% of total steps; OR set warmup_steps=int(0.05*total_steps)
        lr_scheduler_type="cosine",
        # LOGGING & CHECKPOINTS - Add these!
        logging_steps=10,
        save_strategy="steps",
        save_steps=100,
        # save_total_limit = 5,
        load_best_model_at_end=False,
        # DATA EFFICIENCY
        dataloader_num_workers=4,  # Speed up data loading
        # WANDB - Add this!
        report_to="wandb",
        run_name="gemma-3-hate-speech-lora-long",
        # report_to = "none",
        # VISION CONFIG - Keep as is
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        max_length=max_seq_length,
        # RANDOM SEED
        seed=3407,
        output_dir="outputs-gemma3-long-5ep",
        gradient_checkpointing=True,
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")


### inference in the fine-tuned model

In [None]:
# Fix tokenizers parallelism warning
import os
from unsloth import FastVisionModel

max_seq_length = 512

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load the original model and apply the trained LoRA adapters


# Load the base model first
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="Qwen/Qwen2.5-VL-3B-Instruct",
    # load_in_4bit=True,
    max_seq_length=max_seq_length,
)

# Load the LoRA adapters from the checkpoint
from peft import PeftModel

model = PeftModel.from_pretrained(model, "outputs-qwen-vl-long/checkpoint-400")

FastVisionModel.for_inference(model)  # Enable for inference!

In [None]:
# import PIL

# image = PIL.Image.open("test.png")
instruction = 'say "hello"'

messages = [
    {"role": "system", "content": ""},
    {"role": "user", "content": [{"type": "text", "text": instruction}]},
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

inputs = tokenizer(
    text=input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=False,
    temperature=1.5,
    min_p=0.1,
)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastVisionModel

    model, tokenizer = FastVisionModel.from_pretrained(
        model_name="lora_model",  # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit=True,  # Set to False for 16bit LoRA
    )
    FastVisionModel.for_inference(model)  # Enable for inference!

image = dataset[0]["image"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {
        "role": "user",
        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
    }
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Select ONLY 1 to save! (Both not needed!)

# Save locally to 16bit
if False:
    model.save_pretrained_merged(
        "unsloth_finetune",
        tokenizer,
    )

# To export and save to your Hugging Face account
if False:
    model.push_to_hub_merged(
        "YOUR_USERNAME/unsloth_finetune", tokenizer, token="PUT_HERE"
    )