![image.png](https://i.imgur.com/a3uAqnb.png)

# 🦙 LLaMA 3.2 (1B & 3B) Conversational Lab

Welcome to the **LLaMA 3.2 Conversational Solution Labs**.  
These labs guide you through building, fine-tuning, and running **conversational AI systems** using **Meta's LLaMA 3.2 models** in two sizes: **1B** and **3B** parameters.

## 📚 What You'll Learn
1. **Model Loading & Setup** – How to load and configure LLaMA 3.2 for conversational use.
2. **Tokenization & Preprocessing** – How to prepare data using tokenizers for conversational datasets.
3. **Fine-Tuning** – Techniques to fine-tune the models on dialogue-specific datasets.
4. **Evaluation** – Methods to assess the performance of conversational models.
5. **Inference & Interaction** – Running the model for chat-based interactions in real time.

## 🛠 Tools & Libraries
- **PyTorch** – For model training & inference.
- **Hugging Face Transformers** – For model and tokenizer utilities.
- **Datasets** – For loading and managing conversation data.
- **Evaluate / Scikit-learn** – For computing metrics like accuracy and log loss.

## 🎯 Lab Objectives
By the end of these labs, you will:
- Understand how to work with LLaMA 3.2 models.
- Be able to fine-tune them for your conversational AI needs.
- Evaluate and deploy them for interactive chat experiences.



### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

| Type                   | Meaning & Use Case                                                                                                                                         | Example Models                                         |
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ |
| **Base**               | Pretrained only on large general text datasets. Not instruction-tuned. Best for fine-tuning into your own style/task. Needs specific prompting.            | `Llama-3.2-3B`, `Mistral-7B`                           |
| **Instruct**           | Fine‑tuned to follow natural‑language instructions and provide direct, helpful answers without multi‑turn chat context. Optimized for single‑prompt tasks. | `unsloth/Llama-3.2-3B-Instruct`, `Mistral-7B-Instruct` |
| **Chat**               | Like Instruct, but fine‑tuned with *multi‑turn conversations* and chat templates. Good for chatbot apps.                                                   | `Llama-2-Chat`, `Gemma-Chat`                           |
| **Code**               | Fine‑tuned on programming code datasets. Good for coding tasks, debugging, doc generation.                                                                 | `CodeLlama`, `StarCoder`                               |
| **Persona** | Fine‑tuned to keep a personality or role during responses. Often used in AI characters.                                                                    | `MythoMax`, `Pygmalion`                                |
| **RLHF**               | Instruction-tuned + **Reinforcement Learning from Human Feedback** for safe, preference-aligned answers. Usually overlaps with Instruct/Chat models.       | `Llama-2-Chat`                                         |
| **Math / Reasoning**   | Fine‑tuned for logic, mathematics, problem solving.                                                                                                        | `WizardMath`, `DeepSeek-Math`                          |


In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.11: Fast Llama patching. Transformers: 4.54.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

## **LoRA Fine‑Tuning with `FastLanguageModel.get_peft_model`**
This method applies **Low‑Rank Adaptation (LoRA)** to a pretrained model by injecting small trainable matrices into selected layers, allowing efficient fine‑tuning while keeping most weights frozen.  
- **`r`** → LoRA rank (capacity of adaptation; higher = more expressive but uses more VRAM).  
- **`target_modules`** → Transformer layers where LoRA is injected (e.g., `q_proj`, `v_proj`).  
- **`lora_alpha`** → Scaling factor for LoRA outputs before merging with base weights.  
- **`lora_dropout`** → Dropout for LoRA path to reduce overfitting (often `0` for speed).  
- **`bias`** → Controls if biases in adapted layers are trainable (`"none"` = fastest).  
- **`use_gradient_checkpointing`** → Saves VRAM by recomputing activations during backprop (`"unsloth"` = optimized).  
- **`random_state`** → Random seed for reproducible LoRA initialization.  
- **`use_rslora`** → Optional stability tweak for large ranks.  
- **`loftq_config`** → Optional config for quantization‑aware LoRA training.

####Adding LoRA adapters allow us to only update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # This is the rank number, higher number, more complexity, more memory. Choose any number > 0 ! Suggested 8, 16, 32, 64, 128, for A and B matrices
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], #These are the linear layers in the transformer where LoRA adapters will be injected.
    lora_alpha = 16, #Scaling factor for the LoRA weights.
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # RS-LoRA improves training stability in some cases, but Unsloth disables it by default.
    loftq_config = None, # Optional: Use LoftQ (Low-rank, One-shot Fine-Tuning with Quantization). , It’s another PEFT method that fuses quantization-aware training. Here it’s disabled.
)

Unsloth 2025.7.11 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

### 📝 Step-by-Step: Chat Template & Dataset Preparation

1. **Import Chat Template Utility**  
   Load Unsloth's `get_chat_template` to standardize how chat prompts are formatted.

2. **Apply the LLaMA 3.1 Template**  
   Modify the tokenizer with the `llama-3.1` chat template to match the model's expected input format (e.g., user/assistant roles).

3. **Define a Prompt Formatting Function**  
   Create a function that applies the chat template to each conversation and returns formatted plain-text prompts (not tokenized).

4. **Load Instruction Dataset**  
   Use Hugging Face's `load_dataset` to import `mlabonne/FineTome-100k`, a high-quality dataset of 100k multi-turn instructions for chat fine-tuning.


In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):

    convos = examples["conversations"]
    #tokenization usually later, add_generation_prompt=True only in infrance
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [5]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [6]:
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [7]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

### 🔄 Epochs vs Steps in Fine-Tuning

When training a language model, it's important to understand two key terms: **epochs** and **steps**.

---

#### 🧠 What is an **Epoch**?

- One **epoch** = One full pass through the entire training dataset.
- If your dataset has 10,000 examples, then 1 epoch = all 10,000 are seen once.

#### ⚙️ What is a **Step**?

- One **step** = One update to the model's weights using a single batch.
- For example, if `batch_size = 2`, each step sees 2 examples.

---

#### 🔁 Relationship Between Them

The number of steps per epoch depends on:

$$
\text{Steps per Epoch} = \frac{\text{Dataset Size}}{\text{Batch Size} \times \text{Gradient Accumulation Steps}}
$$

**Example**:
- Dataset size = 10,000  
- Batch size = 2  
- Gradient Accumulation = 4  
- ⟶ Effective batch = 8  

So:

$$
\text{Steps per Epoch} = \frac{10,000}{8} = 1,250
$$

---

**Summary:** With gradient accumulation, we process smaller batches (e.g., 2 examples) multiple times (e.g., 4 times) before updating the weights. This simulates training with a larger batch (2 × 4 = 8) while using less GPU memory.



In [8]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer), #	Dynamically pads batches using DataCollatorForSeq2Seq
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,#2 examples per gpu
        gradient_accumulation_steps = 4, #Simulates a batch size of 8 (2×4)
        warmup_steps = 5, #Gradually increase LR to avoid instability
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4, #	Typical LR for LoRA fine-tuning
        logging_steps = 1, #	Log metrics after every step
        optim = "adamw_8bit", #8-bit Adam optimizer from bitsandbytes (VRAM-efficient)
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

###🧠 What train_on_responses_only() Does
It ignores the user tokens when computing loss.

This means:

- The model still reads the user input (context)

- But it only gets penalized for mistakes in the assistant's response

In [9]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

We verify masking is actually done:

In [10]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight rel

### 🔍 Why Replace `-100` in Labels When Decoding?

- In language model fine‑tuning, **`-100`** in `labels` means **"ignore this token when computing loss"**  
  (PyTorch’s `CrossEntropyLoss(ignore_index=-100)` skips these positions).
- These usually mark:
  - **User prompt tokens** (when training only on responses)
  - **Padding tokens**
- You **can’t decode** `-100` directly — it’s not a valid token ID.
- Replacing them with a **space token** (or another placeholder) lets you **decode** the label sequence into readable text.
- This is **only for debugging/visualization** — it doesn’t affect training.


In [12]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                  Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [13]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
3.07 GB of memory reserved.


In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.7747
2,0.839
3,1.0757
4,0.8919
5,0.7575
6,0.9373
7,0.6192
8,0.9986
9,0.8596
10,0.7613


In [15]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

435.3868 seconds used for training.
7.26 minutes used for training.
Peak reserved memory = 4.248 GB.
Peak reserved memory for training = 1.178 GB.
Peak reserved memory % of max memory = 28.818 %.
Peak reserved memory for training % of max memory = 7.991 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [21]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

attention_mask = inputs.ne(tokenizer.pad_token_id).long()

outputs = model.generate(
    input_ids=inputs,
    attention_mask=attention_mask,
    max_new_tokens=64,
    temperature=0.1,
    min_p=0.1,
    use_cache=True
)
tokenizer.batch_decode(outputs)

['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers, starting from 1 and 1. The sequence is as follows:\n\n1, 1, 2, 3, 5, 8, 13, 21, 34, 55,']

In [22]:
full_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
assistant_reply = full_text.split("<|start_header_id|>assistant<|end_header_id|>\n\n")[-1]
print(assistant_reply)


The Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers, starting from 1 and 1. The sequence is as follows:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55,


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [23]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 0.1, min_p = 0.1)

The Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding numbers, starting from 1 and 1. The sequence is as follows:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [24]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [25]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Disables gradient tracking, Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 256,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

==((====))==  Unsloth 2025.7.11: Fast Llama patching. Transformers: 4.54.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
In the city of Paris, one of France's most iconic landmarks is the Eiffel Tower, also referred to as the La Tour Eiffel. Completed in 1889 as part of the World's Fair, it's now the symbol of Paris and French culture. It stands as a symbol of innovation and engineering.

At 324 meters (1,063 feet) in height, the Eiffel Tower was initially meant to be temporary but became an instant icon. It is made of over 18,000 pieces of wrought iron. The tower is held together by a complex system of girders and beams that form a latticework structure, 

### 🧪 Exercise

Finetune the model on small dataset you create ( 10 rows ). For example you can finetune it to answer in poetic way

## Make Your Dataset
* hint: array of ?

##Format your dataset as we did before

* hint: Dataset.from_list is kinda powerful

##what we need to define before finetuning ? defining ...

##Train it!

##Try Your Model!

##Contributed by: Yazan Alshoibi

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)
