In [1]:
pip install unsloth transformers trl

Collecting unsloth
  Downloading unsloth-2025.11.4-py3-none-any.whl.metadata (64 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/64.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m64.3/64.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.25.1-py3-none-any.whl.metadata (11 kB)
Collecting unsloth_zoo>=2025.11.4 (from unsloth)
  Downloading unsloth_zoo-2025.11.5-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.35-py3-none-any.whl.metadata (12 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.33.post1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting bitsandbytes!=0.46.0,!=0.48.0,>=0.45.5 (from unsloth)
  Downloading bitsandbytes-0.48.2

In [2]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True
 )

==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

## This code loads an optimized Llama 3.2 (3B) Instruction-tuned model using Unsloth.
Unsloth makes the model faster, lighter, and suitable for training on GPUs with low VRAM (like the RTX 3050).

The function returns:

model ‚Üí The neural network

tokenizer ‚Üí Converts text ‚Üî tokens


1. model, tokenizer =

This unpacks the result into two variables:

model ‚Äî the loaded Llama model

tokenizer ‚Äî the text processor used for encoding/decoding text

These two always come together when working with LLMs.

2. FastLanguageModel.from_pretrained(...)

This is Unsloth‚Äôs optimized version of HuggingFace‚Äôs from_pretrained().

Unsloth adds:

Memory-efficient loading

Speed optimizations (FlashAttention, fused kernels, etc.)

Support for 4-bit / 8-bit loading

Automatic GPU placement

Essentially:
Loads a large model faster and with less VRAM.

3. model_name="unsloth/Llama-3.2-3B-Instruct"

Specifies which model to download.

Breakdown:

unsloth/ ‚Üí optimized version maintained by Unsloth

Llama-3.2 ‚Üí architecture

3B ‚Üí 3 billion parameters (medium-sized model)

Instruct ‚Üí trained to follow instructions (good for chat, QA, tasks)

This is a great model for fine-tuning on smaller GPUs.

4. max_seq_length=2048

Sets the maximum number of tokens in a single input/output sequence.

Notes:

2048 tokens ‚âà 1‚Äì2 pages of text

Higher sequence length = more VRAM required

2048 is a safe maximum for RTX 3050

Lower to 1024 if you get memory errors

This controls context length during training and generation.

5. load_in_4bit=True

Enables 4-bit quantized loading (QLoRA technique).

What this does:

Reduces model memory size √ó4

Makes 3B models fit on low-VRAM GPUs

Enables training without running out of memory

Keeps most of the accuracy

This is essential for GPUs with 4‚Äì8GB VRAM.

üß© What Happens Internally

When this code runs, Unsloth:

Downloads the model

Loads weights in 4-bit quantized format

Initializes the tokenizer

Applies fast GPU kernels (FlashAttention, fused RMSNorm, etc.)

Moves model to your GPU (cuda:0)

Returns model and tokenizer ready for training or inference

‚úî Summary (One Sentence)

This code loads a memory-efficient, GPU-optimized, 4-bit Llama-3.2-3B model with a tokenizer, supporting sequences up to 2048 tokens, making it ideal for fine-tuning on laptops like the RTX 3050.

In [4]:
model = FastLanguageModel.get_peft_model(
    model, r=16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

Unsloth 2025.11.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## Adds LoRA adapters to the model so only small layers are trained instead of the full model.
This makes fine-tuning much faster and requires very little GPU memory.

‚öôÔ∏è Parameters

r=16 ‚Üí LoRA rank (training capacity).
Higher = more learning ability but more VRAM.
16 is a good balanced choice.

target_modules=[...] ‚Üí Specific layers that get LoRA adapters.
These are the key transformer layers responsible for attention and MLP computation:

q_proj, k_proj, v_proj, o_proj ‚Üí attention layers

gate_proj, up_proj, down_proj ‚Üí feed-forward layers
# üìò What Each Projection Layer Does (LoRA Targets)

### üîπ q_proj (Query Projection)
Turns the hidden state into **Query vectors**.  
Queries decide *what information a token wants to look for* in other tokens.

### üîπ k_proj (Key Projection)
Turns the hidden state into **Key vectors**.  
Keys represent *what information each token contains*.

### üîπ v_proj (Value Projection)
Turns the hidden state into **Value vectors**.  
Values contain the *actual content* that gets passed through attention.

### üîπ o_proj (Output Projection)
Combines attention outputs back into the model‚Äôs main hidden state.  
It merges all attention heads into one vector.

---

# üß† Feed-Forward (MLP) Projections

### üîπ gate_proj
First part of the MLP layer.  
Applies a gating function that controls how much information flows forward.

### üîπ up_proj
Expands (upsamples) the hidden dimension by projecting it to a larger space.  
This gives the model more capacity to transform information.

### üîπ down_proj
Compresses (downsamples) the expanded representation back to the original size.  
Completes the MLP transformation.

---

# ‚úî Summary
These projections are the **core trainable parts** of a transformer:
- `q/k/v/o_proj` ‚Üí attention mechanics  
- `gate/up/down_proj` ‚Üí MLP transformation  

Adding LoRA to these layers lets the model learn new behaviors **efficiently** without modifying the full model.


In [5]:
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

In [6]:
dataset = load_dataset("mlabonne/FineTome-100k", split="train")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [7]:
dataset = standardize_sharegpt(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

In [8]:
dataset = dataset.map(
    lambda examples: {
        "text": [
            tokenizer.apply_chat_template(convo, tokenize=False)
            for convo in examples["conversations"]
        ]
    },
    batched=True
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Transforms each conversation in the dataset into a formatted text string that the model understands.

üß† Breakdown

dataset.map(...)
Applies a function to every batch of examples in the dataset.

examples["conversations"]
A list of conversation objects (each convo is a list of user/assistant messages).

tokenizer.apply_chat_template(convo, tokenizer=False)
Converts a conversation into the model‚Äôs required chat format, e.g.:

<|user|> Hello
<|assistant|> Hi!


Setting tokenizer=False returns raw text, not token IDs.

"text": [...]
Creates a new field "text" in the dataset containing the formatted conversation strings.

batched=True
Processes multiple examples at once for speed.

‚úî Summary

This code converts raw conversation data into a properly formatted text field so the model can be trained on chat-style input.

In [9]:
dataset[0]

{'conversations': [{'content': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.',
   'role': 'user'},
  {'content': 

In [11]:
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    tokenizer = tokenizer, # Add this line
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir="outputs"
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/100000 [00:00<?, ? examples/s]

# Explanation of Tokenization and Training Setup

This document explains the steps taken in your Jupyter Notebook (`.ipynb`) to tokenize a dataset manually and configure the `SFTTrainer` for training a model using Unsloth + HuggingFace.

---

## 1. **Manual Tokenization Before Training**

Instead of letting `SFTTrainer` tokenize your dataset (via `dataset_text_field`), you manually tokenize it first.  
This helps avoid multiprocessing issues and gives full control over tokenization.

### **Tokenization Function**

```python
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding=False,
    )
```

- Reads `"text"` from each example.
- Applies truncation up to 2048 tokens.
- No padding is added.

---

## 2. **Mapping Tokenizer Over Dataset**

Because multiprocessing caused errors earlier, the tokenization uses `num_proc=1`:

```python
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset.column_names,
    desc="Tokenizing dataset",
)
```

### What this step does:
- Tokenizes the dataset in batches.
- Removes original raw text columns.
- Produces fields like `input_ids` and `attention_mask`.

---

## 3. **Creating the Trainer Without `dataset_text_field`**

Since the dataset is already tokenized, you remove `dataset_text_field` to prevent double-tokenization:

```python
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_dataset,
    max_seq_length=2048,
    packing=False,
```

### Why `packing=False`?
- Keeps each training sample as-is.
- Useful when ensuring predictable sequence length behavior.

---

## 4. **Training Arguments Overview**

```python
args=TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    output_dir="outputs",
    dataloader_num_workers=0,
)
```

### Key points:
- **Effective batch size = 2 √ó 4 = 8** due to gradient accumulation.
- **Learning rate (2e-4)** suited for small fine-tuning tasks.
- **Automatic FP16/BF16 selection** based on GPU support.
- **No multiprocessing** in dataloader to avoid Jupyter crashes.

---

## Summary

You have successfully:
- Tokenized the dataset manually.
- Passed tokenized data into `SFTTrainer`.
- Prevented multiprocessing and re-tokenization issues.
- Configured a stable and efficient fine-tuning pipeline.

This markdown file can be added directly to your project or repository for documentation.



In [12]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33maryankatoch22440[0m ([33maryankatoch22440-tmotions[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.4399
2,1.8461
3,1.3803
4,1.4138
5,1.3508
6,1.4457
7,0.9463
8,1.5072
9,1.2393
10,1.2541


0,1
train/epoch,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà‚ñà‚ñà
train/global_step,‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà‚ñà‚ñà
train/grad_norm,‚ñÖ‚ñÜ‚ñà‚ñá‚ñÖ‚ñÜ‚ñÑ‚ñÑ‚ñÑ‚ñÉ‚ñÖ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÅ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÅ‚ñÇ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÇ‚ñÇ
train/learning_rate,‚ñÅ‚ñÇ‚ñÑ‚ñÖ‚ñá‚ñà‚ñà‚ñà‚ñá‚ñá‚ñá‚ñá‚ñá‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÅ
train/loss,‚ñÜ‚ñà‚ñÖ‚ñÖ‚ñÖ‚ñÇ‚ñÑ‚ñÑ‚ñÑ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÇ‚ñÉ‚ñÇ‚ñÑ‚ñÉ‚ñÇ‚ñÇ‚ñÇ‚ñÑ‚ñÇ‚ñÉ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÇ‚ñÑ‚ñÉ‚ñÅ‚ñÉ‚ñÉ‚ñÇ‚ñÅ‚ñÇ‚ñÉ‚ñÇ‚ñÇ

0,1
total_flos,5495762487078912.0
train/epoch,0.0048
train/global_step,60.0
train/grad_norm,0.18614
train/learning_rate,0.0
train/loss,0.8742
train_loss,1.02651
train_runtime,710.858
train_samples_per_second,0.675
train_steps_per_second,0.084


TrainOutput(global_step=60, training_loss=1.0265128056208292, metrics={'train_runtime': 710.858, 'train_samples_per_second': 0.675, 'train_steps_per_second': 0.084, 'total_flos': 5495762487078912.0, 'train_loss': 1.0265128056208292, 'epoch': 0.0048})

In [13]:
model.save_pretrained("finetuned_model")

In [14]:
inference_model, inference_tokenizer = FastLanguageModel.from_pretrained(
    model_name="./finetuned_model",
    max_seq_length=2048,
    load_in_4bit=True
)

==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [15]:
text_prompts = [
    "What are the key principles of investment?"
]

for prompt in text_prompts:
  formatted_prompt = inference_tokenizer.apply_chat_template([{
      "role": "user",
      "content": prompt
      }], tokenize=False)

  model_inputs = inference_tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
  generated_ids = inference_model.generate(
      **model_inputs,
      max_new_tokens=512,
      temperature=0.7,
      do_sample=True,
      pad_token_id=inference_tokenizer.pad_token_id
  )
  response = inference_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  print(response)

system

Cutting Knowledge Date: December 2023
Today Date: 30 Nov 2025

user

What are the key principles of investment?assistant

The key principles of investment are:

1. Diversification: Investing in a variety of assets to reduce risk and increase potential returns.
2. Long-term perspective: Investing for the long-term, rather than trying to make quick profits.
3. Risk management: Understanding and managing risk to minimize potential losses.
4. Dollar-cost averaging: Investing a fixed amount of money at regular intervals, regardless of market conditions.
5. Compounding: Reinvesting earnings to take advantage of the power of compounding.
6. Low-cost investing: Choosing low-cost index funds or ETFs to minimize fees and maximize returns.
7. Tax efficiency: Considering the tax implications of investments and aiming to minimize tax liabilities.
8. Active management: Actively managing investment portfolios to take advantage of opportunities and minimize risks.
9. Retirement planning: Inves