<a href="https://colab.research.google.com/github/HtmMhmd/fine-tuning-examples/blob/main/llm_instruction_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced LLM Instruction Tuning Course

This comprehensive course teaches you how to fine-tune Large Language Models (LLMs) using instruction tuning with state-of-the-art techniques. You'll learn:

1. **Theoretical foundations** of instruction tuning and parameter-efficient techniques
2. **Data validation** with Pydantic for robust training datasets
3. **Experiment tracking** with Weights & Biases
4. **Quantization techniques** to run large models on consumer hardware
5. **LoRA fine-tuning** for efficient adaptation of pre-trained models
6. **Evaluation methods** to assess model performance

By the end of this course, you'll have the skills to fine-tune and deploy your own instruction-tuned LLM for specific tasks.

# Module 1: Introduction to LLM Fine-tuning

## Why Fine-tune LLMs?

Large Language Models (LLMs) like LLaMA, Mistral, and DeciLM have impressive capabilities, but they may not perform optimally for specific tasks or domains out of the box. Fine-tuning allows us to adapt these general-purpose models to:

1. **Domain-specific knowledge**: Tailor the model to understand specific jargon and concepts
2. **Task-specific behavior**: Optimize for specific tasks like summarization, classification, or code generation
3. **Output formatting**: Train the model to follow specific response formats
4. **Compliance and safety**: Reduce harmful outputs and align with ethical guidelines
5. **Instruction following**: Improve the model's ability to follow user instructions

## Types of Fine-tuning

Several fine-tuning approaches exist for LLMs:

1. **Full Fine-tuning**: Updates all model parameters, requiring significant computational resources
2. **Instruction Tuning**: Fine-tunes a model specifically to follow natural language instructions
3. **Parameter-Efficient Fine-Tuning (PEFT)**: Updates only a small subset of parameters
   - **LoRA (Low-Rank Adaptation)**: Adds trainable low-rank matrices to certain layers
   - **QLoRA**: Combines quantization with LoRA for memory efficiency
   - **Adapter Layers**: Inserts small trainable modules between existing layers

4. **RLHF (Reinforcement Learning from Human Feedback)**: Uses human preferences to guide model training

In this course, we'll focus on **instruction tuning with PEFT**, specifically using LoRA, which provides an excellent balance of performance and efficiency.

## Instruction Tuning

Instruction tuning is a subset of supervised fine-tuning that focuses on teaching LLMs to follow natural language instructions. The process involves:

1. Collecting pairs of instructions and desired outputs
2. Fine-tuning the LLM on these instruction-output pairs
3. Evaluating the model's ability to follow new, unseen instructions

This approach bridges the gap between a base model's next-token prediction objective and the user's desire for the model to follow specific instructions.

# Module 2: Theoretical Foundations

## Transfer Learning in NLP

Transfer learning is a technique where a model trained on one task is adapted to another related task. In NLP, this typically involves:

1. **Pre-training**: Training a model on a large corpus of text with self-supervised objectives like masked language modeling or next-token prediction
2. **Fine-tuning**: Adapting the pre-trained model to specific downstream tasks

The key insight is that pre-training helps the model learn general language understanding that can be transferred to specific tasks with relatively little task-specific data.

## Parameter-Efficient Fine-Tuning (PEFT)

As LLMs grow larger (reaching hundreds of billions of parameters), full fine-tuning becomes prohibitively expensive in terms of:

- **Computational resources**: Requires multiple high-end GPUs
- **Memory requirements**: Full model weights plus optimizer states must fit in memory
- **Training time**: Longer training times due to more parameters
- **Storage costs**: Each fine-tuned model is the same size as the base model

Parameter-efficient methods address these challenges by:

1. **Training fewer parameters**: Only 0.1-1% of the model's parameters are updated
2. **Keeping the base model frozen**: The original pre-trained weights remain unchanged
3. **Adding small trainable modules**: These modules adapt the model's behavior

## Understanding LoRA (Low-Rank Adaptation)

LoRA is based on a key insight: the updates to weight matrices during fine-tuning have low "intrinsic rank" (can be approximated by low-rank matrices).

Instead of fine-tuning a full weight matrix W ∈ ℝᵐˣⁿ, LoRA decomposes the update into two smaller matrices:

W + ΔW = W + BA

Where:
- B ∈ ℝᵐˣʳ
- A ∈ ℝʳˣⁿ
- r << min(m,n) is the rank of the decomposition

This reduces the number of trainable parameters from m×n to r×(m+n).

During training:
1. The pre-trained weights W remain frozen
2. Only the low-rank matrices A and B are updated
3. The forward pass computes y = Wx + BAx

Key benefits:
- **Memory efficiency**: Requires significantly less memory
- **Computational efficiency**: Trains faster with fewer parameters
- **Storage efficiency**: Only need to store the small adapter weights
- **Composition**: Multiple LoRA adaptations can be combined

## The Mathematics of Quantization

Quantization reduces the precision of model weights to save memory. For example, converting from 32-bit floating-point (FP32) to:

- **16-bit (FP16/BF16)**: 2× memory reduction
- **8-bit (INT8)**: 4× memory reduction
- **4-bit (INT4/NF4)**: 8× memory reduction

The quantization process converts a high-precision weight w to a lower-precision value q:

q = round(w / scale) * scale

Where scale is a factor that preserves the range of values.

Different quantization types include:
- **PTQ (Post-Training Quantization)**: Applied after training is complete
- **QAT (Quantization-Aware Training)**: Model is trained with quantization in mind
- **Dynamic Quantization**: Quantization parameters are computed on the fly

In this course, we'll use 4-bit quantization with the NF4 data type (Normal Float 4-bit), designed for weights with a normal distribution.

## setup

In [8]:
from google.colab import userdata

# Retrieve secrets from Google Colab userdata
hf_token = userdata.get("huggingface")
wandb_token = userdata.get("wandb")

In [9]:
!pip -q install -U wandb

In [10]:
from google.colab import userdata
import wandb

# Initialize Weights & Biases

wandb.login(key= wandb_token)

# Initialize huggingface

!huggingface-cli login --token {hf_token}

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhatem-mohamed[0m ([33mnone12345[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `colab` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `colab`


## install and import

In [11]:
%%capture
!pip install -q optimum
# Install required packages
!pip install -U -q "transformers>=4.31.0" "datasets>=2.13.0" "peft>=0.4.0" "accelerate>=0.21.0" "bitsandbytes>=0.41.0" "trl>=0.4.7" "safetensors>=0.3.1" "pydantic>=2.0.0" "evaluate>=0.4.0" "scipy>=1.11.1"

# Module 3: Environment Setup and Prerequisites

We'll now set up our environment with all the necessary libraries:

1. **transformers**: Hugging Face's library for state-of-the-art pre-trained models
2. **datasets**: Library for efficiently working with ML datasets
3. **peft**: Parameter-Efficient Fine-Tuning library
4. **accelerate**: Library for distributed training
5. **bitsandbytes**: Efficient implementation of quantization algorithms
6. **trl**: Transformer Reinforcement Learning library
7. **safetensors**: Secure way to store and load tensors
8. **pydantic**: Data validation and settings management
9. **wandb**: Weights & Biases for experiment tracking
10. **evaluate**: Hugging Face's evaluation library

Let's also initialize Weights & Biases for experiment tracking:

In [12]:
# Configure wandb
wandb_project = "llm-instruction-tuning"
wandb_run_name = "microsoft/phi-1_5-instruction-tuning"

# Initialize a wandb run
wandb.init(
    project=wandb_project,
    name=wandb_run_name,
    config={
        "model_name": "microsoft/phi-1_5",
        "peft_method": "LoRA",
        "dataset": "hakurei/open-instruct-v1",
        "quantization": "4-bit NF4",
        "instruction_format": "alpaca"
    }
)

In [13]:
from datasets import load_dataset

# Load the Open Instruct V1 dataset
open_instruct_dataset = load_dataset("hakurei/open-instruct-v1", split="train")

# Filter to examples that fit within our context window (4096 tokens)
dataset = open_instruct_dataset.filter(
    lambda example: (len(example["input"]) + len(example["output"]) + len(example["instruction"])) <= 2048
)

# Display sample entries
print(f"Total examples after filtering: {len(dataset)}")
dataset.to_pandas().head(3)

Filter:   0%|          | 0/498813 [00:00<?, ? examples/s]

Total examples after filtering: 480150


Unnamed: 0,output,input,instruction
0,1. Eat a balanced diet and make sure to includ...,,Give three tips for staying healthy.
1,"The three primary colors are red, blue, and ye...",,What are the three primary colors?
2,"An atom is made up of a nucleus, which contain...",,Describe the structure of an atom.


In [14]:
# Load the dataset directly from the Hugging Face Hub
dataset_ = load_dataset("HuggingFaceH4/instruction-dataset")

# Module 4: Data Processing with Pydantic

## Data Validation with Pydantic

Pydantic is a powerful data validation library that uses Python type annotations. It allows us to:

1. Define clear data models with validation rules
2. Automatically validate data against these models
3. Get helpful error messages when data doesn't conform
4. Convert between different data formats

For our instruction tuning task, we'll define Pydantic models to validate our instruction dataset and ensure it meets our requirements.

Let's create models for:
1. Individual instruction examples
2. The complete instruction dataset

In [15]:
from pydantic import BaseModel, Field, field_validator
from typing import List, Optional
import random

# Define a Pydantic model for an instruction example
class InstructionExample(BaseModel):
    """Model for a single instruction example."""
    instruction: str = Field(..., description="The instruction to follow")
    input: Optional[str] = Field("", description="Optional input context")
    output: str = Field(..., description="Expected model output")

    @field_validator('instruction')
    @classmethod
    def instruction_not_empty(cls, v):
        if not v.strip():
            raise ValueError('Instruction cannot be empty')
        return v

    @field_validator('output')
    @classmethod
    def output_not_empty(cls, v):
        if not v.strip():
            raise ValueError('Output cannot be empty')
        return v

In [16]:
# Define a Pydantic model for the entire dataset
class InstructionDataset(BaseModel):
    """Model for a collection of instruction examples."""
    examples: List[InstructionExample]

    @field_validator('examples')
    def min_examples(cls, v):
        if len(v) < 10:
            raise ValueError('Dataset must contain at least 10 examples')
        return v

    def sample(self, n: int) -> 'InstructionDataset':
        """Sample n examples from the dataset."""
        if n > len(self.examples):
            raise ValueError(f"Requested {n} samples but dataset only has {len(self.examples)} examples")
        sampled_examples = random.sample(self.examples, n)
        return InstructionDataset(examples=sampled_examples)


In [17]:
# Function to convert dataset examples to Pydantic models
def validate_dataset_with_pydantic(dataset):
    """Convert dataset examples to validated Pydantic models."""
    valid_examples = []
    invalid_examples = []

    for i, example in enumerate(dataset):
        try:
            # Create an InstructionExample model from the dataset entry
            valid_example = InstructionExample(
                instruction=example["instruction"],
                input=example.get("input", ""),
                output=example["output"]
            )
            valid_examples.append(valid_example)
        except Exception as e:
            print(f"Example {i} invalid: {e}")
            invalid_examples.append((i, str(e)))

    print(f"Valid examples: {len(valid_examples)}")
    print(f"Invalid examples: {len(invalid_examples)}")

    # Create an InstructionDataset from the valid examples
    return InstructionDataset(examples=valid_examples), invalid_examples


In [18]:
# Validate the dataset
validated_dataset, invalid_examples = validate_dataset_with_pydantic(dataset)

# Sample 5,000 examples for training
sampled_dataset = validated_dataset.sample(15000)
print(f"Sampled {len(sampled_dataset.examples)} examples for training")

# Log to wandb
wandb.log({"valid_examples": len(validated_dataset.examples),
           "invalid_examples": len(invalid_examples),
           "training_examples": len(sampled_dataset.examples)})

# Convert back to a format suitable for the Transformers library
dataset_for_training = [example.model_dump() for example in sampled_dataset.examples]

Example 189265 invalid: 1 validation error for InstructionExample
output
  Value error, Output cannot be empty [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error
Example 190420 invalid: 1 validation error for InstructionExample
output
  Value error, Output cannot be empty [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error
Example 197289 invalid: 1 validation error for InstructionExample
output
  Value error, Output cannot be empty [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error
Example 197792 invalid: 1 validation error for InstructionExample
output
  Value error, Output cannot be empty [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error
Example 202087 invalid: 1 va

## Instruction Formatting

The format of instructions is crucial for effective fine-tuning. Common formats include:

1. **Alpaca Format**: Used by the Stanford Alpaca project
   ```
   Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
   
   ### Instruction:
   {instruction}
   
   ### Input:
   {input}
   
   ### Response:
   {output}
   ```

2. **Llama 2 Format**: Used in Meta's Llama 2 instruction tuning
   ```
   <s>[INST] <<SYS>>
   {system_prompt}
   <</SYS>>
   
   {instruction} [/INST] {output} </s>
   ```

3. **ChatML Format**: Used by OpenAI models
   ```
   <|im_start|>system
   {system_prompt}<|im_end|>
   <|im_start|>user
   {instruction}<|im_end|>
   <|im_start|>assistant
   {output}<|im_end|>
   ```

For this course, we'll use the Alpaca format, which has proven effective across a wide range of models. Let's create a formatting function:

In [15]:
def format_instruction_prompt(example):
    """Format an instruction example using the Alpaca template."""
    # Check if 'input' key exists and has content
    has_input = example.get('input', None) is not None and example.get('input', '').strip() != ''

    # Define the prompts based on the presence of input
    if has_input:
        primer_prompt = ("Below is an instruction that describes a task, paired with an input "
                        "that provides further context. Write a response that appropriately completes the request.")
        input_template = f"### Input:\n{example['input']}\n\n"
    else:
        primer_prompt = ("Below is an instruction that describes a task. "
                        "Write a response that appropriately completes the request.")
        input_template = ""

    instruction_template = f"### Instruction:\n{example['instruction']}\n\n"

    # Format the response
    response_template = f"### Response:\n{example['output']}\n\n"

    # Combine all parts
    formatted_prompt = f"{primer_prompt}\n\n{instruction_template}{input_template}{response_template}"

    return formatted_prompt

# Test the formatting with a sample example
sample_example = dataset_for_training[2]
print(format_instruction_prompt(sample_example))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Construct a regex to check if a provided string is phone number or not


### Response:
Regex: ^\d{10}$




# Module 5: Model Preparation

## Model Selection

For this course, we'll use DeciLM-6B, a 6 billion parameter model from Deci AI that offers a good balance of performance and efficiency. Other options you might consider include:

- **LLaMA 2 (7B/13B/70B)**: Meta's updated LLaMA models
- **Mistral (7B)**: A strong open-source 7B model
- **Phi-2 (2.7B)**: Microsoft's smaller but capable model
- **Falcon (7B/40B)**: Highly efficient models from TII

## Quantization with BitsAndBytes

To run large models on consumer hardware, we need to use quantization. The BitsAndBytes library provides efficient quantization methods:

- **4-bit Quantization**: Reduces memory usage by approximately 8x
- **NF4 Data Type**: Optimized for weights with a normal distribution
- **Double Quantization**: Further reduces memory usage
- **BFloat16 Compute**: Uses BF16 for calculations to balance precision and speed

Let's load the model with these optimizations:


In [16]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Define the model ID
model_id = "microsoft/phi-1_5"
device = "cuda" # for GPU usage or "cpu" for CPU usage

# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit quantization
    bnb_4bit_use_double_quant=True,  # Use double quantization for further memory savings
    bnb_4bit_quant_type="nf4",  # Use NF4 data type, optimized for normal distributions
    bnb_4bit_compute_dtype=torch.bfloat16  # Use BF16 for calculations
)


In [17]:
# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    use_cache=False,  # Disable KV cache for training
    device_map="auto",  # Automatically determine device mapping
    trust_remote_code=True  # Allow custom code from the model repo
).to(device)


config.json:   0%|          | 0.00/736 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

In [18]:
# Set pretraining TP to 1 (Tensor Parallelism)
model.config.pretraining_tp = 1

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Log model details to wandb
wandb.log({
    "model_parameters": model.num_parameters(),
    "context_length": model.config.max_position_embeddings,
    "vocabulary_size": len(tokenizer)
})

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

## LoRA Configuration

Now, we'll set up Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. The key parameters are:

- **r**: The rank of the low-rank matrices (higher = more capacity but more parameters)
- **lora_alpha**: Scaling factor for the LoRA weights
- **lora_dropout**: Dropout probability for regularization
- **bias**: Whether to train the bias terms

These parameters control the trade-off between model capacity and efficiency. Let's configure LoRA and prepare the model:

In [19]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Configure LoRA
peft_config = LoraConfig(
    lora_alpha=4,  # Scale factor for LoRA weights
    lora_dropout=0.1,  # Dropout probability for regularization
    r=8,  # Rank of the low-rank matrices
    bias="none",  # Whether to train bias terms
    task_type="CAUSAL_LM"  # Task type for causal language modeling
)

# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

# Log LoRA configuration to wandb
wandb.log({
    "lora_alpha": peft_config.lora_alpha,
    "lora_dropout": peft_config.lora_dropout,
    "lora_r": peft_config.r,
    "lora_bias": peft_config.bias
})

# Get the PEFT model
model = get_peft_model(model, peft_config)

# Print the trainable parameters
model.print_trainable_parameters()

trainable params: 5,505,024 || all params: 1,423,775,744 || trainable%: 0.3866


# Module 6: Training Configuration

## Training Arguments

The `TrainingArguments` class from Hugging Face's Transformers library configures the training process. Key parameters include:

- **Learning rate**: Controls how quickly the model adapts
- **Batch size**: Number of examples processed in parallel
- **Gradient accumulation**: Simulates larger batch sizes
- **Optimizer**: Algorithm for updating weights
- **Learning rate schedule**: How the learning rate changes during training
- **Mixed precision**: Using lower precision for efficiency

Let's configure our training with these parameters and integrate with Weights & Biases:


In [26]:
from transformers import TrainingArguments

# Configure training arguments
training_args = TrainingArguments(
    # Output directory
    output_dir="phi-1_5_instruction_tuned",

    # Training parameters
    num_train_epochs=1,  # Number of training epochs
    per_device_train_batch_size=8,  # Batch size per device
    gradient_accumulation_steps=2,  # Accumulate gradients over multiple steps
    gradient_checkpointing=True,  # Save memory by recomputing gradients

    # Optimizer settings
    optim="paged_adamw_32bit",  # Optimizer to use
    learning_rate=3e-5,  # Learning rate
    max_grad_norm=0.3,  # Maximum gradient norm for clipping
    weight_decay=0.01,  # Weight decay for regularization

    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of training for learning rate warmup
    lr_scheduler_type="linear",  # Learning rate schedule type

    # Mixed precision
    bf16=True,  # Use bfloat16 precision
    # tf32=True,  # Use TF32 precision (on NVIDIA Ampere GPUs)
    # tf16=True,
    # Logging and saving
    logging_dir="logs",  # Directory for logs
    logging_steps=25,  # Log every N steps
    save_strategy="steps",  # When to save checkpoints
    save_steps=100,  # Save every N steps

    # wandb integration
    report_to="wandb",  # Use wandb for reporting
    run_name=wandb_run_name,  # Run name in wandb

    # Other settings
    disable_tqdm=False,  # Show progress bars
    seed=42  # Random seed for reproducibility
)

# Log training arguments to wandb
wandb.config.update({
    "learning_rate": training_args.learning_rate,
    "epochs": training_args.num_train_epochs,
    "batch_size": training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps,
    "warmup_ratio": training_args.warmup_ratio,
    "weight_decay": training_args.weight_decay
})

# Module 7: Training Process

## SFTTrainer Setup

The `SFTTrainer` (Supervised Fine-Tuning Trainer) from the TRL library simplifies the fine-tuning process for instruction tuning. It handles:

1. Formatting the data into the proper structure
2. Setting up the training loop
3. Integrating with PEFT for efficient fine-tuning

Let's set up our trainer:

In [21]:
from datasets import Dataset

# Convert the list to a Hugging Face Dataset
dataset_for_training = Dataset.from_list(dataset_for_training)

In [28]:
from trl import SFTTrainer

# Set maximum sequence length
max_seq_len = 2048

# Initialize the SFT trainer
trainer = SFTTrainer(
    model=model,  # The model to train
    train_dataset=dataset_for_training,  # Training dataset
    peft_config=peft_config,  # PEFT configuration
    # max_seq_length=max_seq_len,  # Maximum sequence length
    # tokenizer=tokenizer,  # Tokenizer
    # packing=True,  # Pack multiple examples in a single sequence
    formatting_func=format_instruction_prompt,  # Function to format examples
    args=training_args,  # Training arguments
)

# Wandb will automatically track training progress through the integration with transformers



Applying formatting function to train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/5000 [00:00<?, ? examples/s]

## Training Execution

Now that we've set up all the components, let's start the training process. This will:

1. Fine-tune the model on our instruction dataset
2. Log metrics to Weights & Biases
3. Save checkpoints at regular intervals

The training will take several hours depending on your hardware. On a single 16GB GPU, expect around 3-6 hours for 3 epochs with our configuration.

In [29]:
# Start training
trainer.train()

# Save the final model
trainer.save_model()

# Log the final loss
wandb.log({"final_loss": trainer.state.log_history[-1].get("loss", None)})

Step,Training Loss
25,1.991


KeyboardInterrupt: 

In [30]:
# Free up memory
del model
torch.cuda.empty_cache()

## UnSloth LLM SFT Finetunning

In [31]:
!pip install -q -U unsloth xformers
# !pip install -U transformers unsloth unsloth_zoo

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.3/52.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.9/312.9 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.2/117.2 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.0/196.0 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.7/131.7 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.6/213.6 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [36]:
! pip install transformers==4.55.0

Collecting transformers==4.55.0
  Downloading transformers-4.55.0-py3-none-any.whl.metadata (39 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers==4.55.0)
  Downloading tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.55.0-py3-none-any.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m107.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m102.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.22.0
    Uninstalling tokenizers-0.22.0:
      Successfully uninstalled tokenizers-0.22.0
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.0
    Unins

### Check the model before we run

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Model and tokenizer loading
model_name = "unsloth/Qwen3-1.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
                                            torch_dtype=torch.bfloat16).to("cuda")
# Inference Example
prompt = "The capital of France is"  # Or any prompt you want
input_ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")  # Move input to GPU

# Generate text
with torch.no_grad():  # Disable gradient calculation for faster inference
    output = model.generate(
        input_ids,
        max_length=100,           # Adjust as needed
        num_beams=5,              # For better generation quality (optional)
        temperature=0.7,          # Adjust for creativity
        top_p=0.9,                # Adjust for more focused generation
        do_sample=True,           # Enable sampling
        pad_token_id=tokenizer.eos_token_id  # Important for padding
    )

# Decode and print the output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
# Free up memory
del model
torch.cuda.empty_cache()

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/752 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The capital of France is Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Japan is Tokyo. The capital of Korea is Seoul. The capital of Mexico is Mexico City. The capital of Nigeria is Abuja. The capital of Pakistan is Islamabad. The capital of Peru is Lima. The capital of Philippines is Manila. The capital of Poland is Warsaw. The capital of Portugal is Lisbon. The capital of Russia is Moscow. The capital of Saudi Arabia is Riyadh


In [5]:
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset

num_epochs = 2
# Unsloth Fine-tuning Setup
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True

# Load model with Unsloth optimizations
unsloth_model, unsloth_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-1.7B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)



Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.10: Fast Qwen3 patching. Transformers: 4.55.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

In [19]:
# Format dataset for Unsloth
def format_prompts(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        if input_text.strip():
            text = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:
{output}"""
        else:
            text = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""
        texts.append(text)
    return {"text": texts}


In [22]:

# Convert dataset to Unsloth format
unsloth_dataset = dataset_for_training.select(range(5000))  # Use subset for demo
unsloth_dataset = unsloth_dataset.map(format_prompts, batched=True)

# Improved Unsloth configuration
unsloth_model = FastLanguageModel.get_peft_model(
    unsloth_model,
    r=8,  # Increased rank for better capacity
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
                    #, "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,  # 2x the rank
    lora_dropout=0.05,  # Small dropout for regularization
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=True,  # Rank-stabilized LoRA
)

# Improved training arguments
unsloth_trainer = SFTTrainer(
    model=unsloth_model,
    tokenizer=unsloth_tokenizer,
    train_dataset=unsloth_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,  # Enable packing for efficiency
    args=TrainingArguments(
        per_device_train_batch_size=4,  # Reduced for stability
        gradient_accumulation_steps=1,  # Maintain effective batch size
        warmup_steps=50,  # 10% warmup
        max_steps=250,  # More training steps
        learning_rate=1e-3,  # Reduced learning rate
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,  # Less frequent logging
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",  # Better scheduler
        warmup_ratio=0.1,
        seed=3407,
        output_dir="unsloth_outputs",
        report_to="wandb",
        run_name="unsloth-Qwen3-improved",
        save_steps=100,  # Save checkpoints
        eval_steps=100,  # If you have validation data
        dataloader_num_workers=4,  # Faster data loading
        remove_unused_columns=False,
        max_grad_norm=1.0,  # Gradient clipping
    ),
)


print("Unsloth trainer configured successfully!")
print(f"GPU memory before training: {torch.cuda.memory_allocated()/1024**3:.2f} GB")

# Add to track improvement
wandb.log({
    "effective_batch_size": 4 * 8,  # 32
    "total_training_tokens": 14000 * max_seq_length * num_epochs,
    "learning_rate_schedule": "cosine",
    "lora_parameters": 32 * 64 * 2,  # Approximate trainable params
})

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.8.10 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/5000 [00:00<?, ? examples/s]

Unsloth trainer configured successfully!
GPU memory before training: 1.35 GB


In [23]:
print("Starting Unsloth training...")
trainer_stats = unsloth_trainer.train()

# Save Unsloth model
unsloth_model.save_pretrained("unsloth_lora_model")
unsloth_tokenizer.save_pretrained("unsloth_lora_model")

# For demo purposes, we'll simulate the training completion
print("Unsloth training configuration completed!")
print("Benefits of Unsloth:")
print("- 2x faster training speed")
print("- 50% less memory usage")
print("- Optimized CUDA kernels")
print("- Seamless integration with existing workflows")

# Log completion
wandb.log({"unsloth_training_completed": True})

Starting Unsloth training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 3,211,264 of 1,723,786,240 (0.19% trained)


Step,Training Loss,entropy
10,2.0986,0
20,1.4555,No Log
30,1.361,No Log
40,1.3167,No Log
50,1.2493,No Log
60,1.3151,No Log
70,1.2994,No Log
80,1.2904,No Log
90,1.3056,No Log
100,1.1257,No Log


Unsloth: Will smartly offload gradients to save VRAM!
Unsloth training configuration completed!
Benefits of Unsloth:
- 2x faster training speed
- 50% less memory usage
- Optimized CUDA kernels
- Seamless integration with existing workflows


In [24]:
import os

# Save Unsloth model after training
print("Saving Unsloth fine-tuned model...")

# Create directory for model
os.makedirs("unsloth_fine_tuned", exist_ok=True)


try:
    # Save the model with its LoRA adapters
    unsloth_model.save_pretrained("unsloth_fine_tuned")
    unsloth_tokenizer.save_pretrained("unsloth_fine_tuned")
    print("✅ Unsloth model successfully saved to 'unsloth_fine_tuned'")

    # Log saving event to wandb
    wandb.log({"unsloth_model_saved": True})

    # Optional: Save in safetensors format for better security and compatibility
    unsloth_model.save_pretrained("unsloth_fine_tuned", safe_serialization=True)
    print("✅ Model also saved in safetensors format")
except Exception as e:
    print(f"❌ Error saving Unsloth model: {e}"
         )
    print("Note: If training wasn't actually run, this is expected.")

    # For demonstration purposes, provide information about what would happen
    print("\nTo save a trained Unsloth model, you would typically:")
    print("1. Run the training loop (uncomment the training code)")
    print("2. Execute this cell to save the fine-tuned model")
    print("3. The model can then be loaded for inference with: FastLanguageModel.from_pretrained('unsloth_fine_tuned')")

Saving Unsloth fine-tuned model...
✅ Unsloth model successfully saved to 'unsloth_fine_tuned'
✅ Model also saved in safetensors format


In [25]:
from unsloth import FastLanguageModel
import torch
from transformers import pipeline

# Test the Unsloth model with sample instructions

model_path = "/content/unsloth_fine_tuned"  # Using original model since training wasn't executed
print(f"Loading model from {model_path}")

# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    max_seq_length=2048,
    load_in_4bit=True,
)

# Function to generate responses
def generate_unsloth_response(instruction, input_text=""):
    # Format the prompt based on whether input is provided
    if input_text.strip():
        prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:
"""
    else:
        prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""

    # Generate text with Unsloth model
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.95,
    )

    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return response

# Test with sample instructions
test_instructions = [
    {
        "instruction": "Explain the concept of transfer learning in AI.",
        "input": ""
    },
    {
        "instruction": "Write a short poem about fine-tuning language models.",
        "input": ""
    },
    {
        "instruction": "Summarize the following text.",
        "input": "Low-rank adaptation (LoRA) is a technique that accelerates the fine-tuning of large language models while consuming less memory. LoRA adds pairs of rank-decomposition weight matrices to existing weights, and only trains these newly added weights."
    }
]

# Generate and display responses
print("\n===== Unsloth Model Test Results =====\n")
for i, test in enumerate(test_instructions):
    print(f"Test {i+1}:")
    print(f"Instruction: {test['instruction']}")
    if test['input']:
        print(f"Input: {test['input']}")

    response = generate_unsloth_response(test['instruction'], test['input'])
    print(f"\nResponse:\n{response}\n")
    print("-" * 50)

# Log to wandb
wandb.log({
    "unsloth_test_completed": True,
    "num_test_instructions": len(test_instructions)
})

Loading model from /content/unsloth_fine_tuned
==((====))==  Unsloth 2025.8.10: Fast Qwen3 patching. Transformers: 4.55.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

===== Unsloth Model Test Results =====

Test 1:
Instruction: Explain the concept of transfer learning in AI.

Response:
Transfer learning is a machine learning technique that uses pre-trained models to solve new problems. The idea is that you can take a model that has been trained on one task and adapt it to a new task with little or no additional training. This makes it possible to solve complex problems with a small amount of data and high accuracy. Transfer learning has been successfully