<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/FineTuning_Mistral7B_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
# The `%%capture` magic in Jupyter/Colab captures output, suppressing it from being displayed.

# Install the `unsloth` package from PyPI
!pip install unsloth

# Uninstall `unsloth` to ensure a clean installation, then install the latest version from GitHub
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [None]:
from unsloth import FastLanguageModel  # Importing the FastLanguageModel class from the unsloth library
import torch  # Importing PyTorch for handling tensors and computations

# Set the maximum sequence length for the model's input
max_seq_length = 2048  # The maximum number of tokens the model can process in one sequence. Customize as needed.
# Note: The library internally supports RoPE (Rotary Position Embedding) scaling to handle long sequences.

# Set the data type for model computation
dtype = None  # Automatically detect the best precision.
# Set dtype to 'torch.float16' for Tesla T4/V100 GPUs, or 'torch.bfloat16' for Ampere and newer GPUs.

# Choose whether to use 4-bit quantization for the model
load_in_4bit = True  # Enabling 4-bit quantization reduces memory usage and speeds up computation.
# Set to False if higher precision is needed or memory is not a concern.

# Load the model and tokenizer using the FastLanguageModel class
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-v0.3",  # Specify the model name to load. Replace with any model of your choice.
    max_seq_length=max_seq_length,  # Pass the chosen maximum sequence length.
    dtype=dtype,  # Pass the chosen data type for computations.
    load_in_4bit=load_in_4bit,  # Pass whether to use 4-bit quantization.
)

# Explanation:
# - `FastLanguageModel.from_pretrained` is a convenient method to load both the model and tokenizer.
# - `model_name`: The name of the pre-trained model. Example: "unsloth/mistral-7b-v0.3".
# - `max_seq_length`: Configures the maximum token length the model can handle in one input.
# - `dtype`: Allows precise control over computation precision for optimal performance on different hardware.
# - `load_in_4bit`: If True, enables 4-bit quantization to reduce memory footprint while maintaining good accuracy.


### We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
# Configure the model with PEFT (Parameter-Efficient Fine-Tuning) settings using LoRA (Low-Rank Adaptation)
model = FastLanguageModel.get_peft_model(
    model,  # The base model to be fine-tuned using PEFT techniques

    # Low-Rank Adaptation (LoRA) rank
    r=16,  # Defines the rank of the low-rank matrices. Common choices: 8, 16, 32, 64, 128.
    # Larger values increase expressiveness but require more memory.

    # Modules to target for LoRA fine-tuning
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projection layers
        "gate_proj", "up_proj", "down_proj",     # MLP layers
    ],
    # Only these specified modules will be fine-tuned to reduce memory and computational overhead.

    # LoRA-specific hyperparameters
    lora_alpha=16,  # Scaling factor for LoRA weights. Balances new and pre-trained weights.
    lora_dropout=0,  # Dropout rate for LoRA. Setting to 0 often gives optimized performance.

    # Bias handling in fine-tuning
    bias="none",  # Specifies bias tuning. "none" is optimized for performance. Alternatives: "all", "lora_only".

    # Optimizations for VRAM and context length
    use_gradient_checkpointing="unsloth",  # Use gradient checkpointing to save memory during training.
    # The "unsloth" setting reduces VRAM usage by ~30%, allowing larger batch sizes or longer contexts.

    # Random seed for reproducibility
    random_state=3407,  # Ensures the results are reproducible across runs.

    # Advanced fine-tuning features
    use_rslora=False,  # Enables Rank-Stabilized LoRA (rSLoRA) if set to True. Useful for stability in high ranks.
    loftq_config=None,  # Configures LoftQ (Low Overhead Fine-Tuning Quantization), if used. Set to None for default.
)


<a name="Data"></a>
### Data Prep and Load
We now use the text-summarizer dataset from [SURESHBEEKHANI](https://huggingface.co/datasets/SURESHBEEKHANI/text-summarizer)


In [None]:
# Define a string template for the prompt format used for generating responses
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# EOS_TOKEN is the special token that signifies the end of a sequence. It ensures the model stops generating when it reaches this token.
EOS_TOKEN = tokenizer.eos_token  # Retrieve the EOS token from the tokenizer (ensures proper stopping in generation)

# Define the formatting function for processing the dataset
def formatting_prompts_func(examples):
    instructions = examples["instruction"]  # Extract instructions from the dataset
    inputs = examples["dialogue"]  # Extract dialogues (context) from the dataset
    outputs = examples["summary"]  # Extract expected summaries from the dataset

    texts = []  # Initialize an empty list to store the formatted text prompts
    # Loop over the instructions, inputs (dialogues), and outputs (summaries) to create the full prompt
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Use the alpaca_prompt template to format each instruction, input, and output
        # Ensure EOS_TOKEN is appended to avoid endless generation during model training or inference
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)  # Append the formatted text to the list

    # Return the formatted text wrapped in a dictionary with the key "text" for use in model training
    return { "text": texts }

# Load the dataset from the "SURESHBEEKHANI/text-summarizer" dataset repository
# 'split="train"' indicates we are loading the training data.
from datasets import load_dataset
dataset = load_dataset("SURESHBEEKHANI/text-summarizer", split="train")

# Apply the formatting function to the dataset in batches, preparing the data for model training
# 'batched=True' means the function will process multiple examples at once, improving efficiency.
dataset = dataset.map(formatting_prompts_func, batched=True,)


In [None]:
dataset

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
# Import necessary libraries and modules
from trl import SFTTrainer  # SFTTrainer is a class used for training models using Supervised Fine-Tuning (SFT)
from transformers import TrainingArguments  # TrainingArguments holds the configuration for model training
from unsloth import is_bfloat16_supported  # Utility to check if bfloat16 is supported in the system

# Initialize the SFTTrainer with several parameters for training
trainer = SFTTrainer(
    model=model,  # The model to be trained (not defined here, assumed to be defined elsewhere)
    tokenizer=tokenizer,  # The tokenizer to be used for text preprocessing (assumed to be defined elsewhere)
    train_dataset=dataset,  # The dataset used for training (assumed to be defined elsewhere)
    dataset_text_field="text",  # Field in the dataset containing the input text
    max_seq_length=max_seq_length,  # Maximum sequence length for tokenized inputs (assumed to be defined elsewhere)
    dataset_num_proc=2,  # Number of processes to use for data preprocessing; can speed up data loading
    packing=False,  # Whether to use sequence packing, which can make training faster for short sequences
    args=TrainingArguments(  # Set the training arguments and hyperparameters
        per_device_train_batch_size=2,  # Batch size per device (e.g., GPU or CPU) during training
        gradient_accumulation_steps=4,  # Number of steps before performing a gradient update (helps with large batch sizes)
        warmup_steps=5,  # Number of steps for the learning rate warmup before it starts decaying
        max_steps=60,  # Total number of training steps. Usually corresponds to num_train_epochs * num_steps_per_epoch
        learning_rate=2e-4,  # Learning rate for the optimizer
        fp16=not is_bfloat16_supported(),  # Use mixed-precision (float16) if bfloat16 is not supported by the hardware
        bf16=is_bfloat16_supported(),  # Use bfloat16 if supported by the hardware
        logging_steps=1,  # Frequency (in steps) to log training metrics (e.g., loss) during training
        optim="adamw_8bit",  # Optimizer type. Using AdamW with 8-bit precision for memory efficiency
        weight_decay=0.01,  # Weight decay parameter for regularization to prevent overfitting
        lr_scheduler_type="linear",  # Learning rate scheduler type; here it's set to linear decay
        seed=3407,  # Random seed for reproducibility of results
        output_dir="outputs",  # Directory where the model checkpoints and outputs will be saved
        report_to="none",  # Specify where to report metrics (e.g., use "wandb" for reporting to Weights & Biases)
    ),
)

# The trainer is now set up and can be used for training the model.

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# Define the prompt template for text summarization.
alpaca_prompt = """Below is a passage of text. Write a concise summary of the text below.

### Text:
{}

### Summary:
{}"""  # The summary part is left empty for generation.

# FastLanguageModel.for_inference(model) enables optimizations for faster inference.
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference by configuring the model for efficient use

# Example of a text summarization task.
# Here, you provide a longer piece of text as input, and the model will generate a concise summary.
inputs = tokenizer(
    [
        alpaca_prompt.format(  # Format the alpaca_prompt with the specific instruction and input text.
            "The quick brown fox jumps over the lazy dog. The dog, despite being lazy, tries to catch the fox but fails. The fox quickly disappears into the forest, leaving the dog behind. This is a classic example of the speed of the fox being unmatched by the dog’s sluggishness."  # Example of input text to summarize.
            ,  # Leave the summary part empty for generation.
            ""  # Output: empty as the model will generate the summary.
        )
    ], return_tensors="pt"  # Convert input to PyTorch tensors.
).to("cuda")  # Move the input data to the GPU for faster processing.

# Generate the summary using the model.
# 'max_new_tokens' controls how many tokens the model is allowed to generate.
# 'use_cache' allows for faster generation by caching previous results.
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# Decode the generated tokens back into readable text.
# This will give us the model's summary of the provided input text.
tokenizer.batch_decode(outputs)  # Convert the output tokens to text.


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# Define the prompt template for text summarization.
# This will instruct the model to generate a concise summary of the input text.
alpaca_prompt = """Below is a passage of text. Write a concise summary of the text below.

### Text:
{}

### Summary:
{}"""  # The summary part is left empty for generation.

# FastLanguageModel.for_inference(model) optimizes the model for faster inference.
# This typically involves optimizations that speed up processing, such as reducing latency and improving GPU utilization.
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference by configuring the model for efficient use

# Tokenize the input text for the summarization task.
# Provide an example of input text that needs to be summarized. The model will generate a summary based on this input.
inputs = tokenizer(
    [
        alpaca_prompt.format(  # Format the alpaca_prompt with a specific text input and empty summary part.
            "The quick brown fox jumps over the lazy dog. The dog, despite being lazy, tries to catch the fox but fails. The fox quickly disappears into the forest, leaving the dog behind."  # Example input text to summarize.
            ,  # Leave the summary part empty for the model to generate.
            ""  # Output: Empty since the model will generate the summary.
        )
    ], return_tensors="pt"  # Convert input text into PyTorch tensors, as required for model input.
).to("cuda")  # Move the input data to the GPU for faster processing.

# Import TextStreamer from transformers to stream the output generation.
# The TextStreamer class will allow the model to generate the summary token by token.
from transformers import TextStreamer

# Initialize the TextStreamer with the tokenizer to decode the generated tokens during streaming.
# This enables real-time generation of summaries.
text_streamer = TextStreamer(tokenizer)

# Generate the summary using the model, and stream the output token by token.
# This helps in quickly receiving the summary output, especially for longer texts.
_ = model.generate(
    **inputs,  # Provide the tokenized inputs (text) to the model.
    streamer=text_streamer,  # Enable token-by-token streaming for faster output generation.
    max_new_tokens=128  # Set the maximum number of tokens to generate (the length of the summary).
)


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
# Import FastLanguageModel from the unsloth library.
# This class allows you to load pre-trained models, configure them for fast inference, and perform tasks like text generation.
from unsloth import FastLanguageModel

# Load the pre-trained model and tokenizer using the FastLanguageModel class.
# - 'model_name' is the name of the model you trained (in this case, "lora_model").
# - 'max_seq_length' is the maximum sequence length the model can handle for input.
# - 'dtype' is the data type for model weights (such as float32 or float16).
# - 'load_in_4bit' specifies whether to load the model with reduced 4-bit precision for efficiency (saves memory).
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model",  # Replace with the name of the model you used for training.
    max_seq_length = max_seq_length,  # Maximum length of the input sequence.
    dtype = dtype,  # Data type for the model (e.g., float32, float16).
    load_in_4bit = load_in_4bit,  # Option to load the model in 4-bit precision to save memory.
)

# Enable optimizations for inference to make the model's token generation up to 2x faster.
# This improves performance by reducing latency and making the model more efficient during text generation.
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference for optimized generation.

# 'alpaca_prompt' should be the pre-defined prompt template you are using to format your input for the model.
# This template typically has placeholders for instructions, inputs, and outputs that are formatted during generation.

# Tokenize the input text using the tokenizer.
# Here we are preparing the input by filling the prompt template with specific instructions.
# The instruction asks about a famous tall tower in Paris, and the input/output are left blank for the model to generate a response.
inputs = tokenizer(
    [
        alpaca_prompt.format(  # Format the alpaca_prompt with the provided instruction and input.
            "What is a famous tall tower in Paris?",  # Instruction: Ask about a famous tower in Paris.
            "",  # Input: Leave it blank, as the model will generate the response.
            ""  # Output: Left empty, as the model will generate the answer.
        ),
    ], return_tensors="pt"  # Convert the formatted input text into PyTorch tensors (required for the model to process).
).to("cuda")  # Move the input tensors to the GPU to speed up computation.

# Generate the output using the model based on the tokenized inputs.
# The model will generate a response with a maximum of 64 new tokens.
# 'use_cache=True' allows for more efficient generation by reusing intermediate states during the generation process.
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# Decode the generated token IDs back into human-readable text using the tokenizer.
# 'batch_decode' converts the tokenized outputs into strings of text.
tokenizer.batch_decode(outputs)  # Retrieve the final generated output (e.g., the model's response to the question).


## Push the trained model to the Hugging Face Model Hub using the GGUF format

In [None]:
#Push the trained model to the Hugging Face Model Hub using the GGUF format
model.push_to_hub_gguf(
    "SURESHBEEKHANI/Mistral_7B_Summarizer_SFT_GGUF",  # Specify the model repository path on Hugging Face Hub. Replace "hf" with your Hugging Face username.
    tokenizer,  # Pass the tokenizer associated with the model to ensure compatibility on the hub
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],  # Specify the quantization methods to apply for optimized model storage (e.g., q4_k_m, q8_0, q5_k_m)
    token="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",  # Provide the Hugging Face token f
)