In [1]:
%pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git datasets bitsandbytes einops --progress-bar off

Note: you may need to restart the kernel to use updated packages.


In [2]:
from huggingface_hub import notebook_login
notebook_login() # Logging into the Hugging Face Hub from a notebook environment

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
%pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    %pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    %pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

🌟 **4-Bit Quantization Explained:**

4-bit quantization is a technique used to compress data, especially in machine learning. It involves reducing the number of bits used to represent a value down to just 4 bits.

🧮 **Understanding the Concept:**

- **Bits and Numbers:** Computers store information using bits, which can be 0 or 1. More bits allow for a wider range of numbers to be represented with greater precision.
- **Quantization:** Think of it like compressing an image by reducing the number of colors. Similarly, in machine learning, quantization reduces the bits used to represent values, such as weights in neural networks.
- **4-Bit Quantization:** Specifically, this method uses only 4 bits to represent each value, significantly cutting down on memory requirements compared to traditional 16-bit or 32-bit formats.

🔹 **Benefits of 4-Bit Quantization:**

- **Reduced Memory Footprint:** Perfect for running large models on devices with limited resources, like language models.
- **Faster Processing:** Smaller models may process information more quickly, leading to faster inference times.

🔸 **Challenges of 4-Bit Quantization:**

- **Loss of Accuracy:** There might be a slight decrease in model accuracy due to reduced bit precision.
- **Computational Overhead:** Implementing 4-bit quantization algorithms can introduce additional computational complexity during training.

🚀 **In Conclusion:**

4-bit quantization is a powerful technique for compressing machine learning models, making them suitable for deployment on memory-constrained devices and potentially enhancing processing speed. However, striking a balance between compression and accuracy is essential.

In [4]:
# Import the FastLanguageModel from the unsloth library
from unsloth import FastLanguageModel

# Import the torch library for tensor computations
import torch

# Set the maximum sequence length for the language model
max_seq_length = 2048

# Set the data type for the model, None means it will use the default data type
dtype = None

# Set the flag to load the model in 4-bit quantization to reduce memory usage
load_in_4bit = True # set it to true as it will help you finetune model faster

# List of 4-bit models available in the unsloth library
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
]

# Load the pretrained model and tokenizer from the unsloth library
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2-1.5B-Instruct", # Specify the model name
    max_seq_length = max_seq_length, # Specify the maximum sequence length
    dtype = dtype, # Specify the data type
    load_in_4bit = load_in_4bit, # Specify whether to load the model in 4-bit quantization
)


    PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.2.1+cu121)
    Python  3.10.14 (you have 3.10.10)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


Unsloth unsuccessfully patched LoraLayer.update_layer. Please file a bug report.
Luckily, your training run will still work in the meantime!


==((====))==  Unsloth: Fast Qwen2 patching release 2024.5
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,
    loftq_config = None, # And LoftQ
)

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2024.5 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [6]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("Abhaykoul/Ancient-Indian-Wisdom", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

🌟 **PEFT (Parameter-Efficient Fine-Tuning) Explained:**

**Goal:** PEFT aims to streamline the training process for large language models (LLMs) by fine-tuning only a subset of parameters, making it more efficient.

**Benefits:**
- 🚀 Reduces computational cost and storage needs significantly.
- 💡 Enables training LLMs on everyday hardware.
- 📈 Achieves performance similar to fully fine-tuned models in many scenarios.

**How it Works:**
- 🛠️ Utilizes techniques like:
    - **LoRA (Low-Rank Adaptation):** Introduces a low-rank adapter module with fewer parameters, focusing on task-specific learning while preserving the base model's parameters.
    - **Soft Prompting:** Embeds task-specific cues strategically in the input sequence to guide the model towards desired outcomes without extensive parameter adjustments.

**Applications:**
- 🌐 Fine-tuning LLMs for tasks like text classification, question answering, and more.
- 🌟 Democratizing LLM training by making it accessible to users with modest hardware resources.

🔗 **Resources:**
- PEFT Library: [PEFT Documentation](https://huggingface.co/docs/peft/en/index)
- GitHub Repository: [PEFT GitHub](https://github.com/huggingface/peft)

---

🔍 **SFT (Supervised Fine-Tuning) Overview:**

**Process:**
1. Train a large LLM on a vast dataset.
2. Fine-tune this pre-trained model on a smaller task-specific dataset to enhance its performance on that particular task.

**Challenges:**
- 💻 Computationally intensive and time-consuming.
- 📊 Demands significant hardware resources.

**Relation to PEFT:**
- 🔗 SFT forms the basis of fine-tuning, which PEFT seeks to optimize.
- 🔄 PEFT techniques can be integrated into an SFT framework for efficient results with reduced computational costs.

🔗 **Resources:**
- Supervised Fine-tuning Trainer: [SFT Documentation](https://huggingface.co/docs/transformers/en/training)

🚀 **In Summary:**

PEFT offers an efficient alternative to traditional SFT for fine-tuning LLMs, reducing training time and resource requirements while maintaining performance. This approach broadens access to LLM training and enhances efficiency across various tasks.

In [7]:
# here we are using supervised Fine-tuning
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2, # maping dataset 2 times
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2, # per device training batch size
        gradient_accumulation_steps = 4, 
        warmup_steps = 5,
        max_steps = 200,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

max_steps is given, it will override any value given in num_train_epochs


In [8]:
trainer.train() # starts training

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 616 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 200
 "-____-"     Number of trainable parameters = 18,464,768


Step,Training Loss
1,1.8516
2,2.194
3,1.8623
4,2.1174
5,1.5413
6,1.715
7,1.5648
8,1.4438
9,1.5207
10,1.3919


TrainOutput(global_step=200, training_loss=0.8591467666625977, metrics={'train_runtime': 253.0825, 'train_samples_per_second': 6.322, 'train_steps_per_second': 0.79, 'total_flos': 4685431193217024.0, 'train_loss': 0.8591467666625977, 'epoch': 2.5974025974025974})

In [9]:
trainer.save_model("Wise-Qwen") # saving our new adapter model

**Merging base model with adapter**

In [10]:
# This script is designed to merge a base model with an adapter model

import gc  # Importing garbage collector module for memory management
import os  # Importing operating system module for file and directory operations

import torch  # Importing PyTorch library
# from datasets import load_dataset  # Importing function to load datasets
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training  # Importing modules for PEFT training
from transformers import (  # Importing various components from the Transformers library
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import *

# Model
base_model = "unsloth/Qwen2-1.5B-Instruct" # full base model name
new_model = "Wise-Qwen" # sirif model name, it is the folder in which your adapter model is present



RuntimeError: Failed to import trl.multitask_prompt_tuning because of the following error (look up to see its traceback):
No module named 'trl.multitask_prompt_tuning'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model)
fp16_model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    offload_buffers=True
)
fp16_model, tokenizer = setup_chat_format(fp16_model, tokenizer)

# Merge adapter with base model
model = PeftModel.from_pretrained(fp16_model, new_model)
model = model.merge_and_unload()

In [None]:
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

# Quantization to GGUF

GGUF stands for GGML Universal Format. It's a specific type of file format designed to store models for a process called inference, especially when it comes to large language models (LLMs) like GPT.  In simpler terms, it's a way to save and use these models efficiently.

🤖 **Quantization in AI**

Quantization in AI is an essential technique that involves **reducing the precision or bit-width of numerical data** in a neural network model. It's like giving the model a makeover to optimize its performance on devices with **limited computational resources**, such as mobile phones or embedded systems.

During quantization, the model's **floating-point values** are transformed into **fixed-point or integer values** with a **reduced number of bits**. This clever compression technique helps to **save memory and computational power**, making the model more efficient when deployed on hardware with lower precision capabilities.

However, it's worth noting that quantization comes with a trade-off between **model accuracy and efficiency**. When we reduce the precision, there's a chance of losing some valuable information, which can lead to a decline in the model's performance. To combat this, experts employ **optimization and calibration techniques** to minimize the loss and maintain an acceptable level of accuracy.

Overall, quantization plays a vital role in the AI world by enabling the deployment of neural network models on **resource-constrained devices**. By finding the perfect balance between efficiency and accuracy, we can make the most of the available hardware resources while still achieving satisfactory performance levels.


## Why is AI Quantized? 🤔

AI models are often quantized to achieve various benefits, such as:

1. **Model Size Reduction:** Quantization techniques help reduce the size of AI models by representing the weights and activations with lower precision data types. 📉 This is particularly useful when deploying models on resource-constrained devices with limited storage capacity. 💾

2. **Inference Speed Improvement:** Quantized models can perform computations faster due to the reduced memory bandwidth requirements and optimized hardware instructions for low-precision operations. ⚡️ This enables real-time or near-real-time inference, making AI applications more efficient and responsive. 🚀

3. **Energy Efficiency:** By reducing the precision of AI models, quantization reduces the computational workload, resulting in lower power consumption. 🔋 This is especially important for battery-powered devices or scenarios where energy efficiency is a priority. 💡

4. **Deployment Flexibility:** Quantized models can be deployed on a wide range of platforms, including edge devices, embedded systems, and IoT devices. 🌐 The smaller model size and improved performance make it easier to integrate AI capabilities into various applications. 📱

It's important to note that quantization involves a trade-off between model performance and resource efficiency. ⚖️ While quantized models offer benefits in terms of size and speed, they may experience a slight decrease in accuracy compared to their full-precision counterparts. 🔍 However, advancements in quantization techniques have significantly minimized this accuracy gap, making it a valuable optimization strategy for AI models.

By quantizing AI models, we can unlock their potential to run efficiently on diverse hardware and enable widespread deployment of AI applications across different domains. 🌟


 🚀 **AI Quantization in GGUF Format**

ℹ️ **Quantizing AI models is crucial for optimizing performance and memory usage.**

If you wish to quantize your AI models in the GGUF format, head over to the following Google Colab notebook:
[Quantization in GGUF Format Colab](https://colab.research.google.com/drive/1zmrF7Jhe_q4fNLupSWyt1bX0mqilE8sa#scrollTo=fD24jJxq7t3k)

🔧 **Explore the notebook to leverage the benefits of GGUF quantization for enhanced AI model efficiency!**