## 1. Importing Required LibrariesðŸ“¥

In [None]:
import sys
import site
import os
import psutil
import pandas as pd
import torch
import intel_extension_for_pytorch as ipex
import warnings
warnings.filterwarnings("ignore")

num_physical_cores = psutil.cpu_count(logical=False)
num_cores_per_socket = num_physical_cores // 2

print("Number of physical cores: ", num_physical_cores)
print("Number of cores per socket: ", num_cores_per_socket)

## 2. Device SetupðŸ”§ðŸ’»

In [2]:
def get_device() -> torch.device:
    """Check and return the appropriate device (XPU, CUDA, or CPU)."""
    if torch.cuda.is_available():
        device_type = "cuda"
        device = torch.device(device_type)
        print(f"Using CUDA device: {torch.cuda.get_device_name(0)}")
    elif torch.xpu.is_available():
        device_type = "xpu"
        device = torch.device(device_type)
        torch.xpu.empty_cache()  # Empty the XPU cache if using XPU
        print(f"Using device: {torch.xpu.get_device_name()}")
    else:
        device_type = "cpu"
        device = torch.device(device_type)
        print("Using CPU")
        
    return device

- **CUDA Availability Check:** The function first checks if a CUDA-capable GPU is available using torch.cuda.is_available(). If CUDA is available, it selects the GPU as the device and prints the name of the GPU.

- **XPU Availability Check:** If CUDA is not available, the function checks if an XPU (Accelerator) is available using torch.xpu.is_available(). If XPU is available, it selects the XPU device and empties the XPU cache to ensure no previous memory is used.

- **Fallback to CPU:** If neither CUDA nor XPU is available, the function defaults to using the CPU as the device and prints "Using CPU".

## 3. Setting up LoraConfig

In [3]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    # could use q, v and 0 projections as well and comment out the rest
    target_modules=["q_proj", "o_proj", 
                    "v_proj", "k_proj", 
                    "gate_proj", "up_proj",
                    "down_proj"],
    task_type="CAUSAL_LM")

### Explanation of `LoraConfig` Parameters

-  **r=32**
   - **What it is**: The rank (`r`) represents how many dimensions the low-rank adaptation uses. It controls the size of the updates to the modelâ€™s parameters.
   - **Why it matters**: A higher rank allows the model to adapt more flexibly, but it also uses more memory and processing power. In this case, `r=32` means the model can use a moderate amount of flexibility in adapting to new tasks.

-  **lora_alpha=16**
   - **What it is**: This is a scaling factor for the low-rank updates.
   - **Why it matters**: It controls how strongly the low-rank updates affect the modelâ€™s parameters. A higher value (like `16`) means the updates will have a bigger impact on the model, making the adaptation stronger.

-  **lora_dropout=0.1**
   - **What it is**: Dropout is a technique used during training to randomly ignore some parts of the model to prevent overfitting.
   - **Why it matters**: Here, `lora_dropout=0.1` means that during training, 10% of the low-rank connections will be randomly turned off. This helps the model generalize better to new data.

-  **bias="none"**
   - **What it is**: This setting controls whether bias terms are added to the low-rank updates.
   - **Why it matters**: Setting `bias="none"` means no additional bias is added, simplifying the model. This focuses the adaptation on low-rank updates, making it computationally lighter.

-  **target_modules=["q_proj", "o_proj", "v_proj", "k_proj", "gate_proj", "up_proj", "down_proj"]**
   - **What it is**: This is a list of specific layers in the model where the low-rank updates will be applied.
   - **Why it matters**: These layers are parts of the attention mechanism and other model components. By applying low-rank updates to only certain layers (like `q_proj`, `v_proj`, etc.), we can fine-tune the model more efficiently without touching everything. Each of these components has a specific role in processing data:
     - **q_proj**: Handles the "query" in attention, which helps the model look at previous words to predict the next one.
     - **o_proj**: Manages the output after attention.
     - **v_proj**: Deals with the "values" in attention, which represent the information the model is focusing on.
     - **k_proj**: Handles the "keys" in attention, which help match queries with values.
     - **gate_proj**: Often controls or adjusts the flow of information in the model.
     - **up_proj**: Used to expand the feature size in some parts of the model.
     - **down_proj**: Used to reduce the feature size in other parts of the model.

-  **task_type="CAUSAL_LM"**
   - **What it is**: This defines the type of task the model is being adapted for.
   - **Why it matters**: Setting this to `"CAUSAL_LM"` means the model is being trained for causal language modeling. In this task, the model predicts the next word based only on the words before it (not the ones after). This is common in models like GPT that generate text one word at a time.

### Summary:
The `LoraConfig` allows you to fine-tune a pre-trained model by applying low-rank adaptations to specific layers, which helps the model learn more efficiently for new tasks. You can control the strength of these updates, how much the model "ignores" during training to prevent overfitting, and which parts of the model to focus on for adapting to the task (like predicting the next word in a sequence).


## 4. Model Initialization and Optimization for Fine-Tuning

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

#Model ID
model_id = "Qwen/Qwen2.5-0.5B-Instruct"

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Set padding side to the right to ensure proper attention masking during fine-tuning
tokenizer.padding_side = "right"

# Load the model and move it to the appropriate device
device = get_device()
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# Disable caching mechanism to reduce memory usage during fine-tuning
model.config.use_cache = False

# Configure the model's pre-training tensor parallelism degree to match the fine-tuning setup
model.config.pretraining_tp = 1

# Enable gradient checkpointing to save memory during backpropagation
model.gradient_checkpointing_enable()

# Enable mixed precision for reduced memory usage and faster computation
model.fp16 = True  

### Explanation of Key Configurations

- **Setting the Padding Token**: 
  - The `pad_token` is set to the `eos_token` (End of Sequence token). This is done to handle padding in the input sequences during training, ensuring the tokenizer uses the EOS token as padding.

- **Padding Side**: 
  - Sets the padding side of the tokenizer to `"right"`. This ensures that when padding is added to sequences, it happens at the end (right side), which is important for attention masking during fine-tuning.

- **Disabling Cache**: 
  - This setting is turned off to save memory during fine-tuning. By default, caching speeds up inference, but it uses additional memory. Disabling it reduces memory usage during training.

- **Tensor Parallelism**: 
  - This sets the degree of tensor parallelism used during pre-training. A value of `1` means no parallelism (default setup). If you are using multiple GPUs for pre-training, you can adjust this value to split the tensor computation across GPUs.

- **Gradient Checkpointing**: 
  - This feature is enabled to reduce memory usage during backpropagation (the process of updating model weights). It saves memory by not storing intermediate activations and recomputing them during backpropagation, which is especially useful for large models.

- **Mixed Precision (FP16)**: 
  - Enables mixed precision training, which uses 16-bit floating point numbers (half precision) instead of the default 32-bit floating point (single precision). This reduces memory usage and speeds up computation, especially when fine-tuning large models on GPUs.


## 5.Testing the Base Model

Let's get answers for the questions from the base model(i.e., not finetuned model)

In [None]:
def generate_response(model, prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)    
    outputs = model.generate(input_ids, max_new_tokens=250,
                             eos_token_id=tokenizer.eos_token_id)    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def test_model(model, test_inputs):
    """quickly test the model using queries."""
    for input_text in test_inputs:
        print("__"*25)
        generated_response = generate_response(model, input_text)
        print(f"{input_text}")
        print(f"Generated Answer: {generated_response}\n")
        print("__"*25)

test_inputs = [
    "Who are the authors of the paper-Attention all you need?",
    "List out the formulas given the paper.",
    "What is llama?",
    "What are the different types of finetuning techniques?",
    "What is Gen-AI?",
    "What are the difference between Machine Learning algorithms and AI algorithms?"
]

print("Testing the model before fine-tuning:")
test_model(model, test_inputs)

## Dataset Loading, Filtering, and Preprocessing

This script demonstrates the process of loading, filtering, and preprocessing a dataset using the Hugging Face `datasets` library. Specifically, it works with the `databricks/databricks-dolly-15k` dataset, which is a collection of instruction-response pairs.

In [None]:
from datasets import load_dataset

# Load the dataset
dataset_name = "databricks/databricks-dolly-15k"
raw_data = load_dataset(dataset_name, split="train")

# Display a summary of the raw dataset
print(raw_data)

In [None]:
# Display a sample instruction and response
print(f"Instruction: {raw_data[0]['instruction']}")
print(f"Response: {raw_data[0]['response']}")

# Filter dataset to keep only question-answer categories
qa_categories = {"close_qa", "open_qa", "general_qa"}
qa_data = raw_data.filter(lambda example: example['category'] in qa_categories)

# Display filtered dataset information
print(f"Filtered dataset contains {len(qa_data)} examples.")
print(f"Categories in filtered dataset: {qa_data['category'][:10]}")

# Remove unnecessary fields for a cleaner dataset
cleaned_data = qa_data.remove_columns(["context", "category"])

# Display the final dataset information
print(f"Final dataset contains {len(cleaned_data)} examples.")
print(f"Fields in final dataset: {list(cleaned_data.features.keys())}")



### Steps:

1. **Loading the Dataset**:
   The dataset `databricks/databricks-dolly-15k` is loaded from the Hugging Face hub using the `load_dataset` function. The script loads the "train" split of the dataset into a variable called `dataset`.

2. **Displaying Sample Data**:
   The script prints the first instruction and its corresponding response from the dataset to give a quick preview of the data.

3. **Filtering the Dataset**:
   The dataset is filtered to keep only examples that belong to the following categories:
   - `"close_qa"`
   - `"open_qa"`
   - `"general_qa"`

   The filter function iterates through the dataset and includes only those examples whose 'category' field matches one of the categories above.

4. **Displaying Filtered Data Information**:
   The script prints out the number of examples that remain after filtering and displays the first few categories present in the filtered dataset.

5. **Removing Unwanted Fields**:
   The columns `"context"` and `"category"` are removed from the dataset to simplify it for further processing. After this step, only the relevant fields, such as `instruction` and `response`, remain.

6. **Final Dataset Information**:
   The script then prints the final number of examples and the names of the remaining fields in the dataset after the cleanup process.


## Dataset Formatting and Splitting

In [None]:
def format_prompts(batch):
    formatted_prompts = []
    for instruction, response in zip(batch["instruction"], batch["response"]):
        prompt = f"Instruction:\n{instruction}\n\nResponse:\n{response}"
        formatted_prompts.append(prompt)
    return {"text": formatted_prompts}

# Apply the formatting function to the dataset in a batched manner
dataset = cleaned_data.map(format_prompts, batched=True)

# Split the dataset into train and validation sets with a 80-20 split
train_dataset, validation_dataset = dataset.train_test_split(test_size=0.2, seed=99).values()

## Fine-Tuning a Model with SFTTrainer and LoRA

This script fine-tunes a model using the `SFTTrainer` from the `trl` library, with the application of the LoRA technique for efficient model adaptation. It sets up the training process, customizes various training arguments, and includes optimizations for memory and speed. 

In [11]:
import transformers
import warnings
from transformers import logging as transformers_logging
warnings.filterwarnings("ignore")
transformers_logging.set_verbosity_error()
 
from trl import SFTTrainer
 
finetuned_model_id = "qwen-0.5B-qa"

# Calculate max_steps based on the subset size
num_train_samples = len(train_dataset)
batch_size = 2
gradient_accumulation_steps = 8
steps_per_epoch = num_train_samples // (batch_size * gradient_accumulation_steps)
num_epochs = 5
max_steps = steps_per_epoch * num_epochs
print(f"Finetuning for max number of steps: {max_steps}")

def print_training_summary(results):
    print(f"Time: {results.metrics['train_runtime']: .2f}")
    print(f"Samples/second: {results.metrics['train_samples_per_second']: .2f}")

training_args = transformers.TrainingArguments(
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_ratio=0.05,
        max_steps=max_steps,
        learning_rate=1e-5,
        evaluation_strategy="steps",
        save_steps=500,
        bf16=True,
        logging_steps=100,
        output_dir=finetuned_model_id,
        use_ipex=True,
        max_grad_norm=0.6,
        weight_decay=0.01,
        group_by_length=True
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=512,
    packing=True
)

if device != "cpu":
    torch.xpu.empty_cache()
results = trainer.train()
print_training_summary(results)

# save lora model
tuned_lora_model = "Qwen2.5-0.5B-qa-lora"
trainer.model.save_pretrained(tuned_lora_model)

Finetuning for max number of steps: 1480


Generating train split: 951 examples [00:00, 1515.61 examples/s]
Generating train split: 259 examples [00:00, 1459.73 examples/s]


{'loss': 2.5948, 'grad_norm': 0.287109375, 'learning_rate': 9.815078236130868e-06, 'epoch': 1.680672268907563}
{'eval_loss': 2.4375014305114746, 'eval_runtime': 3.8475, 'eval_samples_per_second': 67.317, 'eval_steps_per_second': 8.577, 'epoch': 1.680672268907563}
{'loss': 2.479, 'grad_norm': 0.22265625, 'learning_rate': 9.103840682788051e-06, 'epoch': 3.361344537815126}
{'eval_loss': 2.377213478088379, 'eval_runtime': 3.9288, 'eval_samples_per_second': 65.924, 'eval_steps_per_second': 8.4, 'epoch': 3.361344537815126}
{'loss': 2.4481, 'grad_norm': 0.240234375, 'learning_rate': 8.392603129445237e-06, 'epoch': 5.042016806722689}
{'eval_loss': 2.361212730407715, 'eval_runtime': 3.9084, 'eval_samples_per_second': 66.267, 'eval_steps_per_second': 8.443, 'epoch': 5.042016806722689}
{'loss': 2.414, 'grad_norm': 0.291015625, 'learning_rate': 7.681365576102418e-06, 'epoch': 6.722689075630252}
{'eval_loss': 2.3490004539489746, 'eval_runtime': 3.9152, 'eval_samples_per_second': 66.152, 'eval_steps

### Key Parameters

### Training Parameters:
   - **`num_train_samples`**: The total number of training samples in the `train_dataset`.
   - **`batch_size`**: The batch size for training (set to 2).
   - **`gradient_accumulation_steps`**: The number of steps to accumulate gradients before updating the model (set to 8).
   - **`steps_per_epoch`**: Calculated based on the number of training samples, batch size, and gradient accumulation steps.
   - **`num_epochs`**: The number of epochs (set to 5).
   - **`max_steps`**: The total number of training steps is computed by multiplying `steps_per_epoch` with `num_epochs`.
   - The script prints out the calculated number of maximum steps for finetuning.

### Training Arguments:
   The `transformers.TrainingArguments` are set up with the following parameters:
   - **`per_device_train_batch_size`**: The batch size used per device (set to 2).
   - **`gradient_accumulation_steps`**: The number of gradient accumulation steps (set to 8).
   - **`warmup_ratio`**: Ratio of warmup steps (set to 0.05).
   - **`max_steps`**: The maximum number of training steps (calculated earlier).
   - **`learning_rate`**: The learning rate for training (set to `1e-5`).
   - **`evaluation_strategy`**: Defines how often evaluation should happen (set to "steps").
   - **`save_steps`**: Defines the frequency of saving the model (set to 500 steps).
   - **`bf16`**: Specifies usage of bfloat16 precision for better training speed and memory efficiency.
   - **`logging_steps`**: Defines the frequency of logging (set to 100 steps).
   - **`output_dir`**: Directory where the fine-tuned model will be saved.
   - **`use_ipex`**: Enables the Intel Extension for PyTorch to optimize model training on Intel hardware.
   - **`max_grad_norm`**: Gradient clipping (set to 0.6).
   - **`weight_decay`**: Weight decay for regularization (set to 0.01).
   - **`group_by_length`**: Optimizes batching by grouping examples with similar sequence lengths.

### Trainer Setup:
   The `SFTTrainer` is instantiated with the following parameters:
   - **`model`**: The model to be fine-tuned.
   - **`train_dataset`**: The training dataset.
   - **`eval_dataset`**: The validation dataset.
   - **`tokenizer`**: The tokenizer used for encoding the input.
   - **`args`**: The training arguments defined earlier.
   - **`peft_config`**: The LoRA configuration for efficient fine-tuning.
   - **`dataset_text_field`**: The name of the text field in the dataset (set to `"text"`).
   - **`max_seq_length`**: Maximum sequence length for input examples (set to 512).
   - **`packing`**: Whether to pack sequences to optimize training.

### Key Techniques:
- **LoRA**: Low-Rank Adaptation (LoRA) is used for efficient fine-tuning, enabling parameter-efficient training.
- **`SFTTrainer`**: A custom trainer for fine-tuning models with special support for sequence-to-sequence tasks and additional features like LoRA.


## Let's test our finetuned model

In [13]:
from peft import PeftModel

tuned_model = "qwen-0.5B-qa"
tuned_lora_model = "Qwen2.5-0.5B-qa-lora"

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.bfloat16,
)

model = PeftModel.from_pretrained(base_model, tuned_lora_model)
model = model.merge_and_unload()
# save final tuned model
model.save_pretrained(tuned_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Testing our finetuned model

In [None]:
test_inputs = [
    "Who are the authors of the paper-Attention all you need?",
    "List out the formulas given the paper.",
    "What is llama?",
    "What are the different types of finetuning techniques?",
    "What is Gen-AI?",
    "What are the difference between Machine Learning algorithms and AI algorithms?"
]
device = "xpu:0"

model = model.to(device)
for text in test_inputs:
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=200, 
                             do_sample=False, top_k=100,temperature=0.1, 
                             eos_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    print("____"*25)