<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Finetune_Gemma_NRE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install needed packages**


In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Install Flash Attention 2 for softcapping support
import torch
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install --no-deps packaging ninja einops "flash-attn>=2.6.3"

### **Loading a Pre-trained Language Model with Custom Configuration for Efficient Memory Usage**

In [None]:
# Import necessary libraries
from unsloth import FastLanguageModel  # FastLanguageModel provides easy loading of pre-trained models.
import torch  # PyTorch library for tensor computations and deep learning

# Configuration settings for the model
max_seq_length = 2048  # Define the maximum sequence length (2048 tokens in this case).
# The model can handle larger sequence lengths, and RoPE (Rotary Positional Embedding) scaling will be applied internally.

dtype = None  # The data type of the model's parameters. None means auto-detection.
# If you're using a Tesla T4 or V100, you might use Float16 for better performance.
# For Ampere+ (like A100, V100), Bfloat16 is usually a better option.

load_in_4bit = True  # If set to True, 4-bit quantization is applied to the model weights, reducing memory usage.
# This is useful for low-memory environments or when working with large models. Set to False to use full precision.

# Load the pre-trained model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-9b",  # Specify the name of the pre-trained model to load (in this case, "gemma-2-9b").
    max_seq_length = max_seq_length,  # Pass the defined maximum sequence length.
    dtype = dtype,  # Pass the dtype configuration (None for auto-detection).
    load_in_4bit = load_in_4bit,  # Pass the flag for using 4-bit quantization.
    # token = "hf_...",  # Uncomment and provide a token if using gated models like meta-llama (for example, Llama-2-7b-hf).
)


ModuleNotFoundError: No module named 'datasets'

### **We now add LoRA adapters so we only need to update 1 to 10% of all parameters!**

In [None]:
# Apply PEFT (Parameter Efficient Fine-Tuning) model to the pre-trained model
model = FastLanguageModel.get_peft_model(
    model,  # The pre-trained model to apply PEFT to.

    r = 16,  # Rank of the low-rank adaptation. Choose any number > 0.
    # Common suggested values include 8, 16, 32, 64, or 128, depending on memory and performance needs.

    # Specify the model layers to apply PEFT to. These are typically projection layers in transformer models.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",  # Attention projections: query, key, value, output projections
                      "gate_proj", "up_proj", "down_proj",]   # Additional layers for gated architectures or transformations

    lora_alpha = 16,  # Scaling factor for the low-rank approximation. Higher values increase the weight of the low-rank components.

    lora_dropout = 0,  # Dropout rate applied to the low-rank adaptation. A value of 0 means no dropout, which is optimized for performance.

    bias = "none",  # Bias handling for low-rank layers. "none" means no bias applied, which is optimized for the PEFT approach.

    # Gradient checkpointing configuration for memory efficiency.
    # "unsloth" is optimized for very long context and reduces VRAM usage by 30%, allowing larger batch sizes.
    use_gradient_checkpointing = "unsloth",  # Set to True or "unsloth" for long-context memory optimization.

    random_state = 3407,  # Random seed to ensure reproducibility of results.

    use_rslora = False,  # Option to use Rank Stabilized LoRA (a variant of LoRA). Set to False if not using it.

    loftq_config = None,  # LoftQ configuration option. Set to None unless using LoftQ (a specific quantization method).
)


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

<a name="Data"></a>
### Data Preparation
We are using the **Named Entity Recognition (NER)** dataset from [SURESHBEEKHANI](https://huggingface.co/datasets/SURESHBEEKHANI/Named_entity_recognition). This dataset is ideal for training on named entity recognition tasks, where the goal is to identify entities such as person names, locations, and organizations within the text.

You can replace the dataset loading section with your own data preparation steps, depending on your specific use case.

**[NOTE]** To train only on the output (ignoring any extra context like the user's input), please refer to [TRL's documentation](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** It’s important to add the **EOS_TOKEN** to the tokenized output. Without it, your model could generate an infinite sequence! This marks the end of the generated text and helps control the length of the output.

For training on conversational datasets, you may want to use the `llama-3` template. We have prepared a conversational notebook that you can find [here](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

If your goal is to work with text completions (such as for creative writing), consider using this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).


In [None]:
# Define the prompt template for Alpaca-based task instruction
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""
# The alpaca_prompt template includes placeholders for instruction, input, and output. These will be filled dynamically during processing.

# EOS_TOKEN (End of Sequence Token) to signal the end of a generated sequence. It's necessary to prevent infinite generation.
EOS_TOKEN = tokenizer.eos_token  # Retrieve the EOS token from the tokenizer.

# Define a function to format the dataset examples into the Alpaca prompt format
def formatting_prompts_func(examples):
    instructions = examples["instruction"]  # Extract instructions from the dataset.
    inputs = examples["input"]  # Extract input data from the dataset.
    outputs = examples["output"]  # Extract the expected output from the dataset.

    texts = []  # Initialize a list to store the formatted prompt texts.

    # Loop through each instruction, input, and output in parallel.
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Format the alpaca_prompt with the current instruction, input, and output.
        # EOS_TOKEN is added to signal the end of the text generation.
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)  # Append the formatted text to the list.

    # Return the formatted texts in a dictionary with the key "text".
    return { "text" : texts, }

# Load a dataset from Hugging Face's datasets library
from datasets import load_dataset
dataset = load_dataset("SURESHBEEKHANI/Named_entity_recognition", split="train")
# This loads the "train" split of the "Named_entity_recognition" dataset by the user "SURESHBEEKHANI".

# Apply the formatting function to the dataset using the `map` method.
# This will apply the `formatting_prompts_func` to each example in the dataset in a batched manner.
dataset = dataset.map(formatting_prompts_func, batched=True)

<a name="Train"></a>
### Train the Model
Next, let's train the model using Hugging Face TRL's `SFTTrainer`! For more details, check out the [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).

To speed up the process, we perform 100 steps in this example, but you can modify the `num_train_epochs` parameter to `1` for a full training run and set `max_steps=None` for training based on the number of epochs instead.

Additionally, we also support using TRL's `DPOTrainer` for other fine-tuning strategies, which you can explore if needed!


In [None]:
# Import necessary libraries
from trl import SFTTrainer  # SFTTrainer is used for training models in a parameter-efficient way.
from transformers import TrainingArguments  # TrainingArguments contains configuration options for model training.
from unsloth import is_bfloat16_supported  # Function to check if the hardware supports bfloat16 precision.

# Initialize the SFTTrainer with the necessary arguments
trainer = SFTTrainer(
    model = model,  # The pre-trained model to be fine-tuned.
    tokenizer = tokenizer,  # The tokenizer used to process input data for the model.
    train_dataset = dataset,  # The dataset to train on, assumed to be preprocessed and formatted.
    dataset_text_field = "text",  # Specifies which field in the dataset contains the input text. Here, it's the "text" field.
    max_seq_length = max_seq_length,  # The maximum length of input sequences for the model. Ensures sequences are not longer than this value.
    dataset_num_proc = 2,  # Number of CPU processes to use for data loading. 2 can speed up the dataset loading process.
    packing = False,  # If set to True, sequences will be packed into a single tensor (can make training faster for short sequences).

    # Define the training configuration through TrainingArguments
    args = TrainingArguments(
        per_device_train_batch_size = 2,  # Batch size per device during training.
        gradient_accumulation_steps = 4,  # Number of steps to accumulate gradients before performing an update (helps with memory efficiency).
        warmup_steps = 5,  # Number of steps to perform learning rate warmup before training starts.
        max_steps = 100,  # The total number of training steps. Once reached, training will stop.
        learning_rate = 2e-4,  # Learning rate for the optimizer.

        # fp16 (16-bit floating point) and bf16 (bfloat16) are precision modes used to speed up training and reduce memory usage.
        fp16 = not is_bfloat16_supported(),  # Use fp16 if bfloat16 is not supported by the hardware.
        bf16 = is_bfloat16_supported(),  # Use bf16 if supported by the hardware (typically for Ampere GPUs or newer).

        logging_steps = 1,  # Log training progress every step. Setting this to a higher value reduces logging frequency.
        optim = "adamw_8bit",  # Specifies the optimizer used during training (AdamW with 8-bit precision for memory efficiency).
        weight_decay = 0.01,  # Regularization parameter to prevent overfitting by penalizing large weights.
        lr_scheduler_type = "linear",  # Learning rate scheduler type. "linear" gradually decays the learning rate during training.
        seed = 3407,  # Random seed for reproducibility of results.
        output_dir = "outputs",  # Directory to save model checkpoints and logs during training.
        report_to = "none",  # Specifies where to report metrics (e.g., "none" means no reporting, or use "wandb" for Weights & Biases).
    ),
)
