<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Finetune_Gemma_NRE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install needed packages**


In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Install Flash Attention 2 for softcapping support
import torch
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install --no-deps packaging ninja einops "flash-attn>=2.6.3"

### **Loading a Pre-trained Language Model with Custom Configuration for Efficient Memory Usage**

In [2]:
# Import necessary libraries
from unsloth import FastLanguageModel  # FastLanguageModel provides easy loading of pre-trained models.
import torch  # PyTorch library for tensor computations and deep learning

# Configuration settings for the model
max_seq_length = 2048  # Define the maximum sequence length (2048 tokens in this case).
# The model can handle larger sequence lengths, and RoPE (Rotary Positional Embedding) scaling will be applied internally.

dtype = None  # The data type of the model's parameters. None means auto-detection.
# If you're using a Tesla T4 or V100, you might use Float16 for better performance.
# For Ampere+ (like A100, V100), Bfloat16 is usually a better option.

load_in_4bit = True  # If set to True, 4-bit quantization is applied to the model weights, reducing memory usage.
# This is useful for low-memory environments or when working with large models. Set to False to use full precision.

# Load the pre-trained model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-9b",  # Specify the name of the pre-trained model to load (in this case, "gemma-2-9b").
    max_seq_length = max_seq_length,  # Pass the defined maximum sequence length.
    dtype = dtype,  # Pass the dtype configuration (None for auto-detection).
    load_in_4bit = load_in_4bit,  # Pass the flag for using 4-bit quantization.
    # token = "hf_...",  # Uncomment and provide a token if using gated models like meta-llama (for example, Llama-2-7b-hf).
)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.7: Fast Gemma2 patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/46.4k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

### **We now add LoRA adapters so we only need to update 1 to 10% of all parameters!**

In [3]:
 # Configure the model with PEFT (Parameter-Efficient Fine-Tuning) settings using LoRA (Low-Rank Adaptation)
model = FastLanguageModel.get_peft_model(
    model,  # The base model to be fine-tuned using PEFT techniques

    # Low-Rank Adaptation (LoRA) rank
    r=16,  # Defines the rank of the low-rank matrices. Common choices: 8, 16, 32, 64, 128.
    # Larger values increase expressiveness but require more memory.

    # Modules to target for LoRA fine-tuning
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projection layers
        "gate_proj", "up_proj", "down_proj",     # MLP layers
    ],
    # Only these specified modules will be fine-tuned to reduce memory and computational overhead.

    # LoRA-specific hyperparameters
    lora_alpha=16,  # Scaling factor for LoRA weights. Balances new and pre-trained weights.
    lora_dropout=0,  # Dropout rate for LoRA. Setting to 0 often gives optimized performance.

    # Bias handling in fine-tuning
    bias="none",  # Specifies bias tuning. "none" is optimized for performance. Alternatives: "all", "lora_only".

    # Optimizations for VRAM and context length
    use_gradient_checkpointing="unsloth",  # Use gradient checkpointing to save memory during training.
    # The "unsloth" setting reduces VRAM usage by ~30%, allowing larger batch sizes or longer contexts.

    # Random seed for reproducibility
    random_state=3407,  # Ensures the results are reproducible across runs.

    # Advanced fine-tuning features
    use_rslora=False,  # Enables Rank-Stabilized LoRA (rSLoRA) if set to True. Useful for stability in high ranks.
    loftq_config=None,  # Configures LoftQ (Low Overhead Fine-Tuning Quantization), if used. Set to None for default.
)

Unsloth 2025.1.7 patched 42 layers with 42 QKV layers, 42 O layers and 42 MLP layers.


<a name="Data"></a>
### Data Preparation
We are using the **Named Entity Recognition (NER)** dataset from [SURESHBEEKHANI](https://huggingface.co/datasets/SURESHBEEKHANI/Named_entity_recognition). This dataset is ideal for training on named entity recognition tasks, where the goal is to identify entities such as person names, locations, and organizations within the text.

You can replace the dataset loading section with your own data preparation steps, depending on your specific use case.

**[NOTE]** To train only on the output (ignoring any extra context like the user's input), please refer to [TRL's documentation](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** It’s important to add the **EOS_TOKEN** to the tokenized output. Without it, your model could generate an infinite sequence! This marks the end of the generated text and helps control the length of the output.

For training on conversational datasets, you may want to use the `llama-3` template. We have prepared a conversational notebook that you can find [here](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

If your goal is to work with text completions (such as for creative writing), consider using this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).


In [4]:
# Define the prompt template for Alpaca-based task instruction
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""
# The alpaca_prompt template includes placeholders for instruction, input, and output. These will be filled dynamically during processing.

# EOS_TOKEN (End of Sequence Token) to signal the end of a generated sequence. It's necessary to prevent infinite generation.
EOS_TOKEN = tokenizer.eos_token  # Retrieve the EOS token from the tokenizer.

# Define a function to format the dataset examples into the Alpaca prompt format
def formatting_prompts_func(examples):
    instructions = examples["instructions"]  # Extract instructions from the dataset.
    inputs = examples["input"]  # Extract input data from the dataset.
    outputs = examples["Output"]  # Extract the expected output from the dataset.

    texts = []  # Initialize a list to store the formatted prompt texts.

    # Loop through each instruction, input, and output in parallel.
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Format the alpaca_prompt with the current instruction, input, and output.
        # EOS_TOKEN is added to signal the end of the text generation.
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)  # Append the formatted text to the list.

    # Return the formatted texts in a dictionary with the key "text".
    return { "text" : texts, }

# Load a dataset from Hugging Face's datasets library
from datasets import load_dataset
dataset = load_dataset("SURESHBEEKHANI/Named_entity_recognition", split="train")
# This loads the "train" split of the "Named_entity_recognition" dataset by the user "SURESHBEEKHANI".

# Apply the formatting function to the dataset using the `map` method.
# This will apply the `formatting_prompts_func` to each example in the dataset in a batched manner.
dataset = dataset.map(formatting_prompts_func, batched=True)

README.md:   0%|          | 0.00/571 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11989 [00:00<?, ? examples/s]

Map:   0%|          | 0/11989 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the Model
Next, let's train the model using Hugging Face TRL's `SFTTrainer`! For more details, check out the [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).

To speed up the process, we perform 100 steps in this example, but you can modify the `num_train_epochs` parameter to `1` for a full training run and set `max_steps=None` for training based on the number of epochs instead.

Additionally, we also support using TRL's `DPOTrainer` for other fine-tuning strategies, which you can explore if needed!


In [5]:
# Import necessary libraries
from trl import SFTTrainer  # SFTTrainer is used for training models in a parameter-efficient way.
from transformers import TrainingArguments  # TrainingArguments contains configuration options for model training.
from unsloth import is_bfloat16_supported  # Function to check if the hardware supports bfloat16 precision.

# Initialize the SFTTrainer with the necessary arguments
trainer = SFTTrainer(
    model = model,  # The pre-trained model to be fine-tuned.
    tokenizer = tokenizer,  # The tokenizer used to process input data for the model.
    train_dataset = dataset,  # The dataset to train on, assumed to be preprocessed and formatted.
    dataset_text_field = "text",  # Specifies which field in the dataset contains the input text. Here, it's the "text" field.
    max_seq_length = max_seq_length,  # The maximum length of input sequences for the model. Ensures sequences are not longer than this value.
    dataset_num_proc = 2,  # Number of CPU processes to use for data loading. 2 can speed up the dataset loading process.
    packing = False,  # If set to True, sequences will be packed into a single tensor (can make training faster for short sequences).

    # Define the training configuration through TrainingArguments
    args = TrainingArguments(
        per_device_train_batch_size = 2,  # Batch size per device during training.
        gradient_accumulation_steps = 4,  # Number of steps to accumulate gradients before performing an update (helps with memory efficiency).
        warmup_steps = 5,  # Number of steps to perform learning rate warmup before training starts.
        max_steps = 50,  # The total number of training steps. Once reached, training will stop.
        learning_rate = 2e-4,  # Learning rate for the optimizer.

        # fp16 (16-bit floating point) and bf16 (bfloat16) are precision modes used to speed up training and reduce memory usage.
        fp16 = not is_bfloat16_supported(),  # Use fp16 if bfloat16 is not supported by the hardware.
        bf16 = is_bfloat16_supported(),  # Use bf16 if supported by the hardware (typically for Ampere GPUs or newer).

        logging_steps = 1,  # Log training progress every step. Setting this to a higher value reduces logging frequency.
        optim = "adamw_8bit",  # Specifies the optimizer used during training (AdamW with 8-bit precision for memory efficiency).
        weight_decay = 0.01,  # Regularization parameter to prevent overfitting by penalizing large weights.
        lr_scheduler_type = "linear",  # Learning rate scheduler type. "linear" gradually decays the learning rate during training.
        seed = 3407,  # Random seed for reproducibility of results.
        output_dir = "outputs",  # Directory to save model checkpoints and logs during training.
        report_to = "none",  # Specifies where to report metrics (e.g., "none" means no reporting, or use "wandb" for Weights & Biases).
    ),
)


Map (num_proc=2):   0%|          | 0/11989 [00:00<?, ? examples/s]

In [6]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
6.879 GB of memory reserved.


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 11,989 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 50
 "-____-"     Number of trainable parameters = 54,018,048


Step,Training Loss
1,1.6093
2,1.4369
3,1.5298
4,1.4469
5,1.2982
6,1.0349
7,0.8812
8,0.7204
9,0.6962
10,0.6418


In [8]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

928.6021 seconds used for training.
15.48 minutes used for training.
Peak reserved memory = 13.617 GB.
Peak reserved memory for training = 6.738 GB.
Peak reserved memory % of max memory = 92.331 %.
Peak reserved memory for training % of max memory = 45.688 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [9]:
 #Define the prompt template for text summarization.
alpaca_prompt = """Your AI assistant for NER", "The entities are categorized into different types, such as PERSON, LOCATION, ORGANIZATION, etc.", "Please review the extracted entities for any potential errors or misclassifications.", "For improved accuracy, try using a more domain-specific NER mode.

### input:
{}

### Output:
{}"""  # The summary part is left empty for generation.

# FastLanguageModel.for_inference(model) enables optimizations for faster inference.
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference by configuring the model for efficient use

# Example of a text summarization task.
# Here, you provide a longer piece of text as input, and the model will generate a concise summary.
inputs = tokenizer(
    [
         alpaca_prompt.format(  # Format the prompt with the input text and an empty placeholder for the summary.
            """On the Republican side , Senator John McCain seems on the verge of clinching his party 's nomination """
            ,  # Insert input text for summarization.
            ""  # The summary section is empty for the model to fill in.
        )
    ], return_tensors="pt"  # Convert input to PyTorch tensors.
).to("cuda")  # Move the input data to the GPU for faster processing.

# Generate the summary using the model.
# 'max_new_tokens' controls how many tokens the model is allowed to generate.
# 'use_cache' allows for faster generation by caching previous results.
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# Decode the generated tokens back into readable text.
# This will give us the model's summary of the provided input text.
tokenizer.batch_decode(outputs)  # Convert the output tokens to text.

['<bos>Your AI assistant for NER", "The entities are categorized into different types, such as PERSON, LOCATION, ORGANIZATION, etc.", "Please review the extracted entities for any potential errors or misclassifications.", "For improved accuracy, try using a more domain-specific NER mode.\n\n### input:\nOn the Republican side , Senator John McCain seems on the verge of clinching his party \'s nomination \n\n### Output:\n[{\'end\': 10, \'entity\': \'I-PER\', \'index\': 2, \'score\': 0.99999994, \'start\': 7, \'word\': \'John\'}, {\'end\': 13, \'entity\': \'I-PER\',']

In [10]:
#Define the prompt template for text summarization.
alpaca_prompt = """Your AI assistant for NER", "The entities are categorized into different types, such as PERSON, LOCATION, ORGANIZATION, etc.", "Please review the extracted entities for any potential errors or misclassifications.", "For improved accuracy, try using a more domain-specific NER mode.

### input:
{}

### Output:
{}"""  # The summary part is left empty for generation.

# FastLanguageModel.for_inference(model) enables optimizations for faster inference.
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference by configuring the model for efficient use

# Example of a text summarization task.
# Here, you provide a longer piece of text as input, and the model will generate a concise summary.
inputs = tokenizer(
    [
         alpaca_prompt.format(  # Format the prompt with the input text and an empty placeholder for the summary.
            """On the Republican side , Senator John McCain seems on the verge of clinching his party 's nomination """
            ,  # Insert input text for summarization.
            ""  # The summary section is empty for the model to fill in.
        )
    ], return_tensors="pt"  # Convert input to PyTorch tensors.
).to("cuda")  # Move the input data to the GPU for faster processing.

from transformers import TextStreamer

# Initialize the TextStreamer to decode the generated tokens during streaming.
# This facilitates immediate feedback on the model’s output.
text_streamer = TextStreamer(tokenizer)

# Generate the summary using the model, streaming token-by-token for faster results.
# The model will produce a summary up to a maximum of 128 tokens.
_ = model.generate(
    **inputs,  # Provide the tokenized input text to the model.
    streamer=text_streamer,  # Enable token-by-token streaming.
    max_new_tokens=128  # Limit the number of tokens in the generated summary.
)

<bos>Your AI assistant for NER", "The entities are categorized into different types, such as PERSON, LOCATION, ORGANIZATION, etc.", "Please review the extracted entities for any potential errors or misclassifications.", "For improved accuracy, try using a more domain-specific NER mode.

### input:
On the Republican side , Senator John McCain seems on the verge of clinching his party 's nomination 

### Output:
[{'end': 10, 'entity': 'I-PER', 'index': 2, 'score': 0.99999994, 'start': 7, 'word': 'John'}, {'end': 13, 'entity': 'I-PER', 'index': 3, 'score': 0.99999994, 'start': 11, 'word': 'Mc'}, {'end': 15, 'entity': 'I-PER', 'index': 4, 'score': 0.999


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters,

In [11]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.model',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

##### Push the trained model to the Hugging Face Model Hub using the GGUF format

In [None]:
# Push the trained model to the Hugging Face Model Hub using the GGUF format
model.push_to_hub_gguf(
    "SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF",  # Specify the model repository path on Hugging Face Hub. Replace "hf" with your Hugging Face username.
    tokenizer,  # Pass the tokenizer associated with the model to ensure compatibility on the hub
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],  # Specify the quantization methods to apply for optimized model storage (e.g., q4_k_m, q8_0, q5_k_m)
    token="",  # Provide the Hugging Face token for authentication. Obtain a token at https://huggingface.co/settings/tokens
)


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.75 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 31%|███       | 13/42 [00:01<00:02, 13.91it/s]
We will save to Disk and not RAM now.
100%|██████████| 42/42 [03:10<00:00,  4.54s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF/pytorch_model-00001-of-00004.bin...
Unsloth: Saving SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF/pytorch_model-00002-of-00004.bin...
Unsloth: Saving SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF/pytorch_model-00003-of-00004.bin...
Unsloth: Saving SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting gemma2 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0', 'q5_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF into f16 GGUF format.
The output location will be /content/SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Finetune_Gemma_NRE_SFT_GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/5.76G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/9.83G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Finetune_Gemma_NRE_SFT_GGUF
Unsloth: Uploading GGUF to Huggingface Hub...
