<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Finetune_Gemma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install needed packages**


In [6]:
# Upgrade datasets library for data handling
# Upgrade transformers library for model inference
# Upgrade PEFT (Parameter-Efficient Fine-Tuning) library for efficient fine-tuning
# Upgrade trl library for reinforcement learning (training with transformers)
# Install bitsandbytes for memory-efficient model training
# Install accelerate for optimizing training across devices
# Install tensorboard for visualizing training progress
# Install jsonlines for creating and handling JSON Lines formatted datasets
!pip install --upgrade datasets transformers peft trl bitsandbytes accelerate tensorboard jsonlines

In [None]:
from huggingface_hub import notebook_login
notebook_login()


### **Import Required Libraries and Modules**

In [3]:
# Importing the os module for interacting with the operating system (e.g., file paths, environment variables)
import os

# Importing the transformers library for working with pre-trained models and tokenizers
import transformers

# Importing PyTorch library for tensor computations and deep learning model implementation
import torch

# Importing functions to load and manage datasets
from datasets import load_dataset, Dataset, DatasetDict

# Importing SFTTrainer from trl for supervised fine-tuning of transformers models
from trl import SFTTrainer

# Importing LoraConfig and PeftModel from peft for parameter-efficient fine-tuning
from peft import LoraConfig, PeftModel

# Importing AutoTokenizer and AutoModelForCausalLM from transformers for working with tokenization
# and causal language models
from transformers import AutoTokenizer, AutoModelForCausalLM

# Importing BitsAndBytesConfig for memory-efficient model training
# and GemmaTokenizer for specialized tokenization for the Gemma model
from transformers import BitsAndBytesConfig, GemmaTokenizer

ModuleNotFoundError: No module named 'datasets'

### **Load model 4-bit quantization and tokenizer**

In [10]:
# Specify the ID of the Gemma 2B Italian base model from Hugging Face's model hub
model_id = "google/gemma-2b-it"

# Configure Bits and Bytes for 4-bit quantization to reduce memory usage and speed up computations
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit precision for model weights
    bnb_4bit_quant_type="nf4",  # Specify the quantization type as Normal Float 4 (NF4)
    bnb_4bit_compute_dtype=torch.bfloat16  # Use bfloat16 precision for computations
)

# Load the tokenizer for the specified Gemma model
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the Gemma model with the specified quantization configuration
# and map the model to the first available device (e.g., GPU with ID 0)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,  # Apply the 4-bit quantization settings
    device_map={"": 0}  # Map the entire model to device 0
)

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### **Perform Inference**

After loading the model with 4-bit quantization, you can proceed with inference as usual:

In [23]:
# Define the input text
input_text = "What is NLP and CV"

# Tokenize the input text and move tensors to the appropriate device
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)

# Generate the output
outputs = model.generate(**input_ids, max_length=128)

# Decode and print the output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
generated_text

'What is NLP and CV?\n\n**Natural Language Processing (NLP)** is the ability of computers to understand, interpret, and generate human language. This includes tasks such as:\n\n* Text classification\n* Sentiment analysis\n* Named entity recognition\n* Question answering\n* Machine translation\n\n**Computer Vision (CV)** is the ability of computers to understand and interpret visual information. This includes tasks such as:\n\n* Object detection\n* Image segmentation\n* Image classification\n* Facial recognition\n\nNLP and CV are both used in a wide variety of applications, including:\n\n* Natural language processing is used in chatbots, machine translation,'

In [22]:
text = "Quote: Imagination is more"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
generated_text=tokenizer.decode(outputs[0], skip_special_tokens=True)
generated_text

'Quote: Imagination is more than just a dream. It\'s the blueprint for a better future." - Buck Brann\n\n'

In [24]:
os.environ["WANDB_DISABLED"] = "false"

### **configure the Low-Rank Adaptation (LoRA) model settings**

In [25]:
# Create a LoraConfig object to configure the Low-Rank Adaptation (LoRA) model settings
peft_config = LoraConfig(
    # Set the rank (r) of the adaptation layers; this controls the size of the low-rank matrices used in LoRA
    r = 8,  # Low-rank matrix rank for each adaptation layer

    # Specify which model modules should have LoRA applied to them
    target_modules = [
        "q_proj",  # Query projection layer of the attention mechanism
        "o_proj",  # Output projection layer of the attention mechanism
        "k_proj",  # Key projection layer of the attention mechanism
        "v_proj",  # Value projection layer of the attention mechanism
        "gate_proj",  # Gate projection layer for controlling information flow (if present)
        "up_proj",  # Upward projection layer for expanding feature dimensions
        "down_proj",  # Downward projection layer for reducing feature dimensions
    ],

    # Define the task type for which the LoRA config is being set. "CAUSAL_LM" is for causal language modeling.
    task_type = "CAUSAL_LM",  # Task type for the language model (e.g., causal language modeling)
)

###  **load the dataset**

In [28]:
# Import necessary modules from Hugging Face's datasets library and transformers
from datasets import load_dataset

# Load the "yrehan32/llama2-wiki-medical-terms" dataset
data = load_dataset("yrehan32/llama2-wiki-medical-terms")


In [29]:
data

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 6861
    })
})

In [30]:
 ##Apply the tokenizer to the dataset. We assume the text data is in a field like "text" or "medical_terms"
data = data.map(lambda samples: tokenizer(samples["text"]), batched=True)

# To view the tokenized data (first example from the dataset)
print(data['train'][0])  # Adjust according to the available split

In [36]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 6861
    })
})

In [1]:
# Define a formatting function to apply preprocessing or formatting
def formatting_func(example):
    # For example, tokenize the text and add any necessary formatting
    # Tokenize the 'text' field and return it as part of the output
    return tokenizer(example["text"], padding="max_length", truncation=True)

In [None]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,  # Specify where to save the model outputs and logs.
    num_train_epochs=num_train_epochs,  # Define the number of training epochs.
    per_device_train_batch_size=per_device_train_batch_size,  # Set the batch size per GPU.
    gradient_accumulation_steps=gradient_accumulation_steps,  # Accumulate gradients over several steps for larger effective batch size.
    optim=optim,  # Specify the optimizer to use during training.
    save_steps=save_steps,  # Define how often to save model checkpoints.
    logging_steps=logging_steps,  # Define how often to log training progress.
    learning_rate=learning_rate,  # Set the initial learning rate.
    weight_decay=weight_decay,  # Apply weight decay to prevent overfitting.
    fp16=fp16,  # Enable mixed-precision training using float16 if applicable.
    bf16=bf16,  # Enable mixed-precision training using bfloat16 if applicable.
    max_grad_norm=max_grad_norm,  # Clip gradients to prevent exploding gradients.
    max_steps=max_steps,  # Set the maximum number of training steps.
    warmup_ratio=warmup_ratio,  # Specify the ratio of warmup steps for learning rate scheduling.
    group_by_length=group_by_length,  # Group sequences by length to optimize memory usage during training.
    lr_scheduler_type=lr_scheduler_type,  # Choose the learning rate scheduler.
    report_to="tensorboard"  # Log training metrics to TensorBoard.
)

In [None]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,  # Use the pre-trained model.
    train_dataset=dataset,  # Provide the training dataset.
    peft_config=peft_config,  # Apply the LoRA configuration.
    dataset_text_field="text",  # Specify the text field in the dataset for training.
    max_seq_length=max_seq_length,  # Limit the maximum sequence length for training examples.
    tokenizer=tokenizer,  # Use the tokenizer for text preprocessing.
    args=training_arguments,  # Pass the training arguments.
    packing=packing,  # Enable sequence packing if applicable.
)