## Task
To fine-tune a pre-trained large language model (gemma2 2b it) for multi-class classification task using Quantized Low-Ranking Adaptation (QLoRA).

## Environment Setup

### Login to Hugging Face Hub

In [None]:
from huggingface_hub import login
login()

login() is used to authenticate with Hugging Face to load the model from their hub.

### Check GPU Availability

In [2]:
import torch
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Device Setup

In [3]:
torch.cuda.set_device(0)

This selects the GPU (if available) for running computations, optimizing performance for large models.

## Load Pretrained Model and Tokenizer

### Define ID and Label Mapping

In [4]:
id2label = {0: "negative", 1: "positive", 2: "neutral"}
label2id = {"negative": 0, "positive": 1, "neutral": 2}

### Define Model and Tokenizer

In [5]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig, AutoConfig

model_name = "google/gemma-2-2b-it"

bnbConfig = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3,
    id2label=id2label,
    label2id=label2id,
    device_map = "auto",
    quantization_config=bnbConfig
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of Gemma2ForSequenceClassification were not initialized from the model checkpoint at google/gemma-2-2b-it and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


* Define the model name to be used.
* Configure BitsAndBytesConfig Parameters: This configuration is used to quantize the model using the bitsandbytes library. Quantization helps in reducing the memory footprint and computation required for model training and inference, which is essential when dealing with large models like Gemma-2-2b.
    * load_in_4bit: Enables 4-bit precision to reduce memory usage and speed up computations.
    * bnb_4bit_quant_type: Specifies the quantization type (e.g., nf4), which impacts the trade-off between accuracy and efficiency.
    * bnb_4bit_use_double_quant: Adds a second layer of quantization to retain some precision despite low-bit encoding.
    * bnb_4bit_compute_dtype: Sets the data type for computations (bfloat16 is commonly used in GPUs as it provides efficient precision and computation speed).
* Initialize tokenizer for the defined model
* Load the model:
    * model_name: Name of the defined model.
    * num_labels=3: Use for multi-class classification (e.g., positive, neutral, negative).
    * id2label=id2label: A dictionary that maps output class indices (ids) to human-readable labels. This is used to convert model predictions into interpretable labels.
    * label2id=label2id: A dictionary that maps human-readable labels to their corresponding indices (ids). This is used during model fine-tuning to convert the dataset's labels into the correct format for the model.
    * device_map="auto": Automatically maps model layers to available devices.
    * quantization_config=bnbConfig: This specifies the quantization settings (e.g., 4-bit quantization) to reduce model size and improve efficiency.

### Check Where Model is Loaded

In [6]:
loaded_device = next(model.parameters()).device
print(f"The model is loaded on: {loaded_device}")

The model is loaded on: cuda:0


## Sentiment Inference Before Fine-Tuning

### Inference 1

In [7]:
input_text = "@united I loved the flight!"
    
inputs = tokenizer(input_text, return_tensors='pt', truncation=True).to("cuda")

with torch.no_grad():
    outputs = model(**inputs)

predicted_logits = outputs.logits
predicted_class_id = torch.argmax(predicted_logits, dim=1).item()
predicted_label = id2label[predicted_class_id]

print(predicted_class_id)
print("Predicted Sentiment:",predicted_label)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


2
Predicted Sentiment: neutral


### Inference 2

In [8]:
input_text = "Need to improve your airline service"
    
inputs = tokenizer(input_text, return_tensors='pt', truncation=True).to("cuda")

with torch.no_grad():
    outputs = model(**inputs)

predicted_logits = outputs.logits
predicted_class_id = torch.argmax(predicted_logits, dim=1).item()
predicted_label = id2label[predicted_class_id]

print(predicted_class_id)
print("Predicted Sentiment:",predicted_label)

2
Predicted Sentiment: neutral


The steps are:
* Input Text: The input text is tokenized and sent to the GPU for processing.
* Model Inference: The model outputs logits for each sentiment class (positive, negative, neutral).
* Prediction: The predicted sentiment class is determined using the argmax function on the logits.

## Data Preparation for Fine-Tuning

### Load Dataset

In [9]:
from datasets import load_dataset, Dataset
import pandas as pd

df = pd.read_csv('Datasets/Airline-Sentiment-2-w-AA.csv', encoding='ISO-8859-1')

filtered_df = df[['text', 'airline_sentiment']]

dataset = Dataset.from_pandas(filtered_df)

shuffled_dataset = dataset.shuffle(seed=42)

shuffled_dataset

Dataset({
    features: ['text', 'airline_sentiment'],
    num_rows: 14640
})

The airline sentiment dataset is loaded, filtered, shuffled, and converted into a Hugging Face dataset.

### Train, Validation, and Test Splits

### Split Dataset

In [10]:
train_val_test_split = shuffled_dataset.train_test_split(test_size=0.2, seed=42)

train_val_dataset = train_val_test_split['train']
test_dataset = train_val_test_split['test']

train_val_split = train_val_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = train_val_split['train']
validation_dataset = train_val_split['test']

The data is split into training, validation, and test sets (72-8-20 split).

### Check Dataset Size

In [11]:
print(f"Train size: {len(train_dataset)}")
print(f"Validation size: {len(validation_dataset)}")
print(f"Test size: {len(test_dataset)}")

Train size: 10540
Validation size: 1172
Test size: 2928


## Fine-Tuning the Model with QLoRA

### Define Tokenization Function

In [12]:
def tokenize_function(examples):
    tokenized_output = tokenizer(examples['text'], padding='max_length', truncation=True)
    tokenized_output['labels'] = [label2id[label] for label in examples['airline_sentiment']]
    return tokenized_output

### Tokenization of Dataset

In [13]:
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_validation_dataset = validation_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/10540 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.


Map:   0%|          | 0/1172 [00:00<?, ? examples/s]

Map:   0%|          | 0/2928 [00:00<?, ? examples/s]

The text and labels are tokenized and converted into numerical representations using the tokenizer.

### QLoRA Preparation:

In [14]:
model.gradient_checkpointing_enable()

In [15]:
from peft import prepare_model_for_kbit_training
prepare_model_for_kbit_training(model)

Gemma2ForSequenceClassification(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear4bit(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_feedforw

The model is prepared for fine-tuning using LoRA (Low-Rank Adaptation) with 4-bit quantization.

### Identify LoRA Modules

In [16]:
import bitsandbytes as bnb

def find_linear_names(model):
    cls = bnb.nn.Linear4bit  

    lora_module_names = set()

    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

        if 'lm_head' in lora_module_names: 
            lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_linear_names(model)

In [17]:
print(modules)

['o_proj', 'down_proj', 'gate_proj', 'q_proj', 'up_proj', 'k_proj', 'v_proj']


This code defines a function that utilizes bitsandbytes.nn modules to identify all learnable linear layers within the quantized GEMMA-2b model. Candidates layers for the LoRA adapter are printed.

### LoRA Configuration

In [18]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=modules,
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 10,390,272 || all params: 2,624,739,072 || trainable%: 0.3959


This configuration defines the LoRA (Low-Rank Adaptation) setup. LoRA helps in fine-tuning large models by freezing most of the model's layers and only training low-rank weight matrices, significantly reducing the number of trainable parameters.
* r: Determines the rank of the low-rank decomposition. Smaller ranks reduce the number of parameters but may lead to lower accuracy.
* lora_alpha: Controls the scaling factor for LoRA layers, influencing how much impact these low-rank weights have.
* target_modules: Specifies which layers of the model will use LoRA for training (identified via find_linear_names()).
* lora_dropout: Dropout rate for LoRA layers to prevent overfitting.
* bias: Controls whether to use bias in the LoRA weights; "none" means no additional bias terms.
* task_type: Specifies the type of task (e.g., sequence classification).

### Define Data Collator

In [19]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer
)

The DataCollatorWithPadding class from Transformers helps prepare batches of data for training. It handles padding sequences to a common length and creates attention masks.

## Training and Evaluation Setup

### Define Training Arguments

In [20]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    num_train_epochs=3,                
    report_to="none",
    metric_for_best_model='eval_loss',
)




These parameters define how the model is trained, optimized, and validated during fine-tuning.
* output_dir: Where all the outputs from training, including model checkpoints, logs, and results, are saved.
* learning_rate: Controls how much to adjust the model's weights with each step. A lower value (e.g., 2e-5) helps in fine-tuning by making smaller, controlled updates.
* per_device_train_batch_size: Number of samples processed in parallel during training for each GPU. A smaller batch size is used for memory efficiency.
* per_device_eval_batch_size: Number of samples processed in parallel during evaluation.
* weight_decay: Regularization to prevent the model from overfitting by shrinking large weights.
* eval_strategy: Defines when to run evaluation during training (e.g., after each epoch).
* save_strategy: When to save checkpoints of the model (here, after each epoch).
* load_best_model_at_end: Ensures that the model with the lowest evaluation loss is used after training.
* num_train_epochs: Specifies how many full passes over the dataset the model will make during training.
* metric_for_best_model: The metric used to decide which model checkpoint is the best. Here it is set to eval_loss, meaning lower evaluation loss is preferred.

### Define Early Stopping Callback Parameters

In [21]:
from transformers import EarlyStoppingCallback

early_stop = EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=.0)

This callback is used to stop training early if the model's performance on the validation set does not improve.
* early_stopping_patience: Specifies how many consecutive epochs with no improvement are allowed before stopping. A value of 1 means training will stop after just one epoch of no improvement.
* early_stopping_threshold: Minimum improvement in the monitored metric (e.g., loss) required to continue training. A value of 0.0 means any improvement is sufficient to keep training.

### Define Function for Metrics Calculation

In [22]:
import evaluate
import numpy as np

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')
    precision = precision_metric.compute(predictions=predictions, references=labels, average='weighted')
    recall = recall_metric.compute(predictions=predictions, references=labels, average='weighted')
    
    return {**accuracy, **f1, **precision, **recall}

This function computes accuracy, F1 score, precision, and recall for the model's predictions.
* Accuracy: Proportion of correct predictions.
* F1 Score: Harmonic mean of precision and recall (weighted for imbalanced classes).
* Precision: Proportion of positive predictions that were correct.
* Recall: Proportion of actual positives that were predicted correctly.

### Define Trainer Parameters

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stop]
)

The Trainer class in Hugging Face's transformers library simplifies the training and evaluation of models. It includes parameters for various aspects of training, data handling, evaluation, and optimizations.
* model=model: Specifies the model that is being fine-tuned. In this case, a pre-trained model (gemma-2-2b-it) is used for sequence classification.
* args=training_args: Contains various training hyperparameters and configurations, defined in the TrainingArguments object.
* train_dataset=tokenized_train_dataset: The tokenized training dataset, which has been processed and labeled for the sentiment classification task.
* eval_dataset=tokenized_validation_dataset: The dataset used for evaluating the model during training to track performance.
* tokenizer=tokenizer: The tokenizer associated with the model, which converts raw text into token IDs that the model can process.
* data_collator=data_collator: A function responsible for collecting a batch of data samples and preparing them for the model.
* compute_metrics=compute_metrics:  A function that computes evaluation metrics like accuracy, precision, recall, and F1 score.
* callbacks=[early_stop]: Specifies callback functions to be executed at certain stages of the training process.

### Train the Model

In [24]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4994,0.546998,0.883106,0.877703,0.882679,0.883106
2,0.4227,0.480931,0.880546,0.881077,0.881713,0.880546
3,0.3041,0.597767,0.877986,0.877983,0.878591,0.877986


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


TrainOutput(global_step=7905, training_loss=0.42834574574537476, metrics={'train_runtime': 5562.5396, 'train_samples_per_second': 5.684, 'train_steps_per_second': 1.421, 'total_flos': 1.4641962791786496e+16, 'train_loss': 0.42834574574537476, 'epoch': 3.0})

The trainer object is used to train the model.

### Evaluate the Trained Model

In [25]:
trainer.evaluate(eval_dataset=tokenized_test_dataset)

{'eval_loss': 0.520151674747467,
 'eval_accuracy': 0.8767076502732241,
 'eval_f1': 0.8766970672982001,
 'eval_precision': 0.876958392165369,
 'eval_recall': 0.8767076502732241,
 'eval_runtime': 157.0042,
 'eval_samples_per_second': 18.649,
 'eval_steps_per_second': 4.662,
 'epoch': 3.0}

The trainer object is used to evaluate the model. Here tokenized_test_dataset is used during evaluation.

## Save the Model

In [26]:
model.save_pretrained("./fine_tuned_gemma2_2b_it_qlora")
tokenizer.save_pretrained("./fine_tuned_gemma2_2b_it_qlora")

('./fine_tuned_gemma2_2b_it_qlora\\tokenizer_config.json',
 './fine_tuned_gemma2_2b_it_qlora\\special_tokens_map.json',
 './fine_tuned_gemma2_2b_it_qlora\\tokenizer.json')

After training, the fine-tuned model is saved for future use

In [31]:
fine_tuned_model_name = "./fine_tuned_gemma2_2b_it_qlora"

fine_tuned_model = AutoModelForSequenceClassification.from_pretrained(
    fine_tuned_model_name,
    num_labels=3,
    id2label=id2label,
    label2id=label2id,
)

tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_gemma2_2b_it_qlora")

fine_tuned_model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of Gemma2ForSequenceClassification were not initialized from the model checkpoint at google/gemma-2-2b-it and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Gemma2ForSequenceClassification(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): lora.Linear(
            (base_layer): Linear(in_features=2304, out_features=2048, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=2304, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=2048, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): lora.Linear(
            (base_layer): Linear(in_features=2304, out_features=1024, bias=False)
            (lora_dropout): Modu

## Sentiment Inference After Fine-Tuning

In [32]:
input_text = "@united I loved the flight!"
    
inputs = tokenizer(input_text, return_tensors='pt', truncation=True)


outputs = fine_tuned_model(**inputs)

predicted_logits = outputs.logits
predicted_class_id = torch.argmax(predicted_logits, dim=1).item()
predicted_label = id2label[predicted_class_id]

print(predicted_class_id)
print("Predicted Sentiment:",predicted_label)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


1
Predicted Sentiment: positive


In [33]:
input_text = "Need to improve your airline service"
    
inputs = tokenizer(input_text, return_tensors='pt', truncation=True)


outputs = fine_tuned_model(**inputs)

predicted_logits = outputs.logits
predicted_class_id = torch.argmax(predicted_logits, dim=1).item()
predicted_label = id2label[predicted_class_id]

print(predicted_class_id)
print("Predicted Sentiment:",predicted_label)

0
Predicted Sentiment: negative


The saved model is loaded for inference, and predictions are made on new texts to classify their sentiment.

## Summary
This code performs sentiment classification on airline-related text data using Google's Gemma-2-2b-it model fine-tuned with QLoRA (Quantized LoRA). It leverages Hugging Face Transformers, PyTorch, bitsandbytes, and PEFT libraries to load, fine-tune, and evaluate a transformer model on text data, and applies quantization to optimize model performance on GPUs.

## Limitations
* Overfitting
* Accuracy less than 90%
* Usage of small variant of the pre-trained model

## Ways to Overcome Limitations
* Regularization: Increase weight decay or apply stronger regularization techniques.
* Dropout: Adjust the dropout rate in LoRA (lora_dropout) to prevent overfitting, especially if overfitting occurs.
* Data Augmentation: Use text data augmentation techniques to add diversity to the training set.
* Cross-Validation: Perform cross-validation to ensure the model generalizes well across different data splits.
* Class Imbalance: Handle class imbalance by using class weights in the loss function or oversampling underrepresented classes.
* Batch Size and Learning Rate: Experiment with larger batch sizes or reduce the learning rate to stabilize training.
* Early Stopping: Continue using early stopping but consider adjusting the patience to prevent premature stopping.