# Ridham Dholaria - Final Project

# Fine-tuning 

### Definition:
Fine tuning is taking a pre-trained existing large language model and training at least one internal model parameter for a particular use case. 

### Description: 
Here I am taking the [IMDB](https://huggingface.co/datasets/imdb) (Large Movie Review Dataset containing movie reviews labeled as positive or negative) dataset and getting the randomly 20% of the data to work with as it will become computationally difficult if I run the entire dataset.
Then I build the model using [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) model. Which is a pre-trained model of the English language using a masked language modeling (MLM) objective. Then comes the preprocess where I Tokenize the data using [AutoTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). It is necessary to convert the text to numerical form for model comprehension.


Citation link: https://github.com/ShawhinT/YouTube-Blog/blob/main/LLMs/fine-tuning/ft-example.ipynb

Due to limitations in GPU memory and computational resources, I downscaled the original IMDB dataset from 25,000 entries to 1000 entries (4% of the original data) for training. Despite trying to increase the number of epochs beyond 10, CUDA ran out of memory. Therefore, I proceeded with the available parameters. However, the achieved accuracy is around 55%, which might not be optimal. Given more GPU resources, training with higher parameters could potentially yield better results. Considering the constraints, I am considering submitting the model with 50% accuracy.


### Setup

Importing various libraries and modules for natural language processing (NLP) such as datasets, [peft](https://huggingface.co/docs/peft/en/index), [evaluate](https://huggingface.co/docs/evaluate/en/index), [torch](https://pytorch.org/docs/stable/torch.html) and [numpy](https://numpy.org/doc/stable/user/index.html#user)

In [19]:
from datasets import load_dataset, DatasetDict, Dataset

from transformers import (
    AutoTokenizer,
    AutoConfig, 
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np

### Dataset

In [20]:
# loading imdb data
imdb_dataset = load_dataset("imdb")

# defining subsample size
N = 20000 
# generating indexes for random subsample
rand_idx = np.random.randint(24999, size=N) 

# extracting train and test data
x_train = imdb_dataset['train'][rand_idx]['text']
y_train = imdb_dataset['train'][rand_idx]['label']

x_test = imdb_dataset['test'][rand_idx]['text']
y_test = imdb_dataset['test'][rand_idx]['label']

# creating new dataset
dataset = DatasetDict({'train':Dataset.from_dict({'label':y_train,'text':x_train}),
                             'validation':Dataset.from_dict({'label':y_test,'text':x_test})})

loading the IMDb dataset and creating a subsample of 5000 data points for both the training and testing sets. It then splits the data into training and validation sets and organizes them into a new dataset for further processing.

### Model

Used [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base)model for the assignment

In [21]:
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative":0, "Positive":1}

# generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained('roberta-base', num_labels=2, id2label=id2label, label2id=label2id)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Initializes a classification model using the [roberta-base](https://huggingface.co/FacebookAI/roberta-base) architecture for binary sentiment classification (negative or positive). It specifies mappings between label IDs and their corresponding labels ("Negative" and "Positive") for model interpretation.

In [22]:
# display architecture
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

### Preprocessing Data

In [23]:
# creating tokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base', add_prefix_space=True)

# add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

Initializes a tokenizer using the Hugging Face library, ensuring it adds a space before each token. If the tokenizer lacks a padding token, it adds one, [PAD], and adjusts the model's token embeddings accordingly. This ensures consistent tokenization and padding across different model configurations.

In [24]:
# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["text"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

Function takes examples containing text data as input. It tokenizes the text, truncating it from the left side if needed to fit within a maximum length of 512 tokens, and returns the tokenized inputs as numpy arrays.

In [25]:
# tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 20000
    })
})

In [26]:
# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Initializes a data collator, which is used to pad sequences to the maximum length within a batch during training. It utilizes the tokenizer previously defined.

### Evaluation

In [27]:
# import accuracy evaluation metric
accuracy = evaluate.load("accuracy")

In [28]:
# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

Imports the accuracy evaluation metric and defines a function, compute_metrics, which computes the accuracy metric for model predictions. It compares the predicted labels with the actual labels and returns the accuracy value. This function will be passed into the trainer later for evaluation.

### Train model

In [29]:
peft_config = LoraConfig(task_type="SEQ_CLS",
                        r=4,
                        lora_alpha=32,
                        lora_dropout=0.01)

This code initializes a configuration object for the Lora model, specifying hyperparameters such as task type (sequence classification), attention radius (r=4), Lora's alpha value (lora_alpha=32), and dropout rate (lora_dropout=0.01).

In [30]:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 739,586 || all params: 125,386,756 || trainable%: 0.5898437949858117


In [31]:
# hyperparameters
lr = 0.001
batch_size = 2
num_epochs = 3

In [37]:
# define training arguments
training_args = TrainingArguments(
    output_dir= 'roberta-base' + "-lora-text-classification-output",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Defines training arguments for the model training process. It specifies parameters such as the output directory for saving the trained model, learning rate, batch size for training and evaluation, number of epochs for training, weight decay, evaluation strategy (per epoch), strategy for saving models (per epoch), and whether to load the best model at the end of training.

The code below creates a trainer object for training the model. It utilizes the defined model, training arguments, tokenized training and evaluation datasets, tokenizer, data collator (for padding examples in each batch), and a function for computing evaluation metrics. Then, it initiates the training process by calling the train() method on the trainer object.

In [35]:
# creater trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator, # This will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics,
)

# train model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8214,0.723342,{'accuracy': 0.4979}
2,0.7692,0.732342,{'accuracy': 0.5021}
3,0.7004,0.693324,{'accuracy': 0.5021}


TrainOutput(global_step=30000, training_loss=0.7908513966878256, metrics={'train_runtime': 10453.1466, 'train_samples_per_second': 5.74, 'train_steps_per_second': 2.87, 'total_flos': 1.0728339444312432e+16, 'train_loss': 0.7908513966878256, 'epoch': 3.0})

### Accuracy: 50.21%

In [16]:
# creater trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator, # This will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics,
)

# train model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.7674,0.811388,{'accuracy': 0.5054666666666666}
2,0.7312,0.719436,{'accuracy': 0.4945333333333333}
3,0.71,0.702773,{'accuracy': 0.5054666666666666}


TrainOutput(global_step=22500, training_loss=0.7535486938476562, metrics={'train_runtime': 7588.9126, 'train_samples_per_second': 5.93, 'train_steps_per_second': 2.965, 'total_flos': 8080859342571024.0, 'train_loss': 0.7535486938476562, 'epoch': 3.0})

### Accuracy: 50.54%

### Generate prediction

In [17]:
text_list = ["It was good.", "Not a fan, don't recommed.", "Better than the first one.", "This is not worth watching even once.", "This one is a pass."]

In [18]:
model.to('cuda') 

print("Trained model predictions:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt").to("cuda")

    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(text + " - " + id2label[predictions.tolist()[0]])

Trained model predictions:
--------------------------
It was good. - Positive
Not a fan, don't recommed. - Positive
Better than the first one. - Positive
This is not worth watching even once. - Positive
This one is a pass. - Positive


This code transfers the model to the GPU for faster computation. Then, it iterates through a list of texts, encodes each text using the tokenizer, and passes the encoded inputs to the model for prediction. Finally, it prints the predicted label for each text based on the maximum logit score.