<a href="https://colab.research.google.com/github/AashiDutt/LLM-Projects/blob/main/Sentiment_Analysis_FineTuning_LLMs_with_LoRA_HF_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# -What is fine-tuning?
Taking a pre-trained model and training at least one model parameter

# -There are three ways to fine-tune a LLM

1. Self-supervised learning - uses training corpus with base model
2. supervised learning - given a dataset with input and labels with base model to fine tune upon
3. Reinforcement learning (RLHF) - combines supervised fine tuning with training reward model and RL with PPO.

# -Options for parameter training?

1. Retrain all parameters (65M paramteres in base model(distil bert) used in this repo)
2. Transfer learning - freeze almost all paramters and only finetune the head
3. Parameter Efficient Fine-Tuning(PEFT) - used in this repo -> freeze all the weights and augment the model with additional model parameters that are trainable (small set)

## -Method used in this Repo
Parameter training method used in this repo "***LoRA- low Rank Adaptation***" using PEFT (fine tunes model by adding new trainable parameters like r or intrinsic rank used later)

hidden layer before LoRA = h(x) = W0*x

hidden layer after LoRA = h(x) = W0*x + delta(W)*x where W0 is frozen




---



---


**Note:Turn on GPU for faster training**

In [2]:
!pip install datasets --quiet
!pip install evaluate --quiet
!pip install peft --quiet
!pip install accelerate==0.25.0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m


In [3]:
# imports
from datasets import load_dataset, DatasetDict, Dataset

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig

import evaluate
import torch
import numpy as np



In [4]:
# Choose a base model

model_checkpoint = 'distilbert-base-uncased'

# define label maps
index2label = {0:'Negative', 1:'Positive'}
label2index = {'Negative': 0, 'Positive': 1}

# generate classification model from checkpoint

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels = 2, id2label = index2label,label2id = label2index )


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# dataset
dataset = load_dataset("shawhin/imdb-truncated")
dataset

Downloading readme:   0%|          | 0.00/592 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/836k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/853k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
})

In [6]:
# preprocess the data

# Tokenize data
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space = True)

def tokenize_fun(examples):
  text = examples["text"] # text comes from dataset

  tokenizer.truncation_side = "left" # to keep size of sentences same
  tokenized_inputs = tokenizer(
      text,
      return_tensors = "np", # return numpy tensors
      truncation = True,
      max_length = 512
  )
  return tokenized_inputs

# add pad token
if tokenizer.pad_token is None:
  tokenizer.add_special_tokens({'pad_token':'[PAD]'})
  model.resize_token_embeddings(len(tokenizer))

tokenized_dataset = dataset.map(tokenize_fun, batched = True)
tokenized_dataset

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [7]:
data_collator = DataCollatorWithPadding(tokenizer = tokenizer) # helps to pad small sequences dynamically

In [8]:
# evaluation metrics

accuracy = evaluate.load("accuracy")

def compute_metrics(p): # evaluation function to pass into trainer later
  predictions, labels = p
  predictions = np.argmax(predictions, axis = 1)
  return {"accuracy": accuracy.compute(predictions = predictions, references = labels)}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [9]:
# untrained model performance

text_list = ["It was good.", "Not a fan, don't recommend", "Better than first one"]
print("Untrained model performance")
print("---------------------------")
for text in text_list:
  inputs = tokenizer.encode(text, return_tensors = "pt")
  logits = model(inputs).logits
  predictions= torch.argmax(logits)

  print(text + " - " + index2label[predictions.tolist()])

Untrained model performance
---------------------------
It was good. - Negative
Not a fan, don't recommend - Negative
Better than first one - Negative


In [10]:
# fine tuning distil bert to do sentiment analysis

peft_config = LoraConfig(task_type = "SEQ_CLS", # sequence_classfication
                         r = 4, # rank of trainable weight matrix
                         lora_alpha = 32, # learning rate
                         lora_dropout = 0.01, # prob of dropout
                         target_modules = ['q_lin'] # apply LoRA to query layer
                         )

In [11]:
# finetuning
model = get_peft_model(model,peft_config)
model.print_trainable_parameters()

trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9306847223789819


In [12]:
lr = 1e-3
batch_size = 4
num_epochs = 10

training_args = TrainingArguments(
    output_dir = model_checkpoint + "-lora-text-classification",
    learning_rate = lr,
    per_device_train_batch_size = batch_size,
    num_train_epochs = num_epochs,
    weight_decay = 0.01,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    load_best_model_at_end = True,

)

In [13]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["validation"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.413647,{'accuracy': 0.873}
2,0.412000,0.50615,{'accuracy': 0.871}
3,0.412000,0.500105,{'accuracy': 0.882}
4,0.193300,0.660631,{'accuracy': 0.887}
5,0.193300,0.90559,{'accuracy': 0.872}
6,0.059600,0.935455,{'accuracy': 0.885}
7,0.059600,0.974872,{'accuracy': 0.886}
8,0.022400,1.013547,{'accuracy': 0.889}
9,0.022400,1.087178,{'accuracy': 0.89}
10,0.005700,1.109543,{'accuracy': 0.89}


Trainer is attempting to log a value of "{'accuracy': 0.873}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.871}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.882}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.887}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.872}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This i

TrainOutput(global_step=2500, training_loss=0.13859124555587768, metrics={'train_runtime': 477.3194, 'train_samples_per_second': 20.95, 'train_steps_per_second': 5.238, 'total_flos': 1113026652407424.0, 'train_loss': 0.13859124555587768, 'epoch': 10.0})

In [19]:
model.to('cuda') # for mac

text_list = ["It was good.", "Not a fan, don't recommend", "Better than first one"]
print("trained model performance")
print("---------------------------")
for text in text_list:
  inputs = tokenizer.encode(text, return_tensors = "pt").to("cuda")
  logits = model(inputs).logits
  predictions= torch.max(logits, 1).indices

  print(text + " - " + index2label[predictions.tolist()[0]])

trained model performance
---------------------------
It was good. - Positive
Not a fan, don't recommend - Negative
Better than first one - Positive
