# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [8]:
from datasets import load_dataset

In [9]:
# Getting a medical diagnosis dataset with 10 different diseases as labels
# ds = "Deysi/spam-detection-dataset"
ds = "ninaa510/diagnosis-text"
orig_dataset = load_dataset(ds)

In [10]:
# Converting string labels to numerical. This can be mapped back during inference at the end
label_to_int = {'Allergic sinusitis' : 0
            , 'Anaphylaxis' : 1
            , 'Chagas' : 2
            , 'HIV (initial infection)' : 3
            , 'Influenza' : 4
            , 'Localized edema' : 5
            , 'SLE' : 6
            , 'Sarcoidosis' : 7
            , 'Tuberculosis' : 8
            , 'Whooping cough' : 9}


In [11]:
from datasets import load_dataset, ClassLabel, Features, Value

# Get the unique text labels to define the ClassLabel feature
unique_labels = sorted(list(set(orig_dataset["train"]["label"])))
print(f"Original text labels: {unique_labels}\n")

Original text labels: ['Allergic sinusitis', 'Anaphylaxis', 'Chagas', 'HIV (initial infection)', 'Influenza', 'Localized edema', 'SLE', 'Sarcoidosis', 'Tuberculosis', 'Whooping cough']



In [12]:
# Define the new features, including the ClassLabel for the 'label' column for integer mapping

new_features = orig_dataset['train'].features.copy()
new_features['label'] = ClassLabel(names=unique_labels)

new_features

{'label': ClassLabel(names=['Allergic sinusitis', 'Anaphylaxis', 'Chagas', 'HIV (initial infection)', 'Influenza', 'Localized edema', 'SLE', 'Sarcoidosis', 'Tuberculosis', 'Whooping cough']),
 'sentence1': Value('string')}

In [14]:
# Cast the dataset columns to the new feature types
# This automatically handles the conversion from string to integer based on the ClassLabel.
dataset = orig_dataset.cast(new_features)

dataset['train'][11979]

{'label': 9,
 'sentence1': 'As a 53-year-old female, my main symptoms include vomiting and coughing, along with diarrhea, pain in the upper abdomen described as a knife stroke, skin lesions or rashes on my right ankle, nausea, shortness of breath, swelling in my nose area, and an allergic reaction. The rash is pink in color.'}

In [15]:
# The 'label' column will contain integers instead of strings.
# You can see the mapping by inspecting the features of the new dataset.

print("Features of the new dataset:")
print(dataset["train"].features)

print("\nFirst 5 examples with integer labels:")
print(dataset["train"][:5])

# You can still access the original text labels by using int2str()
label_feature = dataset["train"].features["label"]
print(f"\nLabel mapping (integer to string): {label_feature.int2str(8)}")

Features of the new dataset:
{'label': ClassLabel(names=['Allergic sinusitis', 'Anaphylaxis', 'Chagas', 'HIV (initial infection)', 'Influenza', 'Localized edema', 'SLE', 'Sarcoidosis', 'Tuberculosis', 'Whooping cough']), 'sentence1': Value('string')}

First 5 examples with integer labels:
{'label': [0, 0, 0, 0, 0], 'sentence1': ['I am a 12-year-old male with an itchy nose, sharp upper abdominal pain, pink skin lesions or rashes on my neck, swelling in my nose, had an allergic reaction, and experiencing high-pitched breathing sounds, lightheadedness, and wheezing on exhale.', 'I am suffering from eye itching with pain, skin rashes, allergic reactions, shortness of breath, swelling, loss of consciousness, high-pitched breathing, and wheezing on exhale.', 'I am a 20-year-old female suffering from an itchy nose, heavy pain in my left top of the foot, and swelling in my right sole.', 'I am a 68-year-old male with eye itching as the main symptom, accompanied by shortness of breath, cough, an

In [16]:
set(dataset["train"]['label'])
# type(dataset["train"])
# dataset['label']

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

In [17]:
# Check the number of samples of each label in the train dataset
import datasets
import pandas as pd

pandas_df = dataset["train"].to_pandas()

# Perform groupby operation using pandas
grouped_df = pandas_df.groupby("label")["sentence1"].count()

grouped_df

label
0    1331
1    1331
2    1331
3    1331
4    1331
5    1331
6    1331
7    1331
8    1331
9    1331
Name: sentence1, dtype: int64

In [18]:
dataset["train"]
splits = ["train", "test"]

In [19]:
# Taking only 5000 from train and 1000 from test splits
dataset["train"] = dataset["train"].shuffle(seed = 41).select(range(5000))
dataset["test"] = dataset["test"].shuffle(seed = 41).select(range(1000))

In [20]:
# Checking label counts after subsetting the dataset. To see if they are evenly spread
pandas_df = dataset["train"].to_pandas()

# Perform groupby operation using pandas
grouped_df = pandas_df.groupby("label")["sentence1"].count()

grouped_df

label
0    504
1    524
2    495
3    484
4    519
5    485
6    496
7    494
8    500
9    499
Name: sentence1, dtype: int64

In [21]:
type(dataset["train"])

datasets.arrow_dataset.Dataset

In [22]:
dataset["train"]
dataset["test"]

Dataset({
    features: ['label', 'sentence1'],
    num_rows: 1000
})

In [23]:
dataset["train"][0]
# max(dataset["train"]["label"])
# min(dataset["train"]["label"])

{'label': 4,
 'sentence1': 'My main symptom is pain accompanied by an itchy nose or throat and nasal congestion.'}

In [24]:
# Creating 2 different models to check which one performs better
from transformers import AutoModelForSequenceClassification, AutoTokenizer 

# model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# model_name = "distilbert-base-uncased"
# model_name = "roberta-base"
model_name = "medicalai/ClinicalBERT"


In [25]:
# from transformers import DistilBertTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize all examples
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["sentence1"], padding = "max_length", truncation = True, max_length = 512)

tokenized_ds = dataset.map(tokenize_function, batched=True)

tokenized_ds["train"]    

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['label', 'sentence1', 'input_ids', 'attention_mask'],
    num_rows: 5000
})

In [18]:
# tokenized_ds["train"][0]

In [19]:
# tokenized_ds["test"][0]['input_ids']
len(tokenized_ds["train"][10]['input_ids'])
# tokenized_ds["train"][0]

512

In [20]:
# num_labels = 2
# id2label = {0: 'not_spam', 1: 'spam'}
# label2id = {v: k for k,v in id2label.items()}

In [21]:
from transformers import AutoModelForSequenceClassification
# import tensorflow

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=10,
#     from_tf = True
#     id2label=id2label,
#     label2id=label2id,
    ignore_mismatched_sizes=True

)

# Freeze all the model parameters.
for param in model.base_model.parameters():
    param.requires_grad = False

pytorch_model.bin:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at medicalai/ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

In [22]:
# print(model)

In [23]:
from datetime import datetime as dt
output_dir="./results/bert_" + dt.now().strftime("%m-%d_%H:%M:%S")
print(output_dir)

./results/bert_10-02_20:45:51


In [25]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

model_args=TrainingArguments(
    output_dir=output_dir,
    learning_rate = 2e-5, 
    per_device_train_batch_size = 10, 
    per_device_eval_batch_size = 10, 
    eval_strategy = "epoch", 
    save_strategy = "epoch", 
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    remove_unused_columns=True,

)

trainer = Trainer(
    model=model,
    args=model_args, 
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [24]:
# trainer.train()

In [26]:
trainer.evaluate()



{'eval_loss': 2.27618145942688,
 'eval_model_preparation_time': 0.0009,
 'eval_accuracy': 0.126,
 'eval_runtime': 19.406,
 'eval_samples_per_second': 51.531,
 'eval_steps_per_second': 5.153}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

###  ⚠️ IMPORTANT ⚠️

Due to workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [27]:
from peft import LoraModel, LoraConfig

lora_config = LoraConfig(
    task_type="SEQ_CLS",
    r=8,
    lora_alpha=32,
#     target_modules=["q_lin", "k_lin", "v_lin", "out_lin", "lin1", "lin2"],
    target_modules = ["q_lin", "k_lin", "v_lin", "out_lin"],
    lora_dropout=0.1,
)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=10,
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at medicalai/ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
from peft import get_peft_model
lora_model = get_peft_model(model, lora_config)

In [29]:
# print(lora_model)

In [29]:
lora_model.print_trainable_parameters()

trainable params: 893,194 || all params: 136,225,556 || trainable%: 0.6557


In [31]:
from datetime import datetime as dt
output_dir="./results/bert_lora_" + dt.now().strftime("%m-%d_%H:%M:%S")
print(output_dir)

./results/bert_lora_10-02_20:49:18


In [33]:
# import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
# 5 or 10 epochs is fine
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

lora_train_args=TrainingArguments(
        output_dir=output_dir,
        learning_rate = 2e-5, 
        per_device_train_batch_size = 10, 
        per_device_eval_batch_size = 10, 
        eval_strategy = "epoch", 
        save_strategy = "epoch", 
        num_train_epochs=5,
        weight_decay=0.01,
        load_best_model_at_end=True
    )

lora_trainer = Trainer(
    model=lora_model,
    args = lora_train_args, 
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

lora_trainer.train()

  lora_trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,2.3067,2.294745,0.147
2,2.2065,1.873657,0.398
3,1.6755,1.522989,0.463
4,1.4846,1.416093,0.5
5,1.4236,1.387991,0.506




TrainOutput(global_step=2500, training_loss=1.819380810546875, metrics={'train_runtime': 1733.1732, 'train_samples_per_second': 14.424, 'train_steps_per_second': 1.442, 'total_flos': 3380754739200000.0, 'train_loss': 1.819380810546875, 'epoch': 5.0})

In [34]:
lora_trainer.evaluate()



{'eval_loss': 1.387990951538086,
 'eval_accuracy': 0.506,
 'eval_runtime': 22.5722,
 'eval_samples_per_second': 44.302,
 'eval_steps_per_second': 4.43,
 'epoch': 5.0}

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [35]:
# Saving the model
lora_model.save_pretrained("lora_peft_model")

In [37]:
print(lora_model)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(119547, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): DistilBertSdpaAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=7

In [3]:
from peft import AutoPeftModelForSequenceClassification
lora_peft = AutoPeftModelForSequenceClassification.from_pretrained(
    "lora_peft_model"
    # model = "/Users/harithagollakota/Downloads/LightweightFineTuning/results/bert_lora_10-02_20:49:18"
    , num_labels=10
    , ignore_mismatched_sizes=True)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at medicalai/ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
print(lora_peft)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(119547, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): DistilBertSdpaAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=7

In [26]:
from transformers import AutoTokenizer 
 
model_name = "medicalai/ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [27]:
# get samples from the tokenized dataset of test split and check some predictions
samples = tokenized_ds["test"]
sample = samples[100] # Predicted correct
# sample = samples[0] # Predicted correct
# sample = samples[50] # Predicted wrong

print(sample["sentence1"], " : ", sample['label']
      , " : ", label_feature.int2str(int(sample['label'])))

At 38 years old and male, I am experiencing an itchy nose as my main symptom, accompanied by pain, skin lesions or rashes, shortness of breath, and mouth ulcers or sores. The sharp pain is located in my left shoulder, and the rash is red on my nose.  :  0  :  Allergic sinusitis


In [29]:
# Predict the label
import numpy as np
inputs = tokenizer(sample["sentence1"], return_tensors="pt")
outputs = lora_peft(**inputs)
predicted_label_id = np.argmax(outputs.logits[0].detach().cpu().numpy())
predicted_label = label_feature.int2str(int(predicted_label_id))

print("Text:", sample["sentence1"])
print("Predicted label:", predicted_label)

Text: At 38 years old and male, I am experiencing an itchy nose as my main symptom, accompanied by pain, skin lesions or rashes, shortness of breath, and mouth ulcers or sores. The sharp pain is located in my left shoulder, and the rash is red on my nose.
Predicted label: Allergic sinusitis
