# Lightweight Fine-Tuning Project

This notebook demonstrates the implementation of a computationally efficient method to customize an open-source project for a specific use case. While the demonstrated task is sentiment classification, any task available on HuggingFace can be inferred, trained, or customized using the platform's available task-specific models.

* PEFT technique: LoRA (Low-Rank Adaptation)
* Model: [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
* Evaluation approach: Accuracy and Inference Sanity Checking
* Fine-tuning dataset: [stanfordnlp/sentiment140](https://huggingface.co/datasets/stanfordnlp/sentiment140)

In [1]:
import torch
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, pipeline
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

## Loading Bert Model

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model designed for natural language understanding tasks, and **particularly primed for Sentiment Analysis tasks**. It uses a bidirectional approach to read text, meaning it considers the context from both the left and right sides of a word, providing a deeper understanding of context. 

## Sentiment140 Dataset from StanfordNLP
Sentiment140 is a good dataset for demonstrating and building sentiment analysis models for several reasons:

**Size:** It contains 1.6 million labeled tweets, providing a large amount of data for training robust models.<br>
**Real-world Data:** The dataset consists of tweets, which are real-world, noisy text data, making it a good representation of actual user-generated content.<br>
**Binary Sentiment Labels:** It has clear binary sentiment labels (positive and negative), simplifying the classification task.<br>
**Preprocessed:** The dataset is preprocessed to remove common noise in tweets (like URLs and usernames), making it easier to work with.<br>
**Publicly Available:** It is freely available, allowing easy access for experimentation and learning.

In [2]:
# Check if CUDA is available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Specify the model name
model_name = "bert-base-uncased"

# Load the model and tokenizer, adapt model output to number of classes
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the Sentiment140 dataset
dataset = load_dataset("sentiment140")

# Use a subset of the training dataset for quicker experimentation
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(10000))
test_dataset = dataset["test"]  # only contains 498 rows

# Define the mapping for the remaining labels - the training dataset has a "neutral" label of score 2 added,
# these rows are filtered out, since the test dataset only has positive and negative labels 
mapping = {4: 1, 0: 0}

# Function to filter and map sentiment values
def filter_and_map_sentiment(example):
    if example['sentiment'] in mapping:
        example['sentiment'] = mapping[example['sentiment']]
        return example
    return None

# Apply the filtering and mapping to the training dataset
filtered_train_dataset = small_train_dataset.filter(lambda x: x['sentiment'] in mapping)
filtered_train_dataset = filtered_train_dataset.map(filter_and_map_sentiment)

# Apply the mapping to the test dataset
test_dataset = test_dataset.filter(lambda x: x['sentiment'] in mapping)
test_dataset = test_dataset.map(filter_and_map_sentiment)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading data:   0%|          | 0.00/81.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/498 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/498 [00:00<?, ? examples/s]

Map:   0%|          | 0/359 [00:00<?, ? examples/s]

Check if the **neutral** sentiment rows are deleted in the training dataset

In [3]:
# Verify the changes

# Print the unique values of the 'sentiment' list in the filtered_train_dataset
unique_train_sentiments = list(set(filtered_train_dataset['sentiment']))
print("Unique sentiments in the training dataset:", unique_train_sentiments)

# Print the unique values of the 'sentiment' list in the test_dataset
unique_test_sentiments = list(set(test_dataset['sentiment']))
print("Unique sentiments in the test dataset:", unique_test_sentiments)

Unique sentiments in the training dataset: [0, 1]
Unique sentiments in the test dataset: [0, 1]


In [4]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_train_dataset = filtered_train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare the dataset for training - standardize the column names and set the format to PyTorch tensors
tokenized_train_dataset = tokenized_train_dataset.rename_column("sentiment", "labels")
tokenized_test_dataset = tokenized_test_dataset.rename_column("sentiment", "labels")
tokenized_train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/359 [00:00<?, ? examples/s]

### Instantiate a Trainer Class
The trainer class is instantly used for evaluating the foundation model, and later used to train the PEFT model.

In [5]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

# Create a Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    compute_metrics=compute_metrics,
)

# Evaluate the model
trainer.evaluate()

{'eval_loss': 0.707639217376709,
 'eval_accuracy': 0.48746518105849584,
 'eval_runtime': 11.3483,
 'eval_samples_per_second': 31.635,
 'eval_steps_per_second': 2.027}

### Alternative manual evaluation function
The trainer class above and the custom function below, both can be used for benchmarking the model.
Another HuggingFace library that can be used is the [Evaluate libary](https://huggingface.co/docs/evaluate/en/index), the convenience of this approach was to benchmark not only other metrics like f1-score - which is a necessity for training with **imbalanced datasets** - but also task specific metrics like **BLEU**, **Rouge**, and **Meteor** for evaluating Machine Translations. 

In [6]:
def evaluate_model(model, tokenized_test_dataset, device):
    """
    Evaluate the accuracy of an untrained model on the tokenized test dataset.

    Args:
        model: The model to be evaluated.
        tokenized_test_dataset: The tokenized test dataset.
        device: The device to run the evaluation on (e.g., 'cuda' or 'cpu').

    Returns:
        accuracy: The accuracy of the model on the test dataset.
    """
    # Convert the tokenized input samples to tensors
    input_ids = tokenized_test_dataset["input_ids"]
    attention_mask = tokenized_test_dataset["attention_mask"]
    labels = tokenized_test_dataset["labels"]

    # Set the model to evaluation mode
    model.eval()

    # Disable gradient calculation
    with torch.no_grad():
        # Forward pass
        model.to(device)
        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)

        # Create a progress bar
        progress_bar = tqdm(range(len(input_ids)), desc="Evaluation Progress")

        # Initialize an empty list to store predicted labels
        predicted_labels = []

        # Iterate over the input samples with the progress bar
        for i in progress_bar:
            # Get the input sample
            sample_input_ids = input_ids[i]
            sample_attention_mask = attention_mask[i]

            # Forward pass for the current input sample
            outputs = model(sample_input_ids.unsqueeze(0), attention_mask=sample_attention_mask.unsqueeze(0))

            # Get the predicted label for the current input sample
            predicted_label = torch.argmax(outputs.logits, dim=1).item()

            # Append the predicted label to the list
            predicted_labels.append(predicted_label)

    # Convert the predicted labels to a tensor
    predicted_labels = torch.tensor(predicted_labels)

    # Calculate accuracy
    accuracy = (predicted_labels == labels).float().mean().item()

    return accuracy

In [7]:
# Function call for calculating accuracy 
accuracy = evaluate_model(model, tokenized_test_dataset, device)
print("Accuracy:", accuracy)

Evaluation Progress: 100%|██████████| 359/359 [00:12<00:00, 29.49it/s]

Accuracy: 0.48746517300605774





In [8]:
# Store the untrained base models accuracy for later comparison 
base_model_accuracy = accuracy

## Performing Parameter-Efficient Fine-Tuning

Building a Lora configuration for Bert-base-uncased

In [9]:
# targeting the "query" and "value" modules for the lora configuration 
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [10]:
# Define the LoRA Configuration
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none"
)
    
# Step 4: Apply the LoRA Configuration
model = get_peft_model(model, config)

**Note**: The get_peft_model function is part of the PEFT (Parameter-Efficient Fine-Tuning) framework, which is designed to fine-tune large pre-trained models efficiently. **This function automatically freezes the layers of the foundation model and allows only the LoRA (Low-Rank Adaptation) adapter layers to be trained.**

In [11]:
# bert-base-uncased model including the unfrozen and trainable lora adapter layers (weights and biases)
print(model)

PeftModel(
  (base_model): LoraModel(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(
                    in_features=768, out_features=768, bias=True
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=8, bias=False)
 

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.688,0.652615,0.623955
2,0.6408,0.527348,0.788301
3,0.565,0.422427,0.857939
4,0.5054,0.393537,0.855153
5,0.497,0.387484,0.855153


TrainOutput(global_step=3125, training_loss=0.5680040612792969, metrics={'train_runtime': 3487.198, 'train_samples_per_second': 14.338, 'train_steps_per_second': 0.896, 'total_flos': 1.32008512512e+16, 'train_loss': 0.5680040612792969, 'epoch': 5.0})

In [84]:
# Save function that will save the trained lora weights to a small checkpoint:
model.save_pretrained("bert_base_uncased-sentiment140")
tokenizer.save_pretrained("bert_base_uncased-sentiment140")

# Extract the classifier head state dictionary
classifier_state_dict = {
    "classifier.weight": model.classifier.weight.cpu().detach().numpy(),
    "classifier.bias": model.classifier.bias.cpu().detach().numpy()
}

# Save the classifier head state dictionary
torch.save(classifier_state_dict, "bert_base_uncased-sentiment140/classifier_head.pth")

('bert_base_uncased-sentiment140/tokenizer_config.json',
 'bert_base_uncased-sentiment140/special_tokens_map.json',
 'bert_base_uncased-sentiment140/vocab.txt',
 'bert_base_uncased-sentiment140/added_tokens.json',
 'bert_base_uncased-sentiment140/tokenizer.json')

## Performing Inference with a PEFT Model

Comparing the base models accuracy metric to the accuracy of the trained PEFT Model

In [86]:
# loading saved model from checkpoint

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert_base_uncased-sentiment140")

# Load the base model
model = AutoModelForSequenceClassification.from_pretrained("bert_base_uncased-sentiment140")

# Load the classifier head state dictionary
classifier_state_dict = torch.load("bert_base_uncased-sentiment140/classifier_head.pth")

# Load the classifier head weights into the model
model.classifier.weight.data = torch.tensor(classifier_state_dict["classifier.weight"])
model.classifier.bias.data = torch.tensor(classifier_state_dict["classifier.bias"])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [90]:
# Function call for calculating accuracy 
accuracy = evaluate_model(model, tokenized_test_dataset, device)
print("PEFT Model Accuracy:", accuracy)
print("Untrained Base Model Accuracy:", base_model_accuracy)

Evaluation Progress: 100%|██████████| 359/359 [00:12<00:00, 28.32it/s]

PEFT Model Accuracy: 0.8551532030105591
Untrained Base Model Accuracy: 0.48746517300605774





**Note**: Conventionally if using AutoModel functionality for loading a PEFT model, the specific PEFT version of Automodel is supposed to be used, in this case **AutoPeftModelForSequenceClassification**, however since the **classifier-head layers** which where adapted to the dataset where randomly initialized ad not loaded correctly- presumably because the PEFT library or the specific implementation of LoRA might not fully integrate with the Hugging Face transformers library's save and load mechanisms - the layers weights and biases had to manually be saved and loaded into the reloaded model, which is only possible with default **AutoModelForSequenceClassification**. <br>

Issues as such occur regularly with opensource, and unrestricted models. Domain-specific knowledge must be applied to **patch** these hurdles.   

## Sanity Check

Performing Inference with Custom Trained PEFT model demonstrating prediction samples from test data.

In [91]:
def infer_sentiment(lora_model, tokenizer, test_dataset, device, num_samples=10):
    """
    Perform inference on rows of the test dataset for sanity checking or demonstration.

    Args:
        lora_model: The trained LoRA model.
        tokenizer: The tokenizer used for the model.
        test_dataset: The tokenized test dataset.
        device: The device to run the inference on (e.g., 'cuda' or 'cpu').
        num_samples: The number of samples to infer for demonstration.

    Returns:
        results: A list of dictionaries containing the text and predicted label for each sample.
    """
    # Set the model to evaluation mode
    lora_model.eval()
    lora_model.to(device)

    # Extract text data from the test dataset
    texts = test_dataset["text"][:num_samples]

    results = []

    with torch.no_grad():
        for text in texts:
            # Tokenize the input text
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)

            # Perform inference
            outputs = lora_model(**inputs)
            logits = outputs.logits

            # Get the predicted label and score
            predicted_label = torch.argmax(logits, dim=1).item()
            score = torch.softmax(logits, dim=1).max().item()

            # Map the predicted label to the corresponding class name
            label_name = lora_model.config.id2label[predicted_label]

            # Append the result
            results.append({
                "text": text,
                "label": label_name,
                "score": score
            })

            # Display the result
            print(f"Text: {text}")
            print(f"Predicted Label: {label_name}")
            print(f"Score: {score:.4f}")
            print("-" * 50)

    return results

# Example function call
results = infer_sentiment(model, tokenizer, dataset["test"], device, num_samples=10)

Text: @stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.
Predicted Label: LABEL_1
Score: 0.8558
--------------------------------------------------
Text: Reading my kindle2...  Love it... Lee childs is good read.
Predicted Label: LABEL_1
Score: 0.8653
--------------------------------------------------
Text: Ok, first assesment of the #kindle2 ...it fucking rocks!!!
Predicted Label: LABEL_1
Score: 0.8620
--------------------------------------------------
Text: @kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)
Predicted Label: LABEL_1
Score: 0.6829
--------------------------------------------------
Text: @mikefish  Fair enough. But i have the Kindle2 and I think it's perfect  :)
Predicted Label: LABEL_1
Score: 0.8499
--------------------------------------------------
Text: @richardebaker no. it is too big. I'm quite happy with the Kindle2.