# Task 3: Predictive Modelling for Sentiment Classification

Build a machine learning model to predict sentiment (positive or negative) based on review text.

# Sentiment Analysis with Fine-tuned DistilBERT

Overview:
------------
This notebook demonstrates the process of building a sentiment analysis model using a fine-tuned DistilBERT model. The goal is to predict whether a given review is positive or negative based on the review text. The model leverages the Hugging Face Transformers library for easy integration of pre-trained models.

Workflow:
------------
1. **Data Loading and Preprocessing:**
    - The sentiment dataset is loaded, containing class labels, review titles, and review text.
    - The data is preprocessed by combining review titles and text, mapping class labels to binary values, and dropping unnecessary columns.

2. **Tokenization and Prediction with Pre-trained DistilBERT:**
    - The combined review text is tokenized using a pre-trained DistilBERT tokenizer.
    - The pre-trained DistilBERT model is used to predict sentiment labels for the reviews.
    - Model weights and tokenizer are saved for later use.

3. **Fine-tuning DistilBERT for Sentiment Analysis:**
    - The dataset is split into training, validation, and test sets.
    - A custom PyTorch Dataset class is created for efficient handling of the tokenized data.
    - The pre-trained DistilBERT model is fine-tuned on the training set.
    - Evaluation is performed on the validation and test sets.

4. **Results and Model Saving:**
    - Fine-tuned model results are evaluated on the test set.
    - Model weights and tokenizer for the fine-tuned model are saved.

5. **Load Fine-tuned Model:**
    - The fine-tuned model and tokenizer can be loaded for further analysis or deployment.

------------


In [None]:
! pip install -U accelerate -q
! pip install -U transformers -q

In [None]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments

Because of how large the dataset is, I have reduced the number of rows to save computational time.

In [None]:
# Load Data
col_names = ['class', 'review_title', 'review_text']
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prembly/Datasets/train.csv', names=col_names)
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prembly/Datasets/test.csv', names=col_names)


train_1 = train.iloc[:100]
test_1 = test.iloc[:100]

# combine 'review_title' and 'review_text'
train_1['full_review'] = train_1['review_title'] +  ' ' + train_1['review_text']
test_1['full_review'] = test_1['review_title'] +  ' ' + test_1['review_text']


# Rename class labels: positive 2 to 1, negative 1 to 0
train_1['class'] = train_1['class'].map({2: 1, 1: 0})
test_1['class'] = test_1['class'].map({2: 1, 1: 0})


# Drop columns
del train_1['review_title'], train_1['review_text']
del test_1['review_title'], test_1['review_text']


train_1.dropna(inplace=True)
test_1.dropna(inplace=True)

train_1.shape, test_1.shape

((100, 2), (100, 2))

## Tokenizer and Model setup

In [None]:
tokenizer_name = 'distilbert-base-uncased-finetuned-sst-2-english'
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

device(type='cuda')

## Tokenization and Prediction


In [None]:
batch = tokenizer(train_1['full_review'].tolist(), truncation=True, padding=True, max_length=128, return_tensors='pt')
batch.to(device)
model.to(device)

with torch.no_grad():
  train_outputs = model(**batch)
  train_predictions = F.softmax(train_outputs.logits, dim=1)
  train_labels = torch.argmax(train_predictions, dim=1)
  train_labels = [model.config.id2label[label_id] for label_id in train_labels.tolist()]

In [None]:
# Filter out where the model fails to correctly classify the text
train_pred = pd.DataFrame(train_labels)
train_1['pred'] = train_pred

# Negative Classification Error ratee
train_erros_negative = train_1[(train_1['class'] == 0) & (train_1['pred'] == 'POSITIVE')]
train_error_percentage_negative = (train_erros_negative.shape[0]/train_1.shape[0])*100
print(f'Error rate for NEGATIVE classification: {train_error_percentage_negative}%')

# Positive Classification Error ratee
train_erros_positive = train_1[(train_1['class'] == 1) & (train_1['pred'] == 'NEGATIVE')]
train_error_percentage_positive = (train_erros_positive.shape[0]/train_1.shape[0])*100
print(f'Error rate for POSITIVE classification: {train_error_percentage_positive}%')

# Total error rate
train_total_error_rate = train_erros_negative.shape[0] + train_erros_positive.shape[0]
train_total_error_rate_ = (train_total_error_rate/train_1['pred'].shape[0])*100
print(f'Total error rate is: {train_total_error_rate_}%')

Error rate for NEGATIVE classification: 2.0%
Error rate for POSITIVE classification: 9.0%
Total error rate is: 11.0%


In [None]:
# Faslse Negative
train_1.iloc[[91]]

Unnamed: 0,class,full_review,pred
91,1,Why is SOULWAX ignored by critics & consumers ...,NEGATIVE


The sentiment in the above text seems to be negative. The author expresses frustration and confusion about why the band SOULWAX is ignored by both critics and consumers despite their long presence in the music scene and the perceived quality of their work.

In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

## Save and Load Model and Tokenizer

In [None]:
# Save model weights and tokenizer
save_directory = '/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/model_checkpoint'
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

In [None]:
# Load model weights and tokenizer
# tokenizer = AutoTokenizer.from_pretrained(save_directory)
# model = AutoModelForSequenceClassification.from_pretrained(save_directory)

# Fine Tunning

In [None]:
# Train Test and Validation
X_train, X_temp, y_train, y_temp = train_test_split(train_1['full_review'], train_1['class'], test_size=0.2, stratify=train_1['class'], random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Convert dataframe to list dtype
X_train_arr, X_test_arr, X_val_arr = X_train.values.tolist(), X_test.values.tolist(), X_val.values.tolist()

# Convert series to np.array
y_train_arr, y_test_arr, y_val_arr = np.array(y_train), np.array(y_test), np.array(y_val)


In [None]:
# Dataset class
class AmazonDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings)

In [None]:
# Tokenization for fine-tuning
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

train_encodings =  tokenizer(X_train_arr, truncation=True, padding=True)
test_encodings =  tokenizer(X_test_arr, truncation=True, padding=True)
val_encodings =  tokenizer(X_val_arr, truncation=True, padding=True)

train_dataset = AmazonDataset(train_encodings, y_train_arr)
test_dataset = AmazonDataset(test_encodings, y_test_arr)
val_dataset = AmazonDataset(val_encodings, y_val_arr)

In [None]:
# Training Aruguments
training_args = TrainingArguments(
    output_dir = '/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir='/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/logs',
    logging_steps=10,
    # max_steps=100,
)

In [None]:
# Model and Trainer setup for fine-tuning
model = DistilBertForSequenceClassification.from_pretrained(save_directory)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Now we feast!
trainer.train()

Step,Training Loss


TrainOutput(global_step=2, training_loss=0.9325298070907593, metrics={'train_runtime': 0.8102, 'train_samples_per_second': 4.937, 'train_steps_per_second': 2.469, 'total_flos': 244236766272.0, 'train_loss': 0.9325298070907593, 'epoch': 2.0})

In [None]:
# Evaluation on the test set
results = trainer.evaluate(test_dataset)
results

{'eval_loss': 0.0006351720076054335,
 'eval_runtime': 0.0481,
 'eval_samples_per_second': 41.571,
 'eval_steps_per_second': 20.785,
 'epoch': 2.0}

## Save the fine tunned model

In [None]:
# Save fine-tuned model weights and tokenizer
save_directory = '/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/fine_tunned_model_checkpoint'
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

('/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/fine_tunned_model_checkpoint/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/fine_tunned_model_checkpoint/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/fine_tunned_model_checkpoint/vocab.txt',
 '/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/fine_tunned_model_checkpoint/added_tokens.json',
 '/content/drive/MyDrive/Colab Notebooks/Prembly/Notebook/fine_tunned_model_checkpoint/tokenizer.json')

# Test data evaluation

In [None]:
# Load fine-tuned model weights and tokenizer
tokens = DistilBertTokenizerFast.from_pretrained(save_directory)
model = DistilBertForSequenceClassification.from_pretrained(save_directory)

In [None]:
batch = tokenizer(test_1['full_review'].tolist(), truncation=True, padding=True, max_length=128, return_tensors='pt')
batch.to(device)
model.to(device)

with torch.no_grad():
  test_outputs = model(**batch)
  test_predictions = F.softmax(test_outputs.logits, dim=1)
  test_labels = torch.argmax(test_predictions, dim=1)
  test_labels = [model.config.id2label[label_id] for label_id in test_labels.tolist()]

In [None]:
# Filter out where the model fails to correctly classify the text
test_pred = pd.DataFrame(test_labels)
test_1['pred'] = test_pred

# Negative Classification Error ratee
test_erros_negative = test_1[(test_1['class'] == 0) & (test_1['pred'] == 'POSITIVE')]
error_percentage_negative = (test_erros_negative.shape[0]/test_1.shape[0])*100
print(f'Error rate for NEGATIVE classification: {error_percentage_negative}%')

# Positive Classification Error ratee
test_erros_positive = test_1[(test_1['class'] == 1) & (test_1['pred'] == 'NEGATIVE')]
error_percentage_positive = (test_erros_positive.shape[0]/test_1.shape[0])*100
print(f'Error rate for POSITIVE classification: {error_percentage_positive}%')


total_error_rate = test_erros_negative.shape[0] + test_erros_positive.shape[0]
total_error_rate_ = (total_error_rate/test_1['pred'].shape[0])*100
print(f'Total error rate is: {total_error_rate_}%')

Error rate for NEGATIVE classification: 5.0%
Error rate for POSITIVE classification: 8.0%
Total error rate is: 13.0%


reference:

*   HuggingFace NLP Course [Link](https://huggingface.co/learn/nlp-course/chapter1/1)
*   Text Enhancement with ChatGPT

