In [None]:
{
  "developer": "Swapnendu Banik",
  "version": "1.0.0",
  "projectDescription": """This project fine-tunes a DistilBERT model on the Amazon Customer Review dataset for sentiment analysis, 
  classifying reviews into categories like very positive, positive, neutral, negative, and very negative. 
  The goal is to create an efficient sentiment analysis tool for Amazon product reviews."""
}

## The Dataset Loading and Preprocessing

In [1]:
import pandas as pd

In [2]:
file_path = "dataset\AmazonData.csv"
df = pd.read_csv(file_path)
df.head(5)

Unnamed: 0,Unique_ID,Category,Review_Header,Review_text,Rating,Own_Rating
0,136040,smartTv,Nice one,I liked it,5,Positive
1,134236,mobile,Huge battery life with amazing display,I bought the phone on Amazon and been using my...,5,Positive
2,113945,books,Four Stars,"Awesome book at reasonable price, must buy ......",4,Positive
3,168076,smartTv,Nice quality,good,5,Positive
4,157302,books,Nice book,"The book is fine,not bad,contains nice concept...",3,Neutral


In [3]:
df=df[["Review_text","Rating"]]
df.head(5)

Unnamed: 0,Review_text,Rating
0,I liked it,5
1,I bought the phone on Amazon and been using my...,5
2,"Awesome book at reasonable price, must buy ......",4
3,good,5
4,"The book is fine,not bad,contains nice concept...",3


In [4]:
## Number of entries
print(f"Length of dataset is {df.shape[0]}, entries")

Length of dataset is 60889, entries


In [5]:
## Missing vals
df.isnull().sum()

Review_text    32
Rating          0
dtype: int64

In [6]:
## Drop Missing Reviews rows
df.dropna(subset=["Review_text"], inplace=True)

In [7]:
## Missing value Statistics
df.isnull().sum()

Review_text    0
Rating         0
dtype: int64

In [8]:
df.head(5)

Unnamed: 0,Review_text,Rating
0,I liked it,5
1,I bought the phone on Amazon and been using my...,5
2,"Awesome book at reasonable price, must buy ......",4
3,good,5
4,"The book is fine,not bad,contains nice concept...",3


In [9]:
## The Balance of the dataset (if imbalanced or balanced)
df["Rating"].value_counts()

## There is unbalanced classes

Rating
5    34439
4    12968
1     6979
3     4364
2     2107
Name: count, dtype: int64

## The Target Values, need to be encoded in binary format for DistilBert

When working with categorical target values, it is often necessary to convert them into a binary format that can be used effectively in machine learning models. This process can involve two common techniques:

- LabelBinarizer: Suitable for single-label categorical data.

- MultiLabelBinarizer: Designed for multi-label categorical data.



### 1. Using LabelBinarizer

The LabelBinarizer is used when each instance belongs to one and only one category. For example, consider target values [1, 2, 3, 4]:

Input Target Values:
[1, 2, 3, 4]

Binary Encoding Output:
- 1 -> [1, 0, 0, 0]
- 2 -> [0, 1, 0, 0]
- 3 -> [0, 0, 1, 0]
- 4 -> [0, 0, 0, 1]


### 2. Using MultiLabelBinarizer

The MultiLabelBinarizer is used when instances can belong to multiple categories at once. For example, consider target sets  [(1, 2, 3), (2, 4)]:

Input Target Sets:
[(1, 2, 3), (2, 4)]


Binary Encoding Output:
- (1, 2, 3) -> [1, 1, 1, 0]
- (2, 4)     -> [0, 1, 0, 1]

## Label Encoder

In [10]:
from sklearn.preprocessing import LabelBinarizer
label= LabelBinarizer()

In [11]:
labels= label.fit_transform(df["Rating"]).astype("float32")
texts= df["Review_text"].tolist()

## Train-Test Split

In [12]:
## Train-Test Split
from sklearn.model_selection import train_test_split

train_test,val_text,train_labels,val_labels = train_test_split(texts,labels,test_size=0.2,random_state=42)

In [13]:
len(train_test),len(train_labels),len(val_text),len(val_labels)

(48685, 48685, 12172, 12172)

## Model Building

In [14]:
import torch
from transformers import DistilBertTokenizer, AutoTokenizer
from transformers import DistilBertForSequenceClassification, AutoModelForSequenceClassification
from torch.utils.data import Dataset

- checkpoint: The pre-trained model used as a base. In this case, it is "distilbert-base-uncased", a lightweight and fast version of BERT.
- Tokenizer
    - The tokenizer converts input text into token IDs compatible with the DistilBERT model.
    - It handles tasks like lowercasing(uncased model -> case insensetive), tokenizing, and padding to the desired length.
- DistilBertForSequenceClassification:
    - Designed for sequence classification tasks.
    - Configured with num_labels, representing the number of classes.
    - The problem_type is set to "multi_label_classification" for multi-label tasks.

In [15]:
checkpoint = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(checkpoint)
model = DistilBertForSequenceClassification.from_pretrained(checkpoint, num_labels=len(labels[0]),
                                                            problem_type="multi_label_classification")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Custom Data-Class to Load Custom Data

Functions to be implemented in  the Custom Dataset (NECESSARY)
1.  __init__()
    - Purpose: Initializes the dataset with the input data (texts and labels), tokenizer, and optional settings like max_len.
2. __len__()
    - Purpose:
        - Returns the total number of examples in the dataset.
        - This is necessary for PyTorch's DataLoader to iterate through the dataset.

3. __getitem__()
    - Purpose:
        - Retrieves a single data instance (text and label) by index.
        - Prepares the input by:
            - Tokenizing the text using the specified tokenizer.
            - Padding/truncating the text to the specified max_len.
            - Returning tokenized text, attention mask, and label tensors.

In [41]:
## Custom Dataset

class AmazonReviewDataset(Dataset):
  def __init__(self, texts, labels, tokenizer, max_len=512):
    self.texts = texts
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.texts)

  def __getitem__(self,idx):
    text = str(self.texts[idx])
    label = torch.tensor(self.labels[idx])

    encoding = self.tokenizer(text,
                              truncation=True,
                              padding="max_length",
                              max_length= self.max_len,
                              return_tensors="pt")

    return {
        "input_ids": encoding["input_ids"].flatten(),
        "attention_mask": encoding["attention_mask"].flatten(),
        "labels": label
    }


In [42]:
train_dataset = AmazonReviewDataset(train_test, train_labels, tokenizer)
val_dataset = AmazonReviewDataset(val_text, val_labels, tokenizer)

In [45]:
## Sample
val_dataset[100]

{'input_ids': tensor([  101,  2053,  8272,  2589,  1012,  1019,  2420,  2525,  1012,  3532,
          2968,  2006,  1996,  2112,  1997,  4901,  1012,  1996, 16661,  1998,
          1996,  2158,  4590,  2123,  1005,  1056, 13530,  2426,  3209,  1012,
          1996, 16661,  2018,  2053,  2801,  2055,  2010,  6098,  1012,  8013,
          2729,  2036,  2987,  1005,  1056,  6869,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

## Evaluation Metrics

In [5]:
## Multi Lavel Classification Evaluation Metrics

import numpy as np
from sklearn.metrics import roc_auc_score, f1_score, hamming_loss
from transformers import EvalPrediction
import torch

In [48]:
def multi_label_metrics(predictions, labels, threshold=0.5):
  sigmoid = torch.nn.Sigmoid()
  probs = sigmoid(torch.Tensor(predictions))
  y_pred = np.zeros(probs.shape)
  y_pred[np.where(probs >= threshold)] = 1
  y_true = labels

  f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average="macro")
  roc_auc = roc_auc_score(y_true, y_pred, average="macro", multi_class="ovr")
  hamming_loss_val = hamming_loss(y_true, y_pred)


  return {
      "f1": f1_micro_average,
      "roc_auc": roc_auc,
      "hamming_loss": hamming_loss_val
  }

def compute_metrics(p: EvalPrediction): ## Pytorch Trainer needs compute_metrics function defined for optimizing loss and metrics
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  result = multi_label_metrics(predictions=preds, labels=p.label_ids)
  return result

## Training Args and the Trainer

In [60]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    report_to="none",
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset= val_dataset,
                  compute_metrics = compute_metrics)

In [61]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1,Roc Auc,Hamming Loss,Runtime,Samples Per Second,Steps Per Second
1,0.278,0.277687,0.340229,0.629488,0.118377,191.4224,63.618,7.956


TrainOutput(global_step=6089, training_loss=0.29276267365098374, metrics={'train_runtime': 2613.6598, 'train_samples_per_second': 18.637, 'train_steps_per_second': 2.33, 'total_flos': 6452964675855360.0, 'train_loss': 0.29276267365098374, 'epoch': 1.0})

## Model Evaluation

In [62]:
trainer.evaluate()

{'eval_loss': 0.2776871919631958,
 'eval_f1': 0.340228876934184,
 'eval_roc_auc': 0.6294876758641222,
 'eval_hamming_loss': 0.11837740187222862,
 'eval_runtime': 206.6572,
 'eval_samples_per_second': 58.929,
 'eval_steps_per_second': 7.37,
 'epoch': 1.0}

## Saving the Model and Binarizer

In [71]:
trainer.save_model("ENTER YOUR CUSTOM PATH")
tokenizer.save_pretrained("ENTER YOUR CUSTOM PATH")


import pickle
with open("label_encoder.pkl", "wb") as f:
  pickle.dump(label, f)


## Zip For Easy Download

In [72]:
!zip -r distilbert-amazon-rev.zip "ENTER THE CUSTOM PATH OF SAVED MODEL AND TOKENIZER PARENT DIRECTORY"

updating: content/distilbert_amazon_review_model/ (stored 0%)
updating: content/distilbert_amazon_review_model/model.safetensors (deflated 8%)
updating: content/distilbert_amazon_review_model/config.json (deflated 51%)
updating: content/distilbert_amazon_review_model/training_args.bin (deflated 52%)
  adding: content/distilbert_amazon_review_model/vocab.txt (deflated 53%)
  adding: content/distilbert_amazon_review_model/tokenizer_config.json (deflated 75%)
  adding: content/distilbert_amazon_review_model/special_tokens_map.json (deflated 42%)


## Loading the Model and Testing it

In [16]:
## Script for inferencing the model

import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification
import pickle
import numpy as np

In [17]:
# Load the model and tokenizer
model_path = "artifacts\model\distilbert_amazon_review_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

## Load the label binarizer
with open("artifacts\label_encoder.pkl", "rb") as f:
  label = pickle.load(f)

In [18]:
sample_text = "The Product is Amazing"

In [19]:
sample_encoding = tokenizer(sample_text,
                              truncation=True,
                              padding="max_length",
                              max_length= 512,
                              return_tensors="pt")

In [20]:
sample_output = model(**sample_encoding)
sample_output

SequenceClassifierOutput(loss=None, logits=tensor([[-6.0313, -6.6475, -5.4563, -2.6830,  2.6301]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [21]:
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(torch.Tensor(sample_output.logits[0].cpu()))
probs

tensor([0.0024, 0.0013, 0.0043, 0.0640, 0.9328], grad_fn=<SigmoidBackward0>)

In [22]:
preds = np.zeros(probs.shape)
preds[np.where(probs >= 0.3)] = 1
preds

array([0., 0., 0., 0., 1.])

In [23]:
# Reshape preds to have a samples dimension
preds = preds.reshape(1, -1) ## Add extra dim
label.inverse_transform(preds)

array([5])