## Sentiment analysis using transformers
In this notebook I am using the `hugging-face` transformers library for sentiment analysis of imdb reviews provided by `keras` library. For this purpose I have fine-tuned `distilbert-base-uncased` on the dataset uisng the hugging face transformers library.

## Step 1: Load the IMDB Dataset

We begin by importing the IMDB sentiment classification dataset using Keras. This dataset contains 50,000 movie reviews, split evenly into training and test sets. The reviews are already preprocessed—each review is encoded as a sequence of word indices (integers).

- `x_train` and `x_test` contain the sequences of word indices for each review.
- `y_train` and `y_test` contain binary sentiment labels:  
  - `1` = positive review  
  - `0` = negative review


In [None]:
# Import the IMDB dataset from Keras's built-in datasets module
from tensorflow.keras.datasets import imdb
# training and testing splits
(x_train, y_train), (x_test, y_test) = imdb.load_data()

In [None]:
# Load the word index dictionary that maps words to their integer indices
word_index = imdb.get_word_index()

In [None]:
word_index.items()

## Step 2: Decode Reviews and Create DataFrames

The IMDB dataset provides reviews as sequences of integer word indices. In this step, we convert those sequences back into human-readable text for inspection and further processing.

### Key Actions:
- **Build `index_word`**: Create a reverse lookup dictionary to map indices back to words, while accounting for reserved tokens (`<PAD>`, `<START>`, etc.).
- **Decode**: Use the `decode_review()` function to convert each list of integers into a sentence.
- **Clean**: Strip out special tokens using `clean_review()` for cleaner inputs.
- **Store**: Wrap the processed reviews and labels into two `pandas` DataFrames: `train_df` and `test_df`.


In [None]:
# Shift the original word index values by 3 to account for reserved tokens
index_word = {v + 3: k for k, v in word_index.items()}

In [None]:
# Add special tokens for padding, start of review, unknown words, etc.

index_word[0] = "<PAD>"
index_word[1] = "<START>"
index_word[2] = "<UNK>"
index_word[3] = "<UNUSED>"

In [None]:
# Function to decode integer sequences into readable text using the index_word mapping

def decode_review(seq):
    return ' '.join([index_word.get(i, "?") for i in seq])

In [None]:
# Decode all training and testing reviews from integer sequences to text

train_texts = [decode_review(seq) for seq in x_train]
test_texts = [decode_review(seq) for seq in x_test]

In [None]:
# Function to remove special tokens from the decoded reviews

def clean_review(text):
    for special_token in ["<PAD>", "<START>", "<UNK>", "<UNUSED>"]:
        text = text.replace(special_token, "")
    return text.strip()

train_texts = [clean_review(t) for t in train_texts]
test_texts = [clean_review(t) for t in test_texts]

In [None]:
# Create pandas DataFrames for both train and test sets with text and corresponding label
import pandas as pd

train_df = pd.DataFrame({
    "text": train_texts,
    "label": y_train
})

test_df = pd.DataFrame({
    "text": test_texts,
    "label": y_test
})

## Step 3: Tokenize and Prepare PyTorch Datasets

Now that we have cleaned text reviews, we prepare them for use with the DistilBERT transformer model.

### Key Actions:

- **Tokenizer Initialization**:
  - We load the `distilbert-base-uncased` tokenizer from Hugging Face using `AutoTokenizer`.
  - The tokenizer converts raw text into model-ready input: `input_ids`, `attention_mask`, etc.

- **Tokenization**:
  - We tokenize both training and test sets with `truncation=True` to limit the sequence length.
  - `padding=False` lets the `DataCollatorWithPadding` handle padding dynamically during training.

- **Data Collator**:
  - `DataCollatorWithPadding` will automatically pad batches to the longest sequence in the batch, saving memory and computation.

- **Custom PyTorch Dataset**:
  - `IMDbDataset` wraps tokenized inputs and labels into a PyTorch-compatible format.
  - Each item is a dictionary of tensors that can be directly passed to the model.

This prepares the data to be loaded efficiently by a `DataLoader` for fine-tuning the model.


In [None]:
# Load the tokenizer for the DistilBERT model
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
# Tokenize the training and testing text data
# `truncation=True` ensures inputs are not too long for the model
# `padding=False` because padding will be handled dynamically later
train_encodings = tokenizer(train_texts, truncation=True, padding = False)
test_encodings = tokenizer(test_texts, truncation=True, padding=False)

In [None]:
# Data collator will dynamically pad the inputs at runtime during batching
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
train_encodings[0]

In [None]:
# PyTorch dataset class to wrap encodings and labels together
from torch.utils.data import Dataset
import torch

class IMDbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
# create dataset with appropriate format for our model
train_dataset = IMDbDataset(train_encodings, y_train)
test_dataset = IMDbDataset(test_encodings, y_test)

## Step 4: Load Model and Fine-Tune on IMDB Sentiment Data

We now fine-tune the pretrained `distilbert-base-uncased` model using Hugging Face’s high-level `Trainer` API.

- We configure training parameters using `TrainingArguments`.


In [None]:
# Load the DistilBERT model for binary classification (positive/negative sentiment)

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)

    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    acc = accuracy_score(labels, predictions)

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

In [None]:
!pip install wandb
!wandb login

In [None]:
import wandb
wandb.init(project="distilbert-imdb", name="run-with-tweaks")

In [None]:
# Define training configurations
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

training_args = TrainingArguments(
    output_dir="./results",           # Directory to save model checkpoints
    eval_strategy="epoch",            # Run evaluation at the end of every epoch
    save_strategy="epoch",            # Save model at the end of every epoch
    learning_rate=2e-5,               # Initial learning rate (can be tuned)
    per_device_train_batch_size=8,   # Batch size for training
    per_device_eval_batch_size=64,    # Batch size for evaluation
    num_train_epochs=2,               # Number of training epochs
    weight_decay=0.1,                # L2 regularization to reduce overfitting
    logging_dir="./logs",             # Where to write logs
    load_best_model_at_end=True,             # Load best model based on eval loss
    save_total_limit=1,                      # Keep only best checkpoint
    metric_for_best_model="f1",       # Track eval loss to select best model
    greater_is_better=True,
    report_to="wandb",
    run_name="run-with-tweaks",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
)

In [None]:
# Start the training process
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
import wandb

# Get predictions on test set
preds_output = trainer.predict(test_dataset)
preds = np.argmax(preds_output.predictions, axis=1)
labels = preds_output.label_ids

# Compute and log confusion matrix
wandb.log({
    "confusion_matrix": wandb.plot.confusion_matrix(
        probs=None,
        y_true=labels,
        preds=preds,
        class_names=["negative", "positive"]
    )
})

In [None]:
wandb_table = wandb.Table(columns=["Text", "True Label", "Predicted Label"])

for i in range(10):  # Log 10 examples
    text = test_df.iloc[i]["text"]
    true_label = test_df.iloc[i]["label"]
    pred_label = preds[i]
    wandb_table.add_data(text, true_label, pred_label)

wandb.log({"predictions_table": wandb_table})

In [None]:
wandb.finish()

In [None]:
# Save model locally
trainer.save_model("imdb-sentiment-analysis")

In [None]:
!hf auth login

In [None]:
model.config.id2label = {0: "NEGATIVE", 1: "POSITIVE"}
model.config.label2id = {"NEGATIVE": 0, "POSITIVE": 1}
trainer.save_model("imdb-sentiment-analysis")

In [None]:
trainer.push_to_hub("imdb-sentiment-analysis")

In [None]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

result = sentiment_pipeline("Visually, this film was a standout with stunning cinematography, impressive CGI, and a unique retro aesthetic that felt refreshingly different from the usual superhero fare. The story was fairly basic and the short runtime didn't allow enough depth to fully explore the villain's motivations. The casting was nearly perfect, with every actor bringing their character to life in a way that made you love the team not just as a whole but as individuals. Overall, I'd rate it a solid 7.2 out of 10. It's a fun, visually rich ride and I'm genuinely excited to see where the story goes next. The post-credit scene was a 9 out of 10, one of the most dramatic and thrilling I've seen in years.")
print(result)