## 1. Introduction

Transformers have revolutionized the field of Natural Language Processing (NLP).  
Unlike traditional models that process text sequentially (like RNNs and LSTMs), transformers use **self-attention mechanisms** to capture relationships between words — regardless of their position in the sequence.

This allows transformer models to:
- Understand context more effectively
- Process text in parallel (faster training)
- Achieve state-of-the-art performance on a wide range of NLP tasks

Among these models, **DeBERTa (Decoding-enhanced BERT with Disentangled Attention)** stands out as an advanced architecture that improves upon BERT and RoBERTa by enhancing attention mechanisms and word embeddings.

---

In this notebook, we apply **DeBERTa-v3-small** to the **IMDB movie reviews dataset**, performing binary sentiment classification (**positive** or **negative**).  
Unlike classical pipelines where we handle tokenization and embedding manually, transformers like DeBERTa handle these internally — making the workflow cleaner and more powerful.

---

This notebook is part of a broader NLP project:

- 📘 [**NLP with IMDB: Classic Models (TF-IDF + BiLSTM)**](https://www.kaggle.com/code/ahmedgaitani/nlp-with-imdb-classic-models-tf-idf-bilstm): focused on traditional approaches
- 🤖 **This notebook**: focuses on fine-tuning a modern transformer (DeBERTa)
- 📊 **Next notebook**: will compare all models side by side using a range of evaluation metrics

By the end of this notebook, we will have:
- Fine-tuned a transformer model on IMDB reviews
- Generated predictions on the test set
- Saved results for later comparison

Let’s get started!


## 2. Install Dependencies

We start by installing the `evaluate` library from Hugging Face,  
which will be used later to calculate performance metrics such as accuracy, precision, recall, and F1-score.


In [1]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, evaluate
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cesium 0.12.4 requires 

## 3. Import Libraries

Here we import all necessary libraries used in this notebook:

- **Pandas / NumPy** for data manipulation
- **Scikit-learn** for splitting the dataset
- **Transformers & Datasets** from Hugging Face to load and fine-tune DeBERTa
- **Evaluate** for calculating model performance metrics
- **PyTorch** as the backend framework for training


In [2]:
# Core Libraries
import pandas as pd
import numpy as np

# Sklearn Tools
from sklearn.model_selection import train_test_split

# Transformers and Datasets
from transformers import (AutoTokenizer,
                          AutoModelForSequenceClassification,
                          TrainingArguments, Trainer
)
from datasets import Dataset
import evaluate

# PyTorch
import torch

2025-07-03 00:27:24.535575: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751502444.725160      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751502444.779178      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 4. Load Dataset

In this cell, we load the IMDB dataset using `pandas.read_csv()` and display the first 5 rows.  
This dataset contains 50,000 movie reviews labeled as either positive or negative, which we will use for sentiment analysis.


In [3]:
path = '/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv'
df = pd.read_csv(path)

# Show the first 5 rows
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## 5. Preprocessing

In this section, we prepare the dataset for training:
- Convert sentiment labels to numerical values (0/1)
- Split the data into train, validation, and test sets

### 5.1 Preprocess Labels

We convert the sentiment column from text labels ("positive", "negative") to binary format:

- `positive` → `1`
- `negative` → `0`

This transformation is required for binary classification with DeBERTa.


In [4]:
# Keep only Positive and Negative samples
df['sentiment'] = df['sentiment'].map({'negative': 0, 'positive': 1})

### 5.2 Split Dataset


We split the dataset into three subsets:

- **80%** for training and validation
- **20%** for testing

Then, we further split the training data:

- **80%** → actual training set
- **20%** → validation set

This results in:
- ~64% training
- ~16% validation
- 20% test


In [5]:
# Step 1: Split into train (80%) and test (20%)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['sentiment'])

# Step 2: Split train into train (80% of 80%) and val (20% of 80%) → 64% train, 16% val
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42, stratify=train_df['sentiment'])

# Show sizes
print(f"Train size: {len(train_df)}")
print(f"Validation size: {len(val_df)}")
print(f"Test size: {len(test_df)}")


Train size: 32000
Validation size: 8000
Test size: 10000


## 6. Tokenize Data and Fine-tune DeBERTa

In this section, we prepare our data and fine-tune a pre-trained **DeBERTa-v3-small** transformer model from Hugging Face for sentiment classification.

---

### 🔹 Step 1: Load Tokenizer and Model
We load the **DeBERTa tokenizer** and **pre-trained model** using the model name `"microsoft/deberta-v3-small"`.  
This model has been trained on a large corpus and is capable of understanding rich language representations.  
We specify `num_labels=2` since our task is binary classification: **positive** vs. **negative** sentiment.

---

### 🔹 Step 2: Convert to Hugging Face Datasets
We convert our `train_df` and `val_df` from pandas DataFrames to `datasets.Dataset` format.  
This is required to use Hugging Face’s efficient tokenization and training utilities.

---

### 🔹 Step 3: Tokenization
We tokenize the review texts using the `tokenizer`, applying:
- **Truncation** to cut long reviews (limit = 512 tokens)
- **Padding** to make all sequences equal length (for batching)

Unlike classical NLP pipelines where we manually build features (TF-IDF, Word2Vec),  
transformer models like DeBERTa **learn contextual embeddings** during fine-tuning — no need for separate embedding steps.

---

### 🔹 Step 4: Dataset Formatting
We:
- Remove the original `review` text column
- Rename the `sentiment` column to `labels` (as required by Hugging Face `Trainer`)
- Set the format to PyTorch tensors to enable model training

---

### 🔹 Step 5: Define Evaluation Metric
We use the `evaluate` library to load **accuracy** as our metric,  
and define a `compute_metrics()` function that:
- Gets logits from model output
- Applies `argmax` to get predicted labels
- Compares with true labels

This function will be used by the `Trainer` during evaluation.

---

### 🔹 Step 6: Set Training Arguments
We configure training using `TrainingArguments`, including:
- `batch_size`, `learning_rate`, `epochs`, `weight_decay`
- Saving best model based on validation accuracy
- Enabling `fp16` (mixed precision) for faster training on GPU

---

### 🔹 Step 7: Initialize Trainer and Train Model
We initialize the Hugging Face `Trainer`, passing:
- The model
- Training and validation datasets
- Tokenizer
- Training arguments
- Evaluation function

Then we call `.train()` to fine-tune the model on our data.

---

### 💡 Why this is powerful in NLP:

Transformer-based models like DeBERTa handle:
- **Tokenization**
- **Contextual Embedding**
- **Sequence modeling**

…all in one unified architecture. This greatly simplifies NLP pipelines and improves performance across many tasks, including sentiment analysis, question answering, and more.

This step completes the model training phase — next, we will prepare the test data and generate predictions.


In [6]:
# 1. Load tokenizer and model
model_name = "microsoft/deberta-v3-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 2. Convert pandas DataFrames to HuggingFace Datasets
train_dataset = Dataset.from_pandas(train_df[['review', 'sentiment']], preserve_index=False)
val_dataset = Dataset.from_pandas(val_df[['review', 'sentiment']], preserve_index=False)

# 3. Tokenization
def tokenize_function(example):
    return tokenizer(example['review'], truncation=True, padding='max_length', max_length=512)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Remove original text column
train_dataset = train_dataset.remove_columns(['review'])
val_dataset = val_dataset.remove_columns(['review'])

# Rename label column
train_dataset = train_dataset.rename_column("sentiment", "labels")
val_dataset = val_dataset.rename_column("sentiment", "labels")

# Set format for PyTorch
train_dataset.set_format("torch")
val_dataset.set_format("torch")

# 4. Define evaluation metric
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

# 5. TrainingArguments
training_args = TrainingArguments(
    output_dir="./results_deberta",
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to="none",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True
)

# 6. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

# 7. Train model
trainer.train()


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/286M [00:00<?, ?B/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Downloading builder script: 0.00B [00:00, ?B/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2787,0.242649,0.939125
2,0.1367,0.232967,0.9525
3,0.0896,0.291854,0.951875


TrainOutput(global_step=24000, training_loss=0.17883110904693603, metrics={'train_runtime': 4813.3262, 'train_samples_per_second': 19.945, 'train_steps_per_second': 4.986, 'total_flos': 1.2717323255808e+16, 'train_loss': 0.17883110904693603, 'epoch': 3.0})

## 7. Prepare Test Data & Generate Predictions

Now that our DeBERTa model is trained, we move on to testing it on unseen data (the test set).  
This step helps us evaluate how well the model generalizes beyond the training and validation data.

---

### 🔹 Step 1: Convert to Hugging Face Dataset
We convert the `test_df` to Hugging Face `Dataset` format so it can be processed just like the training and validation sets.

---

### 🔹 Step 2: Tokenization
We apply the same `tokenize_function` used during training to encode the test reviews into the appropriate format  
that DeBERTa expects — token IDs, attention masks, etc.

---

### 🔹 Step 3: Format and Clean the Dataset
We remove the original `review` column and rename the label column from `sentiment` to `labels`,  
as expected by the Hugging Face `Trainer`.

Finally, we format the dataset as PyTorch tensors so it can be passed into the model.

---

### 🔹 Step 4: Make Predictions
Using the `trainer.predict()` method, we run the trained model on the test set to generate predictions.

The output includes:
- **logits** (raw model outputs)
- **label_ids** (true labels)

We convert logits into final predicted class labels using `argmax`.

---

### 💡 Why this matters in NLP:

Evaluating on test data gives us a true indication of how well the model performs on real, unseen inputs.  
It reflects the model's ability to generalize — which is critical in NLP tasks like sentiment analysis, where new, varied language constantly appears.

This sets us up for the final step: saving predictions for comparison with other models.


In [7]:
# Convert test_df to HuggingFace Dataset
test_dataset = Dataset.from_pandas(test_df[['review', 'sentiment']])

# Apply tokenization
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Remove the raw text column
test_dataset = test_dataset.remove_columns(['review'])

# Rename the target column to 'labels'
test_dataset = test_dataset.rename_column("sentiment", "labels")

# Set format to PyTorch tensors
test_dataset.set_format("torch")

# Make predictions using the trained model
predictions_output = trainer.predict(test_dataset)

# Extract true labels and predicted labels
y_pred = np.argmax(predictions_output.predictions, axis=1)
y_true = predictions_output.label_ids


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

## 8. Save Predictions

In this step, we:

- Convert the model's raw output logits into probabilities using the softmax function.
- Create a DataFrame that includes:
  - The original review
  - The true sentiment label
  - The predicted label
  - The predicted probabilities for both classes (negative and positive)
- Save this DataFrame to a CSV file named `deberta_preds.csv`.

This file will be used later in the comparison notebook  
**"NLP with IMDB: Classic Models vs. Transformers"** to analyze model performance side by side.

✅ This step ensures consistency and reproducibility of the model's output.


In [8]:
# Convert logits to probabilities
probs = torch.nn.functional.softmax(torch.tensor(predictions_output.predictions), dim=1).numpy()

# Create DataFrame with the results
results_df = pd.DataFrame({
    'review': test_df['review'],
    'true_label': y_true,
    'predicted_label': y_pred,
    'prob_negative': probs[:, 0],
    'prob_positive': probs[:, 1]
})

# Save to CSV
results_df.to_csv("deberta_preds.csv", index=False)

print("✅ Predictions and probabilities saved to deberta_preds.csv")


✅ Predictions and probabilities saved to deberta_preds.csv


## 9. Conclusion

In this notebook, we fine-tuned a **Transformer-based model (DeBERTa-v3-small)** to perform sentiment analysis on the IMDB movie reviews dataset.  
Unlike traditional models, DeBERTa learns contextual word representations and captures long-range dependencies in text — making it a powerful choice for modern NLP tasks.

---

This notebook is part of a larger workflow:

- ✅ In [NLP with IMDB: Classic Models (TF-IDF + BiLSTM)](https://www.kaggle.com/code/ahmedgaitani/nlp-with-imdb-classic-models-tf-idf-bilstm), we trained traditional models on the same dataset.
- 🔄 Here, we focused on leveraging DeBERTa as a modern alternative.
- 🔜 In the next notebook: **NLP with IMDB: Classic Models vs. Transformers**, we will compare the performance of:
  - Logistic Regression
  - BiLSTM
  - DeBERTa

Using:
- Accuracy, Precision, Recall, F1-score
- Confusion Matrix
- Visual comparisons
- Final insights and recommendations

---

📁 Predictions from this notebook have been saved to `deberta_preds.csv` and will be used in the upcoming comparison.

Stay tuned!
