# **Sentiment Analysis on the IMDB Dataset with Deberta Small**

This notebook demonstrates binary sentiment classification on movie reviews using the DeBERTa-small model, achieving efficient text classification with reduced computational requirements.

🤖 **Model:** microsoft/deberta-v3-small (77M parameters)


**Dataset Information:**  
The dataset used in this notebook is the ["IMDB Dataset of 50K Movie Reviews"](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) by lakshmi25npathi, which contains 50,000 labeled movie reviews for binary sentiment classification (positive/negative).

In [6]:
from kagglehub import KaggleDatasetAdapter
import pandas as pd
import kagglehub
from bs4 import BeautifulSoup

In [4]:
#Data set is provided from kaggle
file_path = "IMDB Dataset.csv"

data = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
  file_path,
)

  data = kagglehub.load_dataset(


In [5]:
df = data.copy()
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [7]:
df["review"] = df["review"].str.lower()

In [8]:
def remove_html_tags(text):
    return BeautifulSoup(text, "html.parser").get_text()
df['review'] = df['review'].apply(remove_html_tags)

In [47]:
import torch
from transformers import DebertaTokenizer, DebertaForSequenceClassification, TrainingArguments, Trainer
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
from datasets import Dataset

model_name = "microsoft/deberta-v3-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
MAX_LEN = 512

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



In [48]:
def format_input(row):
    return (
        f"Review: {row['review']}\n"
        f"Sentiment: {row['sentiment']}"
    )

In [49]:
df['text'] = df.apply(format_input,axis=1)

In [50]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [51]:
label_map = {"negative": 0, "positive": 1}
train_df["label"] = train_df["sentiment"].map(label_map)
test_df["label"] = test_df["sentiment"].map(label_map)

In [52]:
COLS = ['text','label']
train_ds = Dataset.from_pandas(train_df[COLS])
test_ds = Dataset.from_pandas(test_df[COLS])

In [53]:
def tokenize_function(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)

In [54]:
tokenized_train = train_ds.map(tokenize_function, batched=True)
tokenized_test = test_ds.map(tokenize_function, batched=True)

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [55]:
from transformers import DebertaV2ForSequenceClassification

model = DebertaV2ForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [60]:
training_args = TrainingArguments(
    output_dir="./sonuclar",
    eval_strategy="steps",
    eval_steps=750,
    save_steps=750,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    logging_dir="./logs",
    metric_for_best_model="accuracy",
    load_best_model_at_end=True,
    save_total_limit=2,
    logging_steps=100,
    report_to="none",
    fp16=True,
)

In [61]:
from sklearn.metrics import accuracy_score
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.from_numpy(logits), dim=1)
    return {"accuracy": accuracy_score(labels, predictions)}

In [62]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

In [63]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
750,0.0464,0.041565,0.984
1500,0.03,0.04706,0.9869
2250,0.0201,0.039079,0.9883
3000,0.0132,0.043568,0.9889
3750,0.0075,0.047542,0.9886


TrainOutput(global_step=3750, training_loss=0.027821315022309622, metrics={'train_runtime': 4399.217, 'train_samples_per_second': 27.278, 'train_steps_per_second': 0.852, 'total_flos': 1.589665406976e+16, 'train_loss': 0.027821315022309622, 'epoch': 3.0})

F1 score of validation set is controlled.  test_results are the results of validation.

In [65]:
test_results = trainer.predict(tokenized_test)
predictions = test_results.predictions
true_labels = test_results.label_ids

In [67]:
import numpy as np

In [68]:
predicted_labels = np.argmax(predictions, axis=-1)

In [70]:
from sklearn.metrics import classification_report

report = classification_report(
    true_labels,
    predicted_labels,
    target_names=["negative", "positive"],
    digits=4
)
print(report)

              precision    recall  f1-score   support

    negative     0.9875    0.9901    0.9888      4961
    positive     0.9903    0.9877    0.9890      5039

    accuracy                         0.9889     10000
   macro avg     0.9889    0.9889    0.9889     10000
weighted avg     0.9889    0.9889    0.9889     10000

