## Model_1: Model with distilbert

Set up a basic pipeline using **standardized preprocessing from preprocess.py** and distilbert.

## 🔧 Steps:
1. Import Libraries and load data using **preprocess.py functions**
2. Preprocessing: refer data_preprocessing.ipynb notebook
3. Vectorization: TD-IDF - maintaining original approach
4. Model: distilbert
5. Evaluation: Accuracy, confusion matrix, classification report

##  ✅ Purpose:
Establish a working pipeline using **standardized preprocessing functions** and maintain baseline score (~70-80% accuracy expected).

## Step1: Import Libraries and read cleaned Data 

In [34]:
# Step 1: Import Libraries
%pip install transformers datasets scikit-learn pandas matplotlib seaborn torch

Note: you may need to restart the kernel to use updated packages.


In [35]:
%pip install --upgrade transformers

Note: you may need to restart the kernel to use updated packages.


In [36]:
%pip install transformers[torch]

Note: you may need to restart the kernel to use updated packages.


In [37]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import os
import transformers
print(f"Transformers version: {transformers.__version__}") 

Transformers version: 4.52.3


## Step 2: Load and Verify Cleaned Data

In [38]:
df = pd.read_csv("data/cleaned_amazon_reviews_final.csv")
print(df.head())

                                                name       asins   brand  \
0  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
1  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
2  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
3  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
4  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   

                                          categories primaryCategories  \
0  Computers,Electronics Features,Tablets,Electro...       Electronics   
1  Computers,Electronics Features,Tablets,Electro...       Electronics   
2  Computers,Electronics Features,Tablets,Electro...       Electronics   
3  Computers,Electronics Features,Tablets,Electro...       Electronics   
4  Computers,Electronics Features,Tablets,Electro...       Electronics   

                                           imageURLs  doRecommend  rating  \
0  https://pisces.bby

In [39]:
try:
    df = pd.read_csv('data/cleaned_amazon_reviews_final.csv')
    print('Dataset Head:')
    print(df.head())
    print('\nColumn Names:')
    print(df.columns)
    if 'clean_text' not in df.columns or 'label' not in df.columns:
        raise KeyError("Required columns 'clean_text' and 'label' not found in dataset.")
    df = df[['clean_text', 'label']].dropna()
    print('\nUnique Labels:')
    print(df['label'].unique())
    if df.empty:
        raise ValueError('DataFrame is empty after filtering! Check data loading and file path.')
    if not all(df['label'].apply(lambda x: isinstance(x, (int, np.integer)))):
        raise ValueError("Labels must be integers (e.g., 0, 1, 2). Check 'label' column.")
    label_encoder = LabelEncoder()
    label_encoder.fit(df['label'])
    print(f"Label classes: {label_encoder.classes_}")
except FileNotFoundError:
    print("Error: 'data/cleaned_amazon_reviews_final.csv' not found. Verify file path.")
    raise
except KeyError as e:
    print(f"Error: {e}")
    raise
except Exception as e:
    print(f"Error loading data: {e}")
    raise

Dataset Head:
                                                name       asins   brand  \
0  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
1  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
2  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
3  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   
4  Amazon Kindle E-Reader 6" Wifi (8th Generation...  B00ZV9PXP2  Amazon   

                                          categories primaryCategories  \
0  Computers,Electronics Features,Tablets,Electro...       Electronics   
1  Computers,Electronics Features,Tablets,Electro...       Electronics   
2  Computers,Electronics Features,Tablets,Electro...       Electronics   
3  Computers,Electronics Features,Tablets,Electro...       Electronics   
4  Computers,Electronics Features,Tablets,Electro...       Electronics   

                                           imageURLs  doRecommend  rating  \
0  http

In [40]:
# Load data if not already loaded
df = pd.read_csv("data/cleaned_amazon_reviews_final.csv")

# View column names
print(df.columns)

# Rename columns if needed to standardize
df = df[['text', 'rating']].dropna()
print(df['rating'].unique())
if df.empty:
    raise ValueError('DataFrame is empty! Please check data loading and file path.')
label_encoder = LabelEncoder()
label_encoder.fit(df['rating'])

Index(['name', 'asins', 'brand', 'categories', 'primaryCategories',
       'imageURLs', 'doRecommend', 'rating', 'text', 'sourceURLs',
       'full_review', 'label', 'clean_text'],
      dtype='object')
[3 5 4 1 2]


## Step 3: Train, Test, Split

In [41]:
X_train, X_test, Y_train, Y_test = train_test_split(
    df['text'].tolist(),
    df['rating'].tolist(),
    test_size=0.2,
    random_state=42
)
train_dataset = Dataset.from_dict({'Review': X_train, 'Rating': Y_train})
val_dataset = Dataset.from_dict({'Review': X_test, 'Rating': Y_test})

## Step 4: Vectorization with DistilBERT Tokenizer 

In [42]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
def tokenize_function(examples):
    return tokenizer(examples['Review'], padding='max_length', truncation=True, max_length=128)
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'Rating'])
val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'Rating'])

Map:   0%|          | 0/3508 [00:00<?, ? examples/s]

Map:   0%|          | 0/877 [00:00<?, ? examples/s]

## Step 5: Model Setup

In [43]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=len(df['rating'].unique())
)
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy='epoch',  
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='accuracy'
)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, predictions)
    return {'accuracy': acc}
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>=0.26.0'`

## Step 6: Model Training and Evaluation

In [None]:
trainer.train()
eval_results = trainer.evaluate()
print('Evaluation Results:')
print(eval_results)
predictions = trainer.predict(val_dataset)
y_pred = np.argmax(predictions.predictions, axis=1)
y_true = Y_test
accuracy = accuracy_score(y_true, y_pred)
print(f'Validation Accuracy: {accuracy:.4f}')
print('\nClassification Report:')

## Step 7: Confusion Matrix

In [None]:
print(classification_report(y_true, y_pred, target_names=label_encoder.classes_))
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()