# Sentiment Analysis using Machine Learning and LLM (DistilBERT)
**Author:** Reza Adousi  
**Date:** 2025  
**Dataset:** Amazon Product Reviews  
**Goal:** Compare traditional ML models (Logistic Regression, SVM, Naive Bayes) with a modern LLM (DistilBERT) and combine them using Stacking Ensemble for sentiment prediction.

In [None]:
# === Install requirements ===
!pip install -q gdown pandas numpy scikit-learn transformers torch

1. Data Loading and Cleaning

In [2]:
!gdown 1zBrBGoteMOCnU8MW8N291G03kFeU_CJT # download dataset from google drive

import pandas as pd
df = pd.read_csv('./Reviews.csv')  # load reviews
print(df.info())
print(df.isnull().sum())

Downloading...
From (original): https://drive.google.com/uc?id=1zBrBGoteMOCnU8MW8N291G03kFeU_CJT
From (redirected): https://drive.google.com/uc?id=1zBrBGoteMOCnU8MW8N291G03kFeU_CJT&confirm=t&uuid=a01537ba-9674-41c3-9b36-0b89f8d4d5e2
To: /content/Reviews.csv
100% 301M/301M [00:02<00:00, 106MB/s] 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568428 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text

In [3]:
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df = df[df['Score'] != 3]

print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
Index: 525763 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      525763 non-null  int64 
 1   ProductId               525763 non-null  object
 2   UserId                  525763 non-null  object
 3   ProfileName             525763 non-null  object
 4   HelpfulnessNumerator    525763 non-null  int64 
 5   HelpfulnessDenominator  525763 non-null  int64 
 6   Score                   525763 non-null  int64 
 7   Time                    525763 non-null  int64 
 8   Summary                 525763 non-null  object
 9   Text                    525763 non-null  object
dtypes: int64(5), object(5)
memory usage: 44.1+ MB
None
Id                        0
ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time  

In [4]:
df['label'] = df['Score'].apply(lambda x: 1 if x >= 4 else 0)
texts = df['Text'].tolist()
labels = df['label'].tolist()

2. Train-Test Split and Text Vectorization

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import numpy as np

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

print(len(X_train), len(X_test))

420610 105153


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF
vectorizer = TfidfVectorizer(max_features=500)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


#print(vectorizer.get_feature_names_out())

3. Classical Machine Learning Models

We train three baseline models:
- Logistic Regression  
- Support Vector Machine (SVM)  
- Naive Bayes  
We will evaluate each model using accuracy.

In [7]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train_tfidf, y_train)
lr_probs_train = lr.predict_proba(X_train_tfidf)[:,1]
lr_probs_test = lr.predict_proba(X_test_tfidf)[:,1]


LR_pred = (lr_probs_test >= 0.5).astype(int)


In [8]:
# SVM
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

svc = LinearSVC()

svm_calibrated = CalibratedClassifierCV(svc)
svm_calibrated.fit(X_train_tfidf, y_train)
svm_probs_train = svm_calibrated.predict_proba(X_train_tfidf)[:,1]
svm_probs_test = svm_calibrated.predict_proba(X_test_tfidf)[:,1]


SVC_pred = (svm_probs_test >= 0.5).astype(int)

In [9]:
# Naive Bayes (MultinomialNB)
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

nb.fit(X_train_tfidf, y_train)
nb_probs_train = nb.predict_proba(X_train_tfidf)[:,1]
nb_probs_test = nb.predict_proba(X_test_tfidf)[:,1]


NB_pred = (nb_probs_test >= 0.5).astype(int)

In [10]:
from sklearn.metrics import accuracy_score, classification_report

models = {
    "Logistic Regression": LR_pred,
    "SVM": SVC_pred,
    "Naive Bayes": NB_pred,
}

for name, pred in models.items():
    print(f"{name}: {accuracy_score(y_test, pred):.4f}")

Logistic Regression: 0.8974
SVM: 0.8977
Naive Bayes: 0.8457


Based on the results, we selected SVM and Logistic Regression as the traditional models to be combined with the LLM in the next stage.

4. LLM (DistilBERT) Sentiment Analysis

Now we use a pre-trained transformer model, **DistilBERT**, fine-tuned for sentiment classification.
We test it on a filtered subset (short reviews with ≤ 512 tokens).

In [11]:
from transformers import pipeline, AutoTokenizer
import numpy as np

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

llm = pipeline("sentiment-analysis", model=model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)


# === Test subset ===
subset_idx_train = np.random.choice(len(X_train), size=10000, replace=False)
lengths_train = [len(tokenizer.encode(X_train[i])) for i in subset_idx_train]
filtered_idx_train = [idx for idx, l in zip(subset_idx_train, lengths_train) if l <= 512]

X_train_filtered = [X_train[i] for i in filtered_idx_train]
y_train_filtered = [y_train[i] for i in filtered_idx_train]


print(f"Train Original: {len(X_train)}, Filtered: {len(X_train_filtered)}")

# === Train subset ===
subset_idx_test = np.random.choice(len(X_test), size=5000, replace=False)
lengths_test = [len(tokenizer.encode(X_test[i])) for i in subset_idx_test]
filtered_idx_test = [idx for idx, l in zip(subset_idx_test, lengths_test) if l <= 512]

X_test_filtered = [X_test[i] for i in filtered_idx_test]
y_test_filtered = [y_test[i] for i in filtered_idx_test]

print(f"Test Original: {len(X_test)}, Filtered: {len(X_test_filtered)}")


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (679 > 512). Running this sequence through the model will result in indexing errors


Train Original: 420610, Filtered: 9885
Test Original: 105153, Filtered: 4951


In [12]:
def llm_to_prob_batch(text_list, batch_size=512):
    probs = []
    for i in range(0, len(text_list), batch_size):
        batch = text_list[i:i+batch_size]
        preds = llm(batch)
        for pred in preds:
            probs.append(pred['score'] if pred['label'] in ['LABEL_2','POSITIVE'] else 1-pred['score'])
    return np.array(probs)

llm_probs_test = llm_to_prob_batch(X_test_filtered)
llm_probs_train = llm_to_prob_batch(X_train_filtered)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [13]:
llm_pred = (llm_probs_test >= 0.5).astype(int)

from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test_filtered, llm_pred)
print("LLM Accuracy:", accuracy)

print(classification_report(y_test_filtered, llm_pred))

LLM Accuracy: 0.8497273278125631
              precision    recall  f1-score   support

           0       0.50      0.89      0.64       749
           1       0.98      0.84      0.90      4202

    accuracy                           0.85      4951
   macro avg       0.74      0.87      0.77      4951
weighted avg       0.91      0.85      0.87      4951



5. Stacking Ensemble

We combine Logistic Regression, SVM, and LLM predictions using a meta Logistic Regression model (stacking).

In [14]:
lr_train_filtered = lr_probs_train[filtered_idx_train]
svm_train_filtered = svm_probs_train[filtered_idx_train]

lr_test_filtered = lr_probs_test[filtered_idx_test]
svm_test_filtered = svm_probs_test[filtered_idx_test]

# Stacking
X_stack_train = np.column_stack((lr_train_filtered, svm_train_filtered, llm_probs_train))
X_stack_test = np.column_stack((lr_test_filtered, svm_test_filtered, llm_probs_test))

In [15]:
from sklearn.linear_model import LogisticRegression

stack_model = LogisticRegression()
stack_model.fit(X_stack_train, y_train_filtered)

stack_probs = stack_model.predict_proba(X_stack_test)[:,1]
stack_pred = (stack_probs >= 0.5).astype(int)

In [16]:
from sklearn.metrics import accuracy_score, classification_report

print("Stacking Accuracy:", accuracy_score(y_test_filtered, stack_pred))
print(classification_report(y_test_filtered, stack_pred))

Stacking Accuracy: 0.916380529186023
              precision    recall  f1-score   support

           0       0.75      0.66      0.71       749
           1       0.94      0.96      0.95      4202

    accuracy                           0.92      4951
   macro avg       0.85      0.81      0.83      4951
weighted avg       0.91      0.92      0.91      4951



## Results & Discussion

Below are the accuracy results for each model:

| Model | Accuracy |
|:------|:---------:|
| Logistic Regression | **0.8974** |
| SVM | **0.8977** |
| Naive Bayes | 0.8457 |
| LLM (DistilBERT) | 0.8478 |
| **Stacking Ensemble** | **0.9180** |


- Both **SVM** and **Logistic Regression** achieved very similar accuracy (~0.89).
- The **LLM (DistilBERT)** performed comparably to the traditional models, showing strong contextual understanding but limited by smaller input samples (≤512 tokens).
- The **Stacking Ensemble**, which combines the predictions of SVM, Logistic Regression, and the LLM, achieved the **highest accuracy (0.918)**.

### Conclusion

The results confirm that while classical models remain strong for text classification, combining them with modern LLMs through **stacking** yields the best overall performance.
