# Project: Sentiment Analysis on Product Reviews

- **Dataset**: [Women's Clothing E-Commerce Reviews](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews)
- **HuggingFace Model:** [cardiffnlp/twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)

# Part 0: Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')


## Imports & Setup

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tf_keras.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    ReduceLROnPlateau,
)
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification
import os
import datetime
import json

## Configuration

In [3]:
# Hyperparameters
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest" #Where pre-trained model and tokenizer is
NUM_LABELS = 3 #positive, neutral, and negative

# Tokenization
MAX_LEN = 128

# Random seed
SEED = 42

# Training
BATCH_SIZE = 8
EPOCHS = 3
LEARNING_RATE = 2e-5

# Early stopping patience
PATIENCE = 3

# Create timestamped run directory
TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
RUN_DIR = f"runs/run_{TIMESTAMP}"
CHECKPOINT_DIR = os.path.join(RUN_DIR, "checkpoints")
LOG_DIR = os.path.join(RUN_DIR, "logs")
MODEL_DIR = os.path.join(RUN_DIR, "models")

In [3]:
# Create all directories
for dir_path in [CHECKPOINT_DIR, LOG_DIR, MODEL_DIR]:
    os.makedirs(dir_path, exist_ok=True)

print(f"Run directory: {RUN_DIR}")

# Save configuration
config = {
    "model_name": MODEL_NAME,
    "max_len": MAX_LEN,
    "batch_size": BATCH_SIZE,
    "epochs": EPOCHS,
    "learning_rate": LEARNING_RATE,
    "num_labels": NUM_LABELS,
    "patience": PATIENCE,
    "seed": SEED,
    "timestamp": TIMESTAMP
}

config_path = os.path.join(RUN_DIR, "config.json")
with open(config_path, 'w') as f:
    json.dump(config, f, indent=2)
print(f"Configuration saved to: {config_path}")

Run directory: runs/run_20260101_150332
Configuration saved to: runs/run_20260101_150332/config.json


## Setting random seed for reproducibility

In [4]:
np.random.seed(SEED)
tf.random.set_seed(SEED)

## Load Dataset
Download and read dataset from nicapotato/womens-ecommerce-clothing-reviews

In [5]:
import kagglehub

path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews") #Get dataset
df = pd.read_csv(path + '/Womens Clothing E-Commerce Reviews.csv') #Read dataset
df = df.drop(columns=["Unnamed: 0", "Clothing ID", "Age", "Positive Feedback Count", \
"Division Name", "Department Name", "Class Name"]) #removing all data not used for sentiment analysis

#For review
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (23486, 4)


Unnamed: 0,Title,Review Text,Rating,Recommended IND
0,,Absolutely wonderful - silky and sexy and comf...,4,1
1,,Love this dress! it's sooo pretty. i happene...,5,1
2,Some major design flaws,I had such high hopes for this dress and reall...,3,0
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,Flattering shirt,This shirt is very flattering to all due to th...,5,1


## Data Cleaning & Text Preparation
Remove reviews that lack text and standardize review format.

In [6]:
# Drop empty reviews
df = df.dropna(subset=["Title", "Review Text"], how='all') #To filter out completely empty reviews
df = df[(df["Review Text"].str.strip() != "") & (df["Title"].str.strip() != "")] #In case both are whitespace

# Combine title and review text into text column
df["text"] = df["Title"].fillna("").str.strip() + ". " + df["Review Text"].fillna("").str.strip()

# Remove very short reviews
df = df[df["text"].str.split().str.len() >= 5]

# Reset index
df = df.reset_index(drop=True)

print(f"Cleaned dataset: {len(df)} reviews")

Cleaned dataset: 22631 reviews


## Sentiment Label Creation
Translate ratings of 4 or 5 stars to positive (2), 1 or 2 stars to negative (0), and 3 stars to neutral (1).

In [7]:
#Translate star rating to sentiment
def rating_to_sentiment(r):
    if r <= 2:
        return 0  # negative
    elif r == 3:
        return 1  # neutral
    else:
        return 2  # positive

#Adds sentiment value to all data
df["sentiment"] = df["Rating"].apply(rating_to_sentiment)

print("Label distribution:")
print(df["sentiment"].value_counts().sort_index()) #How many of each label
print("\n0=negative, 1=neutral, 2=positive")

Label distribution:
sentiment
0     2370
1     2822
2    17439
Name: count, dtype: int64

0=negative, 1=neutral, 2=positive


## Train / Validation / Test Split (80/10/10)

In [8]:
X = df["text"]
y = df["sentiment"]

# Shuffle indices
indices = np.arange(len(X))
np.random.shuffle(indices)

# Split points
train_end = int(0.8 * len(X))
val_end = int(0.9 * len(X))

#Seperate data into training, validation, and testing
train_idx = indices[:train_end]
val_idx = indices[train_end:val_end]
test_idx = indices[val_end:]

X_train, y_train = X.iloc[train_idx].values, y.iloc[train_idx].values
X_val, y_val = X.iloc[val_idx].values, y.iloc[val_idx].values
X_test, y_test = X.iloc[test_idx].values, y.iloc[test_idx].values

print(f"Train: {len(X_train)} samples")
print(f"Val:   {len(X_val)} samples")
print(f"Test:  {len(X_test)} samples")

Train: 18104 samples
Val:   2263 samples
Test:  2264 samples


## Load Tokenizer

In [9]:
#Load tokenizer from source of MODEL_NAME
tokenizer = RobertaTokenizerFast.from_pretrained(MODEL_NAME)

print(f"Tokenizer loaded: {MODEL_NAME}")

Tokenizer loaded: cardiffnlp/twitter-roberta-base-sentiment-latest


## Define Test Reviews for Before/After Comparison

These exact reviews will be used to compare baseline vs fine-tuned performance.

In [10]:
TEST_REVIEWS = [
    # Clearly negative
    "This is the worst product I've ever bought. Complete waste of money.",

    # Clearly positive
    "Absolutely love this dress! Perfect fit and beautiful fabric. Highly recommend!",

    # Mixed/Neutral - quality good but not my style
    "This dress isn't really my style, but the fabric feels good and high quality.",

    # Disappointed expectations
    "I was really excited about this dress, but the fabric feels cheap and it fits oddly.",

    # Long-time user with price complaint
    "Have used this brand for decades and while it is our favorite, the increase in prices over the years is ridiculous.",

    # Detailed positive review
    "I'm really enjoying this blush. Application is smooth and easy. Long lasting color throughout the day.",

    # Product didn't meet claims
    "Went to the beach with this water bottle full of ice. By 2pm my water was warm! Does not last as described.",

    # Sizing issue but liked the product
    "Beautiful sweater but runs very small. Had to return for a larger size. The quality is excellent though."
]

# Expected sentiments (human judgment)
EXPECTED_LABELS = [
    "negative",   # worst product
    "positive",   # absolutely love
    "neutral",    # not my style but good quality
    "negative",   # disappointed
    "negative",   # price complaint
    "positive",   # enjoying, smooth, long lasting
    "negative",   # didn't meet claims
    "neutral"     # issue but good quality
]

print(f"Defined {len(TEST_REVIEWS)} test reviews for before/after comparison")

Defined 8 test reviews for before/after comparison


---

# PART 1: Baseline Evaluation (Before Fine-Tuning)

Load the pre-trained model and evaluate on test reviews without any fine-tuning

In [None]:
# Load pre-trained model from source of MODEL_NAME (NO fine-tuning yet)
baseline_model = TFRobertaForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

print(f"Baseline model loaded: {MODEL_NAME}")
print("This model was trained on ~58M tweets for sentiment analysis.")

from tf_keras.optimizers import Adam
optimizer = Adam(learning_rate=LEARNING_RATE)

baseline_model.compile(
    optimizer=optimizer,
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

final_model_path = os.path.join(MODEL_DIR, "baseline_model.h5")
baseline_model.save_pretrained(final_model_path)

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Baseline model loaded: cardiffnlp/twitter-roberta-base-sentiment-latest
This model was trained on ~58M tweets for sentiment analysis.


In [12]:
LABEL_MAP = {0: "negative", 1: "neutral", 2: "positive"}

#Function to detects a model's probability of each label for the text
def predict_sentiment(model, text):
    """Predict sentiment for a single text."""
    inputs = tokenizer(
        text,
        truncation=True,
        padding=True,
        max_length=MAX_LEN,
        return_tensors="tf"
    )

    logits = model(**inputs).logits
    probs = tf.nn.softmax(logits, axis=1).numpy()[0]
    pred_idx = np.argmax(probs)

    return {
        "label": LABEL_MAP[pred_idx], #prediction
        "confidence": float(probs[pred_idx]), #prediction label probability
        "probabilities": {
            "negative": float(probs[0]), #probability of negative label
            "neutral": float(probs[1]), #probability of neutral label
            "positive": float(probs[2]) #probability of positive label
        }
    }

### Evaluation - Before Fine-Tuning

In [13]:
baseline_results = []

#Reviews accuracy of predictions for currated reviews before any fine-tuning
for i, (review, expected) in enumerate(zip(TEST_REVIEWS, EXPECTED_LABELS), 1):
    result = predict_sentiment(baseline_model, review)
    baseline_results.append(result)

    match = "✓" if result["label"] == expected else "✗"

    print(f"\nReview {i}: \"{review[:60]}...\"")
    print(f"  Expected: {expected}")
    print(f"  Baseline: {result['label']} (confidence: {result['confidence']:.2%}) {match}")

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.



Review 1: "This is the worst product I've ever bought. Complete waste o..."
  Expected: negative
  Baseline: negative (confidence: 35.20%) ✓

Review 2: "Absolutely love this dress! Perfect fit and beautiful fabric..."
  Expected: positive
  Baseline: neutral (confidence: 35.07%) ✗

Review 3: "This dress isn't really my style, but the fabric feels good ..."
  Expected: neutral
  Baseline: negative (confidence: 35.21%) ✗

Review 4: "I was really excited about this dress, but the fabric feels ..."
  Expected: negative
  Baseline: neutral (confidence: 35.02%) ✗

Review 5: "Have used this brand for decades and while it is our favorit..."
  Expected: negative
  Baseline: negative (confidence: 36.32%) ✓

Review 6: "I'm really enjoying this blush. Application is smooth and ea..."
  Expected: positive
  Baseline: neutral (confidence: 34.50%) ✗

Review 7: "Went to the beach with this water bottle full of ice. By 2pm..."
  Expected: negative
  Baseline: negative (confidence: 36.94%) ✓

Review 8:

In [14]:
# Calculate baseline accuracy
baseline_predictions = [r["label"] for r in baseline_results]
baseline_correct = sum(1 for pred, exp in zip(baseline_predictions, EXPECTED_LABELS) if pred == exp)
baseline_accuracy = baseline_correct / len(EXPECTED_LABELS)

print(f"\nBaseline Accuracy: {baseline_correct}/{len(EXPECTED_LABELS)} = {baseline_accuracy:.1%}")


Baseline Accuracy: 4/8 = 50.0%


---

# PART 2: Fine-Tuning

Fine-tune the model on our e-commerce review dataset.

## Tokenize Data

In [15]:
def tokenize(texts):
    return tokenizer(
        list(texts),
        truncation=True,
        padding=True,
        max_length=MAX_LEN,
        return_tensors="tf"
    )

train_enc = tokenize(X_train)
val_enc = tokenize(X_val)
test_enc = tokenize(X_test)

print("Tokenization complete.")

Tokenization complete.


## Build TensorFlow Datasets

In [16]:
#Combines encodings with labels, then shuffles and organizes them into batches of BATCH_SIZE
train_ds = tf.data.Dataset.from_tensor_slices(
    (dict(train_enc), y_train)
).shuffle(1000).batch(BATCH_SIZE)

val_ds = tf.data.Dataset.from_tensor_slices(
    (dict(val_enc), y_val)
).batch(BATCH_SIZE)

test_ds = tf.data.Dataset.from_tensor_slices(
    (dict(test_enc), y_test)
).batch(BATCH_SIZE)

print(f"Train batches: {len(train_ds)}")
print(f"Val batches: {len(val_ds)}")
print(f"Test batches: {len(test_ds)}")

Train batches: 2263
Val batches: 283
Test batches: 283


## Load Fresh Model for Fine-Tuning

In [17]:
# Load a fresh copy of the pre-trained model for fine-tuning
model = TFRobertaForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

print("Model loaded for fine-tuning.")

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded for fine-tuning.


#### Freeze early layers to reduce overfitting

In [18]:
NUM_LAYERS_TO_FREEZE = 6
for i, layer in enumerate(model.roberta.encoder.layer[:NUM_LAYERS_TO_FREEZE]):
    layer.trainable = False
    print(f"Froze encoder layer {i}")

print(f"Frozen {NUM_LAYERS_TO_FREEZE}/12 encoder layers")

Froze encoder layer 0
Froze encoder layer 1
Froze encoder layer 2
Froze encoder layer 3
Froze encoder layer 4
Froze encoder layer 5
Frozen 6/12 encoder layers


In [19]:
model.summary()

Model: "tf_roberta_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFRobertaMainLaye  multiple                  124055040 
 r)                                                              
                                                                 
 classifier (TFRobertaClass  multiple                  592899    
 ificationHead)                                                  
                                                                 
Total params: 124647939 (475.49 MB)
Trainable params: 82120707 (313.27 MB)
Non-trainable params: 42527232 (162.23 MB)
_________________________________________________________________


In [20]:
print(f"Transformer blocks: {model.config.num_hidden_layers}")

#For each weight layer in the classifier block
for weight in model.classifier.weights:
    print(weight.name, weight.shape)

Transformer blocks: 12
tf_roberta_for_sequence_classification_1/classifier/dense/kernel:0 (768, 768)
tf_roberta_for_sequence_classification_1/classifier/dense/bias:0 (768,)
tf_roberta_for_sequence_classification_1/classifier/out_proj/kernel:0 (768, 3)
tf_roberta_for_sequence_classification_1/classifier/out_proj/bias:0 (3,)


### Activation Function

In [21]:
#What activation function is being used
print(model.config.hidden_act)

gelu


## Compile Model

In [22]:
from tf_keras.optimizers import Adam
optimizer = Adam(learning_rate=LEARNING_RATE)

model.compile(
    optimizer=optimizer,
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

print(f"Model compiled with learning rate: {LEARNING_RATE}")

Model compiled with learning rate: 2e-05


## Setup Callbacks
Early stopping to stop early when val_loss falls for PATIENCE epochs, checkpoint to revert model to when the model had best val_loss, and reduce_lr to reduce learning rate when val_loss falls.

In [23]:
#Stop early if val_loss does not improve in PATIENCE epochs, to prevent overfitting
early_stopping = EarlyStopping(
    monitor="val_loss",
    patience=PATIENCE,
    restore_best_weights=True,
    verbose=1
)

#Revert to model when the epoch with best val_loss on finish
checkpoint = ModelCheckpoint(
    filepath=os.path.join(CHECKPOINT_DIR, "best_model.keras"),
    monitor="val_loss",
    save_best_only=True,
    verbose=1
)

#Reduce learning rate when val_loss does not improve, to allow for more stable learning
reduce_lr = ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.5,
    patience=1,
    min_lr=1e-7,
    verbose=1
)

callbacks = [checkpoint, reduce_lr, early_stopping]

print("Callbacks configured.")

Callbacks configured.


## Train (Fine-Tune)
Each epoch should take 10 minutes with GPU, and several hours without.

In [24]:
print(f"Fine-tuning for up to {EPOCHS} epochs...")
print("=" * 60)

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS,
    callbacks=callbacks
)

print("\nFine-tuning complete!")

Fine-tuning for up to 3 epochs...
Epoch 1/3


I0000 00:00:1767297847.978148  481581 service.cc:152] XLA service 0x78fcd9c95a30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1767297847.986650  481581 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6
2026-01-01 15:04:08.238204: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1767297848.351959  481581 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1767297848.698696  481581 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 1: val_loss improved from inf to 0.34078, saving model to runs/run_20260101_150332/checkpoints/best_model.keras




Epoch 2/3
Epoch 2: val_loss improved from 0.34078 to 0.33083, saving model to runs/run_20260101_150332/checkpoints/best_model.keras
Epoch 3/3
Epoch 3: val_loss did not improve from 0.33083

Epoch 3: ReduceLROnPlateau reducing learning rate to 9.999999747378752e-06.
Restoring model weights from the end of the best epoch: 2.

Fine-tuning complete!


### Save final model

In [25]:
final_model_path = os.path.join(MODEL_DIR, "final_model")
model.save_pretrained(final_model_path)
print(f"\nFinal model saved to: {final_model_path}")


Final model saved to: runs/run_20260101_150332/models/final_model


## Evaluate on Test Set

In [None]:
test_loss, test_acc = model.evaluate(test_ds)

print(f"\nTest Set Results:")
print(f"  Loss: {test_loss:.4f}")
print(f"  Accuracy: {test_acc:.4f}")


Test Set Results:
  Loss: 0.3420
  Accuracy: 0.8388


---

# PART 3: Post-Training Evaluation (After Fine-Tuning)

Same tests

In [None]:
finetuned_results = []

#Reviews accuracy of predictions for currated reviews after fine-tuning
for i, (review, expected) in enumerate(zip(TEST_REVIEWS, EXPECTED_LABELS), 1):
    result = predict_sentiment(model, review)
    finetuned_results.append(result)

    match = "✓" if result["label"] == expected else "✗"

    print(f"\nReview {i}: \"{review[:60]}...\"")
    print(f"  Expected:   {expected}")
    print(f"  Fine-tuned: {result['label']} (confidence: {result['confidence']:.2%}) {match}")


Review 1: "This is the worst product I've ever bought. Complete waste o..."
  Expected:   negative
  Fine-tuned: negative (confidence: 96.01%) ✓

Review 2: "Absolutely love this dress! Perfect fit and beautiful fabric..."
  Expected:   positive
  Fine-tuned: positive (confidence: 99.90%) ✓

Review 3: "This dress isn't really my style, but the fabric feels good ..."
  Expected:   neutral
  Fine-tuned: positive (confidence: 76.14%) ✗

Review 4: "I was really excited about this dress, but the fabric feels ..."
  Expected:   negative
  Fine-tuned: negative (confidence: 67.88%) ✓

Review 5: "Have used this brand for decades and while it is our favorit..."
  Expected:   negative
  Fine-tuned: neutral (confidence: 51.12%) ✗

Review 6: "I'm really enjoying this blush. Application is smooth and ea..."
  Expected:   positive
  Fine-tuned: positive (confidence: 99.89%) ✓

Review 7: "Went to the beach with this water bottle full of ice. By 2pm..."
  Expected:   negative
  Fine-tuned: negative (co

In [None]:
# Calculate fine-tuned accuracy
finetuned_predictions = [r["label"] for r in finetuned_results]
finetuned_correct = sum(1 for pred, exp in zip(finetuned_predictions, EXPECTED_LABELS) if pred == exp)
finetuned_accuracy = finetuned_correct / len(EXPECTED_LABELS)

print(f"\nFine-tuned Accuracy: {finetuned_correct}/{len(EXPECTED_LABELS)} = {finetuned_accuracy:.1%}")


Fine-tuned Accuracy: 5/8 = 62.5%


---

# PART 4: Before vs After Comparison

In [None]:
baseline_test_loss, baseline_test_acc = baseline_model.evaluate(test_ds)

print(f"\nTest Results for Baseline Model:")
print(f"  Loss: {baseline_test_loss:.4f}")
print(f"  Accuracy: {baseline_test_acc:.4f}")

test_loss, test_acc = model.evaluate(test_ds)

print(f"\nTest Results for Fine-Tuned Model:")
print(f"  Loss: {test_loss:.4f}")
print(f"  Accuracy: {test_acc:.4f}")

print(f"\nDifference in Basline and Fine-Tun:")
print(f"  Loss Difference: {(baseline_test_loss-test_loss):.4f}")
print(f"  Accuracy Difference: {(test_acc-baseline_test_acc):.4f}")




Test Results for Baseline Model:
  Loss: 1.1735
  Accuracy: 0.1325

Test Results for Fine-Tuned Model:
  Loss: 0.3420
  Accuracy: 0.8388

Difference in Basline and Fine-Tun:
  Loss Difference: 0.8315
  Accuracy Difference: 0.7063
