# Project description
This project is aimed at the analysis of the sentiments of users who purchased beauty products from Amazon during 2023. The dataset used for this task can be found on the huggingface website under this name: McAuley-Lab/Amazon-Reviews-2023 or at this link: https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023

# Objectives for sentiment analysis:
- Understand Customer Opinions: Identify and analyze the emotions, opinions, and attitudes expressed in customer reviews.
- Classify Sentiment: Automatically categorize text as positive or negative.
- Monitor Brand Reputation: Track how people feel about a brand, product, or service over time to detect trends or issues early.
- Improve Decision-Making: Provide actionable insights to marketing, product development, and customer service teams.
- Automate Feedback Analysis: Reduce manual effort by using models to quickly process large volumes of textual data.

# Plan of action
- The preprocessing of texts will be done by two different methods
    1. TF-IDF
    2. Text embedding
- The sentiment analysis task will be classified by different classification models.
- A final best model will be selected based on computational expenses, accuracy, inference time, ...

In [14]:
import re
import numpy as np
import pandas as pd
from datasets import load_dataset
from nltk.corpus import stopwords
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

# Loading data

In [None]:
# loading the dataset using the datasets library
ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_All_Beauty", trust_remote_code=True);
print(ds["full"]["features"])

# Extracting the input features and target values
text = ds["full"]["text"]
stars = ds["full"]["rating"]

# Reviews are seen as either positive or negative. Ratings of 4 and 5 star are seen as positive.
labels = np.array(stars)>3

# TF-IDF + classification

## Preprocessing for TF-IDF

In [12]:
def clean_lemmatize(text):
    """
    Cleans the text from URLs, punctuations, stopwords, numbers, and html tags and then lemmatizes it.
    returns: cleaned text as a string
    """
    text = text.lower()
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    text = re.sub(r"\d+", "", text)  # Remove numbers
    text = re.sub(r'<[^>]+>', '', text) # Remove html tags

    # Tokenization and stopword removal
    tokens = text.split()
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return " ".join(tokens)


# clean and lemmatize the review strings
text = [clean_lemmatize(review) for review in text]

# splitting the data into training and testing set
x_train, x_test, y_train, y_test = train_test_split(text, labels, test_size=0.2, random_state=42, stratify=labels)

## Classification model selection

In [5]:
# checking the performance of different models on the data
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, n_jobs=-1),
    "Linear SVM": LinearSVC(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "Multinomial NB": MultinomialNB()
}

vectorizer = TfidfVectorizer()
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# not viable to do 5-fold cross validation on the entire training set => 5% stratified sample is selected
x_val, _, y_val, _ = train_test_split(x_train, y_train, test_size=0.95, random_state=42, stratify=y_train)
print("5-Fold Cross-Validation Results:\n")
cross_validation_results = {}
for name, model in models.items():
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', model)
    ])

    scores = cross_val_score(pipeline, x_val, y_val, cv=stratified_kfold, scoring='accuracy')
    cross_validation_results[name] = scores.mean()
    print(f"{name}:")
    print(f"  Accuracy scores: {scores}")
    print(f"  Mean Accuracy: {scores.mean():.4f}")
    print(f"  Std Dev: {scores.std():.4f}\n")

best_model_name = max(cross_validation_results, key=lambda x: cross_validation_results[x])
print(f"The best model based on mean accuracy is {best_model_name}")

5-Fold Cross-Validation Results:

Logistic Regression:
  Accuracy scores: [0.86335293 0.8544191  0.85602281 0.86724875 0.85584462]
  Mean Accuracy: 0.8594
  Std Dev: 0.0050

Linear SVM:
  Accuracy scores: [0.86477819 0.85548824 0.85334996 0.86154669 0.84907341]
  Mean Accuracy: 0.8568
  Std Dev: 0.0056

Random Forest:
  Accuracy scores: [0.85230714 0.8424804  0.84978617 0.85352815 0.84889522]
  Mean Accuracy: 0.8494
  Std Dev: 0.0038

Multinomial NB:
  Accuracy scores: [0.78157848 0.77833215 0.77423378 0.77583749 0.77583749]
  Mean Accuracy: 0.7772
  Std Dev: 0.0026

The best model based on mean accuracy is Logistic Regression


## inference timing

In [6]:
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(text)
y = labels
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=labels)
model = models[best_model_name]
model.fit(x_train, y_train)

In [15]:
%%timeit
text = ds["full"]["text"][137854]
text = clean_lemmatize(text)
x = vectorizer.transform([text])
pred = model.predict(x)

645 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Text embedding + classification

In [3]:
def clean_text(text):
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r'<[^>]+>', '', text) # Remove html tags
    return text

def embed_text(text, model):
    embedding = model.encode(text, convert_to_tensor=False, batch_size=256, show_progress_bar=True)
    return embedding

text = ds["full"]["text"]
text = [clean_text(review) for review in text]

model = SentenceTransformer('all-MiniLM-L6-v2')     # Alternatives: 'paraphrase-MiniLM-L3-v2', 'all-distilroberta-v1'
# # embeddings = embed_text(text, model)
# # pd.DataFrame(embeddings).to_parquet("embeddings.parquet") # saving the embeddings to file
embeddings = pd.read_parquet("embeddings.parquet")# loading the embeddings from file
x_train, x_test, y_train, y_test = train_test_split(embeddings, labels, test_size=0.2, random_state=42, stratify=labels)

In [18]:
# checking the performance of different models on the data
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, n_jobs=-1),
    "Linear SVM": LinearSVC(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
}

stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# not viable to do 5-fold cross validation on the entire training set => 1% stratified sample is selected
x_val, _, y_val, _ = train_test_split(x_train, y_train, test_size=0.95, random_state=42, stratify=y_train)
print("5-Fold Cross-Validation Results:\n")
cross_validation_results = {}
for name, model in models.items():
    pipeline = Pipeline([('clf', model)])

    scores = cross_val_score(pipeline, x_val, y_val, cv=stratified_kfold, scoring='accuracy')
    cross_validation_results[name] = scores.mean()
    print(f"{name}:")
    print(f"  Accuracy scores: {scores}")
    print(f"  Mean Accuracy: {scores.mean():.4f}")
    print(f"  Std Dev: {scores.std():.4f}\n")

best_model_name = max(cross_validation_results, key=lambda x: cross_validation_results[x])
print(f"The best model based on mean accuracy is {best_model_name}")

5-Fold Cross-Validation Results:

Logistic Regression:
  Accuracy scores: [0.86032425 0.85940841 0.8629722  0.86582324 0.86065574]
  Mean Accuracy: 0.8618
  Std Dev: 0.0023

Linear SVM:
  Accuracy scores: [0.86032425 0.8647541  0.86404134 0.86920884 0.8629722 ]
  Mean Accuracy: 0.8643
  Std Dev: 0.0029

Random Forest:
  Accuracy scores: [0.811509   0.81628653 0.81200998 0.81468282 0.81307912]
  Mean Accuracy: 0.8135
  Std Dev: 0.0018

The best model based on mean accuracy is Linear SVM


## inference timing

In [7]:
embedder_model = SentenceTransformer('all-MiniLM-L6-v2')     # Alternatives: 'paraphrase-MiniLM-L3-v2', 'all-distilroberta-v1'
model = LogisticRegression(max_iter=1000, n_jobs=-1)
model.fit(x_train, y_train);

In [10]:
%%timeit
text = ds["full"]["text"][137854]
text = clean_text(text)
embedding = embed_text(text, embedder_model)
pred = model.predict(embedding.reshape(1, -1))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

715 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Final model evaluation


In [15]:
text = ds["full"]["text"]
text = [clean_lemmatize(review) for review in text]
vectorizer = TfidfVectorizer()
vectorized_texts = vectorizer.fit_transform(text)
x_train, x_test, y_train, y_test = train_test_split(vectorized_texts, labels, test_size=0.2, random_state=42, stratify=labels)
model = LogisticRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy*100:.4f}%")

Test Accuracy: 87.9791%


# Summary and conclusions
- Model comparisons are done through 5-fold cross validation scores.
- The best model was found to be Text Embedding (with the all-MiniLM-L6-v2 model) + Linear SVM classifier with a cross validation score of 86.43%.
- The next best model was found to be TF-IDF + Logistic Regression with a cross validation score of 85.94%.
- The TF-IDF method seems to offer lower computational costs due to not needing GPUs, while offering competitive performance.
- The most suitable model was selected as TF-IDF + Logistic Regression and offered a test accuracy of 87.98%.
- The same methods can be used for different families of products to observe and track user opinions on products and brands over time.