# **Sentiment Analysis of Amazon Product Reviews: All Beauty Category**

## Problem Statement
Manual analysis of thousands of customer reviews is both slow and prone to human bias, so in this project we leverage NLP libraries to automate sentiment classification. We specifically choose on beauty products because they are an everyday essential and user opinions tend to be highly subjective.Therefore, we concentrate on Amazon’s Beauty category to help shoppers make more informed choices. Although star ratings offer a rough performance signal, they miss the nuanced feedback found in the free-text reviews themselves. We compare three off-the-shelf sentiment scorers (star-rating mapping, TextBlob, VADER) and train five classifiers (Logistic Regression, Naïve Bayes, SVM, Random Forest, Decision Tree) on each resulting label set. By evaluating accuracy and F1-score, we identify the most reliable approach—one that can deliver fast, consistent insights to product managers so they can pinpoint key pain points, guide feature improvements, and ultimately boost customer satisfaction and sales.

## What is Sentiment Analysis? Why it matters?
Sentiment analysis is a text-classification technique that automatically determines whether a piece of writing expresses positive, negative, or neutral sentiment. As customers share their thoughts more openly than ever in product reviews, surveys, or social media—manually reading thousands of comments is impractical and prone to bias. By converting free-form language into structured sentiment scores at scale, organizations can quickly surface emerging product issues, identify high-impact features, and tailor products and services to meet customer needs. In e-commerce, sentiment analysis complements star ratings by capturing the nuance of what customers actually say, making it indispensable for data-driven decisions in product development, marketing, and customer-experience management.

##Objectives of Project


*   Data Collection

*   Preprocessing and Cleaning

*   Exploratory Data Analysis
*   Generate Sentiment Labels via Multiple Strategies


*   Extract Numerical Features


*   Train and Compare Five Classifiers

*   Evaluate Model Performance






## Project Overview

Customer reviews on e-commerce platforms such as Amazon are a valuable source of insights into customer satisfaction, product quality, and overall brand perception. This project focuses on sentiment analysis of customer reviews in the 'All Beauty' category on Amazon.


#### Module submission group
- Group member 1: Aishwarya Shastry Viswanath 
- Group member 2: Mi Kin Swan 
- Group member 3: Fatimah Aljohani

In [None]:
# Import core data manipulation libraries
import pandas as pd
import numpy as np

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import libraries for text processing and regular expressions
import re
import spacy

# Import progress bar and word cloud generation tools
from tqdm.notebook import tqdm
from wordcloud import WordCloud


# Data Collection( swan)
The dataset was sourced from the [McAuley Lab Amazon Reviews 2023](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) collection via Hugging Face. Using the Hugging Face Datasets library, we loaded the “all_beauty” category of both User Reviews and Item Metadata directly into our notebook.

In [None]:
# Mount Google Drive to access datasets stored in Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Specify the file path to the Amazon beauty reviews CSV file stored in Google Drive
file_path = '/content/drive/MyDrive/PROJECT AMAZON/raw_review_All_Beauty.csv'

In [None]:
# Install the Hugging Face 'datasets' library to access the metadata
pip install datasets


🔽 First, we loaded the User Reviews dataset, which contains 701,528 rows and 10 columns, and saved it as a CSV file.

In [None]:
from datasets import load_dataset

# Load the All_Beauty reviews
dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_All_Beauty", trust_remote_code=True)

# Access the full split
reviews = dataset["full"]

# Display the first review
print(reviews[0])


In [None]:
reviews

In [None]:
# Show the first 10 rows
for i in range(10):
    print(reviews[i])


In [None]:
import pandas as pd

df = pd.DataFrame(reviews)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.to_csv('raw_review_All_Beauty.csv', index=False)

🔽 Second, we loaded the Item Metadata dataset, which contains 112,590 rows and 14 columns, and saved it as a JSONL file.

In [None]:
pip install requests

In [None]:
import requests

# URL of the metadata file
url = "https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/resolve/main/raw/meta_categories/meta_All_Beauty.jsonl"

# Local file name to save
output_file = "meta_All_Beauty.jsonl"

# Download the file
response = requests.get(url, stream=True)

if response.status_code == 200:
    with open(output_file, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)
    print(f"Download complete. File saved as {output_file}")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

In [None]:
df_meta = pd.read_json("/content/drive/MyDrive/PROJECT AMAZON/meta_All_Beauty.jsonl", lines=True)

In [None]:
df_meta.head()

In [None]:
!curl -L -o meta_All_Beauty.jsonl "https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/resolve/main/raw/meta_categories/meta_All_Beauty.jsonl"


import pandas as pd

df_meta = pd.read_json("meta_All_Beauty.jsonl", lines=True)
df_meta.head()


# Reading and Cleaning the data
🔽 After that, we read both the metadata and review files and combined them into a single dataset before beginning preprocessing and cleaning on our selected features for sentiment analysis.

In [None]:
import json
import pandas as pd

clean_data = []

with open("/content/drive/MyDrive/DSCI-521/meta_All_Beauty.jsonl", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        try:
            clean_data.append(json.loads(line))
        except json.JSONDecodeError as e:
            print(f"Skipping line {i} due to error: {e}")
            continue

# converting json to datafram
df_meta = pd.DataFrame(clean_data)


df_meta.head()


In [None]:
#check for the shape
df_meta.shape

In [None]:
#check for columns
df_meta.columns

In [None]:
# reading reviews from google drive
df_reviews = pd.read_csv("/content/drive/MyDrive/DSCI-521/raw_review_All_Beauty.csv")
df_reviews.head()

In [None]:
#check for the shape
df_reviews.shape

In [None]:
#check for columns
df_reviews.columns

In [None]:
#Merge with Product Metadata
df_combined = pd.merge(df_reviews, df_meta, on="parent_asin", how="left")

In [None]:
df_combined.head()

In [None]:
#check for the shape
df_combined.shape

In [None]:
#check for columns
df_combined.columns

In [None]:
# Display summary info
df_combined.info()

In [None]:
# saving the combine dataset
df_combined.to_csv('All_Beauty.csv', index=False)

In [None]:
# rename the duplicate columns
df_combined = df_combined.rename(columns={
    "title_x": "review_title",
    "title_y": "product_title",
    "images_x": "review_images",
    "images_y": "product_images",
    "text": "review_text"
})


In [None]:
#check for the columns
df_combined.columns

In [None]:
#check for null value
df_combined.isnull().sum().sort_values(ascending=False)

In [None]:
#Check how many rating star we have
df_combined["rating"].unique()

In [None]:
# convert rating in to int
df_combined["rating"] = df_combined["rating"].astype(int)

In [None]:
#after changing int check again
df_combined["rating"].unique()

🔽 Moreover, we first examine the rating column by mapping 4–5 → “positive,” 3 → “neutral,” and 1–2 → “negative.” We then visualize this distribution using a bar chart, box plot, and pie chart.The results show that the majority of ratings fall into the “positive” category.

In [None]:
# Define a function to map rating to sentiment
def label_sentiment(rating):
    if rating >= 4:
        return "positive"
    elif rating <= 2:
        return "negative"
    else:
        return "neutral"

# Apply the sentiment function to create a new column
df_combined["sentiment"] = df_combined["rating"].apply(label_sentiment)

# Display the unique values to verify
df_combined["sentiment"].value_counts()

In [None]:
import matplotlib.pyplot as plt

# Count the number of reviews per sentiment category
sentiment_counts = df_combined["sentiment"].value_counts()

# Plot the sentiment distribution
plt.figure(figsize=(6, 4))
sentiment_counts.plot(kind='bar', color=['green', 'red', 'gray'])

# Add titles and labels
plt.title("Sentiment Distribution of Reviews")
plt.xlabel("Sentiment")
plt.ylabel("Number of Reviews")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()
plt.show()


In [None]:
# explore the rating based in the price

import seaborn as sns
import matplotlib.pyplot as plt

# Define custom colors for sentiment
sentiment_palette = {
    "positive": "green",
    "neutral": "gray",
    "negative": "red"
}

# Boxplot: Price distribution by sentiment
plt.figure(figsize=(8, 5))
sns.boxplot(x="sentiment", y="price", data=df_combined, palette=sentiment_palette)
plt.title("Price Distribution by Sentiment")
plt.xlabel("Sentiment")
plt.ylabel("Price")
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()



In [None]:
import matplotlib.pyplot as plt

# Assuming df_combined is already loaded

# Calculate rating counts
rating_counts = df_combined['rating'].value_counts()

# Create pie chart
plt.figure(figsize=(8, 8))  # Adjust size as needed
plt.pie(rating_counts, labels=rating_counts.index, autopct='%1.1f%%',
        startangle=90, colors=['lightgreen', 'green', 'lightblue', 'blue', 'orange'])
plt.title('Distribution of Customer Ratings')
plt.show()

🔽 Next, we examine how many products fall into each price range, which shows that the vast majority of beauty products in the dataset are low-priced, and only a few appear in higher price brackets. In other words, the price distribution is heavily right-skewed: most items cost relatively little, and only a small number are very expensive.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.hist(df_combined['price'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Product Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

🔽 Next, we examine the main category column to see how many beauty-product categories we have. Unfortunately, it only shows two: “All Beauty” and “Premium Beauty.” Later, we will create our own subcategories to analyze how customer sentiment is distributed within each custom group.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Check if 'main_category' or 'categories' exists
if 'main_category' in df_combined.columns:
    top_categories = df_combined['main_category'].value_counts().head(10)
elif 'categories' in df_combined.columns and isinstance(df_combined['categories'].iloc[0], list):
    # Assuming 'categories' contains lists of categories, flatten and count
    all_categories = [cat for sublist in df_combined['categories'].tolist() for cat in sublist]
    top_categories = pd.Series(all_categories).value_counts().head(10)
else:
    print("Neither 'main_category' nor 'categories' column found with expected format.")

    # If neither of the above works, you might need to examine the DataFrame further
    # to see if any column represents category information under a different name.
    # Print the column names to check:
    print(df_combined.columns)

# Only proceed if top_categories is defined
if 'top_categories' in locals():
    plt.figure(figsize=(10, 6))
    sns.barplot(x=top_categories.index, y=top_categories.values, palette='viridis')
    plt.title('Top 10 Product Categories')
    plt.xlabel('Category')
    plt.ylabel('Number of Products')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

🔽 Next, we examine the distribution of sentiment labels for verified versus unverified purchases. We observe that most reviews particularly positive ones—come from verified purchasers, suggesting a strong link between genuine product experience and favorable feedback. In contrast, unverified purchases exhibit a relatively higher share of negative sentiment, which may indicate that reviewers without firsthand experience are more likely to submit critical or misleading feedback.

In [None]:
# explore Sentiment count by Verified Purchase
plt.figure(figsize=(8, 5))

# Map True/False to strings for the palette
verified_palette = {True: "green", False: "red"}

sns.countplot(x="sentiment", hue="verified_purchase", data=df_combined, palette=verified_palette)
plt.title("Sentiment by Verified Purchase")
plt.xlabel("Sentiment")
plt.ylabel("Number of Reviews")
plt.legend(title="Verified Purchase")
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

🔽 Finally, we defined a function to clean text by removing stopwords, punctuation, and numbers while retaining only words tagged as NOUN, VERB, ADJ, or ADV. We applied this function to both review_title and review_text to create clean_title and clean_text, and then saved the resulting DataFrame as All_Beauty_cleaned.csv.

In [None]:
from os import pipe
def clean_text(df,text):
  text = df_combined[text].fillna('').astype(str).to_list()
  nlp = spacy.load("en_core_web_sm")
  clean_text = []
  for doc in tqdm(nlp.pipe(text,batch_size= 1000,disable=["ner", "parser"]), total=len(text)):
    stop_words = nlp.Defaults.stop_words
    important_words = {"not", "no", "never", "very", "just", "really","like","empty","ten","must","serious","yes"}
    custom_stopwords = stop_words.difference(important_words)
    tokens =[
        word.lemma_.lower().strip()
        for word in doc
        if not word.is_punct and not word.is_space and word.text.lower() not in custom_stopwords and not word.like_num and word.pos_ in{"NOUN", "VERB", "ADJ", "ADV"}
    ]
    clean_text.append(" ".join(tokens))
  return clean_text

In [None]:
df_combined['clean_title'] = clean_text(df_combined,'review_title')

In [None]:
df_combined['clean_text'] = clean_text(df_combined,'review_text')

In [None]:
 # saving the clean dataset

df_combined.to_csv('All_Beauty_cleaned.csv', index=False)

### 📥 Loading the Cleaned Dataset and Preparing Date Features 

In this section, we load the cleaned and preprocessed version of the Amazon Beauty reviews dataset. This dataset has undergone multiple cleaning steps such as lowercasing, removal of punctuation and stopwords, lemmatization, and creation of new fields for modeling.

---

#### 🧹 Loading the Dataset

We read the cleaned dataset `All_Beauty_cleaned.csv` from Google Drive using `pandas`. This dataset includes 27+ columns such as:
- `review_title`, `review_text`, `rating`
- Metadata like `asin`, `store`, `categories`, `details`
- Cleaned text fields: `clean_title`, `clean_text`, and `clean_review`

In [None]:
# reading the cleaned dataset
df = pd.read_csv("/content/drive/MyDrive/DSCI-521/All_Beauty_cleaned.csv")
df.head()

The dataset contains 27 columns and includes new features such as:

clean_title – cleaned version of the review title
clean_text – cleaned version of the review body
clean_review – concatenation of cleaned title and text
Metadata fields like store, categories, product_images, and videos



---

### **Timestamp Conversion & Date Features**

The original `timestamp` column (in Unix milliseconds) was converted to a human-readable `review_date`. We also extracted the `review_year` and `review_month` to enable time-based analysis.


We visualized the number of reviews per year to identify trends in customer feedback over time.





In [None]:
# converting timestamp type to date

df['review_date'] = pd.to_datetime(df['timestamp'], unit='s', errors='coerce')

Then, we extract the year and month of each review to allow time-based analysis:

###  **Timestamp Conversion & Date Features**

The original `timestamp` column (in Unix milliseconds) was converted to a human-readable `review_date`. We also extracted the `review_year` and `review_month` to enable time-based analysis.

```

We visualized the number of reviews per year to identify trends in customer feedback over time.


In [None]:
# Convert timestamp from milliseconds to datetime
df['review_date'] = pd.to_datetime(df['timestamp'], unit='ms')

# Optional: extract year/month for analysis
df['review_year'] = df['review_date'].dt.year
df['review_month'] = df['review_date'].dt.month

# Check the result
print(df[['timestamp', 'review_date', 'review_year', 'review_month']].head())


In [None]:
# Plot number of reviews by year
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
sns.countplot(data=df, x='review_year', order=sorted(df['review_year'].dropna().unique()))
plt.title("Number of Reviews by Year")
plt.xlabel("Year")
plt.ylabel("Number of Reviews")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
# Convert timestamp from milliseconds to datetime
df['review_date'] = pd.to_datetime(df['timestamp'], unit='ms')

# Drop the original timestamp column if not needed
df.drop('timestamp', axis=1, inplace=True)

# Optional: Reorder columns to place 'review_date' earlier
# Move 'review_date' to the 4th column (for example)
cols = list(df.columns)
cols.insert(3, cols.pop(cols.index('review_date')))
df = df[cols]

# Check a sample
df.sample(5)


## Sample View
Here's a snapshot of what the updated DataFrame looks like:

review_date: Human-readable review date
review_year & review_month: Useful for time-series or seasonal analysis
Cleaned review content (clean_review) ready for NLP modeling
This step ensures that we are working with a clean, enriched dataset that's structured for both exploratory analysis and machine learning tasks.

In [None]:
df.sample(5)

In [None]:
df.head()

In [None]:
df['categories'].fillna('[]', inplace=True)

In [None]:
df['categories'] = df['categories'].apply(lambda x: eval(x))


### 🛠️ **Feature Enrichment and Preparation**

After loading the cleaned dataset, we performed additional data preparation and feature extraction steps to enrich the dataset for deeper analysis and model training.

---
#### 📂 **Combining Cleaned Text**

We created a new feature, `clean_review`, by concatenating `clean_title` and `clean_text`. This unified text field will serve as the input for text vectorization and sentiment modeling.


In [None]:
df['clean_review'] = df['clean_title'] + " " + df['clean_text']

In [None]:
df.sample(5)

In [None]:
# Remove duplicates
df.drop_duplicates(subset='clean_text', inplace=True)

## Cleaning Up

After the conversion, we dropped the original timestamp column and reorganized the columns to place review_date near the front for better visibility.

In [None]:
df.drop(columns=['review_images'], inplace=True)



#### **Extracting 'Hair Type' Information**

From the newly created `details_dict`, we extracted the `Hair Type` attribute and created a new column. Missing or unknown values were labeled as `'Unknown'`.



In [None]:
# Convert string dict to actual dict
from ast import literal_eval
df['details_dict'] = df['details'].apply(lambda x: literal_eval(x) if pd.notnull(x) else {})

# Extract a sample feature
df['Hair_Type'] = df['details_dict'].apply(lambda x: x.get('Hair Type', None))




#### 🧾 **Parsing the `details` Column**

The `details` column contains product specifications in stringified dictionary format. To extract useful features (like Hair Type), we used Python’s `literal_eval` to safely convert strings into dictionaries



In [None]:
# Function to safely convert strings to dictionaries
def safe_literal_eval(x):
    try:
        return literal_eval(x) if isinstance(x, str) and x not in ['', 'nan'] else {}
    except (ValueError, SyntaxError):  # Handle malformed strings
        return {}

# Convert string dict to actual dict, handling errors
df['details_dict'] = df['details'].apply(safe_literal_eval)

# Extract 'Hair Type' feature, providing a default value
df['Hair_Type'] = df['details_dict'].apply(lambda x: x.get('Hair Type', 'Unknown'))

#### **Binning Helpfulness Votes**

To better understand review usefulness, we categorized `helpful_vote` into bins using `pd.cut`. This `helpful_bin` column allows for grouped analysis of reviews by perceived helpfulness.

In [None]:
df['helpful_bin'] = pd.cut(df['helpful_vote'], bins=[-1, 0, 2, 10, 50, 1000], labels=['0', '1-2', '3-10', '11-50', '50+'])



#### 🧾 Sample View of Final Dataset

After these transformations, our enriched dataset contains over **30 columns**, including:

* `clean_review`: preprocessed text for NLP
* `Hair_Type`: extracted from product metadata
* `helpful_bin`: binned helpfulness scores
* `review_date`, `review_year`, `review_month`: derived from timestamps

These features provide a robust foundation for sentiment analysis, user behavior modeling, and product category analysis.


Certainly! Here's a **concise and cohesive description** combining all the key points from your feature extraction and enrichment section. You can paste this into a markdown cell in your notebook:

---

### 🛠️ **Feature Enrichment and Preparation – Summary**

After loading the cleaned Amazon Beauty reviews dataset, we performed several enrichment steps to prepare it for sentiment analysis and modeling. We began by combining `clean_title` and `clean_text` into a unified `clean_review` column, which serves as the main input for NLP tasks. We then processed the `categories` column by safely converting stringified lists into actual Python lists. For the `details` column—containing structured product metadata—we used a robust parsing function to convert string dictionaries into usable Python dictionaries and extracted specific attributes like `Hair Type`, labeling unknown values accordingly.

To support analysis of review quality, we binned the `helpful_vote` count into a new categorical variable, `helpful_bin`, representing perceived helpfulness ranges (e.g., 0, 1–2, 3–10, etc.). We also converted the original Unix `timestamp` to human-readable `review_date` and extracted `review_year` and `review_month` for time-based analysis.

These enhancements resulted in a richly structured dataset of over 30 columns, enabling advanced text modeling, sentiment classification, and user behavior insights.




In [None]:
df.head()




### 🧴🧼 **Sub-Category Classification Based on Product Titles**

To better understand the types of beauty products in our dataset, we created a new feature called `sub_category` that classifies each product based on keywords found in the `product_title`. This helps in segmenting the dataset for more targeted analysis such as sentiment trends within each product category.

---

#### 🛠️ **Classification Process**

We defined a custom function `extract_subcategory()` to scan each product title for relevant keywords and assign it to one of the following predefined sub-categories:

* **Hair Care** (e.g., shampoo, conditioner, hair)
* **Skin Care** (e.g., moisturizer, serum, face wash)
* **Makeup** (e.g., lipstick, foundation, mascara)
* **Fragrance** (e.g., perfume, cologne)
* **Body Care** (e.g., body wash, soap)
* **Nail Care** (e.g., nail polish, manicure)
* **Sun Care** (e.g., sunscreen, SPF)
* **Deodorant**
* **Other** (when no matching keywords were found)
* **Unknown** (if title was missing)

We then applied this function to the dataset:

```python
df["sub_category"] = df["product_title"].apply(extract_subcategory)
```

---

#### 📊 **Sub-Category Distribution**

After classification, we used value counts and a bar plot to visualize how reviews are distributed across different sub-categories.

```python
sub_category_counts = df['sub_category'].value_counts()
```
This helped us understand which types of products dominate the dataset. For example:

* **Hair Care** and **Other** were the most frequent sub-categories.
* Niche categories like **Sun Care** and **Deodorant** had fewer reviews.


In [None]:
import pandas as pd

# Define a function to extract sub-category from the product title
def extract_subcategory(title):
    if pd.isnull(title):
        return "Unknown"

    # Convert title to lowercase for easier keyword matching
    title = title.lower()

    # Classify based on presence of keywords in the title
    if any(keyword in title for keyword in ["shampoo", "conditioner", "hair"]):
        return "Hair Care"
    elif any(keyword in title for keyword in ["moisturizer", "serum", "cream", "lotion", "cleanser", "face wash"]):
        return "Skin Care"
    elif any(keyword in title for keyword in ["lipstick", "mascara", "eyeshadow", "foundation", "concealer", "blush"]):
        return "Makeup"
    elif any(keyword in title for keyword in ["perfume", "fragrance", "eau de", "cologne"]):
        return "Fragrance"
    elif any(keyword in title for keyword in ["body wash", "soap", "scrub", "bath"]):
        return "Body Care"
    elif any(keyword in title for keyword in ["nail", "polish", "manicure", "pedicure"]):
        return "Nail Care"
    elif any(keyword in title for keyword in ["sunscreen", "spf", "sunblock"]):
        return "Sun Care"
    elif any(keyword in title for keyword in ["deodorant", "antiperspirant"]):
        return "Deodorant"
    else:
        return "Other"

# Apply the function to classify products based on their title
df["sub_category"] = df["product_title"].apply(extract_subcategory)

# Display the frequency of each sub-category
df["sub_category"].value_counts()


We visualized the distribution with the following plot:


#### ✅ **Outcome**

This sub-category classification enables us to:

* Perform more detailed exploratory data analysis
* Compare sentiment scores or review patterns across categories
* Use sub-category as an additional feature in predictive models

---

Let me know if you'd like to convert this into a slide or summary table as well!


In [None]:
#plot
import matplotlib.pyplot as plt
import seaborn as sns
sub_category_counts = df['sub_category'].value_counts()
plt.figure(figsize=(10, 6))
sns.barplot(x=sub_category_counts.index, y=sub_category_counts.values, palette='viridis')
plt.title('Sub-Category Distribution')
plt.xlabel('Sub-Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()




#### 📊 **Sub-Category Distribution**

After classification, we used value counts and a bar plot to visualize how reviews are distributed across different sub-categories.


This helped us understand which types of products dominate the dataset. For example:

* **Hair Care** and **Other** were the most frequent sub-categories.
* Niche categories like **Sun Care** and **Deodorant** had fewer reviews.

We visualized the distribution with the following plot:


In [None]:
df = df[~df['sub_category'].isin(['Unknown'])]  # Remove Unknowns


In [None]:
small_cats = ['Fragrance', 'Deodorant', 'Sun Care']
df['sub_category_grouped'] = df['sub_category'].apply(lambda x: x if x not in small_cats else 'Other')


### 📊 Sentiment Distribution & Time-Based Rating Trends

In this section, we performed two key analyses:

---

#### 1️⃣ Sentiment Distribution by Product Sub-Category

We created sentiment labels based on review ratings:
- Ratings **≥ 4** → **Positive**
- Ratings **= 3** → **Neutral**
- Ratings **≤ 2** → **Negative**

Using this classification, we plotted the distribution of sentiments across different `sub_category` values (e.g., Hair Care, Skin Care, Makeup). This allowed us to see which product types had more balanced or polarized reviews.

---

#### 2️⃣ Time-Based Product Rating Trends

We calculated the **change in product rating over time** to assess how customer perception has shifted. Using the `groupby().diff()` method, we computed a column `Delta` to reflect the difference in ratings spaced six entries apart (as a proxy for time progression).

Focusing on the **Skin Care** sub-category, we identified the **Top 10 products** with the most improved ratings and visualized them in a bar chart. This highlighted which skincare products have gained favor among users over the years.

These analyses help uncover:
- Sentiment tendencies across product types
- Long-term improvements in customer satisfaction


In [None]:
# Create sentiment labels from ratings
df['Sentiment'] = df['rating'].apply(lambda x: 'positive' if x >= 4 else 'negative' if x <= 2 else 'neutral')

In [None]:
sns.countplot(data=df, x='sub_category', hue='Sentiment')
plt.xticks(rotation=45)
plt.title('Sentiment Distribution per Subcategory')
plt.show()


In [None]:

import pandas as pd


# Calculate the change in rating over 6 years

df['Delta'] = df.groupby('product_title')['rating'].diff(periods=6)  # Difference over 6 years

# Filter for skincare products and select top 10 with highest improvement
# Replace 'sub_category' with the column representing product category if it's not 'category'
top_10_rise_skin = df[df['sub_category'] == 'Skin Care'].sort_values(by='Delta', ascending=False).head(10)

# Create the barplot
Top_10_plt = sns.barplot(x='product_title', y='Delta', data=top_10_rise_skin)  # Use the DataFrame directly
Top_10_plt.set_title('Top 10 Skincare Products with Improved Rating in 6 Years', fontsize=15, y=1.05)
Top_10_plt.set_xticklabels(Top_10_plt.get_xticklabels(), rotation=90)
plt.show()

#### 🗂️ Product Classification by Main Category

We checked how reviews are distributed across the two main product categories:


In [None]:
# each product classification
df['main_category'].value_counts()





#### Handling Missing Values

We checked for missing values using

Then we dropped rows with missing values in the `clean_review` or `Sentiment` columns as these are critical for sentiment modeling:






In [None]:
# See nulls
df.isnull().sum().sort_values(ascending=False)

# Drop rows with missing clean_review or sentiment
df = df.dropna(subset=['clean_review', 'Sentiment'])



#### 🆕 . Creating a Binary Helpfulness Feature

We added a new feature `helpful` that marks a review as "helpful" (1) if it received at least one helpful vote:



In [None]:
# Create a new binary feature: was the review helpful or not
df['helpful'] = df['helpful_vote'].apply(lambda x: 1 if x > 0 else 0)

# You can also look at helpfulness by sentiment:
helpful_summary = df.groupby('Sentiment')['helpful_vote'].mean()
helpful_summary

### 📅 Monthly Review Trends Over Time

In this section, we analyzed how the volume of product reviews changed over time.


####   Creating a Monthly Period Column

We extracted the **month and year** from the `review_date` column using `dt.to_period('M')` and stored it in a new column `review_month`.Also We grouped the dataset by `review_month` and counted the number of reviews in each month:



📌 **Insight:**
This visualization helps us understand periods of high or low user engagement and whether review activity increased over specific years (e.g., spikes around 2020–2021).




In [None]:
# Monthly reviews over time
df['review_month'] = df['review_date'].dt.to_period('M')
monthly_reviews = df.groupby('review_month').size()

monthly_reviews.plot(kind='line', figsize=(12, 6), title='Monthly Review Volume')
plt.ylabel('Number of Reviews')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
# Monthly reviews over time
df['review_month'] = df['review_date'].dt.to_period('M')
monthly_reviews = df.groupby('review_month').size()

monthly_reviews.plot(kind='line', figsize=(12, 6), title='Monthly Review Volume')
plt.ylabel('Number of Reviews')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()




####. Verifying Product Categories

We inspected the unique values in the `main_category` column:



Filtered to only show products in the **All Beauty** category and verified the filtering
This ensures that subsequent analyses (like time trends or sentiment by category) focus specifically on the dominant category in our dataset

In [None]:
#list of all produtcs category
df['main_category'].unique()


In [None]:
# Filter the DataFrame to select rows where the 'main_category' column equals 'All Beauty'
filtered_df = df[df['main_category'] == 'All Beauty']

# Get the unique values in the 'main_category' column of the filtered DataFrame
unique_categories = filtered_df['main_category'].unique()

# Print the unique categories (should only be 'All Beauty' if the filtering worked)
print(unique_categories)

In [None]:
print(df['main_category'].unique())

In [None]:
# Show top 10 frequent words
word_freq = word_count(df, 'clean_text')
top_10_words = word_freq.most_common(10)
top_10_words = pd.DataFrame(top_10_words, columns=['Word', 'Frequency'])

# Correct the column names in the sns.barplot call:
ax = sns.barplot(data=top_10_words, x='Word', y='Frequency')  # Use 'Word' and 'Frequency'

ax.set_title('Compound Score by Amazon Star Review')
plt.show()

In [None]:
#Word cloud of frequent words
word_cloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
plt.figure(figsize=(10, 5))
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Frequent Words')
plt.show()

### 📊 Sentiment Class Distribution

To understand the balance of sentiment labels in our dataset, we visualized the distribution of `Sentiment` values using a bar plot:


In [None]:
sns.countplot(data=df, x='Sentiment')
plt.title("Class Distribution: Positive vs Negative Reviews")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.show()

# Check actual counts/ratios
print(df['Sentiment'].value_counts(normalize=True))


## 🔍 Exploratory Data Analysis (EDA) & Feature Engineering – Summary

In this section, we explored the dataset thoroughly and engineered meaningful features to prepare it for sentiment classification and modeling.

---

### 📊 Exploratory Data Analysis (EDA)

#### 1️⃣ Dataset Overview
- Loaded a cleaned dataset with ~600,000 Amazon product reviews in the **All Beauty** and **Premium Beauty** categories.
- Key columns included: `rating`, `review_text`, `verified_purchase`, `helpful_vote`, `categories`, and product `details`.

#### 2️⃣ Temporal Analysis
- Converted the Unix `timestamp` column into a human-readable `review_date`.
- Extracted `review_year` and `review_month` for time-based trend analysis.
- Visualized **monthly review volume** using a line plot to observe customer engagement over time.

#### 3️⃣ Rating Distribution
- Plotted the distribution of `rating` values (1–5 stars).
- Identified that the dataset is **skewed toward positive ratings** (4–5 stars).

#### 4️⃣ Sentiment Distribution
- Created a new `Sentiment` column from ratings:
  - 4–5 → Positive  
  - 3 → Neutral  
  - 1–2 → Negative
- Count plot showed class imbalance with ~70% positive reviews.
- Calculated exact proportions using `value_counts(normalize=True)`.

#### 5️⃣ Sub-Category Classification
- Created a new column `sub_category` by extracting product types (e.g., "shampoo" → Hair Care).
- Visualized the distribution of reviews across sub-categories using a bar plot.

#### 6️⃣ Product Metadata Exploration
- Parsed the `details` column (stringified dictionary) using `literal_eval`.
- Extracted meaningful product attributes like `Hair_Type`.

#### 7️⃣ Helpfulness Analysis
- Created a binary feature `helpful` (1 if `helpful_vote` > 0).
- Grouped by sentiment to calculate average helpfulness:
  - Positive and negative reviews were more often marked helpful than neutral ones.

#### 8️⃣ Time-Based Product Rating Trends
- Calculated `Delta` (change in rating) to track improvement over time.
- Identified the **top 10 Skin Care products** with the highest increase in rating over a 6-year span.

---

### ⚙️ Feature Engineering

We created several new features to enhance the dataset and support effective model training:

#### ✅ Text Features
- `clean_title` and `clean_text`: Preprocessed versions of raw review content.
- `clean_review`: Combined `clean_title` + `clean_text` as a unified input for sentiment modeling.

#### 🕓 Time Features
- `review_year`, `review_month`: Extracted from converted `review_date`.

#### 📦 Categorical Features
- `sub_category`: Product type (Hair Care, Makeup, etc.) derived from title keywords.
- `Hair_Type`: Extracted from `details_dict`.

#### 👍 Helpfulness Features
- `helpful`: Binary feature based on `helpful_vote`.
- `helpful_bin`: Grouped helpful vote ranges (e.g., 0, 1–2, 3–10, etc.)

These enriched features laid the foundation for robust sentiment classification and deeper product insights.


## 🧠 Sentiment Classification: Three Approaches

In this section, we classified the sentiment of customer reviews into three categories:
- **Positive**
- **Negative**
- **Neutral**

We applied and compared three different sentiment classification techniques:

---

### 1️⃣ Manual Labeling Based on Star Ratings

We first created sentiment labels using the `rating` column, which reflects the star rating provided by the customer (from 1 to 5).

This method assumes that star ratings accurately reflect the customer's sentiment toward the product.

---

### 2️⃣ Rule-Based Classification Using TextBlob

We then applied the `TextBlob` library, which uses a built-in sentiment lexicon to compute a **polarity score** for each review and then it's automatically classified into a sentiment category based on its polarity score.

---

### 3️⃣ Rule-Based Classification Using VADER

Finally, we used `VADER` (Valence Aware Dictionary and sEntiment Reasoner) from the NLTK library. It is especially effective for short texts and reviews. VADER calculates a **compound sentiment score**

Like TextBlob, this method also assigns sentiment directly from the text content of the review.

---

### 🎯 Why Use Three Methods?

Using multiple sentiment classification approaches allows us to:

- Compare rule-based predictions (TextBlob & VADER) to the star-based ground truth
- Explore whether textual sentiment aligns with user ratings
- Identify inconsistencies or patterns between rating and review content



In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
import pandas as pd

df= pd.read_csv('/content/drive/MyDrive/Amazon/All_Beauty_cleaned.csv')

In [None]:
df.columns

In [None]:
df['rating'].value_counts()

### 1️⃣ Manual Labeling Based on Star Ratings


To establish a baseline for sentiment classification, we manually labeled each review using the numerical star ratings provided by customers (from 1 to 5).

We defined the sentiment classes as follows:

- Ratings **1 or 2** → **Negative (0)**
- Rating **3** → **Neutral (1)**
- Ratings **4 or 5** → **Positive (2)**

This approach assumes that the star rating reflects the user's sentiment toward the product.

We then created a new column called `label` in the dataset to store these sentiment categories.

This manual labeling will be used as the **ground truth** when comparing results from rule-based models such as **TextBlob** and **VADER**.

In [None]:
# Convert rating column to numeric
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

# Keep only ratings 1 through 5 (drop any missing or invalid ratings)
df = df[df['rating'].isin([1, 2, 3, 4, 5])]

# Create multi-class sentiment labels:
# 0 = Negative, 1 = Neutral, 2 = Positive
def label_sentiment(rating):
    if rating in [1, 2]:
        return 0
    elif rating == 3:
        return 1
    else:  # rating 4 or 5
        return 2

df['label'] = df['rating'].apply(label_sentiment)


In [None]:
df['label'].value_counts()

In [None]:
df['label'].value_counts(normalize=True) * 100

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


plt.figure(figsize=(6,4))


sns.countplot(x='label', data=df, palette='viridis')


plt.title('Distribution of Sentiment Labels')
plt.xlabel('Label (0 = Negative, 1 = Neutral, 2= Positive)')
plt.ylabel('Count')


for p in plt.gca().patches:
    plt.gca().annotate(f'{int(p.get_height())}', (p.get_x() + 0.3, p.get_height() + 50))

plt.tight_layout()
plt.show()


## 2️⃣ Sentiment Classification Using TextBlob

In this section, we used the **TextBlob** library to automatically classify the sentiment of each customer review based on its textual content.

TextBlob uses a built-in sentiment lexicon to compute a **polarity score** for each piece of text. The polarity ranges from -1 (very negative) to +1 (very positive).

We defined the following thresholds for classification:

- Polarity **> 0.05** → **Positive**
- Polarity **< -0.05** → **Negative**
- Otherwise → **Neutral**

We applied this function to the `review_text` column and stored the results in a new column named `textblob_sentiment`.

This approach provides a **rule-based sentiment classification** that is independent of the star ratings, allowing us to later compare it with the manual labels.


In [None]:
# Install and import TextBlob (only once)
# !pip install textblob

from textblob import TextBlob

# Define a function to classify sentiment using TextBlob polarity
def textblob_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0.05:
        return "positive"
    elif polarity < -0.05:
        return "negative"
    else:
        return "neutral"

# Apply the function to the reviewText column
df['textblob_sentiment'] = df['review_text'].astype(str).apply(textblob_sentiment)

In [None]:
df['textblob_sentiment'].value_counts()

## 3️⃣ Sentiment Classification Using VADER

In this step, we used the **VADER (Valence Aware Dictionary and sEntiment Reasoner)** sentiment analysis tool from the **NLTK** library to classify review sentiment directly from the text.

VADER is particularly effective for short, informal text such as product reviews. It calculates a **compound score** for each sentence based on a dictionary of pre-labeled words and heuristics.

The compound score ranges from -1 (most negative) to +1 (most positive). We used the following thresholds to categorize the reviews:

- Compound **> 0.05** → **Positive**
- Compound **< -0.05** → **Negative**
- Between -0.05 and 0.05 → **Neutral**

We applied this method to the `review_text` column and stored the result in a new column called `vader_sentiment`.

This method allows us to compare a second rule-based classifier with both the manual labels and the TextBlob results.


In [None]:
# Import NLTK and download the sentiment lexicon
import nltk
nltk.download('vader_lexicon')

# Import the VADER sentiment analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize the analyzer
sid = SentimentIntensityAnalyzer()

# Define a function to classify sentiment based on compound score
def vader_sentiment(text):
    compound = sid.polarity_scores(text)['compound']
    if compound > 0.05:
        return "positive"
    elif compound < -0.05:
        return "negative"
    else:
        return "neutral"

# Apply the function to the reviewText column
df['vader_sentiment'] = df['review_text'].astype(str).apply(vader_sentiment)

In [None]:
df['vader_sentiment'].value_counts()

In [None]:
df.head(5)

We compared the sentiment predictions from TextBlob and VADER against manual labels using precision, recall, and F1-score. This evaluation helped assess how closely each tool reflects actual customer sentiment.


In [None]:
from sklearn.metrics import classification_report

# Define mapping: positive = 1, negative = 0, neutral = -1
sentiment_map = {'positive': 1, 'negative': 0, 'neutral': -1}

# Apply mapping to sentiment results from TextBlob and VADER
df['textblob_label'] = df['vader_sentiment'].map(sentiment_map)
df['vader_label'] = df['vader_sentiment'].map(sentiment_map)

# Remove rows where either method gave a neutral prediction (we can't compare these to binary labels)
df_clean = df[(df['textblob_label'] != -1) & (df['vader_label'] != -1)]

# Extract true labels and predictions
y_true = df_clean['label']
y_textblob = df_clean['textblob_label']
y_vader = df_clean['vader_label']

# Show classification performance for TextBlob
print("📊 TextBlob Performance:")
print(classification_report(y_true, y_textblob))

# **Visual Comparison of Sentiment Distributions**

To better understand how each method classified the review sentiments, we visualized the distribution of sentiment categories across the three approaches:

- **True Labels**: Manually derived from star ratings
- **TextBlob Sentiment**: Rule-based classification from textual polarity
- **VADER Sentiment**: Rule-based classification using compound sentiment scores

This side-by-side comparison helped us quickly observe differences and potential biases in how each method perceives customer sentiment.


In [None]:
import matplotlib.pyplot as plt

# Create 3 bar plots to compare label distributions
fig, axs = plt.subplots(1, 3, figsize=(15, 4))

#  Plot original labels (true labels)
df['label'].value_counts().sort_index().plot(
    kind='bar',
    ax=axs[0],
    title='True Labels',
    color=['red', 'gray', 'green']  # 0=negative, 1=neutral, 2=positive
)
axs[0].set_xlabel("label")
axs[0].set_ylabel("Count")

# Plot TextBlob sentiment results
df['textblob_sentiment'].value_counts().loc[['negative', 'neutral', 'positive']].plot(
    kind='bar',
    ax=axs[1],
    title='TextBlob Sentiment',
    color=['red', 'gray', 'green']
)
axs[1].set_xlabel("textblob_sentiment")
axs[1].set_ylabel("Count")

# Plot VADER sentiment results
df['vader_sentiment'].value_counts().loc[['negative', 'neutral', 'positive']].plot(
    kind='bar',
    ax=axs[2],
    title='VADER Sentiment',
    color=['red', 'gray', 'green']
)
axs[2].set_xlabel("vader_sentiment")
axs[2].set_ylabel("Count")

# Adjust layout
plt.tight_layout()
plt.show()


##  **Machine Learning: Sentiment Classification Approaches**

---



In this section, we build machine learning models to classify review sentiment using three different labeling strategies.

Each strategy provides a different perspective on how sentiment can be derived, allowing us to compare how the **source of the labels** affects model performance.

---

### 🔹 Track 1: Model trained on manually labeled data
- Uses sentiment labels created directly from the star rating (`rating` column).
- Labels: 0 = Negative, 1 = Neutral, 2 = Positive.
- Serves as the primary supervised learning setup.

### 🔹 Track 2: Model trained using TextBlob-generated labels
- Uses the `textblob_sentiment` column (positive/neutral/negative).
- Allows us to see how a model learns from rule-based lexicon predictions.

### 🔹 Track 3: Model trained using VADER-generated labels
- Uses the `vader_sentiment` column.
- Helps evaluate how another rule-based system performs as a labeling strategy for training.

---

By training and evaluating models on each of these label types, we can compare:
- The reliability of each labeling method.
- The generalization power of models based on different label sources.
- How well rule-based vs. human-proxy labels support learning.

Each track will follow the same modeling steps for consistency:
1. Text preprocessing
2. Feature extraction (TF-IDF)
3. Model training (e.g., Logistic Regression)
4. Evaluation using accuracy, precision, recall, and F1-score


##  Machine Learning – Track 1: Training on Manual Labels

In this track, we trained multiple machine learning models to classify review sentiment using manually labeled data derived from star ratings. These labels represent three sentiment classes:
- 0 → Negative
- 1 → Neutral
- 2 → Positive

---

### 🔹 Models Used
We trained and evaluated the following classifiers:
- Naive Bayes
- Support Vector Machine (SVM)
- Random Forest
- Decision Tree
- Logistic Regression

Each model was implemented as a `Pipeline` consisting of:
1. **TF-IDF Vectorizer**: to convert text into numerical features.
2. **Classifier**: the selected machine learning algorithm.

---

### 🔹 Evaluation Metrics
For each model, we calculated:
- **Precision**
- **Recall**
- **F1 Score**

We also visualized the **confusion matrix** for each model to understand the distribution of correct and incorrect predictions.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
# Features and target
X_manual = df['clean_review']
y_manual = df['label']

In [None]:
from sklearn.model_selection import train_test_split

X_train_manual, X_test_manual, y_train_manual, y_test_manual = train_test_split(
    X_manual, y_manual, test_size=0.2, random_state=42, stratify=y_manual
)


In [None]:
classifiers = [
    ("Naive Bayes", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', MultinomialNB())
    ])),
    ("Logistic Regression", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', LogisticRegression(class_weight='balanced', max_iter=1000, solver='liblinear', random_state=42))
    ])),
    ("SVM", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', LinearSVC(class_weight='balanced', max_iter=1000))
    ])),
    ("Decision Tree", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', DecisionTreeClassifier(random_state=42))
    ])),
    ("Random Forest", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', RandomForestClassifier(n_estimators=20, random_state=42))
    ]))
]


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

for name, classifier in classifiers:
    classifier.fit(X_train_manual, y_train_manual)
    predictions = classifier.predict(X_test_manual)

    print(f"\n{name} Metrics:")
    print(" Precision:", precision_score(y_test_manual, predictions, average='macro'))
    print(" Recall:   ", recall_score(y_test_manual, predictions, average='macro'))
    print(" F1 Score: ", f1_score(y_test_manual, predictions, average='macro'))
    print("-" * 50)


In [None]:

results = []

for name, classifier in classifiers:
    preds = classifier.predict(X_test_manual)
    results.append((name, preds))


In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

for name, preds in results:
    cm = confusion_matrix(y_test_manual, preds)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f"Confusion Matrix for {name}")
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.tight_layout()
    plt.show()



After training and evaluating five different models on the manually labeled dataset, we observed the following:

- **Logistic Regression** achieved the highest **F1 Score (0.64)**, making it the most balanced model in terms of both precision and recall. It also showed the best generalization among all models.
  
- **SVM** performed reasonably well with an F1 score of **0.71**, slightly below logistic regression, but still showed strong performance, especially in recall.

- **Naive Bayes** had high precision (**0.69**) and recall (**0.73**), meaning it was too conservative in predicting positive classes and missed many true positives.

- **Decision Tree** and **Random Forest** performed the worst overall in terms of F1 Score (both below 0.56), likely due to overfitting on the small training sample or being too sensitive to noise.

**Overall**, models based on linear decision boundaries (Logistic Regression and SVM) performed more consistently than tree-based or probabilistic models in this track.


##  Machine Learning – Track 2: Training on TextBlob Sentiment Labels

In this track, we train machine learning models using labels generated by the **TextBlob** sentiment analyzer. Each review was automatically classified by TextBlob into:

- `positive` → 2
- `neutral`  → 1
- `negative` → 0

The goal is to assess how well models can learn sentiment when trained on rule-based (lexicon) labels, and compare performance against other labeling approaches.


In [None]:
# Map text sentiment to numeric labels
sentiment_map = {'positive': 2, 'neutral': 1, 'negative': 0}
y_textblob = df['textblob_sentiment'].map(sentiment_map)
X_textblob = df['clean_review']


In [None]:
X_train_blob, X_test_blob, y_train_blob, y_test_blob = train_test_split(
    X_textblob, y_textblob, test_size=0.2, random_state=42, stratify=y_textblob
)

In [None]:
classifiers = [
    ("Naive Bayes", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', MultinomialNB())
    ])),
    ("Logistic Regression", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', LogisticRegression(class_weight='balanced', max_iter=1000, solver='liblinear', random_state=42))
    ])),
    ("SVM", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', LinearSVC(class_weight='balanced', max_iter=1000))
    ])),
    ("Decision Tree", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', DecisionTreeClassifier(random_state=42))
    ])),
    ("Random Forest", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', RandomForestClassifier(n_estimators=20, random_state=42))
    ]))
]


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Train and evaluate each classifier
for name, classifier in classifiers:
    print(f"Training {name} on textblob labels…")
    classifier.fit(X_train_blob, y_train_blob)
    predictions = classifier.predict(X_test_blob)


    print(f"\nFor {name}:")
    print("Precision:", precision_score(y_test_blob, predictions, average='macro'))
    print("Recall:   ", recall_score(y_test_blob, predictions, average='macro'))
    print("F1 Score: ", f1_score(y_test_blob, predictions, average='macro'))
    print("-" * 50)


In [None]:
results_textblob = []

for name, classifier in classifiers:
    preds = classifier.predict(X_test_blob)
    results_textblob.append((name, preds))

In [None]:
for name, preds in results_textblob:
    cm = confusion_matrix(y_test_blob, preds)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f"Confusion Matrix for {name} (TextBlob)")
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.tight_layout()
    plt.show()

In this track, we trained the same models using sentiment labels generated by the TextBlob library. Here’s what we observed:

- **Logistic Regression** again performed best with an **F1 Score of 0.69**, showing strong balance between precision and recall. This reinforces its consistency across both manual and rule-based labeling.

- **Random Forest** came second with a solid F1 Score of **0.66**, showing better performance here compared to its result on manual labels. It appears to generalize better with the smoother sentiment predictions from TextBlob.

- **SVM** also showed competitive results with an F1 Score of **0.67**, proving it remains a strong choice even when trained on weak labels.

- **Naive Bayes**, although it had high precision (**0.77**), once again suffered from **low recall (0.43)**, meaning it failed to detect many relevant positive cases — similar behavior as in Track 1.

- **Decision Tree** also improved from its manual-label performance, but still lagged behind the other models in overall F1 Score (**0.62**).

💡 **Overall**, models trained on TextBlob labels showed slightly better F1 scores than with manual labels. This could be due to TextBlob generating more consistent (less noisy) training labels than human ratings.


##  Machine Learning – Track 3: Training on VADER Sentiment Labels

In this final track, we train the same machine learning models using sentiment labels generated by the **VADER** sentiment analyzer.

VADER classifies text into:
- `positive` → 2
- `neutral`  → 1
- `negative` → 0

This track allows us to evaluate how machine learning models perform when trained on rule-based sentiment annotations from a linguistically tuned model like VADER.


In [None]:
# Map VADER output to numeric labels
sentiment_map = {'positive': 2, 'neutral': 1, 'negative': 0}
y_vader = df['vader_sentiment'].map(sentiment_map)
X_vader = df['clean_review']


In [None]:
from sklearn.model_selection import train_test_split

X_train_vader, X_test_vader, y_train_vader, y_test_vader = train_test_split(
    X_vader, y_vader, test_size=0.2, random_state=42, stratify=y_vader
)

In [None]:
classifiers = [
    ("Naive Bayes", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', MultinomialNB())
    ])),
    ("Logistic Regression", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', LogisticRegression(class_weight='balanced', max_iter=1000, solver='liblinear', random_state=42))
    ])),
    ("SVM", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', LinearSVC(class_weight='balanced', max_iter=1000))
    ])),
    ("Decision Tree", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', DecisionTreeClassifier(random_state=42))
    ])),
    ("Random Forest", Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000)),
        ('clf', RandomForestClassifier(n_estimators=20, random_state=42))
    ]))
]


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Train and evaluate each classifier
for name, classifier in classifiers:
  print(f"Training {name} on VADER labels…")
  classifier.fit(X_train_vader, y_train_vader)
  predictions = classifier.predict(X_test_vader)

  print(f"\n{name} Metrics:")
  print(" Precision:", precision_score(y_test_vader, predictions, average='macro'))
  print(" Recall:   ", recall_score(y_test_vader, predictions, average='macro'))
  print(" F1 Score: ", f1_score(y_test_vader, predictions, average='macro'))
  print("-" * 50)


In [None]:
results_vader = []

for name, classifier in classifiers:
    preds = classifier.predict(X_test_vader)
    results_vader.append((name, preds))

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

for name, preds in results_vader:
    cm = confusion_matrix(y_test_vader, preds)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f"Confusion Matrix for {name} (VADER)")
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.tight_layout()
    plt.show()


When training the models on sentiment labels generated by VADER, we observed improved performance across most classifiers compared to Track 1 and Track 2.

- **Logistic Regression** achieved the highest F1 Score (**0.71**), confirming its strength across all tracks. It also had the best recall and balanced performance, making it the most reliable model overall.

- **SVM** came in second with an F1 Score of **0.70**, showing excellent generalization and stability. It also had a strong balance between precision and recall.

- **Random Forest** improved significantly compared to previous tracks with an F1 Score of **0.65**, indicating it benefits from the structured sentiment provided by VADER.

- **Decision Tree** also reached its best performance so far (**F1 = 0.63**), but it still lagged slightly behind the ensemble and linear models.

- **Naive Bayes**, despite having decent precision (**0.73**), had the lowest F1 Score (**0.45**) due to its poor recall, consistent with previous tracks.

💡 Overall, **Track 3 produced the strongest results**, likely because VADER's lexicon and scoring method provided smoother and more consistent labels, which helped the models learn better.


In [None]:
import pandas as pd

# collecting the data for (Model, Precision, Recall, F1 Score, Track)
data = [
    # Manual
    ("Naive Bayes", 0.712, 0.507, 0.518, "Manual"),
    ("Logistic Regression", 0.659, 0.673, 0.666, "Manual"),
    ("SVM", 0.658, 0.671, 0.664, "Manual"),
    ("Decision Tree", 0.580, 0.573, 0.576, "Manual"),
    ("Random Forest", 0.699, 0.585, 0.595, "Manual"),

    # TextBlob
    ("Naive Bayes", 0.771, 0.453, 0.476, "TextBlob"),
    ("Logistic Regression", 0.687, 0.713, 0.698, "TextBlob"),
    ("SVM", 0.683, 0.710, 0.694, "TextBlob"),
    ("Decision Tree", 0.648, 0.644, 0.646, "TextBlob"),
    ("Random Forest", 0.755, 0.659, 0.694, "TextBlob"),

    # VADER
    ("Naive Bayes", 0.738, 0.455, 0.485, "VADER"),
    ("Logistic Regression", 0.715, 0.751, 0.731, "VADER"),
    ("SVM", 0.710, 0.749, 0.727, "VADER"),
    ("Decision Tree", 0.651, 0.642, 0.646, "VADER"),
    ("Random Forest", 0.761, 0.651, 0.693, "VADER")
]

# convert to DataFrame
df_results = pd.DataFrame(data, columns=["Model", "Precision", "Recall", "F1 Score", "Track"])


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.barplot(data=df_results, x="Model", y="F1 Score", hue="Track")

plt.title("Model Comparison Across Tracks (F1 Score)")
plt.ylabel("F1 Score")
plt.ylim(0, 1)
plt.legend(title="Label Source")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.barplot(data=df_results, x="Model", y="Precision", hue="Track")
plt.title("Model Comparison Across Tracks (Precision)")
plt.ylim(0, 1)
plt.ylabel("Precision")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=df_results, x="Model", y="Recall", hue="Track")
plt.title("Model Comparison Across Tracks (Recall)")
plt.ylim(0, 1)
plt.ylabel("Recall")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


# ***Model Evaluation Summary ***

**Definitions:**

*   **Precision:**
*   **Recall:**
*   **F1 Score:**


## Model Optimization Using GridSearchCV

To further improve the model performance, we apply `GridSearchCV` to perform hyperparameter tuning on the best-performing model: **Logistic Regression trained on VADER sentiment labels**.

This step allows us to explore different parameter combinations and select the one that achieves the best average F1-score during cross-validation.

The optimization is performed using the training data from the VADER-based sentiment classification track.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# Define pipeline for logistic regression with TF-IDF
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression())
])

# Define hyperparameter grid
param_grid = {
    'tfidf__max_features': [1000, 3000, 5000],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'clf__C': [0.1, 1, 10],
    'clf__class_weight': [None, 'balanced'],
    'clf__solver': ['liblinear']
}

# Setup GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro', n_jobs=-1, verbose=1)

# Fit the grid search using VADER-labeled training data
grid_search.fit(X_train_vader, y_train_vader)

# Show best parameters and performance
print("\n--- GridSearchCV Results ---")
print("Best parameters found:")
print(grid_search.best_params_)

print("\nBest score (average F1 score on the cross-validation folds):")
print(grid_search.best_score_)

# Save the best model for predictions
best_model = grid_search.best_estimator_


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Predict using the best model from GridSearchCV
y_pred = best_model.predict(X_test_vader)

# Classification report: Precision, Recall, F1 for each class
print("Classification Report:")
print(classification_report(y_test_vader, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test_vader, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=best_model.classes_)

# Plot the confusion matrix
plt.figure(figsize=(6, 4))
disp.plot(cmap='Blues', values_format='d')
plt.title("Confusion Matrix - Best Model (VADER Track)")
plt.show()


## Model Optimization Summary – VADER Track (Logistic Regression)

After performing hyperparameter tuning using `GridSearchCV`, the performance of the Logistic Regression model improved significantly.


- The model performs very well in identifying positive reviews.
- Some confusion remains between neutral and positive labels.
- Most errors come from misclassifying neutral and negative as positive.

