<h1 style="text-align: center;">Sentiment Analysis on Amazon Reviews 📊</h1>

## Objective

The rapid growth of e-commerce, accelerated significantly during and after the COVID-19 pandemic, has reshaped consumer purchasing behaviors for both essential and non-essential goods. This shift has resulted in an overwhelming increase in online customer reviews, offering businesses a wealth of insights into customer satisfaction, product performance, and potential areas for improvement. However, the sheer volume of these reviews makes manual analysis infeasible for organizations striving to understand and act on customer sentiments effectively.
Sentiment analysis has emerged as an essential solution, leveraging Natural Language Processing (NLP) and machine learning techniques to automatically identify and classify opinions expressed in text. This research explores the application of these techniques to analyze e-commerce reviews, aiming to uncover actionable insights at scale. By automating sentiment analysis, businesses can enhance customer experiences, personalize offerings, and make informed, data-driven decisions that align with evolving customer preferences. This study not only addresses the challenges of large-scale sentiment analysis but also highlights its transformative potential for improving business strategies in the dynamic e-commerce landscape

## Data Description

The dataset used in this project is titled **Amazon Product Reviews** and was sourced from both Kaggle and the University of San Diego’s website. It is a publicly available dataset under the **CC0 1.0 Universal license**, which means it is free to use, share, and adapt without legal restrictions. The dataset can be accessed through [this Kaggle link](https://www.kaggle.com/datasets/arhamrumi/amazon-product-reviews/data).

### Dataset Structure

The dataset comprises the following fields:

1. **Id**: A unique identifier for each review entry.
2. **ProductId**: A unique identifier for the product being reviewed.
3. **UserId**: A unique identifier for the user who submitted the review.
4. **ProfileName**: The name of the user who submitted the review.
5. **HelpfulnessNumerator**: The number of users who found the review helpful.
6. **HelpfulnessDenominator**: The total number of users who rated the helpfulness of the review.
7. **Score**: The rating provided by the user, typically on a scale of 1 to 5.
8. **Time**: A timestamp representing when the review was submitted.
9. **Summary**: A short title or summary of the review.
10. **Text**: The full review text.

### Data Preprocessing and Ethical Considerations:

For this project, the **UserId** and **ProfileName** columns will be dropped from the dataset. This decision is made to ensure that no personal identifiers are used, thus maintaining ethical standards and adhering to data privacy principles. Removing these fields ensures that the dataset is ethically cleared for analysis while retaining all necessary information for sentiment analysis

## Key Research Questions to be Addressed

- **How accurately can various machine learning models classify sentiment in e-commerce reviews?**
- **How do different text preprocessing techniques impact the performance of sentiment classification models?**
- **How do various feature extraction methods affect the accuracy of sentiment classification?**
- **How do different machine learning models compare in terms of performance when classifying sentiment in e-commerce reviews?**

## Methodology

### Imports

Run the following command in your terminal or command prompt to install all necessary libraries:

```bash
pip install pandas seaborn matplotlib numpy scikit-learn nltk textblob wordcloud

In [1]:
#All the imports are mentioned here:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk

# BeautifulSoup
from bs4 import BeautifulSoup

# Data cleaning tools
import re
import string

# Removing special characters
import unicodedata

# Removing stopwords
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Calculating Polarity and Subjectivity
from textblob import TextBlob

# N-grams
from nltk.util import ngrams

# for Wordscloud
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer

# Load  NLTK modules
import nltk
import collections

### Step 1: Load & Inspect Data

In [2]:
# Balanced data is data where all reviews (1 star to 5 star) are taken in equal proportion to avoid overfitting or underfitting
# 25000 Records of each star rating is taken
balanced_data = pd.read_csv('Datasets/balanced_reviews.csv')

In [None]:
balanced_data.shape

In [None]:
balanced_data.head()

In [None]:
balanced_data.tail()

In [None]:
balanced_data.info()

In [None]:
balanced_data.describe()

In [None]:
balanced_data.columns

In [None]:
balanced_data['Text']

### Step 2: Exploratory Data Analysis (EDA)

### Data Cleaning & Preprocessing

#### a. Drop Unnecesary Columns

In [None]:
# Select only relevant columns
balanced_data = balanced_data[['ProductId', 'Score', 'Text']]
# Display the updated DataFrame
balanced_data.head()

#### b. Remove HTML Tags

In [12]:
# Function for removing html tags
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

In [None]:
# Apply the function and update DataFrame
balanced_data['review'] = balanced_data['Text'].apply(strip_html)
balanced_data = balanced_data.drop('Text', axis=1)

balanced_data.head()

In [13]:
# # Step 2: Remove HTML tags
# def step_2_remove_html(Text):
#     return re.sub(r'<.*?>', '', Text)

# # Apply step 2 to the 'Lowercase_Text' column
# data['No_HTML_Text'] = data['Lowercase_Text'].apply(step_2_remove_html)

# # Display the result
# print("\nStep 2: No HTML Text Sample:")
# data[['Text', 'No_HTML_Text']]

#### c. To LowerCase

In [None]:
def lowercaseFunction(Text):
    return Text.lower()

# Apply step 1 to the 'Text' column
balanced_data['LowercaseReview'] = balanced_data['review'].apply(lowercaseFunction)

# Display the result
print("Step 1: Lowercase Text Sample:")
balanced_data[['Score', 'LowercaseReview']]

#### d. Remove Special Characters, Numbers, punctuation

In [15]:
# Step 3: Remove special characters, numbers, and punctuation
def remove_special_characters(Text):
    return re.sub(r'[^a-z\s]', '', Text)

In [None]:
# Apply step 3 column
balanced_data['Review'] = balanced_data['LowercaseReview'].apply(remove_special_characters)
balanced_data = balanced_data.drop('LowercaseReview', axis=1)
balanced_data = balanced_data.drop('review', axis=1)

# Display the result
print("\nStep 3: Cleaned Text (No Special Characters) Sample:")
balanced_data.head()

#### e. Create Sentiment Column (Positive, Neutral, Negative)

In [None]:
# Define a function to classify scores
def classify_score(score):
    if score in [4, 5]:
        return 'Positive'
    elif score == 3:
        return 'Neutral'
    elif score in [1, 2]:
        return 'Negative'

# Apply the function to create a new column
balanced_data['Sentiment'] = balanced_data['Score'].apply(classify_score)

# Display the updated DataFrame
balanced_data.head()

#### f. Tokenization

In [18]:
# Function to tokenize text
def tokenize_text(text):
    return word_tokenize(text)

In [None]:
from nltk.tokenize import word_tokenize

# Ensure that necessary nltk resources are downloaded
nltk.download('punkt')  # for word_tokenize
nltk.download('punkt_tab')

balanced_data['Tokenized_Text'] = balanced_data['Review'].apply(tokenize_text)


In [None]:
balanced_data.head()

#### g. Stop Word Removal

In [None]:
# Ensure necessary NLTK resources are downloaded
nltk.download('stopwords')  # for stopwords

# Get the list of stopwords in English
stop_words = set(stopwords.words('english'))

# Define the function to remove stopwords from tokenized text
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

# Apply stopword removal to the tokenized text
balanced_data['Cleaned_Tokens'] = balanced_data['Tokenized_Text'].apply(remove_stopwords)


In [None]:
balanced_data.head()

#### h. Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK resources are downloaded
nltk.download('wordnet')  # for WordNetLemmatizer
nltk.download('omw-1.4')  # for WordNet lexical resources

In [None]:
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Define the function to lemmatize tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply lemmatization to the cleaned tokens (without stopwords)
balanced_data['Lemmatized_Tokens'] = balanced_data['Cleaned_Tokens'].apply(lemmatize_tokens)

# Display the result
balanced_data[['Review', 'Cleaned_Tokens', 'Lemmatized_Tokens']].head()

### TF-IDF (To FIX)

In [None]:
balanced_data.head()

In [None]:
tfidf_data = balanced_data.copy()
tfidf_data = tfidf_data.drop(['Score', 'Review','Tokenized_Text', 'Cleaned_Tokens'], axis=1)
tfidf_data.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Join the tokens in 'Cleaned_Tokens' back into a string
tfidf_data['Processed_Text'] = tfidf_data['Lemmatized_Tokens'].apply(lambda x: ' '.join(x))

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the processed text data
tfidf_matrix = vectorizer.fit_transform(tfidf_data['Processed_Text'])

# Convert the TF-IDF matrix to a DataFrame for easier inspection
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Check the first few rows of the TF-IDF matrix
tfidf_df.head()