<h1 style="text-align: center;">Sentiment Analysis on Amazon Reviews 📊</h1>

## Objective

The rapid growth of e-commerce, accelerated significantly during and after the COVID-19 pandemic, has reshaped consumer purchasing behaviors for both essential and non-essential goods. This shift has resulted in an overwhelming increase in online customer reviews, offering businesses a wealth of insights into customer satisfaction, product performance, and potential areas for improvement. However, the sheer volume of these reviews makes manual analysis infeasible for organizations striving to understand and act on customer sentiments effectively.
Sentiment analysis has emerged as an essential solution, leveraging Natural Language Processing (NLP) and machine learning techniques to automatically identify and classify opinions expressed in text. This research explores the application of these techniques to analyze e-commerce reviews, aiming to uncover actionable insights at scale. By automating sentiment analysis, businesses can enhance customer experiences, personalize offerings, and make informed, data-driven decisions that align with evolving customer preferences. This study not only addresses the challenges of large-scale sentiment analysis but also highlights its transformative potential for improving business strategies in the dynamic e-commerce landscape

## Data Description

The dataset used in this project is titled **Amazon Product Reviews** and was sourced from both Kaggle and the University of San Diego’s website. It is a publicly available dataset under the **CC0 1.0 Universal license**, which means it is free to use, share, and adapt without legal restrictions. The dataset can be accessed through [this Kaggle link](https://www.kaggle.com/datasets/arhamrumi/amazon-product-reviews/data).

### Dataset Structure

The dataset comprises the following fields:

1. **Id**: A unique identifier for each review entry.
2. **ProductId**: A unique identifier for the product being reviewed.
3. **UserId**: A unique identifier for the user who submitted the review.
4. **ProfileName**: The name of the user who submitted the review.
5. **HelpfulnessNumerator**: The number of users who found the review helpful.
6. **HelpfulnessDenominator**: The total number of users who rated the helpfulness of the review.
7. **Score**: The rating provided by the user, typically on a scale of 1 to 5.
8. **Time**: A timestamp representing when the review was submitted.
9. **Summary**: A short title or summary of the review.
10. **Text**: The full review text.

### Data Preprocessing and Ethical Considerations:

For this project, the **UserId** and **ProfileName** columns will be dropped from the dataset. This decision is made to ensure that no personal identifiers are used, thus maintaining ethical standards and adhering to data privacy principles. Removing these fields ensures that the dataset is ethically cleared for analysis while retaining all necessary information for sentiment analysis

## Key Research Questions to be Addressed

- **How accurately can various machine learning models classify sentiment in e-commerce reviews?**
- **How do different text preprocessing techniques impact the performance of sentiment classification models?**
- **How do various feature extraction methods affect the accuracy of sentiment classification?**
- **How do different machine learning models compare in terms of performance when classifying sentiment in e-commerce reviews?**

## Methodology

### Imports & Downloads

To run this notebook you will need the following installed:
- `pip install pandas`
- `pip install numpy`
- `pip install seaborn`
- `pip install matplotlib`
- `pip install scikit-learn`
- `pip install nltk`
- `pip install textblob`
- `pip install wordcloud`
- `pip install beautifulsoup4`
- `pip install emoji`
- `pip install contractions`

In [2]:
# Data Manipulation and Analysis
import pandas as pd
import numpy as np

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning and Text Vectorization
import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

# Text Preprocessing
import re
import string
import unicodedata
from bs4 import BeautifulSoup  # For parsing HTML
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer

# Sentiment Analysis
from textblob import TextBlob

# Word Clouds
from wordcloud import WordCloud

# Miscellaneous
import collections
import emoji
import contractions

In [4]:
# Downloading NLTK resources
# nltk.download('all')

### Step 1: Load Data

In [3]:
# Balanced data is data where all reviews (1 star to 5 star) are taken in equal proportion to avoid overfitting or underfitting
# 25000 Records of each star rating is taken
df_balancedData = pd.read_csv('Datasets/balanced_reviews.csv')

### Step 2: Exploratory Data Analysis (EDA)

In [None]:
df_balancedData.shape

In [None]:
df_balancedData.head()

In [None]:
df_balancedData.tail()

In [None]:
df_balancedData.info()

In [None]:
df_balancedData.describe()

In [None]:
df_balancedData.columns

In [None]:
df_balancedData['Text']

In [None]:
# Create a pie chart
plt.figure(figsize=(10, 8))
df_balancedData['Score'].value_counts().plot(
    kind='pie', 
    autopct='%.0f%%', 
    startangle=90,  # Rotate the pie chart for better orientation
    colors=plt.cm.Paired.colors  # Add distinct colors for each segment
)

# Add title and legend
plt.title('Data Distribution of Scores', fontsize=16)
plt.ylabel('')  # Remove the default y-axis label for better aesthetics
plt.legend(title='Scores', loc='upper right')

# Show the plot
plt.show()

In [None]:
# Define stopwords
stop_words = set(stopwords.words('english'))

# Combine all text from the 'Review' column into a single string
all_reviews = " ".join(df_balancedData["Text"])

# Generate the Word Cloud
wordcloud = WordCloud(
    stopwords=stop_words, 
    width=800, 
    height=800, 
    background_color='white', 
    min_font_size=10
).generate(all_reviews)

# Plot the Word Cloud
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Select numeric columns only
numeric_data = df_balancedData.select_dtypes(include=['int64', 'float64'])

# Compute the correlation matrix
correlation_matrix = numeric_data.corr()

plt.figure(figsize=(16, 6))

# Generate the heatmap
heatmap = sns.heatmap(
    correlation_matrix, 
    vmin=-1, vmax=1, annot=True, fmt=".2f", cmap='coolwarm', cbar_kws={'label': 'Correlation Coefficient'}
)

# Customize the title
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize': 16, 'fontweight': 'bold'}, pad=20)

# Rotate x and y axis labels for clarity
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

# Tighten the layout for better spacing
plt.tight_layout()

# Display the heatmap
plt.show()

### Step 3: Data Cleaning & Preprocessing

#### Drop Unnecesary Columns

In [None]:
# Select only relevant columns
df_balancedData = df_balancedData[['ProductId', 'Score', 'Text']]
# Display the updated DataFrame
df_balancedData.head()

#### Remove Emojis

In [None]:

class CleanEmojis(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Apply emoji removal to the 'Text' column of the DataFrame
        X['Text'] = X['Text'].apply(lambda review: emoji.demojize(review))
        return X

# Apply the transformer to the DataFrame
cleaner = CleanEmojis()
df_balancedData_cleaned = cleaner.transform(df_balancedData)

# Display the updated DataFrame
df_balancedData_cleaned.head()

#### Method for Removing Special Characters, URLS, HTML etc. and convert to lowercase

In [None]:
# Define the StringProcessing class
class StringProcessing(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        filtered_tweets = []
        for review in X:
            # Converting to lowercase
            review = review.lower()
            # Substituting ampersand signs to 'and'
            review = re.sub(r"&amp", "and", review)
            review = re.sub(r"&", "and", review)
            # Removing @ mentions and hashtags
            review = re.sub(r"[@#][^\s]+", " ", review)
            # Removing URLs
            review = re.sub(r"(http\S+)|(www\S+)", "", review)
            # Removing HTML tags
            review = re.sub(r"<[^<]+?>", " ", review)
            # Cleaning emojis
            review = re.sub(r":", " ", review)
            review = review.replace("_", " ")
            # Removing special characters
            review = re.sub(r"[^a-z0-9'’ ]", " ", review)
            # Removing extra whitespace
            review = re.sub(r"\s+", " ", review)
            filtered_tweets.append(review)
        return filtered_tweets

# Create an instance of the StringProcessing transformer
string_processor = StringProcessing()

# Apply the transformation to the 'Text' column and store the result in a new 'Text' column
df_balancedData.loc[:, 'Text'] = string_processor.transform(df_balancedData['Text'])

# Display the updated DataFrame with the new 'Review' column
df_balancedData.head()

#### Preprocessed Text into New column 'Review'

In [None]:
# Text column is not needed, so its replaced with the preprocesses 'Review' column
df_balancedData = df_balancedData[['ProductId', 'Score', 'Text']]
df_balancedData.head()

#### Method to substitute contractions

In [None]:
def collect_short_contractions(short_contractions):
    short_contractions_dict = {}
    # Sort the contractions alphabetically
    for key, value in sorted(short_contractions.items()):
        # Remove apostrophes from the contraction keys
        new_key = key.replace("'", "")
        short_contractions_dict[new_key] = value
    return short_contractions_dict

# (Maarten et al., 2013)
short_contractions = {
    "i'd": "i would",
    "we'd": "we would",
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "doesn’t": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "haven’t": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'll": "I will",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it'd": "it would",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "oughtn't": "ought not",
    "shan't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "that'd": "that would",
    "that's": "that is",
    "there'd": "there had",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'll": "we will",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who's": "who is",
    "who've": "who have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "y'all": "you all",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have"
}

short_contractions_dict = collect_short_contractions(short_contractions)
short_contractions_dict

In [19]:
class ExpandContractions(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        expanded_reviews = [contractions.fix(review) for review in X]
        return expanded_reviews

#### Method to handle Slang words

In [None]:
def collect_slangs(slangs):
    slang_dict = {}
    for line in slangs.strip().split('\n'):
        key, value = line.lower().strip().split(': ', 1)  # Added limit to split on first ':'
        slang_dict[key] = value

    # Sort the dictionary by keys alphabetically
    return dict(sorted(slang_dict.items()))

# (SlickText, 2023)
slangs = '''
  ROFL: Rolling on the floor laughing
  STFU: Shut up
  ICYMI: In case you missed it
  TL;DR: Too long, didn’t read
  TMI: Too much information
  AFAIK: As far as I know
  LMK: Let me know
  NVM: Nevermind
  FTW: For the win
  BYOB: Bring your own beer
  BOGO: Buy one get one
  JK: Just kidding
  TBH: To be honest
  TBF: To be frank
  RN: Right now
  BRB: Be right back
  BTW: By the way
  GG: Good game
  IRL: In real life
  LOL: Laugh out loud
  SMH: Shaking my head
  NGL: Not gonna lie
  IKR: I know right
  TTYL: Talk to you later
  IMO: In my opinion
  WYD: What are you doing?
  IDK: I don’t know
  IDC: I don’t care
  IDGAF: I don’t care
  TBA: To be announced
  TBD: To be decided
  AFK: Away from keyboard
  IYKYK: If you know you know
  B4: Before
  FOMO: Fear of missing out
  GTG: Got to go
  G2G: Got to go
  H8: Hate
  LMAO: Laughing my ass off
  IYKWIM: If you know what I mean
  MYOB: Mind your own business
  POV: Point of view
  HBD: Happy birthday
  WYSIWYG: What you see is what you get
  FWIF: For what it’s worth
  TW: Trigger warning
  EOD: End of day
  FAQ: Frequently asked question
  AKA: Also known as
  ASAP: As soon as possible
  DIY: Do it yourself
  NP: No problem
  U: you
  R: are
  PLS: please
  WYD: what are you doing
  N/A: Not applicable
  K: Okay
  WUT: what
  FYI: For your information
  NSFW: Not safe for work
  WFH: Work from home
  OMW: On my way
  DM: Direct message
  FB: Facebook
  IG: Instagram
  YT: YouTube
  QOTD: Quote of the day
  OOTD: Outfit of the day
  AMA: Ask me anything
  HMU: Hit me up
  ILY: I love you
  BF: Boyfriend
  GF: Girlfriend
  BAE: Before anyone else
  LYSM: Love you so much
  PDA: Public display of affection
  XOXO: Hugs and kisses
  LOML: Love of my life
  THX: thanks
  V: very
  OMG: Oh My God
'''

slang_dict = collect_slangs(slangs)
slang_dict

In [21]:
import re
from sklearn.base import BaseEstimator, TransformerMixin

class ReplaceSlangs(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        filtered_reviews = []
        for review in X:
            # Replace slang
            for key, value in slang_dict.items():
                review = re.sub(r'\b{}\b'.format(re.escape(key)), value, review, flags=re.IGNORECASE)
            
            # Replace short contractions
            for key, value in short_contractions_dict.items():
                review = re.sub(r'\b{}\b'.format(re.escape(key)), value, review, flags=re.IGNORECASE)
            
            filtered_reviews.append(review)
        return filtered_reviews

#### Method for Tokenization

In [22]:
class Tokenize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        filtered_tweets = []
        for tweet in X:
          filtered_tweets.append(nltk.word_tokenize(tweet)) 
        return filtered_tweets

#### Method to remove stopwords
**There are certain words in the built-in NLTK stop word list that if we removed, would change the sentiment of the text. Therefore, we decided to modify the original list by removing certain words like "no", "against", "above" etc.**

In [23]:
class RemoveStopwords(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Custom sentiment-related stopwords
        sentiment_stopwords = ['against', 'above', 'below', 'up', 'down', 'over', 'under', 'no', 'nor', 'not', "don't", 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
        
        # NLTK's standard list of stopwords
        stop_words = set(stopwords.words('english'))
        
        # Remove sentiment-specific stopwords from the NLTK stopwords set
        for word in sentiment_stopwords:
            stop_words.remove(word)
        
        # List to hold processed texts (after stopwords removal)
        processed_reviews = []
        
        # Process each review
        for review in X:
            # Tokenize the review (assuming it's already tokenized)
            tokens_without_stopwords = [word for word in review if word not in stop_words]
            
            # Append the processed tokens for this review
            processed_reviews.append(tokens_without_stopwords)
        
        return processed_reviews

#### Method to apply lemmatization

In [24]:
class Lemmatize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Initialize the lemmatizer
        lemmatizer = WordNetLemmatizer()

        # List to store the lemmatized tokens for each review
        lemmatized_reviews = []

        # Define a function to convert the POS tags to WordNet tags
        def get_wordnet_pos(pos_tag):
            if pos_tag.startswith('J'):
                return wordnet.ADJ
            elif pos_tag.startswith('V'):
                return wordnet.VERB
            elif pos_tag.startswith('N'):
                return wordnet.NOUN
            elif pos_tag.startswith('R'):
                return wordnet.ADV
            else:
                return wordnet.NOUN  # Set default to noun
        
        # Process each review in the input data
        for review in X:
            # Perform POS tagging on the review
            pos_tags = nltk.pos_tag(review)
            
            # Lemmatize each token based on its POS tag
            lemmatized_tokens = [
                lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in pos_tags
            ]
            
            # Append the lemmatized tokens for the current review to the list
            lemmatized_reviews.append(lemmatized_tokens)
        
        # Return the lemmatized reviews (list of tokenized and lemmatized words)
        return lemmatized_reviews

#### Basic cleaning after preprocessing

In [25]:
class FinalClean(BaseEstimator, TransformerMixin):
    def __init__(self):
        # Precompile regex patterns for better performance
        self.remove_digits = re.compile(r'\d+')
        self.remove_single_letters = re.compile(r'\b\w{1}\b')
        self.remove_apostrophes = re.compile(r"'")
        self.remove_href = "href"

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # List to store cleaned reviews
        cleaned_reviews = []

        # Loop through each review in the input dataset
        for review in X:
            # Use list comprehension for efficiency in cleaning words
            cleaned_words = [
                self.clean_word(word)
                for word in review
                if word and word != self.remove_href
            ]
            
            # Append the cleaned review (list of words) to the final list
            cleaned_reviews.append(cleaned_words)
        
        return cleaned_reviews

    def clean_word(self, word):
        # Removing digits
        word = self.remove_digits.sub("", word)
        
        # Removing single letters
        word = self.remove_single_letters.sub("", word)
        
        # Removing apostrophes
        word = self.remove_apostrophes.sub("", word)

        return word

#### Preprocessing Pipeline

In [None]:
df_balancedData.head()

In [27]:
def preprocess_reviews(reviews_frame):
    # Define the list of preprocessing steps as tuples
    # Each tuple contains a step name (for identification) and an instance of the corresponding transformer
    preprocessing_steps = [
        ('expand_contractions', ExpandContractions()),  # Step to expand contractions like "don't" -> "do not"
        ('replace_slangs', ReplaceSlangs()),         # Step to replace slang words with their full form
        ('tokenize', Tokenize()),                    # Step to tokenize the text (split into words)
        ('remove_stopwords', RemoveStopwords()),     # Step to remove common stopwords
        ('lemmatize', Lemmatize()),                  # Step to lemmatize words (reduce to root form)
        ('final_clean', FinalClean())                # Final cleaning step (removing unwanted characters, etc.)
    ]

    # Create the pipeline object using the preprocessing steps defined above
    # The Pipeline applies each step sequentially to the data
    preprocessing_pipeline = Pipeline(preprocessing_steps)
    
    # Apply the pipeline to the 'Text' column of the DataFrame
    # This will preprocess the 'Text' data column by applying each step defined in the pipeline
    preprocessed_reviews = preprocessing_pipeline.fit_transform(reviews_frame['Text'])
    
    # Convert the list of tokenized words into strings (space-separated)
    # List comprehension is used to join each list of tokens into a single string for each review
    # Each review is a list of words, so ' '.join(review) joins them back into a single string
    preprocessed_strings = [' '.join(review) for review in preprocessed_reviews]

    # Add the preprocessed reviews as a new column to the original DataFrame
    # This new column will contain the cleaned and processed review text
    reviews_frame['Preprocessed_Review'] = preprocessed_strings

    # Return the DataFrame with the added preprocessed reviews
    return reviews_frame


In [None]:
preprocess_reviews(df_balancedData)