# Amazon Review Sentiment Analysis

## Main Goals
- Predict whether an Amazon review has a positive or negative sentiment.
- Convert unstructured text data into meaningful numerical features.
    - Clean and preprocess raw text from reviews.
    - Apply TF-IDF vectorization to represent the text data.
- Explore word-level features to understand what drives sentiment.
- Compare and analyze results from two different classification models.

### Context
In the modern digital marketplace, customer reviews are a cornerstone of consumer decision-making and a vital source of feedback for businesses. Understanding the sentiment expressed in these reviews at a large scale presents a significant challenge. Automatically classifying reviews as positive or negative is crucial for businesses to gauge customer satisfaction, identify product strengths and weaknesses, and manage their brand reputation. In the field of data science, Natural Language Processing (NLP) offers a robust toolkit for converting unstructured text into features for predictive modeling. This project leverages a real-world dataset of past Amazon reviews to build a classification model that predicts sentiment, enabling a more data-driven approach to understanding the voice of the customer.

## 1. Loading in the Data
For this project, we will use the [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?select=Reviews.csv) from Kaggle. In accordance with Kaggle licenses, please directly visit the Kaggle website and download the `Reviews.csv` dataset for this activity, and then upload the file to the same directory as the notebook file.

We can start by loading in the dataset into a pandas dataframe, and then displaying it to ensure it loaded correctly, and so we can see what the features are and how the target is displayed. This means that we have to start by importing pandas as well.

It's worth mentioning that anytime you have a dataset from an external source, such as Kaggle, you can and should refer back to the source of the data to clear up misconceptions and also to get a better understanding of the data.

In [1]:
#Import pandas
import pandas as pd

#Read the CSV file
df = pd.read_csv('Reviews.csv')

#Display the dataframe
display(df)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
568449,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,5,1299628800,Will not do without,Great for sesame chicken..this is a good if no...
568450,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,2,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...
568451,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,5,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o..."
568452,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,5,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...


Using display() on our data shows us our features. Let's list out some of the features worth noting and clarifying. We'll use infromation from Kaggle to supplement what displaying the info tells us as well.

- HelpfulnessNumerator: Number of users who found the review helpful
- HelpfulnessDenominator: Number of users who indicated whether they found the review helpful or not 
- Score: The ratimg for the product on a scale of 1 to 5. 
- Time: While it should simply just be the time of the review, the entries might look unfamiliar. The data is in Unix timestamp (also known as epoch time), which records time as the number of seconds since January 1st, 1970. Something like 1303862400 would be the same as 2011-04-27 00:00:00.
- Text: This is just the text of the reviews, and we'll use this in combination with other features to train our model on sentiment

Note that we don't have a clear sentiment target, so when the time for feature engineering comes, we'll take what data we have now and create a binary target the strictly tells our model whether a review is postive or negative.

As such, we're ready to move on to preproccessing our data.

## 2. Preprocessing

Let's now start to clean up our data. We do want to make it as easy as possible for our model to read our data.

### Handling Null Values
When preprocessing data, a good place to start is with handling null entires. They're easy to check, but leaving them in can cause major issues for our model down the line. Thankfully, pandas has plenty of tools for us to use to check if we have any null values. Something to note is that since we are handling a lot of text, there is a chance that instead of being considered null in the data, it might just be a blank string. As such, we'll check for blank strings as well. We'll use a bit more code than usual for this, so do follow along.

Let's start by printing out the null values in the data frame. In a different code cell, we can check for blank strings in specific columns.

In [2]:
#Check for null values
print(df.isnull().sum())

Id                         0
ProductId                  0
UserId                     0
ProfileName               26
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64


Here we can see that we have 27 null entries in the ProfileName and Summary features. Before handling these rows, let's continue with what we previously said, and check for blank strings as well, as they wouldn't have been counted with the null entries we just saw.

In [3]:
#List of the object/string columns we want to inspect
columns_to_check = [
    'Summary',
    'Text',
    'ProductId',
    'UserId',
    'ProfileName'
]

print("Analyzing String Columns for Missing or Empty Values")

# Loop through each column in our list
for col in columns_to_check:
    #Print the column name for clarity
    print(f"\n'{col}':")

    #Check for empty strings ('')
    #This comparison might not work if the column has NaN values,so we handle that with fillna.
    #This doesn't fill the actual Dataframe and remove nulls, it just stores what it would be if we did.
    empty_strings = (df[col] == '').fillna('').sum()
    print(f"Empty strings: {empty_strings}")
    

    #Check for whitespace-only strings ('   ')
    #We fill potential NaN values with an empty string first so the .str accessor doesn't cause an error.
    whitespace_strings = df[col].fillna('').str.isspace().sum()
    print(f"Whitespace-only strings: {whitespace_strings}")


Analyzing String Columns for Missing or Empty Values

'Summary':
Empty strings: 0
Whitespace-only strings: 0

'Text':
Empty strings: 0
Whitespace-only strings: 0

'ProductId':
Empty strings: 0
Whitespace-only strings: 0

'UserId':
Empty strings: 0
Whitespace-only strings: 0

'ProfileName':
Empty strings: 0
Whitespace-only strings: 0


Fortunately for us, our data doesn't come with empty strings or whitespace strings, so let's just handle the original null entries we saw in the ProfileName and Summary features. Something important to note is that while we are missing these values, we aren't missing other important information in the same row, such as the text or scores. As such, we can fill them in. It doesn't really matter what we fill them in with, as long as it's a string. The best practice however, is to fill them in with empty strings, so we'll do just that.

In [4]:
#Fill the nulls with empty strings
df['Summary'] = df['Summary'].fillna('')
df['ProfileName'] = df['ProfileName'].fillna('')

#You can then verify the nulls are gone:
print(df[['Summary', 'ProfileName']].isnull().sum())

Summary        0
ProfileName    0
dtype: int64


### Removing Nuetrality 
Removing reviews with a 3-star rating is a deliberate strategic decision to improve the quality and clarity of our data. These neutral reviews are often ambiguous, containing a mix of positive and negative language that makes it difficult for a model to learn a clear signal. By dropping them, we create a distinct binary problem with clearly positive (4-5 stars) and clearly negative (1-2 stars) classes. While this does reduce the overall quantity of our data, it significantly improves the quality, which allows models using techniques like Bag-of-Words and TF-IDF to more easily identify the words and phrases strongly associated with each sentiment, leading to a more robust and better-performing final model.

Note that while this simplification is a great technique for building a strong binary classifier, it isn't always the right approach. For more complex, real-world applications, keeping the neutral class can provide a more nuanced understanding of customer feedback. In those cases, it would be ideal to use a more advanced model, such as a neural network, which is better equipped to handle the subtlety and complexity of a three-class (positive, neutral, negative) sentiment problem. For our current purpose of learning TF-IDF later on, however, the binary approach is better suited.

With that in mind, let's go ahead and remove rows where the score is 3.

In [5]:
#Keep only rows where the score is not 3
df = df[df['Score'] != 3].copy()

#Display the filtered DataFrame
display(df)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
568449,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,5,1299628800,Will not do without,Great for sesame chicken..this is a good if no...
568450,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,2,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...
568451,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,5,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o..."
568452,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,5,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...


It might be difficult to see any changes on the surface, but you'll notice that we did lose nearly 40,000 rows. Getting rid of data like this isn't always ideal, so do make sure you have a large enough dataset, and valid reason, before making such a decision. 

### Defining Sentiment 

As of right now, we have a lot of indicators of sentiment, but nothing that clearly defines and categorizes certain reviews as positive or negative. As such, we'll take the liberty of doing so. This will be our target as well. We'll say that anything with a score greater than a 3 is positive, and anything with below a 3 is negative. 

In [6]:
# Create the sentiment column: 1 for positive, 0 for negative
df['sentiment'] = df['Score'].apply(lambda score: 1 if score > 3 else 0)

#Display the DataFrame with the new sentiment column
display(df[['Score', 'sentiment']])

Unnamed: 0,Score,sentiment
0,5,1
1,1,0
2,4,1
3,2,0
4,5,1
...,...,...
568449,5,1
568450,2,0
568451,5,1
568452,5,1


Great! With that, we have our target. Many models such as logistic regressions or random forests excel at predicting binary targets, so we have options.

### Handling Duplicates
While not obvious from simply inspecting the dataframe as we have done so far, looking at the Kaggle documentation for this dataset reveals that even though we have over 500,000 rows, we only had 393579 unique reviews. Without reviews scored at 3 stars, this number is likely even lower. On Amazon, certain products are bunched together, typically by brand as a different version of the same product. Any reviews on one of those products are also visible on a different version of the same product, which is likely what is causing our duplication issue. For the sake of our model reading everything properly, we'll be filtering out these duplicates.

Fortunately, our work is cut out for us, as pandas does have a drop_duplicates function that does exactly what we need it to. We'll be specifying the columns when we use this function as well, since the review, time, profile, and user Id are all likely duplicates, features like the product Id are likely to be unique all throughout.

In [7]:
#Drop duplicates based on the user and their review text
df = df.drop_duplicates(subset=['UserId', 'ProfileName', 'Time', 'Text'])

#Display the DataFrame after dropping duplicates to see how many rows were removed
display(df)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,sentiment
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,1
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,1
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,1
...,...,...,...,...,...,...,...,...,...,...,...
568449,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,5,1299628800,Will not do without,Great for sesame chicken..this is a good if no...,1
568450,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,2,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...,0
568451,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,5,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o...",1
568452,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,5,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...,1


While losing over 100,000 entries doesn't sound good, it's important to remember that we did this so that our model doesn't get confused with duplicate entries. 

### Dropping Noisy Features

To further clean our data, we'll be taking a look at certain features once again. Our goal is to have this data ready in a form that is easy for a model to learn from. Features such as the Id's and the profile name are likely not going to help our model learn the data. In fact, it might even cause data leakage, as it can assume that if a person or item tends to have negative or positive reviews, it can then guess based on either the person or item. When confronted with a new item being reveiwed by a new person, then it'll have trouble making a proper prediction.

We'll also be dropping our HelpfulnessNumerator and HelpfulNessDenominator features out of worry of data leakage. How helpful a review to others is simply won't be available to us when reading a new review, or at least when inputing a new review into the model.

As such, we'll be droping the Id columns, as well as the profile name column. This includes the regular Id column, since although it simply acts as an index, it still doesn't help our model in the end.

In [8]:
#Drop our noisy columns
#We drop time too, since we don't want bias for time of reviews.
#Note that if we were analyzing something like sentiment over seasons, we would need it.
df = df.drop(columns=['Id', 'ProductId', 'UserId', 'Time','ProfileName'])

#View our changes
display(df)

Unnamed: 0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Summary,Text,sentiment
0,1,1,5,Good Quality Dog Food,I have bought several of the Vitality canned d...,1
1,0,0,1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0
2,1,1,4,"""Delight"" says it all",This is a confection that has been around a fe...,1
3,3,3,2,Cough Medicine,If you are looking for the secret ingredient i...,0
4,0,0,5,Great taffy,Great taffy at a great price. There was a wid...,1
...,...,...,...,...,...,...
568449,0,0,5,Will not do without,Great for sesame chicken..this is a good if no...,1
568450,0,0,2,disappointed,I'm disappointed with the flavor. The chocolat...,0
568451,2,2,5,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o...",1
568452,1,1,5,Favorite Training and reward treat,These are the BEST treats for training and rew...,1


### Create a balanced dataset

Another important step of preprocessing we have to perform is balancing the data. As it is right now, we have significantly more postive reviews than negative reviews. Simply feeding this in to our model might cause it to simply guess positive by default, as it would just be the safe option. Our solution to this is to create a balanced dataset where we have an equal amount of both positive and negative reviews. While this does mean discarded even more reviews, note that we still have ample data to work with and to train out model with, especially since it is text data. However, with much smaller datasets, steps like these might be detrimental, so do be cautious before doing so. In our case, it'll help our model significantly.

Let's start by creating two dataframes for the possible sentiments, and then taking a sample of the positve reviews. We'll then concatenate them together to create our new balanced dataset.

In [None]:
#Separate the reviews into two DataFrames: one for positive, one for negative.
df_positive = df[df['sentiment'] == 1]
df_negative = df[df['sentiment'] == 0]

#Randomly sample the positive reviews to match the number of negative reviews.
#This takes a random subset of the majority class.
df_positive_downsampled = df_positive.sample(n=len(df_negative), random_state=64)

#Concatenate the negative reviews and the downsampled positive reviews back together.
df_balanced = pd.concat([df_positive_downsampled, df_negative])

#Shuffle the balanced DataFrame to mix the rows up. Not really necessary, but good practice.
#We also reassign this balanced DataFrame to our original Dataframe for consistency.
df = df_balanced.sample(frac=1, random_state=64).reset_index(drop=True)

#Just to show and compare the original and new balanced datasets
print(f"Original positive reviews: {len(df_positive)}")
print(f"Original negative reviews: {len(df_negative)}")
print("\nNew Balanced Dataset:")
print(df['sentiment'].value_counts())

--- Dataset Balancing Complete ---
Original positive reviews: 307063
Original negative reviews: 57110

--- New Balanced Dataset ---
sentiment
0    57110
1    57110
Name: count, dtype: int64


Here we can see, while we previously had nearly 5-6 times as many postive reviews as negative reviews, we now have an equal amount of postive and negative reviews. When the time comes to train the model, it'll perform much better.

## 3. Text Cleaning 

The next essential step in our project is to perform text cleaning. The main goal here is to standardize our raw review text by removing all the noise that doesn't help in determining sentiment. This ensures that when our model analyzes the data, it focuses only on what's important. If we feed it messy data, we'll get a messy, unreliable model. To do this, we'll create a single function that performs a few key cleaning steps on each review. 

Note that for this function we'll be combining our summary and text features, as while summary may not contain many words, the few words that are contained end up being impactful. We'll be applying the same cleaning strategy to the combined feature.

First, it will remove any stray HTML tags. While it might seem odd to check for such things in Amazon reviews, there's a chance that the webscraper or whatever tool used to collect the reviews did so by pulling in the raw HTML data that contained the text. This might be something like `<br />` to signify a paragraph break on the review in the website. We might not have directly seen any HTML tags, but this is a good check to have.

Next, it will convert all text to lowercase so that words like "Good", "good", and "GOOD" are treated as the same word. It will also strip out all punctuation so that words like "bad" and "bad!" are also treated the same.

Typically this process would also include dealing with "stop words", which are common, unimportant words such as "the", "a", "is", and "in". This would allow the model to pay attention to more meaningful words. However, and fortunately for us, the skicit-learn library has a list of stop words saved for this exact reason, so we won't have to worry about getting those right now.

For our cleaning function, we'll also import Python's built-in re module to handle tasks that require Regular Expressions, often called "regex." Think of regex as a powerful tool for finding and replacing specific patterns within text, much like a supercharged "Find & Replace." We use it for two key jobs: first, to instantly find and remove any HTML tags by searching for the pattern of text enclosed in angle brackets. Second, we use it to strip out all punctuation by searching for any character that is not a letter and replacing it with a space. Using the re module is the most efficient and reliable way to handle these specific cleaning needs.

In [10]:
#Define the text cleaning function
#import regex for text processing
import re

#Define a function to clean text
def clean_text(text):
    """
    Our goal with this function is to take a text string and clean it up for analysis.
    Specifically, we want to:
    1. Removes HTML tags
    2. Converts text to lowercase
    3. Removes punctuation
    """
    
    #Remove HTML tags using a regular expression
    clean = re.sub('<.*?>', '', text)
    
    #Convert to lowercase
    clean = clean.lower()
    
    #Remove punctuation. This regex keeps only letters and replaces everything else with a space.
    clean = re.sub('[^a-zA-Z]', ' ', clean)
    
    #Remove extra whitespace. Possibly not necessary, but it helps to ensure clean text.
    clean = ' '.join(clean.split())
    
    #Split text into a list of words
    words = clean.split()
    
    #Remove short words (e.g., 's', 't', 'aa', 'bc')
    #We keep words that are 3 or more characters long.
    words = [word for word in words if len(word) > 2]
    
    # Join the words back into a single string
    clean = ' '.join(words)
    
    
    return clean

#Apply the function to the 'Text' column
#We create a new column 'cleaned_text' to store the results.
df['cleaned_text'] = df['Text'].apply(clean_text)
df['cleaned_summary'] = df['Summary'].apply(clean_text)

# 2. Create a new column by combining the cleaned summary and text
# Adding a space in between ensures the last word of the summary and the
# first word of the text don't merge together.
df['full_text'] = df['cleaned_summary'] + ' ' + df['cleaned_text']

#Display the results to compare the original vs. cleaned text.
print("\nComparing Original Text with Cleaned Text:")
print(df[['Text', 'full_text']].head())


Comparing Original Text with Cleaned Text:
                                                Text  \
0  These singles sell for $2.50 - $3.36 at the st...   
1  This Australian ginger is the best.  The ginge...   
2  I used to be able to buy these at our local gr...   
3  I decided to try Feline Pine because I was sic...   
4  We love this seasoning.  We use it on Stake an...   

                                           full_text  
0  rip off price these singles sell for the store...  
1  delicious ginger this australian ginger the be...  
2  really delicious used able buy these our local...  
3  one the worst cat litter have ever used decide...  
4  greek spice don love this seasoning use stake ...  


While displaying our data shows a small change for now, most notably that our summary is tacked on in front of the review, our text will be ready for when we use TF-IDF for when we use it. Before that however, we'll have to separate our features from our target, and perform the test-train split.

## 4. Test-Train Split

With our data clean, it's time to split our data into training and testing data. We want to split our data into training and testing so that there is a set of data our model can learn from, and a set of data to practice against. We can do this simply using the train_test_split module from Sklearn.

Before using the function we'll split the target we engineered from the rest of our features as well.

In [11]:
#Import the train_test_split function
from sklearn.model_selection import train_test_split

#Separate features and target variable
X = df['cleaned_text']
y = df['sentiment']

#Split the data into training and testing sets
#Stratify ensures that the proportion of positive and negative reviews is maintained in both sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=64, stratify=y)


## 5. Feature Extraction

With our data cleaned and split, we can now proceed to the most critical step of our NLP pipeline: feature extraction. This is the process where we will finally convert our cleaned review text into a numerical format that a machine learning model can understand and learn from. For this project, as outlined in our goals, we will be using a powerful and standard technique called TF-IDF (Term Frequency-Inverse Document Frequency).

TF-IDF is an intelligent way to represent text that goes beyond simply counting how many times a word appears. It works by calculating a score for each word in each review, a score that reflects how important that word is to that specific document. This is done by balancing two metrics: Term Frequency (TF), which is how often a word appears in a single review, and Inverse Document Frequency (IDF), which lowers the score for words that are very common across all reviews (like "and" or "it") and boosts the score for words that are rare and more descriptive. The result is a numerical feature for each word that effectively represents its importance.

Our plan is to use the TfidfVectorizer from the Sklearn library to implement this. It's crucial that we perform this step correctly to avoid data leakage: we will fit the vectorizer only on our training data to learn the vocabulary and word importances, and then use that same fitted vectorizer to transform both our training and test sets into numerical matrices. By setting parameters like `max_features`, we can also control the size of our vocabulary to keep our model efficient and focused on only the most relevant terms. Another parameter we'll use is `stop_words`, which we mentioned before will deal with incredibly common words like "the".

Let's start by importing the module from Sklearn. We'll set up an instance of the TF-IDF object, fit it on the training data, and use it to transfrom both training and testing data.

In [None]:
#Import the TFIDFVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Initialize the TfidfVectorizer.
#This is where we set all our rules for feature creation.
tfidf_vectorizer = TfidfVectorizer(
    stop_words='english',  #Use the built-in English stop word list
    max_features=10000,    #Keep only the top 10,000 words
    min_df=5,              #We'll keep all words that appear at least 5 times
    max_df=0.8             #Ignore words that appear in more than 80% of reviews
)

#Fit the vectorizer and transform the training data.
#This learns the vocabulary from X_train and converts it to a numerical matrix.
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

#Transform the test data.
#This uses the vocabulary learned from X_train to transform X_test.
X_test_tfidf = tfidf_vectorizer.transform(X_test)

#Analyzing the Transformed Data
#Show the original shape of the data (a 1D series of text)
print(f"Original shape of X_train: {X_train.shape}")
print(f"Original shape of X_test:  {X_test.shape}")

#Show the new shape of the data after TF-IDF
#It is now a 2D matrix where columns are words from the vocabulary.
print(f"\nShape of X_train after TF-IDF: {X_train_tfidf.shape}")
print(f"Shape of X_test after TF-IDF:  {X_test_tfidf.shape}")

Initializing TfidfVectorizer...
Vectorizer initialized.

Fitting and transforming X_train...
X_train transformed.
Transforming X_test...
X_test transformed.

--- Data Transformation Analysis ---
Original shape of X_train: (91376,)
Original shape of X_test:  (22844,)

Shape of X_train after TF-IDF: (91376, 10000)
Shape of X_test after TF-IDF:  (22844, 10000)

Number of learned features (vocabulary size): 10000
A few example features: ['aafco', 'abandoned', 'abc', 'abdominal', 'ability', 'able', 'abroad', 'absence', 'absent', 'absolute']

Example Transformation ---
Original first review in X_test: 'someone gave these olives gift basket christmas last year they are amazing buying them for others this year'

Transformed TF-IDF vector for that review (sparse format):
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 8 stored elements and shape (1, 10000)>
  Coords	Values
  (0, 254)	0.30040135019160796
  (0, 627)	0.38705181677153544
  (0, 1122)	0.2336722662803621
  (0, 1510)	0.32

As we can see, using TF-IDF ended up creating 10,000 features for the words in the review. The idea is that each of these words are given a score based on how frequently they show up in a single review and how frequently they show up in reviews in general. These scores help our model determine the sentiment of certain words, and review overall. 

## 6. Building and Training the Models
Now that our text data has been cleaned and transformed into a numerical format using TF-IDF, we are ready to proceed with the model building and training phase. For this project, we will implement two different classification models: Logistic Regression and Random Forest. Using both allows us to establish a strong, interpretable baseline and then see if a more complex model can improve upon it. Logistic Regression is an excellent starting point as it is very fast to train and highly interpretable, which will allow us to easily explore the word-level features that are most predictive of sentiment. We will then train a Random Forest model, a more powerful ensemble method, to determine if its complexity leads to an increase in predictive performance.

We'll start by importing both models from Sklearn. We'll then intialize them, and fit them on the training data.

In [None]:
#Import logistic regression and random forest classifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#Initialize the classifiers
logistic_model = LogisticRegression(max_iter=1000, random_state=64)
rf_model = RandomForestClassifier(n_estimators=100, random_state=64)

#Fit the logistic regression model
logistic_model.fit(X_train_tfidf, y_train)

#Fit the random forest model
rf_model.fit(X_train_tfidf, y_train)


Initializing classifiers...

Fitting Logistic Regression model...
Fitting Random Forest model...


You might notice that the Random Forest took quite a while to be fit. While there are times where the dataset is seemingly larger with the model fitting faster, having 10,000 features to go through made the model take quite a bit of time. That's alright however, and the further advanced our projects get, the longer such actions might take as well.

## 7. Evaulating our Models.
With both models trained on our cleaned data, it's time to test our models to so how well they perform. We'll import several metrics of success from Sklearn. Fortunately these metrics work with both models, so we can go straight ahead and see how they perform.

In [17]:
#import metrics of success from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

#Make predictions on the validation set, and store it in a variable. 
y_pred1 = logistic_model.predict(X_test_tfidf)
y_pred2 = rf_model.predict(X_test_tfidf)

#Check the accuracy of our predictions
print("Validation Accuracy (Logistic Regression):", accuracy_score(y_test, y_pred1))
print("Validation Accuracy (Random Forest):", accuracy_score(y_test, y_pred2))

#Display the confusion matrix
print("Confusion Matrix (Logistic Regression):\n", confusion_matrix(y_test, y_pred1))
print("Confusion Matrix (Random Forest):\n", confusion_matrix(y_test, y_pred2))

#Print the classification report for more details
print("Classification Report (Logistic Regression):\n", classification_report(y_test, y_pred1))
print("Classification Report (Random Forest):\n", classification_report(y_test, y_pred2))

Validation Accuracy (Logistic Regression): 0.8831640693398705
Validation Accuracy (Random Forest): 0.8554543862721065
Confusion Matrix (Logistic Regression):
 [[10114  1308]
 [ 1361 10061]]
Confusion Matrix (Random Forest):
 [[9875 1547]
 [1755 9667]]
Classification Report (Logistic Regression):
               precision    recall  f1-score   support

           0       0.88      0.89      0.88     11422
           1       0.88      0.88      0.88     11422

    accuracy                           0.88     22844
   macro avg       0.88      0.88      0.88     22844
weighted avg       0.88      0.88      0.88     22844

Classification Report (Random Forest):
               precision    recall  f1-score   support

           0       0.85      0.86      0.86     11422
           1       0.86      0.85      0.85     11422

    accuracy                           0.86     22844
   macro avg       0.86      0.86      0.86     22844
weighted avg       0.86      0.86      0.86     22844



### Analysis

#### Confusion Matrix
The confusion matrices for both models show how well each approach distinguishes between negative reviews (class 0) and positive reviews (class 1). For logistic regression, the model correctly predicted 10114 negative reviews and 10061 positive reviews, but also misclassified 1308 negative reviews as positive, and 1361 positive reviews as negative. The random forest model follows a similar pattern, with 9875 correct predictions for negative reviews and 9667 for positive, while making more errors overall: 1547 false positives and 1755 missed positive reviews. Both models show strong performance overall, though logistic regression appears more balanced and slightly more accurate in distinguishing sentiment polarity.

#### Classification Report
Precision, recall, and F1-score offer additional insight into the models’ respective strengths:

- Precision: For negative reviews (class 0), both models are strong, with precision at 0.88 for logistic regression and 0.85 for random forest, indicating that most negative predictions are indeed correct. The same holds true for positive reviews (class 1), with precision at 0.88 for logistic regression and 0.86 for random forest. This reflects consistent confidence across both classes.

- Recall: Logistic regression achieves nearly identical recall for both classes, at 0.89 for negatives and 0.88 for positives, suggesting the model is adept at correctly identifying both types of sentiment. Random forest lags slightly behind, with 0.86 recall for negative reviews and 0.85 for positive reviews, showing a minor dip in its ability to recognize the true class labels.

- F1-score: F1-scores are evenly matched between the two classes in both models. Logistic regression yields an F1-score of 0.88 across the board, showing balanced precision and recall. Random forest, while slightly lower at 0.86, still maintains respectable performance. The consistency across precision, recall, and F1 in logistic regression indicates stronger overall robustness.

The macro and weighted averages mirror these findings, hovering around 0.88 for logistic regression and 0.86 for random forest. This suggests that class imbalance is not an issue (thanks to our prior balancing), and that performance is uniform across both sentiment categories.

Overall Analysis
Validation accuracy for logistic regression stands at about 0.88, slightly outperforming random forest at 0.86. While both models are strong performers, logistic regression shows better balance, fewer errors, and slightly stronger classification power. The close alignment of its precision, recall, and F1-score for both classes makes it particularly reliable. In contrast, the random forest model, while still highly capable, makes more mistakes and shows slightly lower consistency.

Given the size of the dataset and the nature of the task, logistic regression emerges as a well-suited baseline model. Its solid performance, combined with efficiency and interpretability, makes it a great choice for binary sentiment analysis. In a practical setting, one could consider further tuning the models, adding feature engineering, or exploring more advanced architectures like neural nets if higher performance is needed. But for many use cases, this level of accuracy and stability is already actionable.

Comparing both models highlights not only their strengths and limitations, but also underscores the value of using multiple metrics beyond accuracy. Whether you’re looking to understand product sentiment, streamline customer feedback, or just explore the world of natural language processing, these results are a strong foundation to build on.


## 8. Exploring Word Level Features 

Now that we have confirmed that our models are performing well, we can move beyond simply measuring their predictive accuracy to understanding how they arrive at their conclusions. The next logical step is to explore the word-level features that the model identified as the most powerful predictors of sentiment. The primary reason for this analysis is interpretability; we want to validate that the model is learning logical patterns from the text and gain direct insight into the specific language that drives positive and negative reviews.

To accomplish this, we will inspect our trained Logistic Regression model, as it is highly transparent, especially compared to our Random Forest. This model assigns a numerical weight, or coefficient, to every word in the vocabulary created by our TF-IDF vectorizer. By extracting and ranking these coefficients, we can create a definitive list of the words with the strongest positive weights (the top predictors of a positive review) and those with the strongest negative weights (the top predictors of a negative review). This process effectively allows us to translate the model's internal logic into a clear and actionable analysis of customer sentiment.



In [None]:
#Get the feature names (words) from the TfidfVectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

#Get the coefficients from the trained Logistic Regression model
#The [0] is because we have a binary classification problem
coefficients = logistic_model.coef_[0]

#Create a DataFrame to hold the words and their corresponding weights
feature_weights = pd.DataFrame({
    'word': feature_names,
    'weight': coefficients
})

#Sort the DataFrame by the weights to find the most influential words
#Ascending sort will show the most negative words first
most_negative_words = feature_weights.sort_values(by='weight', ascending=True)

#Descending sort will show the most positive words first
most_positive_words = feature_weights.sort_values(by='weight', ascending=False)

#Display the results
print("Top 20 Most Powerful Predictors of a NEGATIVE Review:")

#reset_index(drop=True) makes the output cleaner
print(most_negative_words.head(20).reset_index(drop=True))


--- Top 20 Most Powerful Predictors of a NEGATIVE Review ---
              word    weight
0            worst -9.016542
1    disappointing -8.631358
2            awful -7.999173
3         terrible -7.943300
4     disappointed -7.678569
5         horrible -7.526429
6   disappointment -7.456513
7    unfortunately -6.595148
8            bland -6.380178
9           return -5.968935
10           threw -5.859926
11           waste -5.725940
12           stale -5.674093
13          hoping -5.634528
14            weak -5.589291
15      disgusting -5.437771
16       tasteless -5.319889
17           sorry -5.193882
18         thought -5.042910
19           worse -5.022502


--- Top 20 Most Powerful Predictors of a POSITIVE Review ---
          word     weight
0        great  10.407474
1    delicious   9.506926
2         best   9.283243
3       highly   7.976964
4      perfect   7.831176
5    excellent   6.791275
6         love   6.649609
7    wonderful   6.649027
8        loves   6.561121
9      

In [None]:
#And our positive words. 
#This is in a separate cell since the output was truncated in the previous cell.
print("Top 20 Most Powerful Predictors of a POSITIVE Review:")
print(most_positive_words.head(20).reset_index(drop=True))

--- Top 20 Most Powerful Predictors of a POSITIVE Review ---
          word     weight
0        great  10.407474
1    delicious   9.506926
2         best   9.283243
3       highly   7.976964
4      perfect   7.831176
5    excellent   6.791275
6         love   6.649609
7    wonderful   6.649027
8        loves   6.561121
9         good   5.986042
10     amazing   5.945671
11    favorite   5.523752
12     pleased   5.474018
13      hooked   5.386439
14       yummy   5.251717
15        nice   5.241427
16     awesome   5.234656
17        beat   5.220004
18       thank   5.181713
19  pleasantly   5.036770


### Analysis of Word-Level Features
An inspection of the model's learned coefficients provides powerful validation for our entire pipeline. The lists of the most influential words for both positive and negative sentiment align perfectly with human intuition, which gives us high confidence that the model has learned meaningful patterns from the text data rather than relying on noise.

For negative reviews, the model identified words like worst, disappointing, awful, and terrible as the strongest predictors. The inclusion of more specific, product-related terms such as bland, stale, and tasteless, as well as action-oriented words like return and threw, further demonstrates the model's ability to grasp the nuances of customer dissatisfaction.

Conversely, the top positive words are dominated by superlatives and enthusiastic descriptors. Words like great, delicious, best, perfect, and excellent were assigned the highest positive weights, clearly signaling strong customer satisfaction. The presence of words like love, favorite, and pleasantly reinforces the model's successful identification of positive emotional language. The fact that the model learned these word associations on its own is a testament to the effectiveness of the TF-IDF feature extraction and the robustness of the classifier.

### Real-World Applications
This analysis is far more than just an academic exercise, as it provides actionable business intelligence. A company could use these insights in several key ways:

- Automated Customer Support: The sentiment model could be used to automatically tag and route incoming customer feedback. A review containing words with strong negative weights could be immediately flagged and sent to a support team for priority follow-up, enabling proactive customer service.

- Product Development Insights: By analyzing the most common negative keywords (e.g., bland, stale), a company can identify specific, recurring issues with their products and direct their quality assurance or product development teams to address them.

- Marketing and Voice of Customer Analysis: The top positive keywords provide a clear picture of what customers value most. A marketing team could leverage this language, using words like delicious or perfect in their campaigns, knowing that these terms resonate strongly with their satisfied customers.

### Conclusion
Congratulations on reaching the end of this project! You have successfully navigated a complete Natural Language Processing pipeline, from handling a large, messy, real-world dataset to performing sophisticated text cleaning and feature extraction. You've built and evaluated multiple models and, most importantly, have interpreted their internal logic to extract meaningful, actionable insights. This project demonstrates a deep understanding of both the technical and analytical skills required for sentiment analysis. Job well done!