# End-to-End NLP Pipeline for Customer Sentiment Analysis of Amazon Food

## Problem Statement

Business Problem: Amazon seeks to enhance customer satisfaction by analyzing sentiment in the food sold on its website. Understanding whether reviews are  negative,positive, or neutral can guide marketing strategies,product improvements,  and customer service satisfaction.

Importance: Positive sentiment drives customer loyalty, while negative sentiment can harm brand reputation. An automated sentiment analysis system enables reducing manual review efforts ,real-time insights,  and enabling data-driven decisions.

Data Collection: We use the Food Reviews Dataset for Amazon, a publicly available dataset on Kaggle having 500,000 reviews by customer with star ratings (1-5). I have mapped out ratings to sentiments: 1-2 (Negative), 3 (Neutral), 4-5 (Positive). This dataset is ideal due to its  diversity, size,and relevance to retail.

NLP Task: The problem has been formulated as a multi-class text classification task, where the inputs are the review texts, and outputs are the sentiments (negative, positive, neutral).

Benefits: The pipeline will enable the company to prioritize customer concerns,monitor sentiment trends,  and improve product offerings, ultimately boosting revenue and customer retention.

System Design
The Natural_Langauage_Processing pipeline contains the following; connected sequentially to process textual data:

Data Collection: Load and sample the Amazon Food Reviews dataset to ensure manageability.
Preprocessing: Text has been cleaned by (removing stopwords, punctuation, lemmatize).

Feature Extraction: TF-IDF has been used for baseline models and BERT embeddings for better accuracy.

Model Training: A logistic regression model  has been tarined as the baseline and a fine tuning has been done using BERT model for better accuracy.
Evaluation: Assessment of precision,accuracy,  recall, F1-score, and confusion matrix has been done.
Discussion: Analyze results, limitations, and business implications.



1. Data Acquisition

We load a subset of the Amazon Reviews dataset (100,000 reviews) to balance computational efficiency and representativeness. The dataset is sourced from Amazon’s public repository.

In [1]:
# Importing libraries for the data collection
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, BertModel
import torch
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from transformers import TFBertForSequenceClassification
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# loading the Dataset
df_nlp = pd.read_csv("/Users/phionanamugga/Documents/coding/datascience/NLP_projects/AmazonFoodReviews.csv", on_bad_lines='skip')

2. Data Exploration

In [3]:
# Getting summary information i.e data type, null columns and column names
df_nlp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568428 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [4]:
#Checking for the Summary description of the Dataset
df_nlp.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.743817,2.22881,4.183199,1296257000.0
std,164098.679298,7.636513,8.28974,1.310436,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.25,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.75,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


In [5]:
# Checking for the headings in the dataset
df_nlp.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
# Checking for the value counts of each star rating
print(df_nlp['Score'].value_counts())

Score
5    363122
4     80655
1     52268
3     42640
2     29769
Name: count, dtype: int64


In [7]:
# Mapping Star ratings to Sentiment
df_nlp['Sentiment'] = df_nlp['Score'].map({1: 'negative', 2: 'negative', 3: 'neutral', 4: 'positive', 5: 'positive'})
df_nlp = df_nlp[['Text', 'Sentiment']].dropna()

In [8]:
# Printing Column names to verify information
print("column names:", df_nlp.columns)

column names: Index(['Text', 'Sentiment'], dtype='object')


3. Preprocessing

Text is cleaned by converting to lowercase, removing punctuation, tokenizing, removing stopwords, and lemmatizing. This reduces noise and standardizes input for feature extraction.

In [None]:
# Downloading the required NLTK data
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

# Initializing the Lemmatizer and Stop words
Stop_Words = set(stopwords.words('english'))
LemmatiZer = WordNetLemmatizer()

# Creating a preprocessing function
def Preprocess_Text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in Stop_Words]  # Remove stop words
    tokens = [t for t in tokens if t.isalpha()]  # Keeping only alphabetic tokens
    tokens = [LemmatiZer.lemmatize(t) for t in tokens]  # Lemmatize tokens
    return ' '.join(tokens)

# Applying preprocessing to the 'Text' column
df_nlp['cleaned_review'] = df_nlp['Text'].apply(Preprocess_Text)
print(df_nlp[['Text', 'cleaned_review']].head())

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/phionanamugga/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/phionanamugga/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/phionanamugga/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
