# End-to-End NLP Pipeline for Customer Sentiment Analysis of Amazon Food

## Problem Statement

Business Problem: Amazon seeks to enhance customer satisfaction by analyzing sentiment in the food sold on its website. Understanding whether reviews are  negative,positive, or neutral can guide marketing strategies,product improvements,  and customer service satisfaction.

Importance: Positive sentiment drives customer loyalty, while negative sentiment can harm brand reputation. An automated sentiment analysis system enables reducing manual review efforts ,real-time insights,  and enabling data-driven decisions.

Data Collection: We use the Food Reviews Dataset for Amazon, a publicly available dataset on Kaggle having 500,000 reviews by customer with star ratings (1-5). I have mapped out ratings to sentiments: 1-2 (Negative), 3 (Neutral), 4-5 (Positive). This dataset is ideal due to its  diversity, size,and relevance to retail.

NLP Task: The problem has been formulated as a multi-class text classification task, where the inputs are the review texts, and outputs are the sentiments (negative, positive, neutral).

Benefits: The pipeline will enable the company to prioritize customer concerns,monitor sentiment trends,  and improve product offerings, ultimately boosting revenue and customer retention.

System Design
The Natural_Langauage_Processing pipeline contains the following; connected sequentially to process textual data:

Data Collection: Load and sample the Amazon Food Reviews dataset to ensure manageability.
Preprocessing: Text has been cleaned by (removing stopwords, punctuation, lemmatize).

Feature Extraction: TF-IDF has been used for baseline models and BERT embeddings for better accuracy.

Model Training: A logistic regression model  has been tarined as the baseline and a fine tuning has been done using BERT model for better accuracy.
Evaluation: Assessment of precision,accuracy,  recall, F1-score, and confusion matrix has been done.
Discussion: Analyze results, limitations, and business implications.



1. Data Acquisition

We load a subset of the Amazon Reviews dataset (100,000 reviews) to balance computational efficiency and representativeness. The dataset is sourced from Amazon’s public repository.

In [1]:
# Importing libraries for the data collection
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, BertModel
import torch
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from transformers import TFBertForSequenceClassification
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# loading the Dataset
df_nlp = pd.read_csv("/Users/phionanamugga/Documents/coding/datascience/NLP_projects/AmazonFoodReviews.csv", on_bad_lines='skip')

2. Data Exploration

In [None]:
# Getting summary information i.e data type, null columns and column names
df_nlp.info()

In [None]:
#Checking for the Summary description of the Dataset
df_nlp.describe()

In [None]:
# Checking for the headings in the dataset
df_nlp.head()

In [None]:
# Checking for the value counts of each star rating
print(df_nlp['Score'].value_counts())

In [None]:
# Mapping Star ratings to Sentiment
df_nlp['Sentiment'] = df_nlp['Score'].map({1: 'negative', 2: 'negative', 3: 'neutral', 4: 'positive', 5: 'positive'})
df_nlp = df_nlp[['Text', 'Sentiment']].dropna()

In [None]:
# Printing Column names to verify information
print("column names:", df_nlp.columns)