# Sentiment Analysis Project

### Group 3
### Members:
        Gideon ochieng
        Ann Mwangi
        Victor Masinde
        Lorna Gatimu
        Charles Odhiambo
### Technical mentor : 
        Maryann Mwikali

## Project Overview
Online business platforms like Amazon generate millions of customer reviews daily, influencing purchasing decisions and shaping brand reputations. These reviews offer valuable insights into customer satisfaction, product quality, and service efficiency. Sentiment analysis, a branch of Natural Language Processing (NLP), enables businesses to analyze and interpret customer emotions from text data. By automating this process, companies can enhance customer experience, improve products, and drive sales.

## Business Understanding
### Real-World Problem
With thousands of reviews per product, customers and businesses struggle to extract meaningful insights manually. The challenge lies in identifying positive, negative, or neutral sentiment efficiently. Traditional rating systems (1-5 stars) may not always reflect the true sentiment behind a review, as users may express mixed opinions in text form. A sentiment analysis system can provide a more accurate and automated way of understanding customer feedback, helping businesses enhance their products and services. 

## Stakeholders
This project is valuable to multiple stakeholders, each benefiting in different ways from sentiment analysis of Amazon reviews data.

1) E-commerce Businesses & Product Sellers

- Gain insights into customer satisfaction and product performance.
- Identify recurring complaints and areas for improvement.
- Monitor brand reputation and respond to negative feedback effectively.

2) Consumers & Online Shoppers
- Get data-driven product recommendations based on real customer sentiments.
- Make informed purchasing decisions by understanding overall product sentiment.
- Avoid misleading star ratings by analyzing actual customer experiences.

3) Marketing & Customer Support Teams
- Track customer sentiment trends to refine marketing strategies.
- Automate review analysis to address complaints and improve customer service.
- Identify key influencers and brand advocates from positive reviews.

In [48]:
import pandas as pd # pandas library for working with the data

In [49]:
df=pd.read_csv('Data\shoes_reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,product_name,review_text,product_rating,review_date,avg_rating
0,0,Ecetana Water Shoes for Women Men Quick Dry Be...,Comfortable. Great with jeans and dresses. Dre...,5,5/29/2022,4.581395
1,1,Ecetana Water Shoes for Women Men Quick Dry Be...,"Perfect fit, very comfortable, and I have rece...",5,11/18/2023,4.581395
2,2,Ecetana Water Shoes for Women Men Quick Dry Be...,"Besides, baby, it was exactly what I needed. T...",5,7/7/2021,4.581395
3,3,Ecetana Water Shoes for Women Men Quick Dry Be...,"Excellent, the only thing took on the size mor...",5,7/24/2021,4.581395
4,4,Ecetana Water Shoes for Women Men Quick Dry Be...,"Perfect product, as description… Posting very ...",5,11/4/2021,4.581395


In [50]:
def load_reviews(Data):# function to load all the reviews into a pandas dataframe
    return pd.read_csv(Data)

def merge_reviews(review_files):# function to merge all reviews to one dataframe
    reviews=pd.concat([load_reviews(file) for file in review_files], ignore_index=True)
    return reviews

review_files=['Data/computer_reviews.csv','Data/Fridge_reviews.csv','Data/hoodie_reviews.csv','Data/parfum_reviews.csv','Data/Playstation_reviews.csv','Data/shoes_reviews.csv','Data/toy_reviews.csv','Data/Water_reviews.csv','Data/Xbox_reviews.csv']

df=merge_reviews(review_files)#merging the listed reviews
df.head()

Unnamed: 0.1,Unnamed: 0,product_name,review_text,product_rating,review_date,avg_rating
0,0,Microsoft Xbox Series S – 1TB White,The series S will set you up to game for years...,5,10/31/2024,4.8
1,1,Microsoft Xbox Series S – 1TB White,"Ordered Xbox series S, received a PS5 controll...",5,12/17/2024,4.8
2,2,Microsoft Xbox Series S – 1TB White,"This product is absolutely amazing, the loadin...",5,2/1/2025,4.8
3,3,Microsoft Xbox Series S – 1TB White,This console works fantastic. I was easily abl...,5,11/1/2024,4.8
4,4,Microsoft Xbox Series S – 1TB White,This product was the least expensive from the ...,5,12/11/2024,4.8


## Data Understanding

In [51]:
df.head()

Unnamed: 0.1,Unnamed: 0,product_name,review_text,product_rating,review_date,avg_rating
0,0,Microsoft Xbox Series S – 1TB White,The series S will set you up to game for years...,5,10/31/2024,4.8
1,1,Microsoft Xbox Series S – 1TB White,"Ordered Xbox series S, received a PS5 controll...",5,12/17/2024,4.8
2,2,Microsoft Xbox Series S – 1TB White,"This product is absolutely amazing, the loadin...",5,2/1/2025,4.8
3,3,Microsoft Xbox Series S – 1TB White,This console works fantastic. I was easily abl...,5,11/1/2024,4.8
4,4,Microsoft Xbox Series S – 1TB White,This product was the least expensive from the ...,5,12/11/2024,4.8


In [52]:
df.info()#Structure of the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7095 entries, 0 to 7094
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      7095 non-null   int64  
 1   product_name    7095 non-null   object 
 2   review_text     7069 non-null   object 
 3   product_rating  7095 non-null   int64  
 4   review_date     7095 non-null   object 
 5   avg_rating      7095 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 332.7+ KB


In [53]:
df.describe()#numerical columns statistics

Unnamed: 0.1,Unnamed: 0,product_rating,avg_rating
count,7095.0,7095.0,7095.0
mean,446.683016,3.54771,4.244355
std,309.00904,1.724285,0.612714
min,0.0,1.0,1.0
25%,197.0,1.0,4.0
50%,397.0,5.0,4.5
75%,645.0,5.0,4.7
max,1373.0,5.0,5.0


In [54]:
df.dtypes #datatypes of the columns

Unnamed: 0          int64
product_name       object
review_text        object
product_rating      int64
review_date        object
avg_rating        float64
dtype: object

In [55]:
df.nunique() #number of unique values in each column

Unnamed: 0        1374
product_name       683
review_text       5662
product_rating       5
review_date        945
avg_rating         157
dtype: int64

In [56]:
df['review_text'].unique()

array(['The series S will set you up to game for years.  Add game pass ultimate and for less than the cost of a good lincy you will have hundreds of hours of fun a month.  The S with a terrabyte of storage is the perfect gateway to the world of xbox!',
       "Ordered Xbox series S, received a PS5 controller instead. Called customer service, ended up overseas with some idiot that don't understand what happened. Meanwhile I'm forced to dispute the charges with my credit card company because the idiots in customer service say the package was delivered. I never said I didn't get a delivery...I said you delivered the wrong 💩❗",
       "This product is absolutely amazing, the loading times are great, graphics are wonderful and it is silent. It doesn't sound like a jet engine going off like a PS4. Absolutely love it. This product is absolutely amazing, the loading times are great, graphics are wonderful and it is silent. It doesn't sound like a jet engine going off like a PS4 absolutely love

In [57]:
df.corr()

Unnamed: 0.1,Unnamed: 0,product_rating,avg_rating
Unnamed: 0,1.0,0.006472,-0.204858
product_rating,0.006472,1.0,0.357765
avg_rating,-0.204858,0.357765,1.0


In [58]:
df.shape

(7095, 6)

## Data Cleaning

In [59]:
# checking for missing values
df.isnull().sum()

Unnamed: 0         0
product_name       0
review_text       26
product_rating     0
review_date        0
avg_rating         0
dtype: int64

In [60]:
df=df.dropna(subset=['review_text'])#remove rows with missing text
df.isnull().sum()

Unnamed: 0        0
product_name      0
review_text       0
product_rating    0
review_date       0
avg_rating        0
dtype: int64

In [62]:
df =df.drop(columns=['Unnamed: 0'])# dropping the unamed olumn as it is unnecessary
df.head()

Unnamed: 0,product_name,review_text,product_rating,review_date,avg_rating
0,Microsoft Xbox Series S – 1TB White,The series S will set you up to game for years...,5,10/31/2024,4.8
1,Microsoft Xbox Series S – 1TB White,"Ordered Xbox series S, received a PS5 controll...",5,12/17/2024,4.8
2,Microsoft Xbox Series S – 1TB White,"This product is absolutely amazing, the loadin...",5,2/1/2025,4.8
3,Microsoft Xbox Series S – 1TB White,This console works fantastic. I was easily abl...,5,11/1/2024,4.8
4,Microsoft Xbox Series S – 1TB White,This product was the least expensive from the ...,5,12/11/2024,4.8


In [63]:
# Convert 'review_date' to datetime format
df['review_date'] = pd.to_datetime(df['review_date'])


In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7069 entries, 0 to 7094
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   product_name    7069 non-null   object        
 1   review_text     7069 non-null   object        
 2   product_rating  7069 non-null   int64         
 3   review_date     7069 non-null   datetime64[ns]
 4   avg_rating      7069 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 651.4+ KB
