<a href="https://colab.research.google.com/github/E-Juliet/Mobile-Phone-Sentiment-Analysis/blob/main/Mobile_Phone_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Business Understanding

## 1.1 Problem Statement

Purchasing a product is an interaction between two entities, consumers and business owners. Consumers often use reviews to make decisions about what products to buy, while businesses, on the other hand, not only want to sell their products but also want to receive feedback in terms of consumer reviews. Consumer reviews about purchased products shared on the internet have a great impact. Human nature is generally structured to make decisions based on analyzing and getting the benefit of other consumer experience and opinions because others often have a great influence on our beliefs, behaviors, perception of reality, and the choices we make. Hence, we ask others for their feedback whenever we are deciding on doing something. Additionally, this fact applies not only to consumers but also to organizations and institutions.

As social media networks have evolved, so have the ways that consumers express their opinions and feelings. With the vast amount of data now available online, it has become a challenge to extract useful information from it all. Sentiment analysis has emerged as a way to predict the polarity (positive, negative, or neutral) of consumer opinion, which can help consumers better understand the textual data.

E-commerce websites have increased in popularity to the point where consumers rely on them for buying and selling. These websites give consumers the ability to write comments about different products and services, which has resulted in a huge amount of reviews becoming available. Consequently, the need to analyze these reviews to understand consumers’ feedback has increased for both vendors and consumers. However, it is difficult to read all the feedback for a particular item, especially for popular items with many comments. 

In this research, we attempt to build a predictor for consumers’ satisfaction on mobile phone products based on the reviews. We will also attempt to understand the factors that contribute to classifying reviews as positive, negative or neutral (based on important or most frequent words). This is believed to help companies improve their products and also help potential buyers make better decisions when buying products.

### Main objective
- To perform a sentiment analysis of mobile phone reviews from Amazon website to determine how these reviews help consumers to have conﬁdence that they have made the right decision about their purchases.

### Specific Objectives
- To help companies understand their consumers’ feedback to maintain their products/services or enhance them.
- To provide insights to companies in curating offers on speciﬁc products to increase their proﬁts and customer satisfaction.
- To understand the factors that contribute to classifying reviews as positive, negative or neutral (based on important or most frequent words).
- To determine mobile phones key features that influence smartphone purchases.
- To perform a market segmentation of consumers based on their reviews
- To advise the advertisement department in companies on these key features to use as selling points and to specific customer segments  in upcoming advertisements.


## 1.2 Metrics of Success

The best performing model will be selected based on:
- An accuracy score > 80%
- An F1 score > 0.85 


# 2. Data Understanding

The data used for this project is obtained from [data.world](https://data.world/promptcloud/amazon-mobile-phone-reviews) and contains more than 400 thousand reviews  of unlocked mobile phones sold on [amazon.com](https://www.amazon.com/). The data was collected from 2016 and last updated in April 2022. The data contains 6 columns:
- Product_name : Contains the name of the product
- Brand : Contains the brand of the product
- Price : Contains the price of the brans
- Rating : Contains the rating awarded to that product
- Reviews : Contains the review of that product
- Review_votes : Number of people who found the review helpful



# 3. Loading the Data

## 3.1 Loading the Libraries

In [70]:
import pandas as pd

from matplotlib import pyplot as pyplot
import seaborn as sns

In [71]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 3.2 Loading the Data

In [72]:
# loading the data

df = pd.read_csv('/content/drive/Shareddrives/Alpha/Data/Amazon Combined Data.csv')

## 3.3 Previewing the Data

In [73]:
# checking the shape of the data

print(f'The data has {df.shape[0]} rows and {df.shape[1]} columns')

The data has 17198 rows and 7 columns


In [74]:
# checking the data types of the data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17198 entries, 0 to 17197
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Rating                       17198 non-null  object
 1   Review Title                 17198 non-null  object
 2   Review                       17168 non-null  object
 3   Location and Date of Review  17198 non-null  object
 4   Affiliated Company           17198 non-null  object
 5   Brand and Features           17198 non-null  object
 6   Price                        17198 non-null  object
dtypes: object(7)
memory usage: 940.6+ KB


# 4. Data Cleaning

## 4.1 Missing values


In [75]:
# Getting the sum of missing values per column

df.isnull().sum()

Rating                          0
Review Title                    0
Review                         30
Location and Date of Review     0
Affiliated Company              0
Brand and Features              0
Price                           0
dtype: int64

Out of the 7 columns,only the review's column has missing values.

Since the dataset is large,the missing values can be dropped and still retain relevant information.

In [76]:
# Dropping the missing values

df.dropna(inplace = True)

In [77]:
# Confirming there are no missing values 

df.isna().sum()

Rating                         0
Review Title                   0
Review                         0
Location and Date of Review    0
Affiliated Company             0
Brand and Features             0
Price                          0
dtype: int64

## 4.2 Duplicates

In [78]:
# Checking for duplicates

print(f"The data has {df.duplicated().sum()} duplicated rows")

The data has 6595 duplicated rows


In [79]:
# Exploring the duplicates

duplicates = df[df.duplicated(keep = 'first')]

duplicates.head(10)

Unnamed: 0,Rating,Review Title,Review,Location and Date of Review,Affiliated Company,Brand and Features,Price
30,4.0 out of 5 stars,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ...","Reviewed in the United States on February 11, ...",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
31,3.0 out of 5 stars,\nNot for Cricket Wireless and this two review...,"\nThe phone itself is a okay android device, b...","Reviewed in the United States on February 4, 2021",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
32,3.0 out of 5 stars,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...,"Reviewed in the United States on June 7, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
33,3.0 out of 5 stars,\nA burner or for a kid\n,\nI use this as a burner w/o a sim card in it....,"Reviewed in the United States on April 14, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
34,4.0 out of 5 stars,\nIt works okay\n,\nIt works fine\n,"Reviewed in the United States on August 13, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
35,3.0 out of 5 stars,\nPhone\n,"\nSo far I don't like this phone at all, I thr...","Reviewed in the United States on May 10, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
36,1.0 out of 5 stars,\nIt died after a month. I figured out the pro...,\n\n\n\n\n The media could ...,"Reviewed in the United States on June 17, 2021",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
37,5.0 out of 5 stars,\nBuena Compra\n,\nTal. Como està descrito….Todo lo necesario a...,"Reviewed in the United States on July 25, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
38,3.0 out of 5 stars,\nEso no me gustó\n,\nNo vale la pena gastar dinero en el.\n,"Reviewed in the United States on April 7, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
39,2.0 out of 5 stars,"\nDemasiado básico y lento, bajo costo pero no...","\nDemasiado básico y lento, bajo costo pero no...","Reviewed in the United States on July 6, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99


In [80]:
# Dropping the duplicates

df.drop_duplicates(inplace = True)




In [81]:
# Confirming if there are duplicates

df.duplicated().sum()

0

## 4.3 Cleaning Specific Columns

#### Rating Column

In [82]:

#Extracting the digits in the Rating column and convering it to float type

df["Rating"] = df["Rating"].str.extract('(\d+)').astype(float)

df["Rating"].head()

0    4.0
1    3.0
2    3.0
3    3.0
4    4.0
Name: Rating, dtype: float64

The rating value was extracted from the column and converted into float data type

#### Price Column

In [83]:
#Extracting the digits in the price column and converting it to integer

df["Price"] = df["Price"].str.extract('(\d+)').astype(int)

df["Price"].head()

0    69
1    69
2    69
3    69
4    69
Name: Price, dtype: int64

The dollar sign was removed and price column converted into integer data type

#### Affiliated company column

In [84]:
# Rename the column to brand name

df.rename(columns = {"Affiliated Company":"Brand","Brand and Features":"Product_name"},inplace = True)
df.head()

Unnamed: 0,Rating,Review Title,Review,Location and Date of Review,Brand,Product_name,Price
0,4.0,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ...","Reviewed in the United States on February 11, ...",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
1,3.0,\nNot for Cricket Wireless and this two review...,"\nThe phone itself is a okay android device, b...","Reviewed in the United States on February 4, 2021",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
2,3.0,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...,"Reviewed in the United States on June 7, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
3,3.0,\nA burner or for a kid\n,\nI use this as a burner w/o a sim card in it....,"Reviewed in the United States on April 14, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
4,4.0,\nIt works okay\n,\nIt works fine\n,"Reviewed in the United States on August 13, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69


The columns affiliated company and brand&features was renamed to brand and product name respectively.

In [85]:
#Getting the value counts for the brand column

df['Brand'].value_counts()

Visit the Amazon Renewed Store    1747
Brand: Motorola                   1578
Visit the BLU Store               1544
Visit the TCL Store               1448
Brand: Amazon Renewed              989
Visit the OnePlus Store            684
Visit the SAMSUNG Store            522
Visit the Nokia Store              492
Visit the Google Store             389
Visit the JTEMAN Store             263
Visit the RCA Store                252
Visit the TracFone Store           140
Visit the NUU Store                 93
Brand: Nokia                        84
Visit the Dilwe Store               62
Visit the Black Shark Store         55
Visit the Ulefone Store             51
Visit the OUKITEL Store             46
Visit the UMIDIGI Store             40
Visit the Easyfone Store            30
Visit the Punkt. Store              23
Visit the CUBOT Store               14
Brand: Hipipooo                     14
Visit the total wireless Store      13
Name: Brand, dtype: int64

In [86]:
# Removing unnecessary words from the column to get the brand name
word_vocabulary = ['Visit', 'the', 'store', 'Brand:', 'Store']

for word in word_vocabulary:
    df['Brand'] = df['Brand'].str.replace(word, '')

df['Brand'].value_counts()    

  Amazon Renewed     1747
 Motorola            1578
  BLU                1544
  TCL                1448
 Amazon Renewed       989
  OnePlus             684
  SAMSUNG             522
  Nokia               492
  Google              389
  JTEMAN              263
  RCA                 252
  TracFone            140
  NUU                  93
 Nokia                 84
  Dilwe                62
  Black Shark          55
  Ulefone              51
  OUKITEL              46
  UMIDIGI              40
  Easyfone             30
  Punkt.               23
  CUBOT                14
 Hipipooo              14
  total wireless       13
Name: Brand, dtype: int64

In [87]:
# Removing all the white spaces

df['Brand'] = df['Brand'].str.strip()

# Renaming the amazon renewed with refurbished

df['Brand'] = df['Brand'].str.replace('Amazon Renewed','Amazon Refurbished')

df['Brand'].value_counts()


Amazon Refurbished    2736
Motorola              1578
BLU                   1544
TCL                   1448
OnePlus                684
Nokia                  576
SAMSUNG                522
Google                 389
JTEMAN                 263
RCA                    252
TracFone               140
NUU                     93
Dilwe                   62
Black Shark             55
Ulefone                 51
OUKITEL                 46
UMIDIGI                 40
Easyfone                30
Punkt.                  23
CUBOT                   14
Hipipooo                14
total wireless          13
Name: Brand, dtype: int64

The brand column had unnecessary words which were removed to get the brands of the phones.

#### Product_name column

In [89]:
# Getting value counts of the product_name column

df['Product_name'].value_counts()

        BLU Tank II T193 Unlocked GSM Dual-SIM Cell Phone w/ Camera and 1900 mAh Big Battery - Unlocked Cell Phones - Retail Packaging - Black Blue                                                                       989
        Apple iPhone 8, 64GB, Gold - Unlocked (Renewed)                                                                                                                                                                   989
        Samsung Galaxy S20 5G, 128GB, Cosmic Gray - Unlocked (Renewed)                                                                                                                                                    658
        TCL 20 SE 6.82" Unlocked Cellphone, 4GB RAM + 128GB ROM, US Version Android Smartphone 48MP AI Quad-Camera, 5000mAh Mobile Phone, Dual Speaker, OTG Reverse Charging, Android 11, Aurora Green                    629
        TCL 20 SE 6.82" Unlocked Cellphone, 4GB RAM + 128GB ROM, US Version Android Smartphone 48MP AI Quad-Came

In [88]:
df.head()

Unnamed: 0,Rating,Review Title,Review,Location and Date of Review,Brand,Product_name,Price
0,4.0,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ...","Reviewed in the United States on February 11, ...",RCA,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
1,3.0,\nNot for Cricket Wireless and this two review...,"\nThe phone itself is a okay android device, b...","Reviewed in the United States on February 4, 2021",RCA,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
2,3.0,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...,"Reviewed in the United States on June 7, 2022",RCA,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
3,3.0,\nA burner or for a kid\n,\nI use this as a burner w/o a sim card in it....,"Reviewed in the United States on April 14, 2022",RCA,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
4,4.0,\nIt works okay\n,\nIt works fine\n,"Reviewed in the United States on August 13, 2022",RCA,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69


# 5. Feature Engineering

# 6. Exploratory Data Analysis(EDA)

# 7. Implementing the Solution

## 7.1 Preprocessing

# 8. Challenging the Solution

# 9. Conclusions

# 10. Recommendations