<a href="https://colab.research.google.com/github/E-Juliet/Mobile-Phone-Sentiment-Analysis/blob/main/Mobile_Phone_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Business Understanding

## 1.1. Problem Statement

Purchasing a product is an interaction between two entities, consumers and business owners. Consumers often use reviews to make decisions about what products to buy, while businesses, on the other hand, not only want to sell their products but also want to receive feedback in terms of consumer reviews. Consumer reviews about purchased products shared on the internet have a great impact. The human nature is generally structured to make decisions based on analyzing and getting the benefit of other consumer experience and opinions because others often have a great influence on our beliefs, behaviors, perception of reality, and the choices we make. Hence, we ask others for their feedback whenever we are deciding on doing something. Additionally, this fact applies not only to consumers but also to organizations and institutions.

As social media networks have evolved, so have the ways that consumers express their opinions and feelings. With the vast amount of data now available online, it has become a challenge to extract useful information from it all. Sentiment analysis has emerged as a way to predict the polarity (positive, negative, or neutral) of consumer opinion, which can help consumers better understand the textual data.

E-commerce websites have increased in popularity to the point where consumers rely on them for buying and selling. These websites give consumers the ability to write comments about different products and services, which has resulted in a huge amount of reviews becoming available. Consequently, the need to analyze these reviews to understand consumers’ feedback has increased for both vendors and consumers. However, it is difficult to read all the feedback for a particular item, especially for popular items with many comments. 

In this research, we attempt to build a predictor for consumers’ satisfaction on mobile phone products based on the reviews. We will also attempt to understand the factors that contribute to classifying reviews as positive, negative or neutral (based on important or most frequent words). This is believed to help companies improve their products and also help potential buyers make better decisions when buying products.


### 1.1.1. Main objective

- To perform a sentiment analysis of mobile phone reviews from Amazon website to determine how these reviews help consumers to have conﬁdence that they have made the right decision about their purchases.

### 1.1.2. Specific Objectives

- To help companies understand their consumers’ feedback to maintain their products/services or enhance them.
- To provide insights to companies in curating offers on speciﬁc products to increase their proﬁts and customer satisfaction.
- To understand the factors that contribute to classifying reviews as positive, negative or neutral (based on important or most frequent words).
- To determine mobile phones key features that influence smartphone purchases.
- To perform a market segmentation of consumers based on their reviews
- To advise the advertisement department in companies on these key features to use as selling points and to specific customer segments  in upcoming advertisements.

### 1.1.3. Metrics of Success

The best performing model will be selected based on:
- An accuracy score > 80%
- An F1 score > 0.85 


# 2. Data Understanding

The data used for this project was obtained by scraping the amazon website for phone revies. It contains 17,198 reviews  of unlocked mobile phones sold on [amazon.com](https://www.amazon.com/). The data scraped was from November 2014 to July 2022.

The data contains 7 columns:
- Rating : Contains the rating awarded to that product.Ratings are made on a 5-star scale.5 being the highest.
- Review Title : Contains the summary of the review.
- Reviews : Contains the review of a product.
- Location and Date of the review :Contains information on where the review was written from and the date it was written.
- Affiliated Company: The brand selling the phone	
- Brand and Features: Contains the name of the specific phone and its features
- Price : Contains the price of the phone in US Dollars.



# 3. Loading Relevant Libraries & Data

## 3.1. Loading Libraries

In [None]:
import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# from wordcloud import WordCloud, STOPWORDS


# import nltk
# import string
# import re
# from nltk import pos_tag
# from nltk.probability import FreqDist
# from nltk.stem import WordNetLemmatizer
# from nltk.corpus import stopwords, wordnet
# from nltk.tokenize import RegexpTokenizer, word_tokenize


# from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# !pip install contractions
# import contractions

# # nltk downloads
# nltk.download('wordnet', quiet=True)
# nltk.download('punkt', quiet=True)
# nltk.download('stopwords', quiet=True)
# nltk.download('tagsets', quiet=True)
# nltk.download('averaged_perceptron_tagger', quiet=True)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 3.2. Loading Data

In [None]:
# loading the data

df = pd.read_csv('/content/drive/Shareddrives/Alpha/Data/Amazon Combined Data.csv')

# previewing the data

df.head()

Unnamed: 0,Rating,Review Title,Review,Location and Date of Review,Affiliated Company,Brand and Features,Price
0,4.0 out of 5 stars,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ...","Reviewed in the United States on February 11, ...",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
1,3.0 out of 5 stars,\nNot for Cricket Wireless and this two review...,"\nThe phone itself is a okay android device, b...","Reviewed in the United States on February 4, 2021",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
2,3.0 out of 5 stars,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...,"Reviewed in the United States on June 7, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
3,3.0 out of 5 stars,\nA burner or for a kid\n,\nI use this as a burner w/o a sim card in it....,"Reviewed in the United States on April 14, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
4,4.0 out of 5 stars,\nIt works okay\n,\nIt works fine\n,"Reviewed in the United States on August 13, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99


## 3.3. Previewing Data

In [None]:
# checking the shape of the data

print(f'The data has {df.shape[0]} rows and {df.shape[1]} columns')

The data has 17198 rows and 7 columns


In [None]:
# checking the data types of the data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17198 entries, 0 to 17197
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Rating                       17198 non-null  object
 1   Review Title                 17198 non-null  object
 2   Review                       17168 non-null  object
 3   Location and Date of Review  17198 non-null  object
 4   Affiliated Company           17198 non-null  object
 5   Brand and Features           17198 non-null  object
 6   Price                        17198 non-null  object
dtypes: object(7)
memory usage: 940.6+ KB


# 4. Data Cleaning

## 4.1. Missing values
- Checking for missing values


In [None]:
# Define a function to get missing data

def missing_data(data: pd.DataFrame) -> pd.DataFrame:
    """
    The function finds columns that have missing values, and returns the column,
    and the number of rows with missing data
    """
    missing_data = data.isna().sum()

    missing_data = missing_data[missing_data>0]

    return missing_data

In [None]:
# Getting the sum of missing values per column

missing_data(df).to_frame()

Unnamed: 0,0
Review,30


Out of the 7 columns, only the review's column has missing values.

Since the dataset is large, the missing values can be dropped and still retain relevant information.

In [None]:
# Dropping the missing values

df.dropna(inplace = True)

# Confirming there are no missing values 

print('The data has {} missing values'.format(df['Review'].isna().sum()))

The data has 0 missing values


## 4.2. Duplicates

In [None]:
# Checking for duplicates

print(f"The data has {df.duplicated().sum()} duplicated rows")

The data has 6595 duplicated rows


In [None]:
# Exploring the duplicates

duplicates = df[df.duplicated()]

duplicates.head(4)

Unnamed: 0,Rating,Review Title,Review,Location and Date of Review,Affiliated Company,Brand and Features,Price
30,4.0 out of 5 stars,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ...","Reviewed in the United States on February 11, ...",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
31,3.0 out of 5 stars,\nNot for Cricket Wireless and this two review...,"\nThe phone itself is a okay android device, b...","Reviewed in the United States on February 4, 2021",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
32,3.0 out of 5 stars,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...,"Reviewed in the United States on June 7, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99
33,3.0 out of 5 stars,\nA burner or for a kid\n,\nI use this as a burner w/o a sim card in it....,"Reviewed in the United States on April 14, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",$69.99


Duplicated columns will be dropped to avoid it misguiding our analysis and prediction process.

In [None]:
# Dropping the duplicates

df.drop_duplicates(inplace = True)


In [None]:
# Confirming if there are duplicates

print(f"The data has {df.duplicated().sum()} duplicated rows")

The data has 0 duplicated rows


## 4.3. Cleaning Specific Columns

### 4.3.1. Rating Column

For better analysis the rating value needs to be extracted and in cast as an integer.

In [None]:
#Extracting the digits in the Rating column and converting it to an interger type

df["Rating"] = df["Rating"].str.extract('(\d+)').astype(int)

# previewing the data

df["Rating"].head().to_frame()

Unnamed: 0,Rating
0,4
1,3
2,3
3,3
4,4


### 4.3.2. Price Column

The price columns seems has a dollar sign which could affect analysis. It needs stripping and conversion to int/float

In [None]:
#Extracting the digits in the price column and converting it to integer

df["Price"] = df["Price"].str.extract('(\d+)').astype(int)

df["Price"].head().to_frame()

Unnamed: 0,Price
0,69
1,69
2,69
3,69
4,69


### 4.3.3. Affiliated company column

In [None]:
# Rename the column to brand name

df.rename(columns = {"Affiliated Company":"Brand","Brand and Features":"Product_name"},inplace = True)

df.head(3)

Unnamed: 0,Rating,Review Title,Review,Location and Date of Review,Brand,Product_name,Price
0,4,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ...","Reviewed in the United States on February 11, ...",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
1,3,\nNot for Cricket Wireless and this two review...,"\nThe phone itself is a okay android device, b...","Reviewed in the United States on February 4, 2021",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69
2,3,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...,"Reviewed in the United States on June 7, 2022",Visit the RCA Store,"RCA Reno Smartphone, 4G LTE, 16GB, And...",69


The columns affiliated company and brand&features was renamed to brand and product name respectively.

In [None]:
#Getting the value counts for the brand column

df['Brand'].value_counts().to_frame()

Unnamed: 0,Brand
Visit the Amazon Renewed Store,1747
Brand: Motorola,1578
Visit the BLU Store,1544
Visit the TCL Store,1448
Brand: Amazon Renewed,989
Visit the OnePlus Store,684
Visit the SAMSUNG Store,522
Visit the Nokia Store,492
Visit the Google Store,389
Visit the JTEMAN Store,263


- The brand name will be extracted.
- White space will be stripped.
- Refurbished phones will be renamed into one name

In [None]:
# Removing unnecessary words from the column to get the brand name

word_vocabulary = ['Visit', 'the', 'store', 'Brand:', 'Store']

for word in word_vocabulary:
    df['Brand'] = df['Brand'].str.replace(word, '')

# Removing all the white spaces

df['Brand'] = df['Brand'].str.strip()

# Renaming the amazon renewed with refurbished

df['Brand'] = df['Brand'].str.replace('Amazon Renewed','Amazon Refurbished')  

In [None]:
df['Brand'].value_counts().to_frame(name='Count')

Unnamed: 0,Count
Amazon Refurbished,2736
Motorola,1578
BLU,1544
TCL,1448
OnePlus,684
Nokia,576
SAMSUNG,522
Google,389
JTEMAN,263
RCA,252


# 5. Feature Engineering

### 5.1 Product_name column

product name column model name in it. It will be stripped into a new column. 
Some of the model names have the word smartphone that will be stripped as well.

In [None]:
#removing punctuations from the column

df['Product_name'] = df['Product_name'].str.replace(r"\(.*\)","", regex=True)
df['Product_name'] = df['Product_name'].str.replace('-',"", regex=True)
df['Product_name'] = df['Product_name'].str.replace(',',"", regex=True)
df['Product_name'] = df['Product_name'].str.replace('|',"", regex=True)

#Splitting the strings in this column into different columns

string_cols = df["Product_name"].str.split(" ", n = -1, expand = True)

#selecting on the first three words of the string that will form the phone type

df["first_word"] = string_cols[8]
df["middle_word"] = string_cols[9]
df["last_word"] = string_cols[10]

#copying the two other columns so as to allow concactination

new1 = df["middle_word"].copy()
new2 = df["last_word"].copy()
 
# concatenating team with name column and overwriting name column

df["Model_Type"]= df["first_word"].str.cat(new1, sep =" ")
df["Model_Type"]= df["Model_Type"].str.cat(new2, sep =" ")
df.drop(["first_word", "middle_word", "last_word"], axis=1, inplace=True)

# Removing unnecessary words from the model type column

word_vocabulary = ['Smartphone']
for word in word_vocabulary:
    df['Model_Type'] = df['Model_Type'].str.replace(word, '')


### 5.1.1 Location and Date of Review

This data was all collected from the United States. This will will be stripped and the dates for reviews will be left

In [None]:
# previewing the column

df['Location and Date of Review'].head(3)

0    Reviewed in the United States on February 11, ...
1    Reviewed in the United States on February 4, 2021
2        Reviewed in the United States on June 7, 2022
Name: Location and Date of Review, dtype: object

In [None]:
# Extract the review dates from the Location and Date of Review Column

df['Location and Date of Review'] = df['Location and Date of Review']\
.str.replace('Reviewed in the United States on ', '')


# Rename the column to Review Date

df.rename(columns = {'Location and Date of Review': 'Review Date'}, inplace = True)

# Convert the column into datetime format

df['Review Date'] = pd.to_datetime(df['Review Date'], errors = 'coerce')

In [None]:
df.head()

Unnamed: 0,Rating,Review Title,Review,Review Date,Brand,Product_name,Price,Model_Type
0,4,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ...",2022-02-11,RCA,RCA Reno Smartphone 4G LTE 16GB Androi...,69,RCA Reno
1,3,\nNot for Cricket Wireless and this two review...,"\nThe phone itself is a okay android device, b...",2021-02-04,RCA,RCA Reno Smartphone 4G LTE 16GB Androi...,69,RCA Reno
2,3,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...,2022-06-07,RCA,RCA Reno Smartphone 4G LTE 16GB Androi...,69,RCA Reno
3,3,\nA burner or for a kid\n,\nI use this as a burner w/o a sim card in it....,2022-04-14,RCA,RCA Reno Smartphone 4G LTE 16GB Androi...,69,RCA Reno
4,4,\nIt works okay\n,\nIt works fine\n,2022-08-13,RCA,RCA Reno Smartphone 4G LTE 16GB Androi...,69,RCA Reno


### 5.1.2. Review Title and Review 

Review and Review Title have some rows that are not in Egnlish. 
These rows will be dropped.


In [None]:
# Explore review title rows not in English
df['Review Title'].iloc[7:10].to_frame()

Unnamed: 0,Review Title
7,\nBuena Compra\n
8,\nEso no me gustó\n
9,"\nDemasiado básico y lento, bajo costo pero no..."


In [None]:
# Explore review rows not in English
df['Review'].iloc[7:10].to_frame()

Unnamed: 0,Review
7,\nTal. Como està descrito….Todo lo necesario a...
8,\nNo vale la pena gastar dinero en el.\n
9,"\nDemasiado básico y lento, bajo costo pero no..."


In [None]:
# Drop rows not in English
df = df[df['Review Title'].map(lambda x: x.isascii())]
df = df[df['Review'].map(lambda x: x.isascii())]

### 5.1.3. Reorder columns in dataframe.
the columns will be reordered to have reviews as the last column

In [None]:
df = df.reindex(columns=['Product_name', 'Model_Type', 'Brand', 'Price', 'Review Date', 
                         'Rating', 'Review Title', 'Review'])

In [None]:
# Explore cleaned dataframe

df.head()

Unnamed: 0,Product_name,Model_Type,Brand,Price,Review Date,Rating,Review Title,Review
0,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-02-11,4,"\n.. not what ordered, not New... but it works...","\nSo first off...it's not what I ordered, but ..."
2,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-06-07,3,\nWill not work on T-Mobile sysem!\n,\nNew phone write up indicates T-Mobile system...
3,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-04-14,3,\nA burner or for a kid\n,\nI use this as a burner w/o a sim card in it....
4,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-08-13,4,\nIt works okay\n,\nIt works fine\n
5,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-05-10,3,\nPhone\n,"\nSo far I don't like this phone at all, I thr..."


It can be noted that the review and review title columns have \n . This needs to be removed.

In [None]:
# removing /n from the texts

df['Review Title'] = df['Review Title'].str.strip()
df['Review'] = df['Review'].str.strip()

# previewing the dataframe 

df.head()

Unnamed: 0,Product_name,Model_Type,Brand,Price,Review Date,Rating,Review Title,Review
0,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-02-11,4,".. not what ordered, not New... but it works s...","So first off...it's not what I ordered, but I ..."
2,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-06-07,3,Will not work on T-Mobile sysem!,New phone write up indicates T-Mobile system c...
3,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-04-14,3,A burner or for a kid,I use this as a burner w/o a sim card in it. J...
4,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-08-13,4,It works okay,It works fine
5,RCA Reno Smartphone 4G LTE 16GB Androi...,RCA Reno,RCA,69,2022-05-10,3,Phone,"So far I don't like this phone at all, I threw..."


In [None]:
# Confirming new changes have created new missing values and if yes drop them
print(missing_data(df))
df.dropna(inplace = True)

Series([], dtype: int64)


In [None]:
# Explore the shape of the cleaned dataframe
df.shape

(9156, 8)

In [None]:
# Dropping  unnecessary columns

df.drop('Product_name', axis=1, inplace=True)

### Creating The Labels Based on the  Ratings

In [None]:
df.Rating.head()

0    4
2    3
3    3
4    4
5    3
Name: Rating, dtype: int64

In [None]:
def to_sentiment(rating):

    if rating <= 2:
      return 'negative'
    elif rating == 3:
      return 'neutral'
    else: 
      return 'positive'

df['ratings_sentiment'] = df['Rating'].apply(to_sentiment)

In [None]:
df['ratings_sentiment'].head()

0    positive
2     neutral
3     neutral
4    positive
5     neutral
Name: ratings_sentiment, dtype: object

In [None]:
df.head()

Unnamed: 0,Model_Type,Brand,Price,Review Date,Rating,Review Title,Review,ratings_sentiment
0,RCA Reno,RCA,69,2022-02-11,4,".. not what ordered, not New... but it works s...","So first off...it's not what I ordered, but I ...",positive
2,RCA Reno,RCA,69,2022-06-07,3,Will not work on T-Mobile sysem!,New phone write up indicates T-Mobile system c...,neutral
3,RCA Reno,RCA,69,2022-04-14,3,A burner or for a kid,I use this as a burner w/o a sim card in it. J...,neutral
4,RCA Reno,RCA,69,2022-08-13,4,It works okay,It works fine,positive
5,RCA Reno,RCA,69,2022-05-10,3,Phone,"So far I don't like this phone at all, I threw...",neutral


In [None]:
# Make a copy of the dataset
data = df.copy()

# 6. Exploratory Data Analysis(EDA)

- Create a Pandas Profile
- Find out the  relationship between product rating and reviews.
- Explore the  relationship between brand and price.
- Explore  the relationship between brand and number of reviews
- Word cloud of most-used words in reviews.
- Trend of reviews over the years.
- Find out the relationship between price and product rating.




### 6.1 Pandas Profiling


In [None]:
# report = ProfileReport(df, title='Pandas Profiling Report')
# report

The pandas profile

In [None]:
# report.to_file(output_file='Amazon_Pandas_profile.html')

Summary of Profile Report:

Overview
- The dataset has 9144 rows and 9 columns, 6 of which are categorical, 1 is a date column and 2 are numerical variables.

Variables

- Minimum price is 1 dollar and maximum price is 799 dollars. This might show the presence of outliers that might need to be addressed later.
- Top smartphone brands within the dataset are; Amazon Refurbished, Motorola, Blu, TLC and OnePlus.
- The review dates range from Novemver 2014 to September 2022.

- Most common words in the review column are Good, Love it, excelente, great phone and nice.

- - Most common words in the review title column are great phone, good phone and good.


Missing Values

- No missing values in the dataset.

### 6.2. Relationship between ratings and reviews

In [None]:
# Group by relevant columns
ratings_review = df.groupby('Rating')['Review'].count()

In [None]:
# plot the data
ax1 = ratings_review.plot(kind='bar', figsize=(15,8), color="green", fontsize = 13);
ax1.set_alpha(0.8)
ax1.set_title('Distribution of Reviews by Product Rating', fontsize = 20)
ax1.set_ylabel("Number of Reviews", fontsize = 15);
ax1.set_xlabel("Ratings", fontsize = 15 , rotation = 60)
plt.show();


Mobile phones with higher rating receive the highest number of reviews followed by phones with the least review rating.

### 6.3. Brands with the highest reviews and with distribution of average prices

In [None]:
# Group relevant columns
brand_reviews = df.groupby('Brand')['Review'].count().sort_values(ascending = False).head(10)
brand_prices = df.groupby('Brand')['Price'].max().sort_values(ascending = False).head(10)

In [None]:
# Plot the data
ax1 = brand_reviews.plot(kind = 'bar', figsize = (15,8), color = 'green', fontsize = 13);
ax1.set_alpha(0.8)
ax1.set_title('Number of Reviews by Brands', fontsize = 26)
ax1.set_ylabel('Number of Reviews', fontsize = 20);
ax1.set_xlabel('Brand', fontsize = 20)
plt.show();



In [None]:
ax2 = brand_prices.plot(kind = 'bar', figsize = (15,8), color = 'green', fontsize = 13);
ax2.set_alpha(0.8)
ax2.set_title('Priciest Phones by Brands', fontsize = 26)
ax2.set_ylabel('Price', fontsize = 20);
ax2.set_xlabel('Brand', fontsize = 20)
plt.xticks(fontsize = 18)
plt.show();

Refurbished phones have the highest number of reviews while Samsung phones have the priciest phones. Samsung is one of the top phone brands in the world and with high performance, which could explain the high prices of its models.

### 6.4. Word cloud of the most-used words in reviews.

In [None]:
# Plot a word cloud 
comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through the csv file
for val in df.Review:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show() 

NameError: ignored

good, great, battery price, issue and work are some of the words that pop up more in the reviews

### 6.5. Trend of reviews over the years

In [None]:
# Create a new dataframe to use while maintaining the original
new_df = df.copy()

# Extract year from the new dataframe
new_df['year'] = new_df['Review Date'].dt.year

# Group by relevant columns
review_date = new_df.groupby('year')['Review'].count()
review_date = pd.DataFrame(review_date).reset_index()

# plot the data
review_date.plot.line(x = 'year', y = 'Review', color = 'green', figsize=(15, 8));


The rate of customer reviews took a major increase from 2020. This could be attributed to the Covid Pandemic where people spent lots of time indoors and online shopping was at an all time high.

### 6.6. Relationship between price and Ratings

In [None]:
# Group by relevant columns
review_price = new_df.groupby('Price')['Rating'].count()
review_price = pd.DataFrame(review_price).reset_index()

# Plot the data
fig, ax = plt.subplots(figsize=(15,8))
sns.scatterplot(x = 'Price', y = 'Rating', data = review_price, color = 'green')
plt.show();

The scatter plot above does not show a correlation between price and rating.

# 7. Implementing the Solution

## 7.1 Preprocessing

- tokenization
- lowercasing our words
- lemmatization/stemming
- vectorization

The first step will be to perform a contraction the reviews, to make sure that our words are expanded, for example, `isn't` would be expanded to `is not`

In [None]:
# fixing contractions
def text_contraction(text):
  
  # creating an empty list
  expanded_words = []

  for word in text.split():
    # using contractions.fix to expand the shortened words
    expanded_words.append(contractions.fix(word))  
    
  expanded_text = ' '.join(expanded_words)

  return expanded_text

In [None]:
# apply the contraction funtion to our reviews

df['Review Title'] = df['Review Title'].map(lambda x: text_contraction(x))
df['Review'] = df['Review'].map(lambda x: text_contraction(x))
df.head()

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

In [None]:
# grab the stop words from stopwords.words('english)
#stopwords_list = stopwords.words('english')
#stopwords_list += punctuation

stopwords_list = stopwords.words('english') + list(string.punctuation)
stopwords_list += ["''", '""', '...', '``']

# remove the stop words
clean_list = [word.lower() for word in tokens if word.lower() not in stopwords_list]

print(clean_list)

In [None]:
# tokenization

def tokenize_words(text):
  # grab all the punctuations
  punctuation = string.punctuation

  # lower case our string
  # text = str([word.lower() for word in t
  # remove the digits
  text = re.sub('\d', '', text)

  # cretae our word tokens
  tokens = word_tokenize(text)

  # grab the stop words from stopwords.words('english)
  # stopwords_list = stopwords.words('english')
  # stopwords_list += punctuation
  # stopwords_list = stopwords.words('english') + list(string.punctuation)
  # stopwords_list += ["''", '""', '...', '``', '..', '....']
  punctuation_list = list(string.punctuation)
  punctuation_list += ["''", '""', '...', '``', '..', '....']

  # remove the stop words
  clean_list = [word.lower() for word in tokens if word.lower() not in punctuation_list]
  
  # return a clean tokenized set
  return clean_list

In [None]:
df['Review'] = df['Review'].map(lambda x: tokenize_words(x))
df['Review Title'] = df['Review Title'].map(lambda x: tokenize_words(x))
df.head()

After tokenization, lemmatization is done to decompose the words to their most basic forms(lemma), but before that, tagging the words is done to ensure that the lemmatization gets the parts of speech represented by the words correctly.

In [None]:
# create a function that takes in the nltk POS tags
# and transforms them to wordnet tags
def wordnet_pos(word_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if word_tag.startswith('J'):
        return wordnet.ADJ
    elif word_tag.startswith('V'):
        return wordnet.VERB
    elif word_tag.startswith('N'):
        return wordnet.NOUN
    elif word_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
# check to see how the word tag performs
sentence = df["Review"][0]
pos_tag(sentence[0:20])

In [None]:
# inastantiate the lemmatizer
lemmatizer = WordNetLemmatizer() 

In [None]:
df['Review'][0:2]

In [None]:
def word_lemma(text):
    '''
    Translate the text POS tags to Word net tags then pass it to the lemmatizer
    
    '''
    # get the pos tags for the text
    word_pos_tags = pos_tag(text)
    
    # translate the pos tags to word net tags
    word_net_tag = [(text[0], wordnet_pos(text[1])) for text in word_pos_tags]
    
    # Pass the text with the wordnet tags to the lemmatizer
    lemma_word = [lemmatizer.lemmatize(text[0], text[1]) for text in word_net_tag]
    
    return lemma_word
    

In [None]:
nltk.download('all', quiet=True)
df['Review'] = df['Review'].apply(word_lemma)
df['Review Title'] = df['Review Title'].apply(word_lemma)
df.head()

Frequency Distribution Plot

In [None]:
# create the frequency distribution plot
sample = df['Review']
freq1_dist = []

for review in sample:
    freq1_dist.extend(review)
    
fdist = FreqDist(freq1_dist)
plt.figure(figsize=(15, 10))
fdist.plot(50);

In [None]:
# display the count of the first 200 texts
word_frequency = fdist.most_common(200)
word_frequency[:30]

In [None]:
# convert the list into a string

X = df['Review']

df["Review"]= X.map(lambda x: ' '.join(map(str, x)))
df["Review"]

In [None]:
df.columns

Vader Analysis

In [None]:
df["id"] = df.index + 1
df = df.reindex(columns=['id', 'Model_Type', 'Brand', 'Price', 'Review Date', 
                         'Rating', 'Review Title', 'Review', 'ratings_sentiment'])
df.drop("Review Date", axis=1, inplace=True)
df.head()

In [None]:
# creating polarity scores on the entire dataset
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
from tqdm.notebook import tqdm
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
  text = row["Review"]
  id = row["id"]
  res[id] = sia.polarity_scores(text)


In [None]:
# populating results into a df
vader = pd.DataFrame(res).T
vader = vader.reset_index().rename(columns={"index":"id"})
vader_df = vader.merge(df, how="left")

In [None]:
# sentiment scores combined with the meta data


# pd.set_option('display.max_colwidth',10000)
vader_df.head()

In [None]:
# plotting a bar plot to compare rating and compound value to see if the ratings align with the sentiment scores
df_sorted = vader_df.sort_values('Rating')
fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(data=df_sorted, x = "Rating", y="compound")
ax.set_title("Compound Score by Amazon Star Rating")
plt.show()

Making assumptions on our data, if rating is 5, then it is likely to be a more positive value than a rating of 1.

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

sns.barplot(data=df_sorted, x = "Rating", y="pos", ax=axs[0])
sns.barplot(data=df_sorted, x = "Rating", y="neu", ax=axs[1])
sns.barplot(data=df_sorted, x = "Rating", y="neg", ax=axs[2])
axs[0].set_title("Positive")
axs[1].set_title("Neutral")
axs[2].set_title("Negative")
plt.tight_layout()
plt.show()

The more positive the compound value, the higher the Ranking score. Positive values represent high rankings of 4 and 5 while low rankings of 1 and 2 mainly show negative values. This proves the assumption made on the data.

## Modeling

In [None]:
# splitting data 

# df_train, df_test = train_test_split(
#   df,
#   test_size=0.3,
#   random_state=23
# )
# df_val, df_test = train_test_split(
#   df_test,
#   test_size=0.5,
#   random_state=23
# )

### Vader Analysis

In [None]:
y = df['ratings_sentiment']
X = df.drop('ratings_sentiment', axis=1)

In [None]:
# Exploring vader as a model for sentiment analysis

analyzer = SentimentIntensityAnalyzer()

In [None]:
X["scores"] = X["Review"].apply(lambda review : analyzer.polarity_scores(review))
X["compound"] = X["scores"].apply(lambda score_dict :score_dict["compound"])
X["vader_label"] = X["compound"].apply(lambda c : "positive" if c >= 0.05 else "negative" if c <= -0.05 else "neutral")

X.head()

In [None]:
# Exploring the unique elements in label column

X["vader_label"].value_counts()

In [None]:
# plotting distribution of sentiment scores
colors = ['green', 'blue', 'magenta']
X["vader_label"].value_counts().plot(kind="bar", color = colors, figsize=(10, 8))
plt.xlabel("Sentiment Label", size=12)
plt.ylabel("Count", size=12)
plt.title("Distribution of Sentiment Scores", size=14);

In [None]:
print(classification_report(y, X['vader_label']))

### Text Blob

In [None]:
# import the model
from textblob import TextBlob

In [None]:
# Implementing the textblob analysis

X[["polarity", "subjectivity"]] = X["Review"].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))

In [None]:
# exploring the positive comments

X[X.polarity > 0].tail()

In [None]:
# exploring the negative comments

# df[df.polarity < 0].head()

In [None]:
# exploring neutral comments

# df[df.polarity == 0].head()

In [None]:
X["Text_blob_labels"] = X["polarity"].apply(lambda c : "positive" if c > 0 else "negative" if c < 0 else "neutral")

In [None]:
X.head()

In [None]:
print(classification_report(y, X["Text_blob_labels"]))

## Roberta Model: Neural Network Model

In [None]:
# Transformer models account for context
!pip install transformers
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [None]:
# # transfer learning on a pretrained model
# # finding pretrained weights to use in the model
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
# load the data

data.head()

In [None]:
data["id"] = data.index + 1
data = data.reindex(columns=['id', 'Model_Type', 'Brand', 'Price', 'Review Date', 
                         'Rating', 'Review Title', 'Review', 'ratings_sentiment'])
data.head()

In [None]:
def polarity_scores_roberta(words):
  encoded_text = tokenizer(words, return_tensors="pt")
  output = model(**encoded_text)
  scores = output [0][0].detach().numpy()
  scores = softmax(scores)
  scores_dict ={
      "roberta_neg": scores[0],
      "roberta_neu" : scores[1],
      "roberta_pos" : scores [2]
  }
  return scores_dict


In [None]:
res = {}
for i, row in tqdm(data.iterrows(), total=len(data)):
  try:
    text = row["Review"]
    id = row["id"]
    roberta_results = polarity_scores_roberta(text)
  except RuntimeError:
    print(f'Broke for id {id}')

In [None]:
vader = pd.DataFrame(res).T
# vader = vader.reset_index().rename(columns={"index":"id"})
# vader_df = vader.merge(df, how="left")

### Bert Model

In [None]:
! pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

In [None]:
! pip install transformers

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification


In [None]:
import torch

# 8. Challenging the Solution

# 9. Conclusions

# 10. Recommendations