# Analysis of Cell Phone Reviews

***************************************************

## Importing required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing and cleaning the dataset

In [2]:
df = pd.read_csv('data/20191226-reviews.csv')

In [3]:
df.head()

Unnamed: 0,asin,name,rating,date,verified,title,body,helpfulVotes
0,B0000SX2UC,Janet,3,"October 11, 2005",False,"Def not best, but not worst",I had the Samsung A600 for awhile which is abs...,1.0
1,B0000SX2UC,Luke Wyatt,1,"January 7, 2004",False,Text Messaging Doesn't Work,Due to a software issue between Nokia and Spri...,17.0
2,B0000SX2UC,Brooke,5,"December 30, 2003",False,Love This Phone,"This is a great, reliable phone. I also purcha...",5.0
3,B0000SX2UC,amy m. teague,3,"March 18, 2004",False,"Love the Phone, BUT...!","I love the phone and all, because I really did...",1.0
4,B0000SX2UC,tristazbimmer,4,"August 28, 2005",False,"Great phone service and options, lousy case!",The phone has been great for every purpose it ...,1.0


In [4]:
df.shape

(67986, 8)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67986 entries, 0 to 67985
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   asin          67986 non-null  object 
 1   name          67984 non-null  object 
 2   rating        67986 non-null  int64  
 3   date          67986 non-null  object 
 4   verified      67986 non-null  bool   
 5   title         67972 non-null  object 
 6   body          67965 non-null  object 
 7   helpfulVotes  27215 non-null  float64
dtypes: bool(1), float64(1), int64(1), object(5)
memory usage: 3.7+ MB


In [6]:
df.isnull().sum()

asin                0
name                2
rating              0
date                0
verified            0
title              14
body               21
helpfulVotes    40771
dtype: int64

- We have $14$ null values in the rating column of our dataset.
- We have to drop the null values.

In [7]:
title_miss = df[df.title.isnull()]

In [8]:
title_miss

Unnamed: 0,asin,name,rating,date,verified,title,body,helpfulVotes
30010,B01NB1KG8U,Sylvester Ofosuhene,5,"December 24, 2019",True,,,
30949,B06XR1K6HR,MOHAMED ALI,5,"January 17, 2019",True,,Almost like pretty new,
32883,B06XSF5C42,Candice,5,"June 13, 2019",True,,Love this phone. Everything's worked great. So...,
35016,B071H9KKKF,Wauany,5,"June 17, 2018",True,,Like the phone so far!!! Never had an expensiv...,
42935,B077T4MVZ6,Evaldina,4,"November 14, 2018",True,,Love it,
45899,B079X7DQ4Q,Roberto,5,"November 25, 2019",True,,,
45905,B079X7DQ4Q,Mahmood al rahawi,5,"December 7, 2018",True,,"I get that phone I needed ,, thanks .",
46470,B07BHT4KGM,Roberto,5,"November 25, 2019",True,,,
46476,B07BHT4KGM,Mahmood al rahawi,5,"December 7, 2018",True,,"I get that phone I needed ,, thanks .",
50404,B07FZH9BGV,Henry,5,"November 1, 2018",True,,Great phone...A++,1.0


In [9]:
title_miss[['rating', 'title', 'body']]

Unnamed: 0,rating,title,body
30010,5,,
30949,5,,Almost like pretty new
32883,5,,Love this phone. Everything's worked great. So...
35016,5,,Like the phone so far!!! Never had an expensiv...
42935,4,,Love it
45899,5,,
45905,5,,"I get that phone I needed ,, thanks ."
46470,5,,
46476,5,,"I get that phone I needed ,, thanks ."
50404,5,,Great phone...A++


- If we closely look, we can find that all the **missing titles** are associated with very high rating in the *rating* column.
- The *body* column for each of the **missing titles** also suggest that the customers are satisfied with the purchase.

> So it is safe to say that missing titles can be associated with high rating.<br>
> However, we cannot use missing titles to train our model so we have to drop the missing titles.

In [84]:
dataset = df[~df.title.isnull()]

In [85]:
dataset.reset_index(inplace = True)

In [86]:
dataset.isnull().sum()

index               0
asin                0
name                2
rating              0
date                0
verified            0
title               0
body               16
helpfulVotes    40758
dtype: int64

Now we can see that the dataset does not contain any missing **title** values.

- Now we can choose the *Independent* and *Dependent* variables.
- We can use the *title* column as our **Reviews**(*independent variable*) and <br> *rating* columns as our **Rating**(*dependent variables*)

In [87]:
reviews = dataset.title
ratings = dataset.rating

In [88]:
dataset.shape

(67972, 9)

In [89]:
reviews.head()

0                     Def not best, but not worst
1                     Text Messaging Doesn't Work
2                                 Love This Phone
3                         Love the Phone, BUT...!
4    Great phone service and options, lousy case!
Name: title, dtype: object

In [90]:
reviews[3]

'Love the Phone, BUT...!'

 ## Text Preprocessing

### Import required libraries

In [42]:
import nltk #Natural Language Toolkit library
from nltk.corpus import stopwords #Library to remove stopwords
from nltk.stem.porter import PorterStemmer #Library to stem words
import re 

In [43]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Aditya
[nltk_data]     Dash\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [44]:
ps = PorterStemmer()

In [45]:
m = len(reviews)

In [91]:
data = []
for i in range(m):
    review = reviews[i]
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    data.append(review)

### Creating bag of words model

In [93]:
from sklearn.feature_extraction.text import CountVectorizer

In [94]:
cv = CountVectorizer(2000)

In [96]:
X = cv.fit_transform(data).toarray()

## Creating train and test set

In [98]:
from sklearn.model_selection import train_test_split

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, ratings, test_size = 0.25, random_state = 42)