# Data Cleaning and Exploring

This section will about cleaning and exploring the data. I'll be making sure there is enough of a balance of reviews before the modelling. I will also do the preprocessing in this section.

In [64]:
#Importing the libraries needed
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from textblob import TextBlob    
from sklearn.model_selection import train_test_split


nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [65]:
#Import dataframe to work witk
df_old = pd.read_excel("thursday-murder-club-book.xlsx")
df_old.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   product      2500 non-null   object
 1   title        2499 non-null   object
 2   rating       2500 non-null   int64 
 3   text_review  2500 non-null   object
dtypes: int64(1), object(3)
memory usage: 78.2+ KB


In [66]:
#Checking number of reviews for each ratings
df_old.rating.value_counts()

5    1600
4     365
1     202
3     181
2     152
Name: rating, dtype: int64

There isn't enough negative ratings, so I'll need to get more data. I decided to import a dataset from https://nijianmo.github.io/amazon/index.html#subsets

In [67]:
#Importing new data
df = pd.read_csv("pantry.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137788 entries, 0 to 137787
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0      137788 non-null  int64  
 1   overall         137788 non-null  float64
 2   verified        137788 non-null  bool   
 3   reviewTime      137788 non-null  object 
 4   reviewerID      137788 non-null  object 
 5   asin            137788 non-null  object 
 6   reviewerName    137759 non-null  object 
 7   reviewText      137611 non-null  object 
 8   summary         137725 non-null  object 
 9   unixReviewTime  137788 non-null  int64  
 10  vote            9437 non-null    float64
 11  image           665 non-null     object 
 12  style           1152 non-null    object 
dtypes: bool(1), float64(2), int64(2), object(8)
memory usage: 12.7+ MB


In [68]:
df.head()

Unnamed: 0.1,Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,image,style
0,0,4.0,True,"09 24, 2015",A31Y9ELLA1JUB0,B0000DIWNI,Her Royal Peepness Princess HoneyBunny Blayze,I purchased this Saran premium plastic wrap af...,Pretty Good For plastic Wrap,1443052800,,,
1,1,5.0,True,"06 23, 2015",A2FYW9VZ0AMXKY,B0000DIWNI,Mary,I am an avid cook and baker. Saran Premium Pl...,"The Best Plastic Wrap for your Cooking, Baking...",1435017600,,,
2,2,5.0,True,"06 13, 2015",A1NE43T0OM6NNX,B0000DIWNI,Tulay C,"Good wrap, keeping it in the fridge makes it e...",Good and strong.,1434153600,,,
3,3,4.0,True,"06 3, 2015",AHTCPGK2CNPKU,B0000DIWNI,OmaShops,I prefer Saran wrap over other brands. It does...,Doesn't cling as well to dishes as other brand...,1433289600,,,
4,4,5.0,True,"04 20, 2015",A25SIBTMVXLB59,B0000DIWNI,Nitemanslim,Thanks,Five Stars,1429488000,,,


The main focus of the project will be predicting product ratings based on the text reviews. I may use the review title (summary) as well. The data is not for one product, but for all of the pantry department. A lot of the columns won't be needed in this project. The main two are reviewText and overall. I will keep the summary and verified in the data, if I want to use that at a later date.

In [69]:
#Dropping the columns I won't be using
df = df.drop(["Unnamed: 0", "reviewTime", "reviewerID", "asin", "reviewerName", "unixReviewTime", "vote", "image", "style"], axis=1)

In [70]:
#Check to see if there enough negative reviews
df.overall.value_counts()

5.0    101456
4.0     20308
3.0      9109
2.0      3661
1.0      3254
Name: overall, dtype: int64

In [71]:
#Checking reviews are in float format and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137788 entries, 0 to 137787
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   overall     137788 non-null  float64
 1   verified    137788 non-null  bool   
 2   reviewText  137611 non-null  object 
 3   summary     137725 non-null  object 
dtypes: bool(1), float64(1), object(2)
memory usage: 3.3+ MB


Checking and dropping null values.

In [72]:
df.dropna(inplace=True)

In [73]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 137566 entries, 0 to 137787
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   overall     137566 non-null  float64
 1   verified    137566 non-null  bool   
 2   reviewText  137566 non-null  object 
 3   summary     137566 non-null  object 
dtypes: bool(1), float64(1), object(2)
memory usage: 4.3+ MB


# Preprocessing

- Lowercase
- Remove punctuation 


In [74]:
#Lowercase text reviews and summaries
df["reviewText"] = df["reviewText"].str.lower()
df["summary"] = df["summary"].str.lower()
df.head()

Unnamed: 0,overall,verified,reviewText,summary
0,4.0,True,i purchased this saran premium plastic wrap af...,pretty good for plastic wrap
1,5.0,True,i am an avid cook and baker. saran premium pl...,"the best plastic wrap for your cooking, baking..."
2,5.0,True,"good wrap, keeping it in the fridge makes it e...",good and strong.
3,4.0,True,i prefer saran wrap over other brands. it does...,doesn't cling as well to dishes as other brand...
4,5.0,True,thanks,five stars


In [75]:
#Remove punctuation and stop words
#Adding stop words from nltk
stop_words = set(stopwords.words("english"))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [76]:
def remove_punct(text):
    removed_text = "".join([word for word in text if word not in string.punctuation])
    return removed_text


In [77]:
df["review_final"] = df["reviewText"].apply(lambda x: remove_punct(x))

In [78]:
#Checking to see if new column is cleaned
df.head()

Unnamed: 0,overall,verified,reviewText,summary,review_final
0,4.0,True,i purchased this saran premium plastic wrap af...,pretty good for plastic wrap,i purchased this saran premium plastic wrap af...
1,5.0,True,i am an avid cook and baker. saran premium pl...,"the best plastic wrap for your cooking, baking...",i am an avid cook and baker saran premium pla...
2,5.0,True,"good wrap, keeping it in the fridge makes it e...",good and strong.,good wrap keeping it in the fridge makes it ea...
3,4.0,True,i prefer saran wrap over other brands. it does...,doesn't cling as well to dishes as other brand...,i prefer saran wrap over other brands it doesn...
4,5.0,True,thanks,five stars,thanks


In [79]:
#Dropping null columns again because some reviews may just be emojis
df.dropna(inplace=True)

In [80]:
# Exploring and modelling

In [81]:
pol = lambda x: TextBlob(x).sentiment.polarity
df['polarity'] = df['review_final'].apply(pol)

In [82]:
df.head()

Unnamed: 0,overall,verified,reviewText,summary,review_final,polarity
0,4.0,True,i purchased this saran premium plastic wrap af...,pretty good for plastic wrap,i purchased this saran premium plastic wrap af...,0.062963
1,5.0,True,i am an avid cook and baker. saran premium pl...,"the best plastic wrap for your cooking, baking...",i am an avid cook and baker saran premium pla...,0.184354
2,5.0,True,"good wrap, keeping it in the fridge makes it e...",good and strong.,good wrap keeping it in the fridge makes it ea...,0.7
3,4.0,True,i prefer saran wrap over other brands. it does...,doesn't cling as well to dishes as other brand...,i prefer saran wrap over other brands it doesn...,-0.145833
4,5.0,True,thanks,five stars,thanks,0.2


In [83]:
pol = lambda x: TextBlob(x).sentiment.subjectivity
df['subjectivity'] = df['review_final'].apply(pol)

I'll now split up the data. I will take 3000 from each rating, then that will be split again into train/test.

In [84]:
df.head(5)

Unnamed: 0,overall,verified,reviewText,summary,review_final,polarity,subjectivity
0,4.0,True,i purchased this saran premium plastic wrap af...,pretty good for plastic wrap,i purchased this saran premium plastic wrap af...,0.062963,0.451852
1,5.0,True,i am an avid cook and baker. saran premium pl...,"the best plastic wrap for your cooking, baking...",i am an avid cook and baker saran premium pla...,0.184354,0.671542
2,5.0,True,"good wrap, keeping it in the fridge makes it e...",good and strong.,good wrap keeping it in the fridge makes it ea...,0.7,0.6
3,4.0,True,i prefer saran wrap over other brands. it does...,doesn't cling as well to dishes as other brand...,i prefer saran wrap over other brands it doesn...,-0.145833,0.220833
4,5.0,True,thanks,five stars,thanks,0.2,0.2


In [85]:
#Sorting by overall ratings before the split
df.sort_values(by=["overall"])

Unnamed: 0,overall,verified,reviewText,summary,review_final,polarity,subjectivity
104852,1.0,True,bag of sauce,one star,bag of sauce,0.000000,0.000000
121098,1.0,True,dogs are burying them. thought they liked them...,dogs hate them.,dogs are burying them thought they liked them ...,0.600000,0.800000
42349,1.0,True,terrible coffee. very weak and the flavor is n...,terrible coffee.,terrible coffee very weak and the flavor is no...,-0.024028,0.568194
23857,1.0,True,the bag was already ripped open. my favorite c...,my favorite cookies though,the bag was already ripped open my favorite co...,0.250000,0.750000
5500,1.0,False,bottle crushed..hard to get unwrapped,lime away,bottle crushedhard to get unwrapped,0.000000,0.000000
...,...,...,...,...,...,...,...
67755,5.0,True,:),five stars,,0.000000,0.000000
67754,5.0,True,yummy,five stars,yummy,0.000000,0.000000
67753,5.0,False,everything as expected.,five stars,everything as expected,-0.100000,0.400000
67763,5.0,True,"just as expected, cheaper than the local store.",just as i expected.,just as expected cheaper than the local store,-0.050000,0.200000


In [86]:
train, test = train_test_split(df, test_size=0.2, shuffle=False)

In [87]:
train["overall"].value_counts()

5.0    81703
4.0    15989
3.0     7042
2.0     2836
1.0     2482
Name: overall, dtype: int64

In [88]:
test["overall"].value_counts()

5.0    19555
4.0     4305
3.0     2063
2.0      822
1.0      769
Name: overall, dtype: int64

In [89]:
#Saving csv files so I can work with them in another document
df.to_csv("clean_pantry.csv")
train.to_csv("train_data")
test.to_csv("test_data")