
Springboard -- Data Science Career Program

Capstone Project #2: Yelp Sentiment Analysis

Data Wrangling -- By Kevin Cole -- July 2020

This document describes data wrangling steps and the code to support it. Capstone proposal can be found below.

Initial Proposal https://github.com/ABitNutty/Capstone-2/blob/master/Capstone%202%20Proposal.pdf

The Data - The yelp open dataset found at https://www.yelp.com/dataset is a collection of reviews, users, and business info across many different industires. In the download are 5 different JSON files, only two of which were determined to be relavant for the project at hand. 

Notes: Some issues occured due to the size of the dataset and limitations on personal hardware. Multiple savepoints exist in the document so that progress would not be lost if the kernal needed to be restarted. 

In [1]:
import pandas as pd
import numpy as np
import json
import pickle
import string
import timeit
import nltk
from nltk.corpus import stopwords


from langdetect import detect
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Names of files downloaded meant for import
business_json = 'yelp_academic_dataset_business.json'
review_json = 'yelp_academic_dataset_review.json'

# Business Info

In [None]:
# Initial convert of json files into dataframes
business = pd.read_json(business_json, lines=True)
business.head()

In [None]:
# How many business in the dataset
len(business)

In [None]:
# How many resturants in the dataset
business.categories.str.contains('Restaurant').sum()

In [None]:
# Getting a count for each category

# Dictionary Count
counts = {}

for entry in business.categories:
    # Looping each entry in the dataframe
    
    # Split string into multiple categories
    categories = str(entry).split(', ')
    
    for category in categories:
        if category in counts:
            counts[category] += 1
        else:
            counts[category] = 1

counts

In [None]:
restaurants = business[business.categories.str.contains('Restaurant')==True]
restaurants = restaurants[['business_id','is_open']]
restaurants

In [None]:
restaurants.to_pickle('restaurants_business_id.pkl')

# Review info

In [3]:
################ Caution: This import is  long as the file is 6+ GB. ###################
################          Last run was 56 minutes on 7/7/20.         ###################
################ Savepoint 1 occurs after this data is merged with business info #######

# Timing start
start_time = timeit.default_timer()

# Initial convert of json files into dataframes
review = pd.read_json(review_json, lines=True)

# Elapsed time calculation
elapsed = timeit.default_timer() - start_time

print('Elapsed Time (minutes):')
print(elapsed/60)

Elapsed Time (minutes):
56.526052133166665


In [4]:
# Loading dataframe of business ID's related to restaurants pickled above. 
businesses = pickle.load( open( "restaurants_business_id.pkl", "rb" ) )

In [10]:
businesses.head()

Unnamed: 0,business_id,is_open
8,pQeaRpvuhoEqudo3uymHIQ,1
20,CsLQLiRoafpJPJSkNX2h5Q,0
24,eBEfgOPG7pvFhb2wcG9I7w,1
25,lu7vtrp_bE9PnxWfA8g4Pg,1
30,9sRGfSVEfLhN_km60YruTA,1


In [8]:
review

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16
1,UmFMZ8PyXZTY2QcwzsfQYA,nIJD_7ZXHq-FX8byPMOkMQ,lbrU8StCq3yDfr-QMnGrmQ,1,1,1,0,I am actually horrified this place is still in...,2013-12-07 03:16:52
2,LG2ZaYiOgpr2DK_90pYjNw,V34qejxNsCbcgD8C0HVk-Q,HQl28KMwrEKHqhFrrDqVNQ,5,1,0,0,I love Deagan's. I do. I really do. The atmosp...,2015-12-05 03:18:11
3,i6g_oA9Yf9Y31qt0wibXpw,ofKDkJKXSKZXu5xJNGiiBQ,5JxlZaqCnk1MnbgRirs40Q,1,0,0,0,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",2011-05-27 05:30:52
4,6TdNDKywdbjoTkizeMce8A,UgMW8bLE0QMJDCkQ1Ax5Mg,IS4cv902ykd8wj1TR0N3-A,4,0,0,0,"Oh happy day, finally have a Canes near my cas...",2017-01-14 21:56:57
...,...,...,...,...,...,...,...,...,...
8021117,LAzw2u1ucY722ryLEXHdgg,6DMFD3BRp-MVzDQelRx5UQ,XW2kaXdahICaJ27A0dhGHg,1,1,0,1,"Fricken unbelievable, I ordered 2 space heater...",2019-12-11 01:07:06
8021118,gMDU14Fa_DVIcPvsKtubJA,_g6P8H3-qfbz1FxbffS68g,IsoLzudHC50oJLiEWpwV-w,3,1,3,1,Solid American food with a southern comfort fl...,2019-12-10 04:15:00
8021119,EcY_p50zPIQ2R6rf6-5CjA,Scmyz7MK4TbXXYcaLZxIxQ,kDCyqlYcstqnoqnfBRS5Og,5,15,6,13,I'm honestly not sure how I have never been to...,2019-06-06 15:01:53
8021120,-z_MM0pAf9RtZbyPlphTlA,lBuAACBEThaQHQGMzAlKpg,VKVDDHKtsdrnigeIf9S8RA,3,2,0,0,Food was decent but I will say the service too...,2018-07-05 18:45:21


In [9]:
review.set_index('business_id')

Unnamed: 0_level_0,review_id,user_id,stars,useful,funny,cool,text,date
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
-MhfebM0QIsKt87iDN-FNw,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,2,5,0,0,"As someone who has worked with many museums, I...",2015-04-15 05:21:16
lbrU8StCq3yDfr-QMnGrmQ,UmFMZ8PyXZTY2QcwzsfQYA,nIJD_7ZXHq-FX8byPMOkMQ,1,1,1,0,I am actually horrified this place is still in...,2013-12-07 03:16:52
HQl28KMwrEKHqhFrrDqVNQ,LG2ZaYiOgpr2DK_90pYjNw,V34qejxNsCbcgD8C0HVk-Q,5,1,0,0,I love Deagan's. I do. I really do. The atmosp...,2015-12-05 03:18:11
5JxlZaqCnk1MnbgRirs40Q,i6g_oA9Yf9Y31qt0wibXpw,ofKDkJKXSKZXu5xJNGiiBQ,1,0,0,0,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",2011-05-27 05:30:52
IS4cv902ykd8wj1TR0N3-A,6TdNDKywdbjoTkizeMce8A,UgMW8bLE0QMJDCkQ1Ax5Mg,4,0,0,0,"Oh happy day, finally have a Canes near my cas...",2017-01-14 21:56:57
...,...,...,...,...,...,...,...,...
XW2kaXdahICaJ27A0dhGHg,LAzw2u1ucY722ryLEXHdgg,6DMFD3BRp-MVzDQelRx5UQ,1,1,0,1,"Fricken unbelievable, I ordered 2 space heater...",2019-12-11 01:07:06
IsoLzudHC50oJLiEWpwV-w,gMDU14Fa_DVIcPvsKtubJA,_g6P8H3-qfbz1FxbffS68g,3,1,3,1,Solid American food with a southern comfort fl...,2019-12-10 04:15:00
kDCyqlYcstqnoqnfBRS5Og,EcY_p50zPIQ2R6rf6-5CjA,Scmyz7MK4TbXXYcaLZxIxQ,5,15,6,13,I'm honestly not sure how I have never been to...,2019-06-06 15:01:53
VKVDDHKtsdrnigeIf9S8RA,-z_MM0pAf9RtZbyPlphTlA,lBuAACBEThaQHQGMzAlKpg,3,2,0,0,Food was decent but I will say the service too...,2018-07-05 18:45:21


In [14]:
# Joining business info 
review = review.join(businesses.set_index('business_id'), on='business_id')

In [15]:
# Subsetting all reviews with those matching a business in the restaurant category
restaurant_reviews = review[review.business_id.isin(list(businesses.business_id))]

In [16]:
# Taking desired columns
restaurant_reviews = restaurant_reviews[['stars','text','is_open']]

# Good Bad Neutral Tagging

In [17]:
restaurant_reviews.head()

Unnamed: 0,stars,text,is_open
2,5,I love Deagan's. I do. I really do. The atmosp...,1.0
3,1,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",0.0
4,4,"Oh happy day, finally have a Canes near my cas...",1.0
5,5,This is definitely my favorite fast food sub s...,1.0
6,5,"Really good place with simple decor, amazing f...",1.0


In [18]:
# Dictionary for Good/Bad/Neutral categories
rating_dict = {5:'Good', 4:'Good', 3:'Neutral', 2:'Bad', 1:'Bad'}

In [19]:
# Mapping star ratings 
restaurant_reviews['good_bad'] = restaurant_reviews.stars.map(rating_dict)

In [20]:
restaurant_reviews.head()

Unnamed: 0,stars,text,is_open,good_bad
2,5,I love Deagan's. I do. I really do. The atmosp...,1.0,Good
3,1,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",0.0,Bad
4,4,"Oh happy day, finally have a Canes near my cas...",1.0,Good
5,5,This is definitely my favorite fast food sub s...,1.0,Good
6,5,"Really good place with simple decor, amazing f...",1.0,Good


In [21]:
# Saving off dataframe after business subsetting and good/bad/neutral tagging
restaurant_reviews.to_pickle('restaurant_reviews.pkl')

In [22]:
# Loading savepoint 1
restaurant_reviews = pickle.load( open( "restaurant_reviews.pkl", "rb" ) )

In [23]:
restaurant_reviews

Unnamed: 0,stars,text,is_open,good_bad
2,5,I love Deagan's. I do. I really do. The atmosp...,1.0,Good
3,1,"Dismal, lukewarm, defrosted-tasting ""TexMex"" g...",0.0,Bad
4,4,"Oh happy day, finally have a Canes near my cas...",1.0,Good
5,5,This is definitely my favorite fast food sub s...,1.0,Good
6,5,"Really good place with simple decor, amazing f...",1.0,Good
...,...,...,...,...
8021113,5,"Confections, cash, and casinos! Welcome to Las...",0.0,Good
8021118,3,Solid American food with a southern comfort fl...,1.0,Neutral
8021119,5,I'm honestly not sure how I have never been to...,1.0,Good
8021120,3,Food was decent but I will say the service too...,1.0,Neutral


# Text Cleaning

The following code will: 
- Remove reviews shorter than 11 characters
- Convert all reviews to lower case
- Remove special characters including punctuation and carriage returns
- Determine the language of each review
- Remove all non-english reviews
- Remove stopwords

# Punctuation & Short Review removal

In [24]:
# Removing punctuation
restaurant_reviews.text = restaurant_reviews.text.str.translate(str.maketrans('','',string.punctuation))

In [25]:
len(restaurant_reviews)

5056227

In [26]:
# Removing reviews with less than 10 characters
restaurant_reviews = restaurant_reviews[restaurant_reviews.text.apply(len) > 10]
len(restaurant_reviews)

5056004

In [27]:
restaurant_reviews.head()

Unnamed: 0,stars,text,is_open,good_bad
2,5,I love Deagans I do I really do The atmosphere...,1.0,Good
3,1,Dismal lukewarm defrostedtasting TexMex glop\n...,0.0,Bad
4,4,Oh happy day finally have a Canes near my casa...,1.0,Good
5,5,This is definitely my favorite fast food sub s...,1.0,Good
6,5,Really good place with simple decor amazing fo...,1.0,Good


# Lower Case

In [28]:
# Converting all text to lower case
restaurant_reviews.text = restaurant_reviews.text.str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [30]:
restaurant_reviews.tail()

Unnamed: 0,stars,text,is_open,good_bad
8021113,5,confections cash and casinos welcome to las ve...,0.0,Good
8021118,3,solid american food with a southern comfort fl...,1.0,Neutral
8021119,5,im honestly not sure how i have never been to ...,1.0,Good
8021120,3,food was decent but i will say the service too...,1.0,Neutral
8021121,5,oh yeah not only that the service was good the...,1.0,Good


# Whitespace normalizing

In [31]:
# Removing multiple spaces, carriage returns, tabs
restaurant_reviews['text'] = restaurant_reviews.text.str.split().apply(' '.join)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [32]:
# Saved off version of data in lower case and punctuation removed
restaurant_reviews.to_pickle('restaurant_reviews2.pkl')

# Data Save Point 2

In [None]:
# Loading savepoint 2
restaurant_reviews = pickle.load( open( "restaurant_reviews2.pkl", "rb" ) )

In [None]:
# Warning: This cell takes a long time to run
# Last run was 7.15 hours on 7/6/20. Results immediately saved off in restaurant_reviews3.pkl
# If cells is not run, data was loaded in at savepoint 3

# Timing start
start_time = timeit.default_timer()

# Detecting languge of reviews
restaurant_reviews['language'] = restaurant_reviews.text.apply(detect)

# Elapsed time calculation
elapsed = timeit.default_timer() - start_time

print('Elapsed Time (minutes):')
print(elapsed/60)


# Data Save Point 3

In [41]:
# Saved version of data with language identified

############ save command commented out to prevent accidental overwrite  ######################

# restaurant_reviews.to_pickle('restaurant_reviews3.pkl')

In [36]:
# Loading savepoint 3
restaurant_reviews = pickle.load( open( "restaurant_reviews3.pkl", "rb" ) )

# Removing non-english reviews

In [38]:
len(restaurant_reviews)

5056004

In [44]:
restaurant_reviews = restaurant_reviews[restaurant_reviews.language == 'en']

In [46]:
len(restaurant_reviews)

5026166

# Remove Stopwords

In [47]:
# Viewing NLTK Stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Since I've already removed punctuation from my text, I should remove the punctuation from the stopwords before stripping them from my reviews.

In [48]:
stopwords_list = stopwords.words('english')

In [49]:
stopwords_list = [word.translate(str.maketrans('','',string.punctuation)) for word in stopwords_list]

In [50]:
print(stopwords_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'youre', 'youve', 'youll', 'youd', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'shes', 'her', 'hers', 'herself', 'it', 'its', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'thatll', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', '

In [51]:
# Tokenizing words for stopwords removal -- Do I need a tokenizer for this or is split sufficent?
restaurant_reviews.text = restaurant_reviews.text.str.split()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [52]:
# Function for remvoing stopwords in a string
def remove_stopwords(text_tokens):
    filtered_words = [word for word in text_tokens if word not in stopwords_list]
    return filtered_words

In [53]:
# Removing stopwords

# Timing start
start_time = timeit.default_timer()

# Applying function to dataframe
restaurant_reviews['text'] = restaurant_reviews.text.apply(remove_stopwords)

# Elapsed time calculation
elapsed = timeit.default_timer() - start_time

print('Elapsed Time (minutes):')
print(elapsed/60)

Elapsed Time (minutes):
62.181068632566664


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [55]:
# Joining text back together
restaurant_reviews['text'] = restaurant_reviews.text.apply(' '.join)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [57]:
# Reset the index
restaurant_reviews = restaurant_reviews.reset_index(drop=True)

# Cleaned Data Set

In [59]:
restaurant_reviews

Unnamed: 0,stars,text,good_bad,language,is_open
0,5,love deagans really atmosphere cozy festive sh...,Good,en,1.0
1,1,dismal lukewarm defrostedtasting texmex glop m...,Bad,en,0.0
2,4,oh happy day finally canes near casa yes other...,Good,en,1.0
3,5,definitely favorite fast food sub shop ingredi...,Good,en,1.0
4,5,really good place simple decor amazing food gr...,Good,en,1.0
...,...,...,...,...,...
5026161,5,confections cash casinos welcome las vegas fin...,Good,en,0.0
5026162,3,solid american food southern comfort flare war...,Neutral,en,1.0
5026163,5,im honestly sure never place im definitely goi...,Good,en,1.0
5026164,3,food decent say service took way long order ev...,Neutral,en,1.0


In [60]:
# Saved off cleaned data set
restaurant_reviews.to_pickle('restaurant_reviews_cleaned.pkl')