# Yelp - Sentiment analysis and business growth predictors 

# Context
(source: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/version/6) - This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset you'll find information about businesses across 11 metropolitan areas in four countries

# Objective

The objective of this project is to find the best growth predictors for businesses given user reviews and ratings for each business. NLP and exploratory data analysis techinques are used to determine these metrics. The focus of this project will be on resturants in the Ontario, Canada and how they can use insights from this drive better business decisions.

# Import 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
%%time 
business = pd.read_csv("./yelp_business.csv", chunksize = 1000000)
business_giant = pd.concat(business)

review = pd.read_csv("./yelp_review.csv", chunksize = 1000000)
review_giant = pd.concat(review)

check_in = pd.read_csv("./yelp_checkin.csv", chunksize = 1000000)
check_in_giant = pd.concat(check_in)

CPU times: user 36.3 s, sys: 10.8 s, total: 47.1 s
Wall time: 59.6 s


In [3]:
business_giant.info()
business_giant.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174567 entries, 0 to 174566
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   174567 non-null  object 
 1   name          174567 non-null  object 
 2   neighborhood  68015 non-null   object 
 3   address       174567 non-null  object 
 4   city          174566 non-null  object 
 5   state         174566 non-null  object 
 6   postal_code   173944 non-null  object 
 7   latitude      174566 non-null  float64
 8   longitude     174566 non-null  float64
 9   stars         174567 non-null  float64
 10  review_count  174567 non-null  int64  
 11  is_open       174567 non-null  int64  
 12  categories    174567 non-null  object 
dtypes: float64(3), int64(2), object(8)
memory usage: 17.3+ MB


Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",,"""4855 E Warner Rd, Ste B9""",Ahwatukee,AZ,85044,33.33069,-111.978599,4.0,22,1,Dentists;General Dentistry;Health & Medical;Or...
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",,"""3101 Washington Rd""",McMurray,PA,15317,40.291685,-80.1049,3.0,11,1,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",,"""6025 N 27th Ave, Ste 1""",Phoenix,AZ,85017,33.524903,-112.11531,1.5,18,1,Departments of Motor Vehicles;Public Services ...
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",,"""5000 Arizona Mills Cr, Ste 435""",Tempe,AZ,85282,33.383147,-111.964725,3.0,9,0,Sporting Goods;Shopping
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",,"""581 Howe Ave""",Cuyahoga Falls,OH,44221,41.119535,-81.47569,3.5,116,1,American (New);Nightlife;Bars;Sandwiches;Ameri...


In [4]:
review_giant.info()
review_giant.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5261668 entries, 0 to 5261667
Data columns (total 9 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   review_id    object
 1   user_id      object
 2   business_id  object
 3   stars        int64 
 4   date         object
 5   text         object
 6   useful       int64 
 7   funny        int64 
 8   cool         int64 
dtypes: int64(4), object(5)
memory usage: 361.3+ MB


Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool
0,vkVSCC7xljjrAI4UGfnKEQ,bv2nCi5Qv5vroFiqKGopiw,AEx2SYEUJmTxVVB18LlCwA,5,2016-05-28,Super simple place but amazing nonetheless. It...,0,0,0
1,n6QzIUObkYshz4dz2QRJTw,bv2nCi5Qv5vroFiqKGopiw,VR6GpWIda3SfvPC-lg9H3w,5,2016-05-28,Small unassuming place that changes their menu...,0,0,0
2,MV3CcKScW05u5LVfF6ok0g,bv2nCi5Qv5vroFiqKGopiw,CKC0-MOWMqoeWf6s-szl8g,5,2016-05-28,Lester's is located in a beautiful neighborhoo...,0,0,0
3,IXvOzsEMYtiJI0CARmj77Q,bv2nCi5Qv5vroFiqKGopiw,ACFtxLv8pGrrxMm6EgjreA,4,2016-05-28,Love coming here. Yes the place always needs t...,0,0,0
4,L_9BTb55X0GDtThi6GlZ6w,bv2nCi5Qv5vroFiqKGopiw,s2I_Ni76bjJNK9yG60iD-Q,4,2016-05-28,Had their chocolate almond croissant and it wa...,0,0,0


In [5]:
check_in_giant.info()
check_in_giant.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911218 entries, 0 to 3911217
Data columns (total 4 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   business_id  object
 1   weekday      object
 2   hour         object
 3   checkins     int64 
dtypes: int64(1), object(3)
memory usage: 119.4+ MB


Unnamed: 0,business_id,weekday,hour,checkins
0,3Mc-LxcqeguOXOVT_2ZtCg,Tue,0:00,12
1,SVFx6_epO22bZTZnKwlX7g,Wed,0:00,4
2,vW9aLivd4-IorAfStzsHww,Tue,14:00,1
3,tEzxhauTQddACyqdJ0OPEQ,Fri,19:00,1
4,CEyZU32P-vtMhgqRCaXzMA,Tue,17:00,1


# Data Cleaning/ Data Wrangling
Clean and combine data sets based on business id filtered for category and year

## Review 
    1. Change the data type from object to date format
    2. Only check after year 2016

In [6]:
# Review
review_giant['date'] = pd.to_datetime(review_giant['date'])
review_giant['date'][0]

Timestamp('2016-05-28 00:00:00')

In [7]:
# check how many reviews in each year
pd.DatetimeIndex(review_giant['date']).year.value_counts()

2017    1128518
2016    1052916
2015     911487
2014     678351
2013     472595
2012     350381
2011     290933
2010     187073
2009      98288
2008      61553
2007      23020
2006       5669
2005        870
2004         14
Name: date, dtype: int64

In [8]:
# select only review after 2016
review_years = review_giant[review_giant['date'] >= '2017']
pd.DatetimeIndex(review_years['date']).year.value_counts()

2017    1128518
Name: date, dtype: int64

In [9]:
# feature selection 
review_years_features = review_years[['review_id','user_id','business_id','stars','date','text']]
review_years_features.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text
33,7GsVl-wMaSfG1VoEK6-s6g,u0LXt3Uea_GidxRW1xcsfg,3E5umUqaU5OZAV3jNLW3kQ,4,2017-11-02,Great place to bring dogs! It's really a dog p...
43,Ebggx4Zlc4VWReJMG1nT6w,u0LXt3Uea_GidxRW1xcsfg,l1_S1mfGbEMxfT1f9omhEA,1,2017-10-16,Terrible service and not so great drinks.\n\nW...
49,O4RZMP8IFyJTNfRp0QXEsw,u0LXt3Uea_GidxRW1xcsfg,vcxvQyAggPqxcHwvJXvjGg,4,2017-01-04,Love this place!\n\nThe cakes are delicious bu...
50,66KqTwiQ1oB9-aTsoEN35Q,u0LXt3Uea_GidxRW1xcsfg,DKiRDPtQ5cTN-eX1oEgA9w,3,2017-01-04,It's a pub... nice and clean one.\n\nCame here...
51,jREsaout3cuhKbROVDXUFg,u0LXt3Uea_GidxRW1xcsfg,I44P6Pfoey2pArOhhx2RnA,4,2017-10-17,Cute little hole in the wall place. Two entran...


In [10]:
# check for null 
review_years_features.isnull().sum()

review_id      0
user_id        0
business_id    0
stars          0
date           0
text           0
dtype: int64

### Review comments cleaning
    1. Tokenize the review comments at word level
    2. Change all upper case to lower case
    3. Remove regular English stopwords based on NTLK
    4. Remove punctuations and other symbols/words less than 2 characters
    5. Untokenize the words and return a clean text column in the dataframe
    

In [11]:
import nltk
import string
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

In [12]:
# define function to remove stop words
def remove_stopwords(list_of_token):
    '''
    Remove English stop words
    '''
    
    cleaned_tokens = []
    for token in list_of_token:
        if token in ENGLISH_STOP_WORDS: continue
        cleaned_tokens.append(token)
        
    return cleaned_tokens

In [13]:
# define function to remove punctutations, remove extra ''
def remove_punctuations(list_of_tokens):
    
    cleaned_tokens =[]
    for word in list_of_tokens:
        
        for punctuation in string.punctuation:
            word = word.replace(punctuation,'')
            
            #remove character less than 2
            word = re.sub(r'\b\w{1,2}\b', '', word)
            
            
        cleaned_tokens.append(word)
        
        #remove extra '' in the list
        while('' in cleaned_tokens):
            cleaned_tokens.remove('')
        
    return cleaned_tokens


In [14]:
# define a function to untokenize the tokens
def the_untokenizer(token_list):
    
    return " ".join(token_list)

In [15]:
#define a function that combines the functions needed to clean out texts
def cleaning_out_texts(text):
    
    cleaned_text = [] 
    tokenizer_list = word_tokenize(text)
    
    lower_word = []
    for word in tokenizer_list:
        word = word.lower()
        lower_word.append(word)
        
    remove_stopwords_list = remove_stopwords(lower_word)
    remove_punctuation_list = remove_punctuations(remove_stopwords_list)
    back_to_string = the_untokenizer(remove_punctuation_list)
    
    cleaned_text.append(back_to_string)
    
    return cleaned_text

In [16]:
def unlist(list):
    return str(list).strip("[],''")

In [17]:
# review_years_features['cleaned_text'] = review_years_features['text'].apply(cleaning_out_texts)
# review_years_features.head()

## Business 
    1. Only interested in resturants
    2. Only North America or Canada? 

In [18]:
# Cleaning business table

#Categories with word 'Restaurants' or 'Food'
business_filter = business_giant[business_giant['categories'].str.contains("Food")| business_giant['categories'].str.contains('Restaurants')]

#Only look at Ontario 
business_filter_state = business_filter[business_filter['state']== 'ON']
business_filter_state.nunique()

business_id     16845
name            12345
neighborhood      114
address         13461
city               95
state               1
postal_code      6197
latitude        14603
longitude       14620
stars               9
review_count      378
is_open             2
categories       7983
dtype: int64

In [19]:
# merge table 
print(business_filter_state.shape)
print(review_years_features.shape)

(16845, 13)
(1128518, 6)


In [20]:
#create master table filtered on business id 
master_df = pd.DataFrame()
master_df = pd.merge(left= business_filter_state, right= review_years_features, on='business_id')
master_df

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars_x,review_count,is_open,categories,review_id,user_id,stars_y,date,text
0,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,vt2HH1VjeZLQE6tMjP48Pg,7aWI9ruXTb0w5WADh-FEVw,1,2017-04-10,Supremacy vibes. You might want to avoid it al...
1,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,JGafXFmK7dDNSIk__-vseA,sno00i53Bv_d0pcMFwTH_w,4,2017-09-20,I will start by saying this is a great place t...
2,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,dLjAr4WtzAPx_L618bVgYA,RizIc-wSzzDu1tcb6Ditbw,5,2017-10-17,"I've been coming here fore almost 10 years, I ..."
3,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,j_TeLG6B7VRueO7ritvImw,ugWgc6T9ZywmFyxh2ScpEA,4,2017-05-16,Fantastic place to stop in for breakfast if yo...
4,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,XugUlErEm1t1ybegUzcW-g,f6AwvBRnwcTyn3teZIE0hw,2,2017-05-30,"Nice to have a more classic, mom-and-pop style..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109848,fukaxeFh8W9ijOp8sCrDyA,"""Heritage Fish & Chips""",,"""1 Fisherman Drive""",Brampton,ON,L7A 2X9,43.714303,-79.799411,4.0,14,1,Seafood;Restaurants,V7vps0z2ZI2f2bXJSsSv4w,FgCfj74mpOZ0uMcb9lXu8g,4,2017-01-05,Food is good. I prefer not such thick fries a...
109849,fukaxeFh8W9ijOp8sCrDyA,"""Heritage Fish & Chips""",,"""1 Fisherman Drive""",Brampton,ON,L7A 2X9,43.714303,-79.799411,4.0,14,1,Seafood;Restaurants,ocw3ex9uGiAAaoCLk1OQ3Q,LHBXPaYdByaADlu-z7DEDQ,3,2017-11-19,"Well, definitely don't go here if you're looki..."
109850,nGjEV4bn0DPk8bcb0C6Aig,"""Sweet Serendipity Bake Shop""",The Danforth,"""1335 Danforth Avenue""",Toronto,ON,M4J 1N1,43.682054,-79.328996,4.5,22,1,Bakeries;Food,Oflm0Jj9Q96ijW0OqJKYMg,Qer0Ab50__RMQ0g_NlHivA,5,2017-05-18,Yummy carrot cake... And I don't like Carrot C...
109851,nGjEV4bn0DPk8bcb0C6Aig,"""Sweet Serendipity Bake Shop""",The Danforth,"""1335 Danforth Avenue""",Toronto,ON,M4J 1N1,43.682054,-79.328996,4.5,22,1,Bakeries;Food,fmY1YjzzwnyDDDOxwg_aAg,yvcZN7H2pmeGXMV2pHpozw,3,2017-07-18,Ordered a dozen cupcakes for a party with vari...


In [21]:
master_df['cleaned_text'] = master_df['text'].apply(cleaning_out_texts)
master_df['cleaned_text'] = master_df['cleaned_text'].apply(unlist)
master_df.head()

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars_x,review_count,is_open,categories,review_id,user_id,stars_y,date,text,cleaned_text
0,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,vt2HH1VjeZLQE6tMjP48Pg,7aWI9ruXTb0w5WADh-FEVw,1,2017-04-10,Supremacy vibes. You might want to avoid it al...,supremacy vibes want avoid altogether places f...
1,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,JGafXFmK7dDNSIk__-vseA,sno00i53Bv_d0pcMFwTH_w,4,2017-09-20,I will start by saying this is a great place t...,start saying great place home cooked meals com...
2,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,dLjAr4WtzAPx_L618bVgYA,RizIc-wSzzDu1tcb6Ditbw,5,2017-10-17,"I've been coming here fore almost 10 years, I ...",coming fore years believe written review servi...
3,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,j_TeLG6B7VRueO7ritvImw,ugWgc6T9ZywmFyxh2ScpEA,4,2017-05-16,Fantastic place to stop in for breakfast if yo...,fantastic place stop breakfast passing area st...
4,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,38,1,Bakeries;Bagels;Food,XugUlErEm1t1ybegUzcW-g,f6AwvBRnwcTyn3teZIE0hw,2,2017-05-30,"Nice to have a more classic, mom-and-pop style...",nice classic momandpop style diner main needs ...


## Check-in 
    1. Aggregrate check in on business id
   

In [22]:
# Clean check-in table
checkin_df = check_in_giant.groupby(['business_id','weekday'])['checkins'].sum().reset_index(name='checkins')
checkin_days_df = checkin_df.pivot_table(values='checkins', index = ['business_id'], columns=['weekday'], aggfunc='first', dropna=True).fillna(0).reset_index()
checkin_days_df

weekday,business_id,Fri,Mon,Sat,Sun,Thu,Tue,Wed
0,--6MefnULPED_I942VcFNA,21.0,19.0,40.0,37.0,7.0,7.0,8.0
1,--7zmmkVg-IMGaXbuVd0SQ,21.0,6.0,64.0,33.0,16.0,7.0,6.0
2,--8LPVSo5i0Oo61X01sV9A,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,--9QQLMTbFzLJ_oT-ON3Xw,6.0,1.0,10.0,6.0,1.0,5.0,4.0
4,--9e1ONYQuAa-CB_Rrw7Tw,294.0,281.0,480.0,791.0,249.0,251.0,222.0
...,...,...,...,...,...,...,...,...
146345,zzvlwkcNR1CCqOPXwuvz2A,1.0,0.0,0.0,0.0,0.0,1.0,0.0
146346,zzwaS0xn1MVEPEf0hNLjew,71.0,34.0,283.0,220.0,56.0,19.0,33.0
146347,zzwhN7x37nyjP0ZM8oiHmw,8.0,1.0,9.0,4.0,3.0,2.0,6.0
146348,zzwicjPC9g246MK2M1ZFBA,17.0,18.0,25.0,30.0,15.0,10.0,12.0


In [23]:
# combine with master_df
master_df = master_df.merge(right=checkin_days_df, on ='business_id', how='left')

In [24]:
# drop null values for weekdays 
master_df.isnull().sum
master_df = master_df.dropna(subset=['Fri','Sat' ,'Sun','Thu','Mon','Tue','Wed'])

In [25]:
master_df.isnull().sum()
master_df = master_df.reset_index()
master_df

Unnamed: 0,index,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,...,date,text,cleaned_text,Fri,Mon,Sat,Sun,Thu,Tue,Wed
0,0,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,...,2017-04-10,Supremacy vibes. You might want to avoid it al...,supremacy vibes want avoid altogether places f...,9.0,14.0,11.0,15.0,12.0,9.0,13.0
1,1,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,...,2017-09-20,I will start by saying this is a great place t...,start saying great place home cooked meals com...,9.0,14.0,11.0,15.0,12.0,9.0,13.0
2,2,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,...,2017-10-17,"I've been coming here fore almost 10 years, I ...",coming fore years believe written review servi...,9.0,14.0,11.0,15.0,12.0,9.0,13.0
3,3,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,...,2017-05-16,Fantastic place to stop in for breakfast if yo...,fantastic place stop breakfast passing area st...,9.0,14.0,11.0,15.0,12.0,9.0,13.0
4,4,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,...,2017-05-30,"Nice to have a more classic, mom-and-pop style...",nice classic momandpop style diner main needs ...,9.0,14.0,11.0,15.0,12.0,9.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109112,109848,fukaxeFh8W9ijOp8sCrDyA,"""Heritage Fish & Chips""",,"""1 Fisherman Drive""",Brampton,ON,L7A 2X9,43.714303,-79.799411,...,2017-01-05,Food is good. I prefer not such thick fries a...,food good prefer fries cooked fish cooked nice...,5.0,0.0,3.0,1.0,3.0,0.0,2.0
109113,109849,fukaxeFh8W9ijOp8sCrDyA,"""Heritage Fish & Chips""",,"""1 Fisherman Drive""",Brampton,ON,L7A 2X9,43.714303,-79.799411,...,2017-11-19,"Well, definitely don't go here if you're looki...",definitely looking healthy meal ate mushrooms ...,5.0,0.0,3.0,1.0,3.0,0.0,2.0
109114,109850,nGjEV4bn0DPk8bcb0C6Aig,"""Sweet Serendipity Bake Shop""",The Danforth,"""1335 Danforth Avenue""",Toronto,ON,M4J 1N1,43.682054,-79.328996,...,2017-05-18,Yummy carrot cake... And I don't like Carrot C...,yummy carrot cake like carrot cake lol able ac...,0.0,0.0,2.0,4.0,1.0,2.0,1.0
109115,109851,nGjEV4bn0DPk8bcb0C6Aig,"""Sweet Serendipity Bake Shop""",The Danforth,"""1335 Danforth Avenue""",Toronto,ON,M4J 1N1,43.682054,-79.328996,...,2017-07-18,Ordered a dozen cupcakes for a party with vari...,ordered dozen cupcakes party various flavors d...,0.0,0.0,2.0,4.0,1.0,2.0,1.0


In [30]:
master_df = master_df.drop('index', axis=1)

# Feature Engineering
## Convert review words to numbers

1. Apply TextBlob packages to get the polarity and subjective score for each review and append to the dataframe
2. Apply sentiment packages from VADER to get the neutral, negative, and positive ratings for each review and append to dataframe
<!-- 3. Get the length of the review comments for each row and append to the dataframe -->


In [27]:
# Using TextBlob to get polarity and subjectivity by sentence
from textblob import TextBlob

polarity = []
subjectivity = []

for n in range(master_df.shape[0]):
    
    polar_score = TextBlob(master_df['cleaned_text'][n]).sentiment[0]
    subject_score = TextBlob(master_df['cleaned_text'][n]).sentiment[1]
    
    polarity.append(polar_score)
    subjectivity.append(subject_score)
    
master_df['polarity'] = polarity
master_df['subjectivity'] = subjectivity

In [31]:
master_df.tail()

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars_x,...,cleaned_text,Fri,Mon,Sat,Sun,Thu,Tue,Wed,polarity,subjectivity
109112,fukaxeFh8W9ijOp8sCrDyA,"""Heritage Fish & Chips""",,"""1 Fisherman Drive""",Brampton,ON,L7A 2X9,43.714303,-79.799411,4.0,...,food good prefer fries cooked fish cooked nice...,5.0,0.0,3.0,1.0,3.0,0.0,2.0,0.666667,0.733333
109113,fukaxeFh8W9ijOp8sCrDyA,"""Heritage Fish & Chips""",,"""1 Fisherman Drive""",Brampton,ON,L7A 2X9,43.714303,-79.799411,4.0,...,definitely looking healthy meal ate mushrooms ...,5.0,0.0,3.0,1.0,3.0,0.0,2.0,0.448148,0.685185
109114,nGjEV4bn0DPk8bcb0C6Aig,"""Sweet Serendipity Bake Shop""",The Danforth,"""1335 Danforth Avenue""",Toronto,ON,M4J 1N1,43.682054,-79.328996,4.5,...,yummy carrot cake like carrot cake lol able ac...,0.0,0.0,2.0,4.0,1.0,2.0,1.0,0.5,0.508333
109115,nGjEV4bn0DPk8bcb0C6Aig,"""Sweet Serendipity Bake Shop""",The Danforth,"""1335 Danforth Avenue""",Toronto,ON,M4J 1N1,43.682054,-79.328996,4.5,...,ordered dozen cupcakes party various flavors d...,0.0,0.0,2.0,4.0,1.0,2.0,1.0,0.085256,0.679915
109116,nGjEV4bn0DPk8bcb0C6Aig,"""Sweet Serendipity Bake Shop""",The Danforth,"""1335 Danforth Avenue""",Toronto,ON,M4J 1N1,43.682054,-79.328996,4.5,...,stopped sunny saturday having brunch street fr...,0.0,0.0,2.0,4.0,1.0,2.0,1.0,0.451111,0.654444


In [32]:
master_df.columns

Index(['business_id', 'name', 'neighborhood', 'address', 'city', 'state',
       'postal_code', 'latitude', 'longitude', 'stars_x', 'review_count',
       'is_open', 'categories', 'review_id', 'user_id', 'stars_y', 'date',
       'text', 'cleaned_text', 'Fri', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed',
       'polarity', 'subjectivity'],
      dtype='object')

## VADER-Sentiment-Analysis
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

In [33]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [35]:
score = []
for i in range(master_df.shape[0]):
    score.append(analyser.polarity_scores(master_df['cleaned_text'][i]))
    

In [41]:
#check length
len(score)

109117

In [43]:
score_df = pd.DataFrame(score)
score_df.head(2)

Unnamed: 0,neg,neu,pos,compound
0,0.318,0.455,0.227,-0.25
1,0.187,0.511,0.302,0.802


In [46]:
master_df['negative'] = score_df['neg']
master_df['neutral'] = score_df['neu']
master_df['postive'] = score_df['pos']
master_df['compound'] = score_df['compound']
master_df.head(2)

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars_x,...,Sun,Thu,Tue,Wed,polarity,subjectivity,negative,neutral,postive,compound
0,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,...,15.0,12.0,9.0,13.0,0.0,0.0,0.318,0.455,0.227,-0.25
1,xcgFnd-MwkZeO5G2HQ0gAQ,"""T & T Bakery and Cafe""",Markham Village,"""35 Main Street N""",Markham,ON,L3P 1X3,43.875177,-79.260153,4.0,...,15.0,12.0,9.0,13.0,0.162262,0.596502,0.187,0.511,0.302,0.802


# EDA

# Modelling

## Train/Test split

# Conclusion