## <span style="color:#730101">Data Import and Cleaning</span>

In this notebook, we import the initial dataset from Yelp.com and perform the appropriate modifications to create the final data in order to continue the analysis. To be more specific, we clean the data, keep only the meaningfull columns and apply thorough text preprocessing.

##### <span style="color:#3A3A3A">Import Libraries</span>

In [1]:
import pandas as pd
import numpy as np
import json
import pickle
import glob
import os
import nltk
#nltk.download('words')
import re
import string 

from collections import Counter
from IPython.display import display
from typing import Tuple, List
from unidecode import unidecode
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from ipywidgets import interact
from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')

pd.options.mode.chained_assignment = None

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

  from pandas import Panel


In [2]:
def set_pandas_display_options() -> None:
    """Set pandas display options."""
    display = pd.options.display
    display.max_columns = 1000
    display.max_rows = 1000
    display.max_colwidth = 199
    display.width = None
    # display.precision = 2  # set as needed

set_pandas_display_options()

##### <span style="color:#3A3A3A">Import Datasets</span>

In [3]:
'''
Import Business Dataset
-----------------------
The business dataset contains business data including location data, attributes, and categories.
'''

# Read the dataset in .json format
business = pd.read_json('yelp_academic_dataset_business.json', lines=True)
display(business.head(3))




Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikeParking': 'True', 'GoodForKids': 'False', 'BusinessParking': '{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}', 'By...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Shopping","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0', 'Wednesday': '10:0-18:0', 'Thursday': '11:0-20:0', 'Friday': '11:0-20:0', 'Saturday': '11:0-20:0', 'Sunday': '13:0-18:0'}"
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': 'True'}","Health & Medical, Fitness & Instruction, Yoga, Active Life, Pilates",
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",


In [4]:
# The business dataset has 209393 unique business_ids
print(business.shape)

(209393, 14)


In [5]:
'''
Import Review Dataset
---------------------
The Review Dataset contains full review text data including the user_id that wrote the review and the business_id the 
review is written for.
'''

# # Process the dataset in chunks of size 400000
# out_path = os.getcwd() #Path to save the pickle files to
# chunk_size = 400000 #size of chunks relies on your available memory

# reader = pd.read_json('yelp_academic_dataset_review.json', lines=True, chunksize=chunk_size)

# for i, chunk in enumerate(reader):
#     out_file = out_path + "/review_{}.pkl".format(i+1)
#     with open(out_file, "wb") as f:
#         pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)      

pickle_path = os.getcwd() #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/review_*.pkl"):
       data_p_files.append(name)


df_reviews = pd.DataFrame([])
for i in range(len(data_p_files)):
    df_reviews = df_reviews.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

In [6]:
display(df_reviews.head(3))

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2,5,0,0,"As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the...",2015-04-15 05:21:16
1,UmFMZ8PyXZTY2QcwzsfQYA,nIJD_7ZXHq-FX8byPMOkMQ,lbrU8StCq3yDfr-QMnGrmQ,1,1,1,0,I am actually horrified this place is still in business. My 3 year old son needed a haircut this past summer and the lure of the $7 kids cut signs got me in the door. We had to wait a few minutes...,2013-12-07 03:16:52
2,LG2ZaYiOgpr2DK_90pYjNw,V34qejxNsCbcgD8C0HVk-Q,HQl28KMwrEKHqhFrrDqVNQ,5,1,0,0,"I love Deagan's. I do. I really do. The atmosphere is cozy and festive. The shrimp tacos and house fries are my standbys. The fries are sometimes good and sometimes great, and the spicy dipping s...",2015-12-05 03:18:11


In [7]:
# The reviews dataset has 8021122 unique reviews_ids
df_reviews.shape

(8021122, 9)

In [8]:
'''
Import User Dataset
-------------------
The User Dataset contains the user's friend mapping and all the metadata associated with the user.
'''

# # Process the dataset in chunks of size 400000
# out_path = os.getcwd() #Path to save the pickle files to
# chunk_size = 400000 #size of chunks relies on your available memory

# reader = pd.read_json('yelp_academic_dataset_user.json', lines=True, chunksize=chunk_size)

# for i, chunk in enumerate(reader):
#     out_file = out_path + "/user_{}.pkl".format(i+1)
#     with open(out_file, "wb") as f:
#         pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
        
pickle_path = os.getcwd() #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/user_*.pkl"):
       data_p_files.append(name)


df_user = pd.DataFrame([])
for i in range(len(data_p_files)):
    df_user = df_user.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

In [9]:
display(df_user.head(3))

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,ntlvfPzc8eglqvk92iDIAw,Rafael,553,2007-07-06 03:27:11,628,225,227,,"oeMvJh94PiGQnx_6GlndPQ, wm1z1PaJKvHgSDRKfwhfDg, IkRib6Xs91PPW7pon7VVig, A8Aq8f0-XvLBcyMk2GJdJQ, eEZM1kogR7eL4GOBZyPvBA, e1o1LN7ez5ckCpQeAab4iw, _HrJVzFaRFUhPva8cwBjpQ, pZeGZGzX-ROT_D5lam5uNg, 0S6...",14,3.57,3,2,1,0,1,11,15,22,22,10,0
1,FOBRPlBHa3WPHFB5qYDlVg,Michelle,564,2008-04-28 01:29:25,790,316,400,200820092010201120122013,"ly7EnE8leJmyqyePVYFlug, pRlR63iDytsnnniPb3AOug, kc-rnN-ndnFTdHG4TfIgeQ, GYndf-h6dAwpGP0lDBz2Wg, FPo3SwQuAK53QVZm_eIyBg, 9fF_T3pQu3ay1oA7h_VYNA, G5T3bd6dUs5zkQ2VMZtRUw, tufuEc5f9TWR05_yko46QQ, 4lM...",27,3.84,36,4,5,2,1,33,37,63,63,21,5
2,zZUnPeh2hEp0WydbAZEOOg,Martin,60,2008-08-28 23:40:05,151,125,103,2010,"Uwlk0txjQBPw_JhHsQnyeg, Ybxr1tSCkv3lYA0I1qmnPQ, DNmeLov3wXNxlxjN5feBoQ, x7n69vEsYFh9xnW3D5lPPQ, -AaBjWJYiQxXkCMDlXfPGw, COXnA2hnzFDai3ywx_iM8A, dUFoyswTt5ZQbleF3_4TCg, uj2AWSvsspbrkebc_jqt4w, MWa...",5,3.44,9,6,0,1,0,3,7,17,17,4,1


In [10]:
# The user dataset has 1968703 unique user_ids
df_user.shape

(1968703, 22)

##### <span style="color:#3A3A3A">Data Cleaning</span>

In [11]:
# We keep the columns we want from the reviews and the user dataset in order to merge them based on the business_id.
# The aim is to create a dataframe in which we have the information about the business the user has made the review for.

# Keep columns from reviews dataset
reviews = df_reviews[['review_id','user_id','business_id','stars','text','date']] 
# Merge the reviews and business datasets
review_business = reviews.merge(business, left_on='business_id', right_on='business_id')

# Keep columns from user dataset
user = df_user[['user_id', 'name', 'review_count', 'average_stars']] 
# Merge the user, business and reviews datasets
restaurants = review_business.merge(user,left_on='user_id', right_on='user_id')
display(restaurants.head(3))

Unnamed: 0,review_id,user_id,business_id,stars_x,text,date,name_x,address,city,state,postal_code,latitude,longitude,stars_y,review_count_x,is_open,attributes,categories,hours,name_y,review_count_y,average_stars
0,xQY8N_XvtGbearJ5X4QryQ,OwjRMXRC0KyPrIlcjaXeFQ,-MhfebM0QIsKt87iDN-FNw,2,"As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the...",2015-04-15 05:21:16,Bellagio Gallery of Fine Art,3600 S Las Vegas Blvd,Las Vegas,NV,89109,36.112896,-115.177637,3.5,180,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'GoodForKids': 'False', 'BusinessParking': '{'garage': True, 'street': False, 'validated': False, 'lot': False, 'valet': Fals...","Shopping, Arts & Entertainment, Art Galleries, Museums","{'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0', 'Wednesday': '10:0-20:0', 'Thursday': '10:0-20:0', 'Friday': '10:0-20:0', 'Saturday': '10:0-20:0', 'Sunday': '10:0-20:0'}",Jamie,58,3.36
1,SjfnCrMCgOiWafnQuCKlhw,OwjRMXRC0KyPrIlcjaXeFQ,9SU7ZZhaFUJJ6m2k5HKHeg,1,SLS just opened in August and they have so many kinks to work out. Worst Vegas hotel stay ever (03/13-15). \n\nFriends and I booked two hotel rooms (Kings in the World Tower) and there were hiccu...,2015-03-19 06:17:28,Sahara,2535 Las Vegas Blvd S,Las Vegas,NV,89109,36.142375,-115.156723,3.0,2259,1,"{'BikeParking': 'False', 'Ambience': '{'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'divey': False, 'touristy': False, 'trendy': True, 'upscale': False, 'casual': Fals...","Hotels & Travel, Nightlife, Hotels, Event Planning & Services, Lounges, Bars, Arts & Entertainment, Casinos","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'Wednesday': '0:0-0:0', 'Thursday': '0:0-0:0', 'Friday': '0:0-0:0', 'Saturday': '0:0-0:0', 'Sunday': '0:0-0:0'}",Jamie,58,3.36
2,t7xOZF5UKXjSpVcXLOSAgw,owbC7FP8SNAlwv6f9S5Stw,-MhfebM0QIsKt87iDN-FNw,2,"I have been there. I believe more than once. \nI was not in awe by this gallery.\nIt was Ok.\nI thought it would be bigger and have more art, and flair.\nI guess not.\nI am glad it exists, I gues...",2014-03-14 08:24:25,Bellagio Gallery of Fine Art,3600 S Las Vegas Blvd,Las Vegas,NV,89109,36.112896,-115.177637,3.5,180,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'GoodForKids': 'False', 'BusinessParking': '{'garage': True, 'street': False, 'validated': False, 'lot': False, 'valet': Fals...","Shopping, Arts & Entertainment, Art Galleries, Museums","{'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0', 'Wednesday': '10:0-20:0', 'Thursday': '10:0-20:0', 'Friday': '10:0-20:0', 'Saturday': '10:0-20:0', 'Sunday': '10:0-20:0'}",Girl,250,3.42


In [12]:
# We keep from the restaurants dataset the records that contain the word "restaurant" in its categories
restaurants['categories'] = restaurants['categories'].str.lower()
restaurants = restaurants[restaurants['categories'].str.contains('(restaurant).*')==True].reset_index()
# We also keep the restaurants that are open
restaurants = restaurants[restaurants['is_open']==1]

# We rename some columns in order to facilitate their use
restaurants = restaurants.rename(columns ={"stars_x":"review_stars"})
restaurants = restaurants.rename(columns ={"name_x":"business_name"})
restaurants = restaurants.rename(columns ={"stars_y":"rating"})
restaurants = restaurants.rename(columns ={"review_count_x":"num_of_business_reviews"})
restaurants = restaurants.rename(columns ={"name_y":"user_name"})
restaurants = restaurants.rename(columns ={"review_count_y":"num_of_reviews"})
restaurants.head(1)

# We also remove records in which the user has done more than one review in the same business and we keep only the 
# most recent one
restaurants=restaurants.sort_values('date', ascending=False).drop_duplicates(subset=['user_id', 'business_id'])
# The dataset, now, contains 4078659 records
print(restaurants.shape)

(4078659, 23)


In [13]:
# We take a loon on the areas that are included in our dataset in order to decrease the records in our dataset.
group_restaur = restaurants[['state','text']]
group_restaur.groupby(['state']).count().sort_values(by=['text'], ascending=False)

Unnamed: 0_level_0,text
state,Unnamed: 1_level_1
NV,1362885
AZ,1168347
ON,521055
NC,278926
OH,225093
PA,202979
QC,136063
WI,84839
AB,57286
IL,25048


In [14]:
# we keep only Nevada State as it contains enough records for our analysis
Nevada = restaurants[restaurants['state']=='NV']
Nevada = Nevada.drop('index',1)
Nevada = Nevada.reset_index(drop=True)
display(Nevada.head(3))

Unnamed: 0,review_id,user_id,business_id,review_stars,text,date,business_name,address,city,state,postal_code,latitude,longitude,rating,num_of_business_reviews,is_open,attributes,categories,hours,user_name,num_of_reviews,average_stars
0,QiImvuUidV_SKI7EZYD83A,cvA8vHPR0Gs0zsPnyv6JEQ,L6URxfBxYvwbr0U6xOzY1Q,4,"Hubby and I were cutting through the Tropicana to go elsewhere for breakfast. I was sold on the $2 mimosas and made to order omelets and waffles. \n\nThere wasn't a crowd, but the staff kept ever...",2019-12-13 15:50:49,Savor Brunch Buffet,3801 S Las Vegas Blvd,Las Vegas,NV,89109,36.099635,-115.171358,3.5,21,1,"{'Ambience': '{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casual': False}', 'HasTV': 'True', '...","breakfast & brunch, buffets, restaurants","{'Monday': '7:0-13:0', 'Tuesday': '7:0-13:0', 'Wednesday': '7:0-13:0', 'Thursday': '7:0-13:0', 'Friday': '7:0-13:0', 'Saturday': '7:0-13:0', 'Sunday': '7:0-13:0'}",Taheerah,496,3.89
1,OKLFoPEC2bmj0gkd3mg2og,qPt9_SU60HY9NJUFOBh8Ew,0m99JzzybBddbP9mr2y5XA,5,"I went to the Chicken Shack with a group of friends as part of the Pinecrest Sloan Canyon Foodie Club, and I really enjoyed the experience! The chicken tenders had a delicious flavor due to their...",2019-12-13 15:48:50,The Chicken Shack,"10445 Spencer St, Ste 120",Las Vegas,NV,89183,35.999082,-115.127911,4.0,152,1,"{'BusinessAcceptsCreditCards': 'True', 'OutdoorSeating': 'False', 'RestaurantsTableService': 'False', 'Caters': 'True', 'GoodForKids': 'True', 'Alcohol': 'u'none'', 'RestaurantsDelivery': 'True',...","fast food, restaurants, chicken shop","{'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0', 'Wednesday': '10:0-21:0', 'Thursday': '10:0-21:0', 'Friday': '10:0-22:0', 'Saturday': '10:0-22:0', 'Sunday': '10:0-21:0'}",Dane,1,5.0
2,izOSwMP2js_ptjDQZsynig,KriIEvoyWwhoswBoqqUpzA,faPVqws-x-5k2CQKDNtHxw,5,"We just recently returned from Las Vegas and had the pleasure of stopping by your restaurant on Tuesday, December 10th for lunch. We were seated in the lounge and was greeted by an amazing server...",2019-12-13 15:34:17,Yardbird Southern Table & Bar,3355 Las Vegas Blvd S,Las Vegas,NV,89109,36.122328,-115.170112,4.5,4828,1,"{'BusinessAcceptsCreditCards': 'True', 'BikeParking': 'True', 'OutdoorSeating': 'False', 'RestaurantsReservations': 'True', 'RestaurantsPriceRange2': '2', 'NoiseLevel': ''average'', 'RestaurantsA...","restaurants, american (new), southern, nightlife, bars, cocktail bars","{'Monday': '0:0-0:0', 'Tuesday': '11:0-23:0', 'Wednesday': '11:0-23:0', 'Thursday': '11:0-23:0', 'Friday': '11:0-0:0', 'Saturday': '10:0-0:0', 'Sunday': '10:0-23:0'}",Christine,1,5.0


In [15]:
# The Nevada Dataset, now, contains 1362885 unique reviews.
Nevada.shape

(1362885, 22)

##### <span style="color:#3A3A3A">Dataset Transformation</span>

In [16]:
# First, we are going to transform the cities in order to achieve a uniform appearance
Nevada['city'] = Nevada['city'].str.upper()
cities = Nevada.city
# Calculate the number of times the city appears
counts = cities.value_counts()
# Calculate number of unique businesses per city
unique_businesses =  Nevada.groupby('city')['business_id'].nunique()
# Calculate number of average reviews per city
avg_reviews = round((counts/unique_businesses),2)
table = pd.DataFrame({'counts':counts,'avg_reviews':avg_reviews,'unique_businesses':unique_businesses}).sort_values(by=['unique_businesses'], ascending=False)
display(table)

Unnamed: 0,counts,avg_reviews,unique_businesses
LAS VEGAS,1212842,276.65,4384
HENDERSON,113564,189.27,600
NORTH LAS VEGAS,25772,98.37,262
BOULDER CITY,6428,169.16,38
N LAS VEGAS,194,38.8,5
NELLIS AFB,93,18.6,5
ENTERPRISE,336,112.0,3
LAS VEGAS,488,162.67,3
SPRING VALLEY,1999,666.33,3
N. LAS VEGAS,55,27.5,2


In [17]:
# Replace the cities with uniform names
Nevada['city'].replace(['N LAS VEGAS', 'N. LAS VEGAS', 'N.LAS VEGAS'], 'NORTH LAS VEGAS',inplace=True)
Nevada['city'].replace(['ENTERPRISE', 'SPRING VALLEY','4321 W FLAMINGO RD','SOUTH LAS VEGAS','SUMMERLIN','BLUE DIAMOND'], 'LAS VEGAS',inplace=True)
Nevada['city'].replace(['LAS  VEGAS'], 'LAS VEGAS',inplace=True)

# Delete these row indexes from dataframe as the contain little information
Nevada.drop(Nevada.loc[Nevada['city']=='SAN ANTONIO'].index, inplace=True)
Nevada.drop(Nevada.loc[Nevada['city']=='SUNRISE MANOR'].index, inplace=True)
Nevada.drop(Nevada.loc[Nevada['city']=='NELLIS AIR FORCE BASE'].index, inplace=True)
Nevada.drop(Nevada.loc[Nevada['city']=='NELLIS AFB'].index, inplace=True)

In [18]:
# We have another look on the cities now
cities = Nevada.city
# Calculate the number of times the city appears
counts = cities.value_counts()
# Calculate number of unique businesses per city
unique_businesses =  Nevada.groupby('city')['business_id'].nunique()
# Calculate number of average reviews per city
avg_reviews = round((counts/unique_businesses),2)
table = pd.DataFrame({'counts':counts,'avg_reviews':avg_reviews,'unique_businesses':unique_businesses}).sort_values(by=['unique_businesses'], ascending=False)
table

Unnamed: 0,counts,avg_reviews,unique_businesses
LAS VEGAS,1216664,276.7,4397
HENDERSON,113564,189.27,600
NORTH LAS VEGAS,26073,96.57,270
BOULDER CITY,6428,169.16,38


In [19]:
# The Nevada Dataset, now, contains 1362729 unique reviews.
Nevada.shape

(1362729, 22)

## <span style="color:#730101">Text Processing</span>

##### <span style="color:#3A3A3A">Lower Text</span>

In [20]:
# Create new column in which the text process will be applied. 
Nevada["cleanText"] = Nevada["text"]

# Convert to lowercase
Nevada.cleanText = Nevada.cleanText.str.lower()

##### <span style="color:#3A3A3A">Remove non-english words and symbols</span>

In [21]:
# We discovered that our dataset also contained non-english characters, like chinese characters

words = set(nltk.corpus.words.words())

def eng_words(text):
    new=" ".join(w for w in nltk.wordpunct_tokenize(text) 
         if w.lower() in words or not w.isalpha())
    eng = re.sub("[^a-zA-Z]+", " ", new)
    return eng

Nevada['cleanText'] = Nevada['cleanText'].progress_apply(eng_words)

HBox(children=(FloatProgress(value=0.0, max=1362729.0), HTML(value='')))




##### <span style="color:#3A3A3A">Contractions</span>

In [22]:
# Based on https://stackoverflow.com/questions/43018030/replace-apostrophe-short-words-in-python
# we are going to replace the apostrophe in short words in our text.

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    
    return phrase

Nevada['cleanText'] = Nevada['cleanText'].progress_apply(decontracted)

HBox(children=(FloatProgress(value=0.0, max=1362729.0), HTML(value='')))




##### <span style="color:#3A3A3A">Remove Punctuation</span>

In [23]:
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

Nevada['cleanText'] = Nevada['cleanText'].progress_apply(remove_punctuations)

HBox(children=(FloatProgress(value=0.0, max=1362729.0), HTML(value='')))




##### <span style="color:#3A3A3A">Convert Accented Characters</span>

In [24]:
# Based on the insights of https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79
# we are converting assected characters like café

def remove_accented_chars(text):
    """remove accented characters from text, e.g. café"""
    text = unidecode(text)
    return text

Nevada['cleanText'] = Nevada['cleanText'].progress_apply(remove_accented_chars)

HBox(children=(FloatProgress(value=0.0, max=1362729.0), HTML(value='')))




##### <span style="color:#3A3A3A">Remove Whitespace</span>

In [25]:
def remove_whitespace(text):
  text = text.strip()
  return text

Nevada['cleanText'] = Nevada['cleanText'].progress_apply(remove_whitespace)   

HBox(children=(FloatProgress(value=0.0, max=1362729.0), HTML(value='')))




##### <span style="color:#3A3A3A">Tokenization</span>

In [26]:
# The next step is to tokenize the reviews text. For this reason, we are using word_tokenize from NLTK.
# Basically, we will split the text into word tokens while we will also remove any words that are only one character long. 
# Finally, we remove numbers, but not words that contain numbers.

# Create new column in which the text process will be applied. 
Nevada["tokenText"]=Nevada["cleanText"]

def clean_token(text):
  return word_tokenize(text)

Nevada["tokenText"] = Nevada["tokenText"].progress_apply(clean_token)

#Remove words that are only one character.
Nevada["tokenText"] = [[token for token in doc if len(token) > 1] for doc in Nevada["tokenText"]]

# Remove numbers, but not words that contain numbers.
Nevada["tokenText"] = [[token for token in doc if not token.isnumeric()] for doc in Nevada["tokenText"]]

display(Nevada.tokenText)

HBox(children=(FloatProgress(value=0.0, max=1362729.0), HTML(value='')))




0          [hubby, and, were, cutting, through, the, to, go, elsewhere, for, breakfast, was, sold, on, the, and, made, to, order, and, there, crowd, but, the, staff, kept, everything, on, the, buffet, fresh...
1          [went, to, the, chicken, shack, with, group, of, as, part, of, the, sloan, canyon, club, and, really, the, experience, the, chicken, had, delicious, flavor, due, to, their, seasoning, on, the, an...
2          [we, just, recently, returned, from, las, and, had, the, pleasure, of, stopping, by, your, restaurant, on, th, for, lunch, we, were, seated, in, the, lounge, and, was, by, an, amazing, server, we...
3          [been, here, three, times, now, one, of, my, me, to, this, place, the, staff, are, very, friendly, the, restaurant, to, clean, and, the, soup, is, delicious, if, you, re, in, the, area, and, you,...
4          [when, we, in, we, that, the, was, nice, and, the, restaurant, was, clean, hubby, and, were, at, all, of, the, on, the, menu, we, were, in, heaven, o

##### <span style="color:#3A3A3A">Stopwords</span>

In [27]:
# Another step in the text cleaning process is the removal of stopwords. Basically, stopwords are a set of commonly 
# used words like "a", "the", "is" and etc. The purpose behind the removal of stopwords is that by removing low information
# words from a text, we can genuinely focus on important words instead.

stopwords = (['ll',  'namely', 'fifty', 'sure', 'bottom', 'rd', 'indicate', 'usefully', 'anyway', 'or', 'haven', 'thorough', 'different', 'she', 'looking', 'whim',
 've', 'nay', 'fify', 'system', 'brief', 're', 'indicated', 'usefulness', 'anyways', 'ord', 'havent', 'thoroughly', 'do', 'shed', 'looks', 'whither',
 'a', 'nd', 'fill', 't', 'briefly', 'readily', 'indicates', 'uses', 'anywhere', 'other', 'having', 'those', 'does', 'shell', 'ltd', 'who',
 'able', 'near', 'find', 'take', 'but', 'really', 'information', 'using', 'apart', 'others', 'he', 'thou', 'doesn', 'shes', 'm', 'whod',
 'about', 'nearly', 'fire', 'taken', 'by', 'reasonably', 'inner', 'usually', 'apparently', 'otherwise', 'hed', 'though', 'doesnt', 'should', 'ma', 'whoever',
 'above', 'necessarily', 'first', 'taking', 'c', 'recent', 'insofar', 'v', 'appear', 'ought', 'hell', 'thoughh', 'doing', 'shouldn', 'made', 'whole',
 'abst', 'necessary', 'five', 'tell', "c'mon", 'recently', 'instead', 'value', 'appreciate', 'our', 'hello', 'thousand', 'don', 'shouldnt', 'mainly', 'wholl',
 'accordance', 'need', 'fix', 'ten', "c's", 'ref', 'interest', 'various', 'appropriate', 'ours', 'help', 'three', "don't", 'shouldve', 'make', 'whom',
 'according', 'needn', 'followed', 'tends', 'ca', 'refs', 'into', 've', 'approximately', 'ourselves', 'hence', 'throug', 'done', 'show', 'makes', 'whomever',
 'accordingly', "needn't", 'following', 'th', 'call', 'regarding', 'invention', 'very', 'are', 'out', 'her', 'through', 'down', 'showed', 'many', 'whos',
 'across', 'needs', 'follows', 'than', 'came', 'regardless', 'inward', 'via', 'aren', 'outside', 'here', 'throughout', 'downwards', 'shown', 'may', 'whose',
 'act', 'neither', 'for', 'thank', 'can', 'regards', 'is', 'viz', 'arent', 'over', 'hereafter', 'thru', 'due', 'showns', 'maybe', 'why',
 'actually', 'never', 'former', 'thanks', 'cannot', 'related', 'isn', 'vol', 'arise', 'overall', 'hereby', 'thus', 'during', 'shows', 'me', 'whys',
 'added', 'nevertheless', 'formerly', 'thanx', 'cant', 'relatively', 'isnt', 'vols', 'around', 'owing', 'herein', 'til', 'e', 'side', 'mean', 'widely',
 'adj', 'new', 'forth', 'that', 'cause', 'research', 'it', 'vs', 'as', 'own', 'heres', 'tip', 'each', 'significant', 'means', 'will',
 'affected', 'next', 'forty', "that'll", 'causes', 'respectively', 'itd', 'w', 'aside', 'p', 'hereupon', 'to', 'ed', 'significantly', 'meantime', 'willing',
 'affecting', 'nine', 'found', 'thats', 'certain', 'resulted', 'itll', 'want', 'ask', 'page', 'hers', 'today', 'edu', 'similar', 'meanwhile', 'wish',
 'affects', 'ninety', 'four', 'thatve', 'certainly', 'resulting', 'its', 'wants', 'asking', 'pages', 'herself', 'together', 'effect', 'similarly', 'merely', 'with',
 'after', 'nobody', 'from', 'the', 'changes', 'results', 'itself', 'was', 'associated', 'part', 'hes', 'tomorrow', 'eg', 'since', 'mg', 'within',
 'afterwards', 'non', 'front', 'their', 'clearly', 'right', 'ive', 'wasn', 'at', 'particular', 'hi', 'too', 'eight', 'sincere', 'might', 'without',
 'again', 'none', 'full', 'theirs', 'co', 'run', 'j', 'wasnt', 'auth', 'particularly', 'hid', 'took', 'eighty', 'six', 'mightn', 'won',
 'against', 'nonetheless', 'further', 'them', 'com', 's', 'just', 'way', 'available', 'past', 'him', 'top', 'either', 'sixty', 'mightnt', 'wonder',
 'ah', 'noone', 'furthermore', 'themselves', 'come', 'said', 'k', 'we', 'away', 'per', 'himself', 'toward', 'eleven', 'slightly', 'mill', 'wont',
 'ain', 'nor', 'g', 'then', 'comes', 'same', 'keep', 'wed', 'awfully', 'perhaps', 'his', 'towards', 'else', 'so', 'million', 'words',
 'aint', 'normally', 'gave', 'thence', 'con', 'saw', 'keeps', 'welcome', 'b', 'placed', 'hither', 'tried', 'elsewhere', 'some', 'mine', 'world',
 'all', 'nos', 'get', 'there', 'concerning', 'say', 'kept', 'well', 'back', 'please', 'home', 'tries', 'empty', 'somebody', 'miss', 'would',
 'allow', 'noted', 'gets', 'thereafter', 'consequently', 'saying', 'kg', 'went', 'be', 'plus', 'hopefully', 'truly', 'end', 'somehow', 'ml', 'wouldn',
 'allows', 'nothing', 'getting', 'thereby', 'consider', 'says', 'km', 'were', 'became', 'poorly', 'how', 'try', 'ending', 'someone', 'more', 'wouldnt',
 'almost', 'novel', 'give', 'thered', 'considering', 'sec', 'know', 'weren', 'because', 'possible', 'howbeit', 'trying', 'enough', 'somethan', 'moreover', 'www',
 'alone', 'now', 'given', 'therefore', 'contain', 'second', 'known', 'werent', 'become', 'possibly', 'however', 'ts', 'entirely', 'something', 'morning', 'x',
 'along', 'nowhere', 'gives', 'therein', 'containing', 'secondly', 'knows', 'weve', 'becomes', 'potentially', 'hows', 'twelve', 'especially', 'sometime', 'most', 'y',
 'already', 'o', 'giving', 'therell', 'contains', 'section', 'l', 'what', 'becoming', 'pp', 'http', 'twenty', 'et', 'sometimes', 'mostly', 'yes',
 'also', 'obtain', 'go', 'thereof', 'corresponding', 'see', 'largely', 'whatever', 'been', 'predominantly', 'https', 'twice', 'etc', 'somewhat', 'move', 'yet',
 'although', 'obtained', 'goes', 'therere', 'could', 'seeing', 'last', 'whatll', 'before', 'present', 'hundred', 'two', 'even', 'somewhere', 'mr', 'you',
 'always', 'obviously', 'going', 'theres', 'couldn', 'seem', 'lately', 'whats', 'beforehand', 'presumably', 'i', 'u', 'ever', 'soon', 'mrs', 'youd',
 'am', 'of', 'gone', 'thereto', 'couldnt', 'seemed', 'later', 'when', 'begin', 'previously', 'id', 'un', 'every', 'sorry', 'much', 'youll',
 'among', 'off', 'got', 'thereupon', 'course', 'seeming', 'latter', 'whence', 'beginning', 'primarily', 'ie', 'under', 'everybody', 'specifically', 'mug', 'your',
 'amongst', 'often', 'gotten', 'thereve', 'cry', 'seems', 'latterly', 'whenever', 'beginnings', 'probably', 'if', 'unfortunately', 'everyone', 'specified', 'must', 'youre',
 'amoungst', 'oh', 'greetings', 'these', 'currently', 'seen', 'least', 'whens', 'begins', 'promptly', 'ignored', 'unless', 'everything', 'specify', 'mustn', 'yours',
 'amount', 'ok', 'h', 'they', 'd', 'self', 'less', 'where', 'behind', 'proud', 'ill', 'unlike', 'everywhere', 'specifying', 'mustnt', 'yourself',
 'an', 'okay', 'had', 'theyd', 'date', 'selves', 'lest', 'whereafter', 'being', 'provides', 'im', 'unlikely', 'ex', 'still', 'my', 'yourselves',
 'and', 'old', 'hadn', 'theyll', 'de', 'sensible', 'let', 'whereas', 'believe', 'put', 'immediate', 'until', 'exactly', 'stop', 'myself', 'yourselvesme',
 'announce', 'omitted', "hadn't", 'theyre', 'definitely', 'sent', 'lets', 'whereby', 'below', 'q', 'immediately', 'unto', 'example', 'strongly', 'n', 'youve',
 'another', 'on', 'happens', 'theyve', 'describe', 'serious', 'like', 'wherein', 'beside', 'que', 'importance', 'up', 'except', 'sub', 'na', 'z',
 'any', 'once', 'hardly', 'thick', 'described', 'seriously', 'liked', 'wheres', 'besides', 'quickly', 'important', 'upon', 'f', 'substantially', 'name', 'zero',
 'anybody', 'one', 'has', 'thickv', 'despite', 'seven', 'likely', 'whereupon', 'between', 'quite', 'in', 'ups', 'far', 'successfully',
 'anyhow', 'ones', 'hasn', 'thin', 'detail', 'several', 'line', 'wherever', 'beyond', 'qv', 'inasmuch', 'us', 'few', 'such',
 'anymore', 'only', "hasn't", 'think', 'did', 'shall', 'little', 'whether', 'bill', 'r', 'inc', 'use', 'ff', 'sufficiently',
 'anyone', 'onto', 'hasnt', 'third', 'didn', 'shan', 'll', 'which', 'biol', 'ran', 'indeed', 'used', 'fifteen', 'suggest',
 'anything', 'opa', 'have', 'this', "didn't", 'shant', 'look', 'while', 'both', 'rather', 'index', 'useful', 'fifth', 'sup'
])

Nevada["tokenText"] = [[token for token in doc if (token not in stopwords)] for doc in Nevada["tokenText"]]

display(Nevada.head(3))

Unnamed: 0,review_id,user_id,business_id,review_stars,text,date,business_name,address,city,state,postal_code,latitude,longitude,rating,num_of_business_reviews,is_open,attributes,categories,hours,user_name,num_of_reviews,average_stars,cleanText,tokenText
0,QiImvuUidV_SKI7EZYD83A,cvA8vHPR0Gs0zsPnyv6JEQ,L6URxfBxYvwbr0U6xOzY1Q,4,"Hubby and I were cutting through the Tropicana to go elsewhere for breakfast. I was sold on the $2 mimosas and made to order omelets and waffles. \n\nThere wasn't a crowd, but the staff kept ever...",2019-12-13 15:50:49,Savor Brunch Buffet,3801 S Las Vegas Blvd,LAS VEGAS,NV,89109,36.099635,-115.171358,3.5,21,1,"{'Ambience': '{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'upscale': False, 'classy': False, 'casual': False}', 'HasTV': 'True', '...","breakfast & brunch, buffets, restaurants","{'Monday': '7:0-13:0', 'Tuesday': '7:0-13:0', 'Wednesday': '7:0-13:0', 'Thursday': '7:0-13:0', 'Friday': '7:0-13:0', 'Saturday': '7:0-13:0', 'Sunday': '7:0-13:0'}",Taheerah,496,3.89,hubby and i were cutting through the to go elsewhere for breakfast i was sold on the and made to order and there t a crowd but the staff kept everything on the buffet fresh i the omelet grits and...,"[hubby, cutting, breakfast, sold, order, crowd, staff, buffet, fresh, omelet, grits, hubby, fruity, pebble, waffle, whipped, cream, good, sweet, decided, channel, child, addition, breakfast, buff..."
1,OKLFoPEC2bmj0gkd3mg2og,qPt9_SU60HY9NJUFOBh8Ew,0m99JzzybBddbP9mr2y5XA,5,"I went to the Chicken Shack with a group of friends as part of the Pinecrest Sloan Canyon Foodie Club, and I really enjoyed the experience! The chicken tenders had a delicious flavor due to their...",2019-12-13 15:48:50,The Chicken Shack,"10445 Spencer St, Ste 120",LAS VEGAS,NV,89183,35.999082,-115.127911,4.0,152,1,"{'BusinessAcceptsCreditCards': 'True', 'OutdoorSeating': 'False', 'RestaurantsTableService': 'False', 'Caters': 'True', 'GoodForKids': 'True', 'Alcohol': 'u'none'', 'RestaurantsDelivery': 'True',...","fast food, restaurants, chicken shop","{'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0', 'Wednesday': '10:0-21:0', 'Thursday': '10:0-21:0', 'Friday': '10:0-22:0', 'Saturday': '10:0-22:0', 'Sunday': '10:0-21:0'}",Dane,1,5.0,i went to the chicken shack with a group of as part of the sloan canyon club and i really the experience the chicken had a delicious flavor due to their seasoning on the and they were meaty in th...,"[chicken, shack, group, sloan, canyon, club, experience, chicken, delicious, flavor, seasoning, meaty, good, chicken, tender, normal, cajun, garlic, lot, admiring, crunchy, texture, flavor, taste..."
2,izOSwMP2js_ptjDQZsynig,KriIEvoyWwhoswBoqqUpzA,faPVqws-x-5k2CQKDNtHxw,5,"We just recently returned from Las Vegas and had the pleasure of stopping by your restaurant on Tuesday, December 10th for lunch. We were seated in the lounge and was greeted by an amazing server...",2019-12-13 15:34:17,Yardbird Southern Table & Bar,3355 Las Vegas Blvd S,LAS VEGAS,NV,89109,36.122328,-115.170112,4.5,4828,1,"{'BusinessAcceptsCreditCards': 'True', 'BikeParking': 'True', 'OutdoorSeating': 'False', 'RestaurantsReservations': 'True', 'RestaurantsPriceRange2': '2', 'NoiseLevel': ''average'', 'RestaurantsA...","restaurants, american (new), southern, nightlife, bars, cocktail bars","{'Monday': '0:0-0:0', 'Tuesday': '11:0-23:0', 'Wednesday': '11:0-23:0', 'Thursday': '11:0-23:0', 'Friday': '11:0-0:0', 'Saturday': '10:0-0:0', 'Sunday': '10:0-23:0'}",Christine,1,5.0,we just recently returned from las and had the pleasure of stopping by your restaurant on th for lunch we were seated in the lounge and was by an amazing server we ordered some lunch after a long...,"[returned, las, pleasure, stopping, restaurant, lunch, seated, lounge, amazing, server, ordered, lunch, long, cold, server, quick, respond, extremely, fast, food, cold, star, rating, amazing, ser..."


##### <span style="color:#3A3A3A">Lemmatization</span>

In [28]:
# Lemmatization is another normalization technique and basically is the conversion of each word to its base form, 
# the lemma that we encounter in dictionaries. We are lemmatising our text based on 
# https://simonhessner.de/lemmatize-whole-sentences-with-python-and-nltks-wordnetlemmatizer/

# Create new column in which the Lemmatization process will be applied. 
Nevada["LemaText"]=Nevada["tokenText"]

In [29]:
lemmatizer = WordNetLemmatizer()
def nltk2wn_tag(nltk_tag):
  if nltk_tag.startswith('J'):
    return wordnet.ADJ
  elif nltk_tag.startswith('V'):
    return wordnet.VERB
  elif nltk_tag.startswith('N'):
    return wordnet.NOUN
  elif nltk_tag.startswith('R'):
    return wordnet.ADV
  else:                    
    return None

def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag((sentence))   
    wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged)
    res_words = []
    for word, tag in wn_tagged:
        if tag is None:                        
            res_words.append(word)
        else:
            res_words.append(lemmatizer.lemmatize(word, tag))
    return " ".join(res_words)

Nevada["LemaText"] = Nevada["LemaText"].progress_apply(lemmatize_sentence)

display(Nevada.LemaText)

HBox(children=(FloatProgress(value=0.0, max=1362729.0), HTML(value='')))




0          hubby cut breakfast sell order crowd staff buffet fresh omelet grit hubby fruity pebble waffle whip cream good sweet decide channel child addition breakfast buffet pizza good pizza ate price good...
1          chicken shack group sloan canyon club experience chicken delicious flavor season meaty good chicken tender normal cajun garlic lot admire crunchy texture flavor taste creamy garlic satisfactory h...
2                                return las pleasure stop restaurant lunch seat lounge amaze server order lunch long cold server quick respond extremely fast food cold star rating amazing service provide merry
3                                                                                                                             time place staff friendly restaurant clean soup delicious area bite eat highly soup
4          nice restaurant clean hubby menu heaven order chicken waffle order breakfast burrito honest breakfast burrito good chicken waffle great breakfast bur

In [30]:
Nevada = Nevada[Nevada.LemaText.str.len()!=0]

In [31]:
Nevada['split_text'] = Nevada['LemaText'].str.split()
# Remove records that contain no word 
Nevada = Nevada[Nevada.split_text.str.len()!=0]
# Remove records that contain only one word
Nevada = Nevada[Nevada.split_text.str.len()!=1]

In [32]:
# Get the average length of a review
round(Nevada.split_text.str.len().mean(),2)

30.57

In [33]:
# Get the max length of a review
round(Nevada.split_text.str.len().max(),2)

400

In [34]:
Nevada = Nevada.drop(columns=['split_text'])

In [35]:
Nevada.shape

(1361876, 25)

##### <span style="color:#3A3A3A">Treat Categories</span>

In [36]:
# Create a new column where the text processing for the categories will be applied
Nevada['cleanCat'] = Nevada['categories']

# Remove Punctuations
Nevada['cleanCat'] = Nevada['cleanCat'].progress_apply(remove_punctuations)

# Remove Extra Space
Nevada['cleanCat'] = [text.replace("  "," ") for text in Nevada['cleanCat']]

display(Nevada['cleanCat'])

HBox(children=(FloatProgress(value=0.0, max=1361876.0), HTML(value='')))




0                                                         breakfast brunch buffets restaurants
1                                                           fast food restaurants chicken shop
2                               restaurants american new southern nightlife bars cocktail bars
3                                                          restaurants vegetarian thai noodles
4                         vegan restaurants breakfast brunch american new american traditional
                                                  ...                                         
1362880    american new food american traditional seafood restaurants nightlife bars wine bars
1362881                            latin american restaurants japanese sushi bars asian fusion
1362882                                 sandwiches breakfast brunch buffets restaurants french
1362883                                 sandwiches breakfast brunch buffets restaurants french
1362884    restaurants hotels travel event plannin

In [37]:
# Get a view on the categories and their frequency on our dataset
most_common = Counter(" ".join(Nevada['cleanCat']).split(" ")).most_common()
most_common

[('restaurants', 1361791),
 ('bars', 645385),
 ('food', 586133),
 ('american', 539031),
 ('nightlife', 336341),
 ('breakfast', 275415),
 ('brunch', 274730),
 ('new', 269627),
 ('traditional', 262745),
 ('services', 169709),
 ('event', 169531),
 ('hotels', 153820),
 ('seafood', 150290),
 ('planning', 146432),
 ('mexican', 143487),
 ('sandwiches', 138584),
 ('burgers', 138012),
 ('japanese', 133669),
 ('steakhouses', 121840),
 ('italian', 115787),
 ('arts', 111194),
 ('pizza', 110214),
 ('entertainment', 106762),
 ('asian', 100719),
 ('fusion', 99511),
 ('sushi', 98052),
 ('buffets', 87109),
 ('fast', 86670),
 ('salad', 85858),
 ('wine', 83321),
 ('chinese', 80310),
 ('travel', 79670),
 ('casinos', 79345),
 ('tea', 78845),
 ('desserts', 76840),
 ('cocktail', 71728),
 ('cafes', 67913),
 ('barbeque', 67662),
 ('beer', 64180),
 ('coffee', 63040),
 ('chicken', 53606),
 ('sports', 50821),
 ('vegan', 47960),
 ('lounges', 46982),
 ('caterers', 46893),
 ('venues', 42466),
 ('korean', 42394),
 ('

In [38]:
#only get 23 most common cuisine categories
Nevada = Nevada[(Nevada['cleanCat'].str.contains('american|mexican|japanese|italian|asian|chinese|korean|french|mediterranean|hawaiian|vietnamese|greek|ethnic|spanish|brazilian|taiwanese|african|british|caribbean|pakistani|thai|indian|irish',case=False))].copy()


In [39]:
#We decide to keep only get 23 most common cuisine categories for our analysis

cuisine = ['american', 'mexican', 'japanese', 'italian', 'asian', 'chinese', 'korean', 'french', 'mediterranean', 'hawaiian', 'vietnamese', 'greek', 'ethnic', 'spanish', 'brazilian', 'taiwanese', 'african', 'british', 'caribbean', 'pakistani', 'thai', 'indian', 'irish']   
Nevada['cuisines'] = Nevada.cleanCat.apply(lambda x: ' '.join([word for word in x.split() if word in (cuisine)]))

In [40]:
# Remove duplicates cuisines categories

def uniquify(string):
    output = []
    seen = set()
    for word in string.split():
        if word not in seen:
            output.append(word)
            seen.add(word)
    return ' '.join(output)

Nevada['cuisines'] = Nevada['cuisines'].apply(uniquify)

In [41]:
# Remove Whitespace
def remove_whitespace(text):
  text = text.strip()
  return text

Nevada['cuisines'] = Nevada['cuisines'].apply(remove_whitespace)   

In [42]:
# Get a view on the cuisines and their frequency on our dataset.
# The american is expected to be the most frequent cuisine of the businesses.
most_common = Counter(" ".join(Nevada['cuisines']).split(" ")).most_common()
most_common

[('american', 443459),
 ('mexican', 134194),
 ('japanese', 130244),
 ('italian', 115787),
 ('asian', 100142),
 ('chinese', 80310),
 ('korean', 42394),
 ('french', 38635),
 ('thai', 38349),
 ('mediterranean', 33491),
 ('hawaiian', 27765),
 ('vietnamese', 25482),
 ('greek', 16535),
 ('ethnic', 14919),
 ('indian', 14555),
 ('spanish', 14248),
 ('brazilian', 9666),
 ('taiwanese', 7079),
 ('british', 6922),
 ('african', 6319),
 ('caribbean', 5108),
 ('pakistani', 4916),
 ('irish', 4703)]

In [43]:
# Merge the clean review text and the categories into one column in order to enhance the information of our dataset.
Nevada['input_text'] = Nevada[['LemaText', 'cuisines']].agg(' '.join, axis=1)

In [44]:
# Drop columns that we will not use in our analysis
Nevada = Nevada.drop(columns=['num_of_business_reviews', 'is_open','hours','num_of_reviews','cleanText','tokenText', 'cleanCat']).reset_index(drop=True)

In [46]:
display(Nevada.head(3))

Unnamed: 0,review_id,user_id,business_id,review_stars,text,date,business_name,address,city,state,postal_code,latitude,longitude,rating,attributes,categories,user_name,average_stars,LemaText,cuisines,input_text
0,izOSwMP2js_ptjDQZsynig,KriIEvoyWwhoswBoqqUpzA,faPVqws-x-5k2CQKDNtHxw,5,"We just recently returned from Las Vegas and had the pleasure of stopping by your restaurant on Tuesday, December 10th for lunch. We were seated in the lounge and was greeted by an amazing server...",2019-12-13 15:34:17,Yardbird Southern Table & Bar,3355 Las Vegas Blvd S,LAS VEGAS,NV,89109,36.122328,-115.170112,4.5,"{'BusinessAcceptsCreditCards': 'True', 'BikeParking': 'True', 'OutdoorSeating': 'False', 'RestaurantsReservations': 'True', 'RestaurantsPriceRange2': '2', 'NoiseLevel': ''average'', 'RestaurantsA...","restaurants, american (new), southern, nightlife, bars, cocktail bars",Christine,5.0,return las pleasure stop restaurant lunch seat lounge amaze server order lunch long cold server quick respond extremely fast food cold star rating amazing service provide merry,american,return las pleasure stop restaurant lunch seat lounge amaze server order lunch long cold server quick respond extremely fast food cold star rating amazing service provide merry american
1,qUxCvPEkl7xrmY-n1szciA,D3XxyNOy8b_1484Oi1eYOg,VrGI7_nRjXpn0415S3coGQ,5,I've been here three times now one of my coworkers introduced me to this place the staff are very friendly the restaurant to clean and the soup is Delicious if you're in the area and you're looki...,2019-12-13 15:32:07,Vegas Noodle House,3516 Wynn Rd,LAS VEGAS,NV,89103,36.125887,-115.194425,4.0,"{'RestaurantsDelivery': 'True', 'GoodForKids': 'True', 'RestaurantsGoodForGroups': 'True', 'RestaurantsAttire': 'u'casual'', 'RestaurantsReservations': 'False', 'BusinessAcceptsCreditCards': 'Tru...","restaurants, vegetarian, thai, noodles",Aaron,3.88,time place staff friendly restaurant clean soup delicious area bite eat highly soup,thai,time place staff friendly restaurant clean soup delicious area bite eat highly soup thai
2,oBbpt5C7BwKaTXy6ylzJ_g,cvA8vHPR0Gs0zsPnyv6JEQ,dVp1llwjZUmhCF4pNsJnQg,4,"When we walked in, we noticed that the decor was nice, and the restaurant was clean. Hubby and I were shocked at all of the choices on the menu. We were in vegan heaven. \n\nI order the vegan ""ch...",2019-12-13 15:29:12,The Modern Vegan,700 E Naples Dr,LAS VEGAS,NV,89119,36.105993,-115.149127,4.0,"{'RestaurantsAttire': ''casual'', 'ByAppointmentOnly': 'False', 'BikeParking': 'True', 'RestaurantsPriceRange2': '2', 'RestaurantsTableService': 'True', 'BusinessParking': '{'garage': False, 'str...","vegan, restaurants, breakfast & brunch, american (new), american (traditional)",Taheerah,3.89,nice restaurant clean hubby menu heaven order chicken waffle order breakfast burrito honest breakfast burrito good chicken waffle great breakfast burrito level experience notch service awesome ha...,american,nice restaurant clean hubby menu heaven order chicken waffle order breakfast burrito honest breakfast burrito good chicken waffle great breakfast burrito level experience notch service awesome ha...


In [47]:
# The final dataset contains 1013794 unique reviews.
Nevada.shape

(1013794, 21)

In [48]:
file_name="Nevada.pkl"
Nevada.to_pickle(file_name)