# Yelp Data Challenge - Data Preprocessing

LeiChen

Sep 2018

## Dataset Introduction

[Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge)

The Challenge Dataset:

    4.1M reviews and 947K tips by 1M users for 144K businesses
    1.1M business attributes, e.g., hours, parking availability, ambience.
    Aggregated check-ins over time for each of the 125K businesses
    200,000 pictures from the included businesses

Cities:

    U.K.: Edinburgh
    Germany: Karlsruhe
    Canada: Montreal and Waterloo
    U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland

Files:

    yelp_academic_dataset_business.json
    yelp_academic_dataset_checkin.json
    yelp_academic_dataset_review.json
    yelp_academic_dataset_tip.json
    yelp_academic_dataset_user.json

Notes on the Dataset

    Each file is composed of a single object type, one json-object per-line.
    Take a look at some examples to get you started: https://github.com/Yelp/dataset-examples.



## Read data from file and load to Pandas DataFrame

**Warning**: Loading all the 1.8 GB data into Pandas at a time takes long time and a lot of memory!

In [1]:
import json
import pandas as pd

In [2]:
file_business, file_checkin, file_review, file_tip, file_user = [
    '../dataset/yelp_academic_dataset_business.json',
    '../dataset/yelp_academic_dataset_checkin.json',
    '../dataset/yelp_academic_dataset_review.json',
    '../dataset/yelp_academic_dataset_tip.json',
    '../dataset/yelp_academic_dataset_user.json',
]

#### Business Data

In [3]:
with open(file_business, errors='ignore') as f:
    df_business = pd.DataFrame(json.loads(line) for line in f)

In [4]:
df_business.head(2)

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,1314 44 Avenue NE,"{'BikeParking': 'False', 'BusinessAcceptsCredi...",Apn5Q_b6Nz61Tq4XzPdf9A,"Tours, Breweries, Pizza, Restaurants, Food, Ho...",Calgary,"{'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'...",1,51.091813,-114.031675,Minhas Micro Brewery,,T2E 6L6,24,4.0,AB
1,,"{'Alcohol': 'none', 'BikeParking': 'False', 'B...",AjEbIBw6ZFfln7ePHha9PA,"Chicken Wings, Burgers, Caterers, Street Vendo...",Henderson,"{'Friday': '17:0-23:0', 'Saturday': '17:0-23:0...",0,35.960734,-114.939821,CK'S BBQ & Catering,,89002,3,4.5,NV


In [5]:
df_business.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188593 entries, 0 to 188592
Data columns (total 15 columns):
address         188593 non-null object
attributes      162807 non-null object
business_id     188593 non-null object
categories      188052 non-null object
city            188593 non-null object
hours           143791 non-null object
is_open         188593 non-null int64
latitude        188587 non-null float64
longitude       188587 non-null float64
name            188593 non-null object
neighborhood    188593 non-null object
postal_code     188593 non-null object
review_count    188593 non-null int64
stars           188593 non-null float64
state           188593 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 21.6+ MB


#### Checking Data

In [5]:
with open(file_checkin) as f:
    df_checkin = pd.DataFrame(json.loads(line) for line in f)
df_checkin.head(2)

Unnamed: 0,business_id,time
0,7KPBkxAOEtb3QeIL9PEErg,"{'Fri-0': 2, 'Sat-0': 1, 'Sun-0': 1, 'Wed-0': ..."
1,kREVIrSBbtqBhIYkTccQUg,"{'Mon-13': 1, 'Thu-13': 1, 'Sat-16': 1, 'Wed-1..."


#### Review Data

In [None]:
with open(file_review) as f:
    df_review = pd.DataFrame(json.loads(line) for line in f)
df_review.head(2)

#### Tip Data

In [None]:
# with open(file_tip) as f:
#     df_tip = pd.DataFrame(json.loads(line) for line in f)
# df_tip.head(2)

#### User Data

In [63]:
# with open(file_user) as f:
#     df_user = pd.DataFrame(json.loads(line) for line in f)
# df_user.head(2)
df_business.head(5)

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,1314 44 Avenue NE,"{'BikeParking': 'False', 'BusinessAcceptsCredi...",Apn5Q_b6Nz61Tq4XzPdf9A,"Tours, Breweries, Pizza, Restaurants, Food, Ho...",Calgary,"{'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'...",1,51.091813,-114.031675,Minhas Micro Brewery,,T2E 6L6,24,4.0,AB
1,,"{'Alcohol': 'none', 'BikeParking': 'False', 'B...",AjEbIBw6ZFfln7ePHha9PA,"Chicken Wings, Burgers, Caterers, Street Vendo...",Henderson,"{'Friday': '17:0-23:0', 'Saturday': '17:0-23:0...",0,35.960734,-114.939821,CK'S BBQ & Catering,,89002,3,4.5,NV
2,1335 rue Beaubien E,"{'Alcohol': 'beer_and_wine', 'Ambience': '{'ro...",O8S5hYJ1SMc8fA4QBtVujA,"Breakfast & Brunch, Restaurants, French, Sandw...",Montr茅al,"{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'...",0,45.540503,-73.5993,La Bastringue,Rosemont-La Petite-Patrie,H2G 1K7,5,4.0,QC
3,211 W Monroe St,,bFzdJJ3wp3PZssNEsyU23g,"Insurance, Financial Services",Phoenix,,1,33.449999,-112.076979,Geico Insurance,,85003,8,1.5,AZ
4,2005 Alyth Place SE,{'BusinessAcceptsCreditCards': 'True'},8USyCYqpScwiNEb58Bt6CA,"Home & Garden, Nurseries & Gardening, Shopping...",Calgary,"{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...",1,51.035591,-114.027366,Action Engine,,T2H 0N5,4,2.0,AB


## Filter data by city and category

#### Create filters/masks

* create filters that selects business 
    * that are located in "Las Vegas"
    * that contains "Restaurants" in their category (You may need to filter null categories first)

In [4]:
# Create Pandas DataFrame filters
condi_la = (df_business['city'] == "Las Vegas")
condi_not_null = (df_business['categories'].notnull())


In [5]:
# Create filtered DataFrame, and name it df_filtered
df_filtered = df_business[condi_la & condi_not_null]
df_filtered.head(5)
df_with_res = df_filtered['categories'].apply(lambda cat: 'Restaurants' in cat)
df_res = df_filtered[df_with_res]
df_filtered = df_res
df_filtered.loc[33, 'categories']

'Beer, Wine & Spirits, Italian, Food, American (Traditional), Breakfast & Brunch, Restaurants'

#### Keep relevant columns

* only keep some useful columns
    * business_id
    * name
    * categories
    * stars

In [6]:
selected_features = [u'business_id', u'name', u'categories', u'stars']

In [11]:
# Make a DataFrame that contains only the abovementioned columns, and name it as df_selected_business
df_selected_business = df_filtered[selected_features]
df_selected_business.head(5)

Unnamed: 0,business_id,name,categories,stars
19,vJIuDBdu01vCA8y1fwR1OQ,CakesbyToi,"American (Traditional), Food, Bakeries, Restau...",1.5
32,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5
33,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0
61,JJEx5wIqs9iGGATOagE8Sg,Baja Fresh Mexican Grill,"Mexican, Restaurants",2.0
141,zhxnD7J5_sCrKSw5cwI9dQ,Popeyes Louisiana Kitchen,"Chicken Wings, Restaurants, Fast Food",1.5


In [12]:
# Rename the column name "stars" to "avg_stars" to avoid naming conflicts with review dataset
df_selected_business = df_selected_business.rename(columns={'stars':"avg_stars"})

In [13]:
# Inspect your DataFrame
df_selected_business.head(5)

Unnamed: 0,business_id,name,categories,avg_stars
19,vJIuDBdu01vCA8y1fwR1OQ,CakesbyToi,"American (Traditional), Food, Bakeries, Restau...",1.5
32,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5
33,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0
61,JJEx5wIqs9iGGATOagE8Sg,Baja Fresh Mexican Grill,"Mexican, Restaurants",2.0
141,zhxnD7J5_sCrKSw5cwI9dQ,Popeyes Louisiana Kitchen,"Chicken Wings, Restaurants, Fast Food",1.5


#### Save results to csv files

In [10]:
# Save to ./data/selected_business.csv for your next task
selected_business_csv = '../data/selected_business.csv'
df_selected_business.to_csv(selected_business_csv, index=False)

In [3]:
# Try reload the csv file to check if everything works fine
df_selected_business_1 = pd.read_csv('../data/selected_business.csv', encoding = "ISO-8859-1")
df_selected_business_1.head(5)

Unnamed: 0,business_id,name,categories,avg_stars
0,vJIuDBdu01vCA8y1fwR1OQ,CakesbyToi,"American (Traditional), Food, Bakeries, Restau...",1.5
1,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5
2,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0
3,JJEx5wIqs9iGGATOagE8Sg,Baja Fresh Mexican Grill,"Mexican, Restaurants",2.0
4,zhxnD7J5_sCrKSw5cwI9dQ,Popeyes Louisiana Kitchen,"Chicken Wings, Restaurants, Fast Food",1.5


### Use the "business_id" column to filter review data

* We want to make a DataFrame that contain and only contain the reviews about the business entities we just obtained

#### Load review dataset

In [None]:
with open(file_review) as f:
    df_review = pd.DataFrame(json.loads(line) for line in f)
df_review.head(2)

#### Prepare dataframes to be joined, - on business_id

In [21]:
# Prepare the business dataframe and set index to column "business_id", and name it as df_left
df_business = df_business.rename(columns={'business_id': 'df_right'})
df_business = df_business.set_index('df_right')
df_business.head(5)

Unnamed: 0_level_0,address,attributes,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
df_right,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Apn5Q_b6Nz61Tq4XzPdf9A,1314 44 Avenue NE,"{'BikeParking': 'False', 'BusinessAcceptsCredi...","Tours, Breweries, Pizza, Restaurants, Food, Ho...",Calgary,"{'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'...",1,51.091813,-114.031675,Minhas Micro Brewery,,T2E 6L6,24,4.0,AB
AjEbIBw6ZFfln7ePHha9PA,,"{'Alcohol': 'none', 'BikeParking': 'False', 'B...","Chicken Wings, Burgers, Caterers, Street Vendo...",Henderson,"{'Friday': '17:0-23:0', 'Saturday': '17:0-23:0...",0,35.960734,-114.939821,CK'S BBQ & Catering,,89002,3,4.5,NV
O8S5hYJ1SMc8fA4QBtVujA,1335 rue Beaubien E,"{'Alcohol': 'beer_and_wine', 'Ambience': '{'ro...","Breakfast & Brunch, Restaurants, French, Sandw...",Montr茅al,"{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'...",0,45.540503,-73.5993,La Bastringue,Rosemont-La Petite-Patrie,H2G 1K7,5,4.0,QC
bFzdJJ3wp3PZssNEsyU23g,211 W Monroe St,,"Insurance, Financial Services",Phoenix,,1,33.449999,-112.076979,Geico Insurance,,85003,8,1.5,AZ
8USyCYqpScwiNEb58Bt6CA,2005 Alyth Place SE,{'BusinessAcceptsCreditCards': 'True'},"Home & Garden, Nurseries & Gardening, Shopping...",Calgary,"{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...",1,51.035591,-114.027366,Action Engine,,T2H 0N5,4,2.0,AB


In [11]:
# Prepare the review dataframe and set index to column "business_id", and name it as df_right
df_review = df_review.rename(columns={'business_id':'df_right'})
df_review = df_review.set_index('df_right')
df_review.head(5)

Unnamed: 0_level_0,cool,date,funny,review_id,stars,text,useful,user_id
df_right,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
iCQpiavjjPzJ5_3gPD5Ebg,0,2011-02-25,0,x7mDIiDB3jEiPGPHOmDzyw,2,The pizza was okay. Not the best I've had. I p...,0,msQe1u7Z_XuqjGoqhB0J5g
pomGBqfbxcqPv14c3XH-ZQ,0,2012-11-13,0,dDl8zu1vWPdKGihJrwQbpw,5,I love this place! My fiance And I go here atl...,0,msQe1u7Z_XuqjGoqhB0J5g
jtQARsP6P-LbkyjbO1qNGg,1,2014-10-23,1,LZp4UX5zK3e-c5ZGSeo3kA,1,Terrible. Dry corn bread. Rib tips were all fa...,3,msQe1u7Z_XuqjGoqhB0J5g
elqbBhBfElMNSrjFqW3now,0,2011-02-25,0,Er4NBWCmCD4nM8_p1GRdow,2,Back in 2005-2007 this place was my FAVORITE t...,2,msQe1u7Z_XuqjGoqhB0J5g
Ums3gaP2qM3W1XcA5r6SsQ,0,2014-09-05,0,jsDu6QEJHbwP2Blom1PLCA,5,Delicious healthy food. The steak is amazing. ...,0,msQe1u7Z_XuqjGoqhB0J5g


#### Join! and reset index

In [14]:
print('rows of business:')
print(df_selected_business.shape[0])
print('rows of reviews:')
print(df_review.shape[0])

rows of business:
6148
rows of reviews:
5996996


In [20]:
print(len(df_business.index.unique()))
print(len(df_review.index.unique()))

188593
188593


In [6]:
df_business = df_business.rename(columns={'stars':"avg_stars"})
business_review = df_review.set_index('business_id').join(df_business.set_index('business_id'))

In [7]:
business_review.head(5)

Unnamed: 0_level_0,cool,date,funny,review_id,stars,text,useful,user_id,address,attributes,...,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,avg_stars,state
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
--1UhMGODdWsrMastO9DZw,0,2016-07-25,0,0tFxHz2j1GJ8-IYPp5NxWA,4,Came here for lunch last week and was pleasant...,0,t4cYW73lVcBb-1R_Wms1RQ,821 4 Avenue SW,"{'Alcohol': 'beer_and_wine', 'BusinessParking'...",...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
--1UhMGODdWsrMastO9DZw,1,2016-12-06,0,KByp3bKAt9GMqU7koJ8htg,5,If you want Mexican for a reasonable price in ...,0,hqk4eugYhjmhM-3S2FwGjA,821 4 Avenue SW,"{'Alcohol': 'beer_and_wine', 'BusinessParking'...",...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
--1UhMGODdWsrMastO9DZw,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw,821 4 Avenue SW,"{'Alcohol': 'beer_and_wine', 'BusinessParking'...",...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
--1UhMGODdWsrMastO9DZw,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew,821 4 Avenue SW,"{'Alcohol': 'beer_and_wine', 'BusinessParking'...",...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
--1UhMGODdWsrMastO9DZw,1,2016-07-16,0,-i9u19L08A4VvhFVfqXK9Q,5,First time here since it changed ownership. Ha...,0,qUnvyCfCpr9ZG_F5oezJMw,821 4 Avenue SW,"{'Alcohol': 'beer_and_wine', 'BusinessParking'...",...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB


In [9]:
business_review = business_review.reset_index()
business_review.head(5)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,address,...,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,avg_stars,state
0,--1UhMGODdWsrMastO9DZw,0,2016-07-25,0,0tFxHz2j1GJ8-IYPp5NxWA,4,Came here for lunch last week and was pleasant...,0,t4cYW73lVcBb-1R_Wms1RQ,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
1,--1UhMGODdWsrMastO9DZw,1,2016-12-06,0,KByp3bKAt9GMqU7koJ8htg,5,If you want Mexican for a reasonable price in ...,0,hqk4eugYhjmhM-3S2FwGjA,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
2,--1UhMGODdWsrMastO9DZw,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
3,--1UhMGODdWsrMastO9DZw,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB
4,--1UhMGODdWsrMastO9DZw,1,2016-07-16,0,-i9u19L08A4VvhFVfqXK9Q,5,First time here since it changed ownership. Ha...,0,qUnvyCfCpr9ZG_F5oezJMw,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1,51.049673,-114.079977,The Spicy Amigos,,T2P 0K5,24,4.0,AB


In [16]:
# maybe reset the index 
business_review_csv = '../data/business_review.csv'
business_review.to_csv(business_review_csv, index=False)

#### We further filter data by date, e.g. keep comments from last 2 years

* Otherwise your laptop may crush on memory when running machine learning algorithms
* Purposefully ignoring the reviews made too long time ago

In [4]:
business_review_1.columns.values.tolist()

['cool',
 'date',
 'funny',
 'review_id',
 'stars',
 'text',
 'useful',
 'user_id',
 'address',
 'attributes',
 'categories',
 'city',
 'hours',
 'is_open',
 'latitude',
 'longitude',
 'name',
 'neighborhood',
 'postal_code',
 'review_count',
 'avg_stars',
 'state']

In [13]:
# Make a filter that selects date after 2015-01-20
date_filter = (pd.to_datetime(business_review['date']) > pd.to_datetime('2015-01-20'))

In [15]:
# Filter the joined DataFrame and name it as df_final
df_final = business_review[date_filter]

In [16]:
df_final_csv = '../data/df_final.csv'
df_final.to_csv(df_final_csv, index=False)

#### Take a glance at the final dataset

* Do more EDA here as you like!

In [None]:
import matplotlib.pyplot as plt

% matplotlib inline

In [3]:
# e.g. calculate counts of reviews per business entity, and plot it
df_final_1 = pd.read_csv('../data/df_final.csv', encoding = "ISO-8859-1")
df_final_1.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,address,...,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,avg_stars,state
0,--1UhMGODdWsrMastO9DZw,0,2016-07-25,0,0tFxHz2j1GJ8-IYPp5NxWA,4,Came here for lunch last week and was pleasant...,0,t4cYW73lVcBb-1R_Wms1RQ,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1.0,51.049673,-114.08,The Spicy Amigos,,T2P 0K5,24.0,4.0,AB
1,--1UhMGODdWsrMastO9DZw,1,2016-12-06,0,KByp3bKAt9GMqU7koJ8htg,5,If you want Mexican for a reasonable price in ...,0,hqk4eugYhjmhM-3S2FwGjA,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1.0,51.049673,-114.08,The Spicy Amigos,,T2P 0K5,24.0,4.0,AB
2,--1UhMGODdWsrMastO9DZw,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1.0,51.049673,-114.08,The Spicy Amigos,,T2P 0K5,24.0,4.0,AB
3,--1UhMGODdWsrMastO9DZw,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1.0,51.049673,-114.08,The Spicy Amigos,,T2P 0K5,24.0,4.0,AB
4,--1UhMGODdWsrMastO9DZw,1,2016-07-16,0,-i9u19L08A4VvhFVfqXK9Q,5,First time here since it changed ownership. Ha...,0,qUnvyCfCpr9ZG_F5oezJMw,821 4 Avenue SW,...,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",1.0,51.049673,-114.08,The Spicy Amigos,,T2P 0K5,24.0,4.0,AB


In [4]:
df_final_1.columns.values.tolist()

['business_id',
 'cool',
 'date',
 'funny',
 'review_id',
 'stars',
 'text',
 'useful',
 'user_id',
 'address',
 'attributes',
 'categories',
 'city',
 'hours',
 'is_open',
 'latitude',
 'longitude',
 'name',
 'neighborhood',
 'postal_code',
 'review_count',
 'avg_stars',
 'state']

## Save preprocessed dataset to csv file


In [5]:
# Save to ./data/last_2_years_restaurant_reviews.csv for your next task
condi_la = (df_final_1['city'] == "Las Vegas")
condi_not_null = (df_final_1['categories'].notnull())

In [11]:
df_filtered = df_final_1[condi_la & condi_not_null]
df_with_res = df_filtered['categories'].apply(lambda cat: 'Restaurants' in cat)
df_res = df_filtered[df_with_res]
df_res.head(5)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,address,...,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,avg_stars,state
113,--9e1ONYQuAa-CB_Rrw7Tw,0,2017-02-14,0,VETXTwMw6qxzOVDlXfe6Tg,5,went for dinner tonight. Amazing my husband ha...,0,ymlnR8UeFvB4FZL56tCZsA,3355 Las Vegas Blvd S,...,"{'Monday': '11:30-14:0', 'Tuesday': '11:30-14:...",1.0,36.123183,-115.169,Delmonico Steakhouse,The Strip,89109,1546.0,4.0,NV
114,--9e1ONYQuAa-CB_Rrw7Tw,0,2017-12-04,0,S8-8uZ7fa5YbjnEtaW15ng,5,This was an amazing dinning experience! ORDER ...,0,9pSSL6X6lFpY3FCRLEH3og,3355 Las Vegas Blvd S,...,"{'Monday': '11:30-14:0', 'Tuesday': '11:30-14:...",1.0,36.123183,-115.169,Delmonico Steakhouse,The Strip,89109,1546.0,4.0,NV
115,--9e1ONYQuAa-CB_Rrw7Tw,0,2016-08-22,1,1nK5w0VNfDlnR3bOz13dJQ,5,My husband and I went there for lunch on a Sat...,1,gm8nNoA3uB4In5o_Hxpq3g,3355 Las Vegas Blvd S,...,"{'Monday': '11:30-14:0', 'Tuesday': '11:30-14:...",1.0,36.123183,-115.169,Delmonico Steakhouse,The Strip,89109,1546.0,4.0,NV
116,--9e1ONYQuAa-CB_Rrw7Tw,0,2016-09-13,0,N1Z93BthdJ7FT2p5S22jIA,3,Went for a nice anniversary dinner. Researched...,0,CEtidlXNyQzgJSdF1ubPFw,3355 Las Vegas Blvd S,...,"{'Monday': '11:30-14:0', 'Tuesday': '11:30-14:...",1.0,36.123183,-115.169,Delmonico Steakhouse,The Strip,89109,1546.0,4.0,NV
117,--9e1ONYQuAa-CB_Rrw7Tw,0,2015-02-02,0,_Uwp6FO1X-avE9wqTMC59w,5,This place is first class in every way. Lobste...,0,-Z7Nw2UF7NiBSAzfXNA_XA,3355 Las Vegas Blvd S,...,"{'Monday': '11:30-14:0', 'Tuesday': '11:30-14:...",1.0,36.123183,-115.169,Delmonico Steakhouse,The Strip,89109,1546.0,4.0,NV


In [12]:
last_2_years_restaurant_reviews = '../data/last_2_years_restaurant_reviews.csv'
df_res.to_csv(last_2_years_restaurant_reviews, index=False)

In [8]:
df_with_res.head(5)

113    True
114    True
115    True
116    True
117    True
Name: categories, dtype: bool