# Yelp Data Challenge - Data Preprocessing

BitTiger DS501

Jun 2017

## Dataset Introduction

[Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge)

The Challenge Dataset:

    4.1M reviews and 947K tips by 1M users for 144K businesses
    1.1M business attributes, e.g., hours, parking availability, ambience.
    Aggregated check-ins over time for each of the 125K businesses
    200,000 pictures from the included businesses

Cities:

    U.K.: Edinburgh
    Germany: Karlsruhe
    Canada: Montreal and Waterloo
    U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland

Files:

    yelp_academic_dataset_business.json
    yelp_academic_dataset_checkin.json
    yelp_academic_dataset_review.json
    yelp_academic_dataset_tip.json
    yelp_academic_dataset_user.json

Notes on the Dataset

    Each file is composed of a single object type, one json-object per-line.
    Take a look at some examples to get you started: https://github.com/Yelp/dataset-examples.



## Read data from file and load to Pandas DataFrame

**Warning**: Loading all the 1.8 GB data into Pandas at a time takes long time and a lot of memory!

In [2]:
import json
import pandas as pd
import numpy as np

In [3]:
file_business, file_checkin, file_review, file_tip, file_user = [
    'yelp_dataset_challenge_round9/yelp_academic_dataset_business.json',
    'yelp_dataset_challenge_round9/yelp_academic_dataset_checkin.json',
    'yelp_dataset_challenge_round9/yelp_academic_dataset_review.json',
    'yelp_dataset_challenge_round9/yelp_academic_dataset_tip.json',
    'yelp_dataset_challenge_round9/yelp_academic_dataset_user.json'
]

#### Business Data

In [4]:
with open(file_business) as f:
    df_business = pd.DataFrame(json.loads(line) for line in f)

In [5]:
df_business.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state,type
0,"227 E Baseline Rd, Ste J2","[BikeParking: True, BusinessAcceptsBitcoin: Fa...",0DI8Dt2PJp07XkVvIElIcQ,"[Tobacco Shops, Nightlife, Vape Shops, Shopping]",Tempe,"[Monday 11:0-21:0, Tuesday 11:0-21:0, Wednesda...",0,33.378214,-111.936102,Innovative Vapors,,85283,17,4.5,AZ,business
1,495 S Grand Central Pkwy,"[BusinessAcceptsBitcoin: False, BusinessAccept...",LTlCaCGZE14GuaUXUGbamg,"[Caterers, Grocery, Food, Event Planning & Ser...",Las Vegas,"[Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday 0:...",1,36.192284,-115.159272,Cut and Taste,,89106,9,5.0,NV,business
2,979 Bloor Street W,"[Alcohol: none, Ambience: {'romantic': False, ...",EDqCEAGXVGCH4FJXgqtjqg,"[Restaurants, Pizza, Chicken Wings, Italian]",Toronto,"[Monday 11:0-2:0, Tuesday 11:0-2:0, Wednesday ...",1,43.661054,-79.429089,Pizza Pizza,Dufferin Grove,M6H 1L5,7,2.5,ON,business
3,7014 Steubenville Pike,"[AcceptsInsurance: False, BusinessAcceptsCredi...",cnGIivYRLxpF7tBVR_JwWA,"[Hair Removal, Beauty & Spas, Blow Dry/Out Ser...",Oakdale,"[Tuesday 10:0-21:0, Wednesday 10:0-21:0, Thurs...",1,40.444544,-80.17454,Plush Salon and Spa,,15071,4,4.0,PA,business
4,321 Jarvis Street,"[BusinessAcceptsCreditCards: True, Restaurants...",cdk-qqJ71q6P7TJTww_DSA,"[Hotels & Travel, Event Planning & Services, H...",Toronto,,1,43.659829,-79.375401,Comfort Inn,Downtown Core,M5B 2C2,8,3.0,ON,business


In [6]:
df_business.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144072 entries, 0 to 144071
Data columns (total 16 columns):
address         144072 non-null object
attributes      127162 non-null object
business_id     144072 non-null object
categories      143747 non-null object
city            144072 non-null object
hours           102464 non-null object
is_open         144072 non-null int64
latitude        144072 non-null float64
longitude       144072 non-null float64
name            144072 non-null object
neighborhood    144072 non-null object
postal_code     144072 non-null object
review_count    144072 non-null int64
stars           144072 non-null float64
state           144072 non-null object
type            144072 non-null object
dtypes: float64(3), int64(2), object(11)
memory usage: 17.6+ MB


#### Checkin Data

In [7]:
# with open(file_checkin) as f:
#     df_checkin = pd.DataFrame(json.loads(line) for line in f)
# df_checkin.head(2)

#### Review Data

In [8]:
# with open(file_review) as f:
#     df_review = pd.DataFrame(json.loads(line) for line in f)
# df_review.head(2)

#### Tip Data

In [9]:
# with open(file_tip) as f:
#     df_tip = pd.DataFrame(json.loads(line) for line in f)
# df_tip.head(2)

#### User Data

In [10]:
# with open(file_user) as f:
#     df_user = pd.DataFrame(json.loads(line) for line in f)
# df_user.head(2)

## Filter data by city and category

#### Create filters/masks

* create filters that selects business 
    * that are located in "Las Vegas"
    * that contains "Restaurants" in their category (You may need to filter null categories first)

In [11]:
# Create Pandas DataFrame filters
names = df_business.columns
shape = df_business.shape
print(names)

Index([u'address', u'attributes', u'business_id', u'categories', u'city',
       u'hours', u'is_open', u'latitude', u'longitude', u'name',
       u'neighborhood', u'postal_code', u'review_count', u'stars', u'state',
       u'type'],
      dtype='object')


In [12]:
# Fill 'None' as Categories for the columns with no category
Correct_City = (df_business['city'] == 'Las Vegas')
Correct_Category = ~ pd.isnull(df_business['categories'])
for i in xrange(shape[0]):
    if Correct_Category[i]:
        Correct_Category[i] = 'Restaurants' in df_business['categories'][i]

In [13]:
filters = Correct_City & Correct_Category

In [14]:
# Create filtered DataFrame, and name it df_filtered
df_filtered = df_business[filters]

In [22]:
CorrectCity = df_business['city'] == 'Las Vegas'
CorrectCategory = df_business['categories'].apply(lambda x: x is not None and 'Restaurants' in x)
Filter = CorrectCity&CorrectCategory
df_business[Filter]

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state,type
37,4811 S Rainbow Blvd,"[Alcohol: full_bar, BusinessAcceptsCreditCards...",saWZO6hB4B8P-mIzS1--Xw,"[Persian/Iranian, Restaurants, Ethnic Food, Fo...",Las Vegas,,0,36.101020,-115.244312,Kabob Palace,Spring Valley,89103,15,2.5,NV,business
71,"4972 S Maryland Pkwy, Ste 22","[Alcohol: none, Ambience: {'romantic': False, ...",hMh9XOwNQcu31NAOCqhAEw,"[Restaurants, Vegetarian, Indian]",Las Vegas,,1,36.099142,-115.136192,Taste of India,Southeast,89119,33,3.5,NV,business
72,2053 Pama Ln,"[Ambience: {'romantic': False, 'intimate': Fal...",pmJqSsCfgbo3TxPWpQNLIw,"[American (New), Cafes, Restaurants]",Las Vegas,"[Wednesday 10:0-14:0, Thursday 10:0-14:0, Frid...",0,36.065839,-115.123944,Artisanal Foods Cafe,Southeast,89119,35,4.5,NV,business
102,4341 N Rancho Dr,"[Alcohol: full_bar, Ambience: {'romantic': Fal...",kUUBBLBHCasOl2a5nW9nAw,"[Nightlife, Bars, Restaurants, Thai, Sports Bars]",Las Vegas,,0,36.238462,-115.231950,Bailey's Sports Bar & Eatery,Northwest,89130,9,3.5,NV,business
103,4949 N Rancho Dr,"[Alcohol: none, Ambience: {'romantic': False, ...",A2pZTpFXWC38z506XIhnBQ,"[Chicken Wings, Fast Food, Restaurants]",Las Vegas,"[Monday 10:30-0:0, Tuesday 10:30-0:0, Wednesda...",1,36.249719,-115.244528,Wingstop,Northwest,89130,41,3.5,NV,business
154,2230 W Bonanza Rd,"[Alcohol: none, BusinessAcceptsCreditCards: Tr...",InDH4ZQ_byiQ5PyaqgHI8Q,"[Barbeque, Restaurants, Southern]",Las Vegas,,0,36.177516,-115.173218,Big Mama's Soul Food Rib Shack,,89106,6,2.0,NV,business
177,7002 W Charleston Blvd,"[Alcohol: beer_and_wine, BusinessAcceptsCredit...",WleVOQ9YhBYl4SWrlsLDhA,"[Restaurants, Burgers, Local Flavor, Barbeque]",Las Vegas,"[Monday 10:0-20:0, Tuesday 10:0-20:0, Wednesda...",0,36.159109,-115.250935,Red Apple Grill,Westside,89145,6,3.5,NV,business
179,"7501 N Cimarron Rd, Ste 108","[Alcohol: none, Ambience: {'romantic': False, ...",gMUAn6xcuE-TbY1seFw_Ww,"[Desserts, Food, Pizza, Restaurants, Bakeries]",Las Vegas,"[Monday 11:0-21:0, Tuesday 11:0-20:0, Wednesda...",1,36.297330,-115.270268,Presto Calzone Bakery,Centennial,89131,132,4.5,NV,business
184,7930 W Tropical Pkwy,"[Alcohol: beer_and_wine, Ambience: {'romantic'...",kUntNQ5P9IrRzEoHdRxV-w,"[Restaurants, Pizza]",Las Vegas,"[Monday 11:0-21:0, Tuesday 11:0-21:0, Wednesda...",0,36.271169,-115.267759,Mark Rich's New York Pizza & Pasta,Centennial,89149,74,3.5,NV,business
260,"1930 N Decatur Blvd, Ste 1","[Alcohol: beer_and_wine, Ambience: {'romantic'...",-vb_yx5QnIhpXUIdPVD2og,"[Restaurants, Chinese]",Las Vegas,"[Monday 11:30-22:0, Tuesday 11:30-22:0, Wednes...",0,36.194891,-115.205114,Fair View Chinese Cuisine,,89108,7,3.5,NV,business


#### Keep relevant columns

* only keep some useful columns
    * business_id
    * name
    * categories
    * stars

In [124]:
selected_features = [u'business_id', u'name', u'categories', u'stars']

In [125]:
# Make a DataFrame that contains only the abovementioned columns, and name it as df_selected_business
df_selected_business = df_filtered[selected_features]
df_selected_business.head()

Unnamed: 0,business_id,name,categories,stars
37,saWZO6hB4B8P-mIzS1--Xw,Kabob Palace,"[Persian/Iranian, Restaurants, Ethnic Food, Fo...",2.5
71,hMh9XOwNQcu31NAOCqhAEw,Taste of India,"[Restaurants, Vegetarian, Indian]",3.5
72,pmJqSsCfgbo3TxPWpQNLIw,Artisanal Foods Cafe,"[American (New), Cafes, Restaurants]",4.5
102,kUUBBLBHCasOl2a5nW9nAw,Bailey's Sports Bar & Eatery,"[Nightlife, Bars, Restaurants, Thai, Sports Bars]",3.5
103,A2pZTpFXWC38z506XIhnBQ,Wingstop,"[Chicken Wings, Fast Food, Restaurants]",3.5


In [126]:
# Rename the column name "stars" to "avg_stars" to avoid naming conflicts with review dataset
df_selected_business['avg_stars'] = df_selected_business['stars']
df_selected_business = df_selected_business.drop('stars', axis = 1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [127]:
# Inspect your DataFrame
df_selected_business
df_selected_business.iloc[35]

business_id                               Uux8P1ruzjPTQgOeYx68rg
name                                                   Nook Café
categories     [Cafes, Coffee & Tea, Food, Breakfast & Brunch...
avg_stars                                                    3.5
Name: 903, dtype: object

#### Save results to csv files

In [128]:
# Save to ./data/selected_business.csv for your next task
df_selected_business.to_csv('./data/selected_business.csv', encoding = 'utf-8', index = False)

In [129]:
# Try reload the csv file to check if everything works fine
df_selected_business = pd.read_csv('./data/selected_business.csv')
df_selected_business.head()

Unnamed: 0,business_id,name,categories,avg_stars
0,saWZO6hB4B8P-mIzS1--Xw,Kabob Palace,"[Persian/Iranian, Restaurants, Ethnic Food, Fo...",2.5
1,hMh9XOwNQcu31NAOCqhAEw,Taste of India,"[Restaurants, Vegetarian, Indian]",3.5
2,pmJqSsCfgbo3TxPWpQNLIw,Artisanal Foods Cafe,"[American (New), Cafes, Restaurants]",4.5
3,kUUBBLBHCasOl2a5nW9nAw,Bailey's Sports Bar & Eatery,"[Nightlife, Bars, Restaurants, Thai, Sports Bars]",3.5
4,A2pZTpFXWC38z506XIhnBQ,Wingstop,"[Chicken Wings, Fast Food, Restaurants]",3.5


### Use the "business_id" column to filter review data

* We want to make a DataFrame that contain and only contain the reviews about the business entities we just obtained

#### Load review dataset

In [130]:
with open(file_review) as f:
    df_review = pd.DataFrame(json.loads(line) for line in f)
df_review.head(2)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,type,useful,user_id
0,2aFiy99vNLklCx3T_tGS9A,0,2011-10-10,0,NxL8SIC5yqOdnlXCg18IBg,5,If you enjoy service by someone who is as comp...,review,0,KpkOkG6RIf4Ra25Lhhxf1A
1,2aFiy99vNLklCx3T_tGS9A,0,2010-12-29,0,pXbbIgOXvLuTi_SPs1hQEQ,5,After being on the phone with Verizon Wireless...,review,1,bQ7fQq1otn9hKX-gXRsrgA


#### Prepare dataframes to be joined, - on business_id

In [131]:
# Prepare the business dataframe and set index to column "business_id", and name it as df_left
df_left = df_selected_business.set_index('business_id')

In [132]:
# Prepare the review dataframe and set index to column "business_id", and name it as df_right
df_right = df_review.set_index('business_id')

In [133]:
# check df_left and df_right
df_left.head()

Unnamed: 0_level_0,name,categories,avg_stars
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
saWZO6hB4B8P-mIzS1--Xw,Kabob Palace,"[Persian/Iranian, Restaurants, Ethnic Food, Fo...",2.5
hMh9XOwNQcu31NAOCqhAEw,Taste of India,"[Restaurants, Vegetarian, Indian]",3.5
pmJqSsCfgbo3TxPWpQNLIw,Artisanal Foods Cafe,"[American (New), Cafes, Restaurants]",4.5
kUUBBLBHCasOl2a5nW9nAw,Bailey's Sports Bar & Eatery,"[Nightlife, Bars, Restaurants, Thai, Sports Bars]",3.5
A2pZTpFXWC38z506XIhnBQ,Wingstop,"[Chicken Wings, Fast Food, Restaurants]",3.5


In [134]:
df_right.head()

Unnamed: 0_level_0,cool,date,funny,review_id,stars,text,type,useful,user_id
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2aFiy99vNLklCx3T_tGS9A,0,2011-10-10,0,NxL8SIC5yqOdnlXCg18IBg,5,If you enjoy service by someone who is as comp...,review,0,KpkOkG6RIf4Ra25Lhhxf1A
2aFiy99vNLklCx3T_tGS9A,0,2010-12-29,0,pXbbIgOXvLuTi_SPs1hQEQ,5,After being on the phone with Verizon Wireless...,review,1,bQ7fQq1otn9hKX-gXRsrgA
2aFiy99vNLklCx3T_tGS9A,0,2011-04-29,0,wslW2Lu4NYylb1jEapAGsw,5,Great service! Corey is very service oriented....,review,0,r1NUhdNmL6yU9Bn-Yx6FTw
2LfIuF3_sX6uwe-IR-P0jQ,1,2014-07-14,0,GP6YEearUWrzPtQYSF1vVg,5,Highly recommended. Went in yesterday looking ...,review,0,aW3ix1KNZAvoM8q-WghA3Q
2LfIuF3_sX6uwe-IR-P0jQ,0,2014-01-15,0,25RlYGq2s5qShi-pn3ufVA,4,I walked in here looking for a specific piece ...,review,0,YOo-Cip8HqvKp_p9nEGphw


#### Join! and reset index

In [135]:
# Join df_left and df_right. What type of join?
# inner join
df_merge = pd.merge(df_left, df_right, left_index = True, right_index = True, how = 'inner')

In [136]:
# You may want to reset the index 
df_merge.reset_index()
df_merge.head()

Unnamed: 0_level_0,name,categories,avg_stars,cool,date,funny,review_id,stars,text,type,useful,user_id
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-26,0,nCqdz-NW64KazpxqnDr0sQ,1,I mainly went for the ceasar salad prepared ta...,review,0,0XVzm4kVIAaH4eQAxWbhvw
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,1,2010-03-25,1,pa4MASGD-2EFoR_rGDZILw,5,"""WOW!!!"" that's what she said... literally! lo...",review,1,5aFBj0emFzoXsUcKbDQZiA
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,1,2013-04-03,0,8N_ZSR4q3m2dTpWyB-ncCg,1,I visited this place with a few of my friends ...,review,1,n0y7p7B1NMia_3lpk7xK3A
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2013-07-13,0,aVor8Ttm0RT3JBvv6DOWAQ,4,"Delmonico is a terrific steakhouse, with great...",review,1,aP4BkNgP4wzQ5woQM-BI1A
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-29,0,iwx6s6yQxc7yjS7NFANZig,4,Nice atmosphere and wonderful service. I had t...,review,0,2aeNFntqY2QDZLADNo8iQQ


#### We further filter data by date, e.g. keep comments from last 2 years

* Otherwise your laptop may crush on memory when running machine learning algorithms
* Purposefully ignoring the reviews made too long time ago

In [137]:
# Make a filter that selects date after 2015-01-20
df_merge['date'] = pd.to_datetime(df_merge['date'])
df_merge.head()

Unnamed: 0_level_0,name,categories,avg_stars,cool,date,funny,review_id,stars,text,type,useful,user_id
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-26,0,nCqdz-NW64KazpxqnDr0sQ,1,I mainly went for the ceasar salad prepared ta...,review,0,0XVzm4kVIAaH4eQAxWbhvw
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,1,2010-03-25,1,pa4MASGD-2EFoR_rGDZILw,5,"""WOW!!!"" that's what she said... literally! lo...",review,1,5aFBj0emFzoXsUcKbDQZiA
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,1,2013-04-03,0,8N_ZSR4q3m2dTpWyB-ncCg,1,I visited this place with a few of my friends ...,review,1,n0y7p7B1NMia_3lpk7xK3A
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2013-07-13,0,aVor8Ttm0RT3JBvv6DOWAQ,4,"Delmonico is a terrific steakhouse, with great...",review,1,aP4BkNgP4wzQ5woQM-BI1A
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-29,0,iwx6s6yQxc7yjS7NFANZig,4,Nice atmosphere and wonderful service. I had t...,review,0,2aeNFntqY2QDZLADNo8iQQ


In [138]:
df_merge_filtered = df_merge.copy()
boundary = pd.to_datetime('2015-01-20')

In [139]:
# Filter the joined DataFrame and name it as df_final
df_final = df_merge_filtered[df_merge_filtered['date'] >= boundary].copy()

#### Take a glance at the final dataset

* Do more EDA here as you like!

In [140]:
import matplotlib.pyplot as plt

% matplotlib inline

In [141]:
# e.g. calculate counts of reviews per business entity, and plot it
df_final['count'] = 1
df_review_num = df_final[['count']].groupby('business_id').sum()



In [142]:
# ouput the reviews per business entity
df_review_num.head()

Unnamed: 0_level_0,count
business_id,Unnamed: 1_level_1
--9e1ONYQuAa-CB_Rrw7Tw,318
-1vfRrlnNnNJ5boOVghMPA,14
-3zffZUHoY8bQjGfPSoBKQ,126
-8R_-EkGpUhBk55K9Dd4mg,35
-9YyInW1wapzdNZrhQJ9dg,50


In [143]:
df_reviews_text = df_merge_filtered[df_merge_filtered['date'] >= boundary].copy()
df_reviews_text.groupby('business_id')
df_reviews_text.head()

Unnamed: 0_level_0,name,categories,avg_stars,cool,date,funny,review_id,stars,text,type,useful,user_id
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-26,0,nCqdz-NW64KazpxqnDr0sQ,1,I mainly went for the ceasar salad prepared ta...,review,0,0XVzm4kVIAaH4eQAxWbhvw
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-29,0,iwx6s6yQxc7yjS7NFANZig,4,Nice atmosphere and wonderful service. I had t...,review,0,2aeNFntqY2QDZLADNo8iQQ
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-04-05,0,2HrBENXZTiitcCJfzkELgA,2,To be honest it really quit aweful. First the ...,review,0,WFhv5pMJRDPWSyLnKiWFXA
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-16,0,6YNPXoq41qTMZ2TEi0BYUA,2,"The food was decent, but the service was defin...",review,0,2S6gWE-K3DHNcKYYSgN7xA
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-08,1,4bQrVUiRZ642odcKCS0OhQ,2,If you're looking for craptastic service and m...,review,1,rCTVWx_Tws2jWi-K89iEyw


## Save your preprocessed dataset to csv file

* Respect your laptop's hard work! You don't want to make it run everything again.

In [145]:
df_reviews_merge = pd.merge(df_reviews_text, df_review_num, how = 'inner', left_index = True, right_index = True)
df_reviews_merge.head()

Unnamed: 0_level_0,name,categories,avg_stars,cool,date,funny,review_id,stars,text,type,useful,user_id,count
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-26,0,nCqdz-NW64KazpxqnDr0sQ,1,I mainly went for the ceasar salad prepared ta...,review,0,0XVzm4kVIAaH4eQAxWbhvw,318
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-29,0,iwx6s6yQxc7yjS7NFANZig,4,Nice atmosphere and wonderful service. I had t...,review,0,2aeNFntqY2QDZLADNo8iQQ,318
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-04-05,0,2HrBENXZTiitcCJfzkELgA,2,To be honest it really quit aweful. First the ...,review,0,WFhv5pMJRDPWSyLnKiWFXA,318
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-16,0,6YNPXoq41qTMZ2TEi0BYUA,2,"The food was decent, but the service was defin...",review,0,2S6gWE-K3DHNcKYYSgN7xA,318
--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-08,1,4bQrVUiRZ642odcKCS0OhQ,2,If you're looking for craptastic service and m...,review,1,rCTVWx_Tws2jWi-K89iEyw,318


In [146]:
# Save to ./data/last_2_years_restaurant_reviews.csv for your next task
df_reviews_merge.to_csv('./data/last_2_years_restaurant_reviews.csv', encoding = 'utf-8', index = True)

In [147]:
df_read_reviews = pd.read_csv('./data/last_2_years_restaurant_reviews.csv')
df_read_reviews.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,type,useful,user_id,count
0,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-26,0,nCqdz-NW64KazpxqnDr0sQ,1,I mainly went for the ceasar salad prepared ta...,review,0,0XVzm4kVIAaH4eQAxWbhvw,318
1,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-06-29,0,iwx6s6yQxc7yjS7NFANZig,4,Nice atmosphere and wonderful service. I had t...,review,0,2aeNFntqY2QDZLADNo8iQQ,318
2,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2015-04-05,0,2HrBENXZTiitcCJfzkELgA,2,To be honest it really quit aweful. First the ...,review,0,WFhv5pMJRDPWSyLnKiWFXA,318
3,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-16,0,6YNPXoq41qTMZ2TEi0BYUA,2,"The food was decent, but the service was defin...",review,0,2S6gWE-K3DHNcKYYSgN7xA,318
4,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-02-08,1,4bQrVUiRZ642odcKCS0OhQ,2,If you're looking for craptastic service and m...,review,1,rCTVWx_Tws2jWi-K89iEyw,318


In [148]:
df_read_reviews.shape

(348016, 14)