## Subsetting the Yelp data

In [1]:
import numpy 
import pandas as pd 
import json
import matplotlib.pyplot as plt
import seaborn as sb

### _Filtering upon the relevant subset of businesses and saving the corresponding data subsets as a CSV file for easier loading of the dataset:_

***Filtering businesses:***

First, we will load the businesses dataframe using pandas:

In [2]:
all_business_df = pd.read_json('yelp_dataset/yelp_academic_dataset_business.json', lines = True)

After that, we are first interested in focusing our analysis on a specific city chosen for this project: **Toronto**.

In [3]:
all_business_toronto_df = all_business_df[all_business_df.city == 'Toronto']
all_business_toronto_df = all_business_toronto_df.loc[all_business_toronto_df.categories.dropna().index]

We would like to save a dataset relative to all the restaurants in Toronto, in order to possibly use it for extended competitor analysis:

In [4]:
restaurants_business_toronto_df = all_business_toronto_df[all_business_toronto_df.categories.str.contains('Restaurant')]

Then, we are interested in selecting all businesses which deal with Japanese food:

In [5]:
japanese_restaurants_business_toronto_df = restaurants_business_toronto_df[restaurants_business_toronto_df.categories.str.contains('Japanese')]

Finally, now that we have our relevant subset of business that we can potentially look at, we will try to create a list of the IDs of these businesses, in order to use this to subset the other datasets:

In [10]:
business_id_all_business_toronto = list(all_business_toronto_df.business_id)
business_id_restaurants_toronto = list(restaurants_business_toronto_df.business_id)
business_id_japanese_restaurants_toronto = list(japanese_restaurants_business_toronto_df.business_id)

After all these datasets have been subsetted, let us save them into their own CSV files:

In [11]:
all_business_toronto_df.to_csv('yelp_dataset/toronto_all_business.csv', index = False)
restaurants_business_toronto_df.to_csv('yelp_dataset/toronto_restaurant_business.csv', index = False)
japanese_restaurants_business_toronto_df.to_csv('yelp_dataset/toronto_japanese_business.csv', index = False)

In [12]:
del all_business_df
del all_business_toronto_df
del restaurants_business_toronto_df
del japanese_restaurants_business_toronto_df

***Filtering reviews:***

The *review* JSON file is extremely large, and often is bound to give memory errors when trying to read it directly from using the specialized `read_json` functions of `pandas` or `dask`. Therefore, we need a more Python-traditional approach:

In [15]:
with open('yelp_dataset/yelp_academic_dataset_review.json', mode = 'r', encoding = 'utf8') as json_file:
    data = json_file.readlines()
    data = list(map(json.loads, data))
    
all_reviews_df = pd.DataFrame(data)

After we are able to load the data, we will filter the reviews immediately by keeping only those that are relevant to our businesses, in order to lower the memory footprint that is taken up by the *reviews* dataframe.

Initially, we will filter by all reviews relevant to all businesses in **Toronto**:

In [16]:
all_reviews_toronto_df = all_reviews_df[all_reviews_df.business_id.isin(business_id_all_business_toronto)]

Next, we want to subset by the reviews of all restaurants in Toronto:

In [17]:
restaurants_reviews_toronto_df = all_reviews_toronto_df[all_reviews_toronto_df.business_id.isin(business_id_restaurants_toronto)]

We want to subset even further to only get the reviews of any Japanese-specific restaurants in Toronto:

In [18]:
japanese_restaurants_reviews_toronto_df = restaurants_reviews_toronto_df[restaurants_reviews_toronto_df.business_id.isin(business_id_japanese_restaurants_toronto)]

Lastly, we are interested in storing the IDs of the users writing all these reviews, in order to filter the Tips and Users datasets.

In [19]:
user_id_all_reviews_toronto = list(all_reviews_toronto_df.user_id)
user_id_restaurants_toronto = list(restaurants_reviews_toronto_df.user_id)
user_id_japanese_restaurants_toronto = list(japanese_restaurants_reviews_toronto_df.user_id)

Let us save these three datasets for future re-usage:

In [20]:
all_reviews_toronto_df.to_csv('yelp_dataset/toronto_all_reviews.csv', index = False)
restaurants_reviews_toronto_df.to_csv('yelp_dataset/toronto_restaurant_reviews.csv', index = False)
japanese_restaurants_reviews_toronto_df.to_csv('yelp_dataset/toronto_japanese_reviews.csv', index = False)

In [21]:
del all_reviews_df
del all_reviews_toronto_df
del restaurants_reviews_toronto_df
del japanese_restaurants_reviews_toronto_df

***Filtering checkins:***

The same steps will be repeated for the check-in dataset, so let's follow along:

In [22]:
all_checkins_df = pd.read_json('yelp_dataset/yelp_academic_dataset_checkin.json', lines = True)
all_checkins_toronto_df = all_checkins_df[all_checkins_df.business_id.isin(business_id_all_business_toronto)]
restaurants_checkins_toronto_df = all_checkins_toronto_df[all_checkins_toronto_df.business_id.isin(business_id_restaurants_toronto)]
japanese_restaurants_checkins_toronto_df = restaurants_checkins_toronto_df[restaurants_checkins_toronto_df.business_id.isin(business_id_japanese_restaurants_toronto)]

Let's save the datasets:

In [23]:
all_checkins_toronto_df.to_csv('yelp_dataset/toronto_all_checkins.csv', index = False)
restaurants_checkins_toronto_df.to_csv('yelp_dataset/toronto_restaurant_checkins.csv', index = False)
japanese_restaurants_checkins_toronto_df.to_csv('yelp_dataset/toronto_japanese_checkins.csv', index = False)

In [24]:
del all_checkins_df
del all_checkins_toronto_df
del restaurants_checkins_toronto_df
del japanese_restaurants_checkins_toronto_df

***Filtering tips:***

The same steps as last time will be used for the tips dataset:

In [25]:
all_tips_df = pd.read_json('yelp_dataset/yelp_academic_dataset_tip.json', lines = True)
all_tips_toronto_df = all_tips_df[(all_tips_df.business_id.isin(business_id_all_business_toronto) & (all_tips_df.user_id.isin(user_id_all_reviews_toronto)))]
restaurants_tips_toronto_df = all_tips_toronto_df[(all_tips_toronto_df.business_id.isin(business_id_restaurants_toronto) & all_tips_toronto_df.user_id.isin(user_id_restaurants_toronto))]
japanese_restaurants_tips_toronto_df = restaurants_tips_toronto_df[(restaurants_tips_toronto_df.business_id.isin(business_id_japanese_restaurants_toronto) & restaurants_tips_toronto_df.user_id.isin(user_id_japanese_restaurants_toronto))]

In [34]:
all_tips_toronto_df.to_csv('yelp_dataset/toronto_all_tips.csv', index = False)
restaurants_tips_toronto_df.to_csv('yelp_dataset/toronto_restaurant_tips.csv', index = False)
japanese_restaurants_tips_toronto_df.to_csv('yelp_dataset/toronto_japanese_tips.csv', index = False)

In [35]:
del all_tips_df
del all_tips_toronto_df
del restaurants_tips_toronto_df
del japanese_restaurants_tips_toronto_df

***Filtering users:***

Finally, once again, the same steps will be applied for the users dataset:

In [36]:
with open('yelp_dataset/yelp_academic_dataset_user.json', mode = 'r', encoding = 'utf8') as json_file:
    data = json_file.readlines()
    data = list(map(json.loads, data))
    
all_users_df = pd.DataFrame(data)

In [37]:
all_users_toronto_df = all_users_df[all_users_df.user_id.isin(user_id_all_reviews_toronto)]
restaurants_users_toronto_df = all_users_toronto_df[all_users_toronto_df.user_id.isin(user_id_restaurants_toronto)]
japanese_restaurants_users_toronto_df = restaurants_users_toronto_df[restaurants_users_toronto_df.user_id.isin(user_id_japanese_restaurants_toronto)]

In [38]:
all_users_toronto_df.to_csv('yelp_dataset/toronto_all_users.csv', index = False)
restaurants_users_toronto_df.to_csv('yelp_dataset/toronto_restaurant_users.csv', index = False)
japanese_restaurants_users_toronto_df.to_csv('yelp_dataset/toronto_japanese_users.csv', index = False)

In [39]:
del all_users_df
del all_users_toronto_df
del restaurants_users_toronto_df
del japanese_restaurants_users_toronto_df