# Business Attributes Dataset Modification Details
## Author: Christian Smith

These are the all the details and proceedures for how I created the "reduced_business" dataset that I used in the "Reimagining Reviews With Sentence Transformers" article.

## Necessary Python Libraries

In [1]:
import numpy as np
import pandas as pd
import json
import math

## reduced_business Dataframe

This is the process for how I made the "reduced_business" dataframe. This dataframe contains information on individual businesses without their reviews, including their business_id, name, attributes, and categories. 
Note: original dataset is the "business.json" found here: [Yelp Open Dataset](https://www.yelp.com/dataset).

In [2]:
# 1) Start by loading in the "business.json" dataframe into a pandas dataframe object.
# Notes about business.json: 
# 150,346 x 14.
# not all are "open" or active.
# contains business data including location data, attributes, and categories.
# attributes include parking, type of restaurant, if good for kids, noise level, etc.
# connects with other datasets on "business_id".
business = pd.read_json('yelp_academic_dataset_business.json', lines = True)

# 2) Use pd.json_normalize("attributes column") to create a new dataframe that only contains each possible attribute as a column.
business_attr = pd.json_normalize(business['attributes'])

# 3) Apply a map over the entire dataframe that helps to simplify the values of the attribute columns into either true, false, or None. As this dataset
# tends to use strings instead of boolean values, this step is very necessary or else unanticipated results may occur. This also is not necessary, but
# I decided to drop some columns that did not fit this pattern (i.e. "BusinessParking") as I did not plan on using them for this analysis.
business_attr = business_attr.applymap(lambda x: 1 if x == 'True' or x == True 
                                 else 0 if x == 'False' or x == False 
                                 else None if x in ['None', 'NaN', 'nan'] or (isinstance(x, float) and math.isnan(x))
                                 else x).drop(['BusinessParking', 'Ambience', 'GoodForMeal', 'Music', 'BestNights',
                                               'HairSpecializesIn', 'DietaryRestrictions'], axis = 1)

# 4) For categories that do not fit the map pattern, I simplfied their values using .apply()
business_attr['RestaurantsAttire'] = business_attr['RestaurantsAttire'].apply(lambda x: 'casual' if x == "u'casual'" or x == "'casual'" 
                                                                        else 'formal' if x == "'formal'" or x == "u'formal'" 
                                                                        else 'semi-formal' if x == "'dressy'" or x == "u'dressy'" 
                                                                        else x)
business_attr['WiFi'] = business_attr['WiFi'].apply(lambda x: 'free' if x == "u'free'" or x == "'free'" 
                                                                        else 'paid' if x == "'paid'" or x == "u'paid'" 
                                                                        else 'no' if x == "'no'" or x == "u'no'" 
                                                                        else x)
business_attr['Alcohol'] = business_attr['Alcohol'].apply(lambda x: 'full_bar' if x == "u'full_bar'" or x == "'full_bar'" 
                                                                        else 'beer_and_wine' if x == "'beer_and_wine'" or x == "u'beer_and_wine'" 
                                                                        else None if x == "'none'" or x == "u'none'" 
                                                                        else x)
business_attr['NoiseLevel'] = business_attr['NoiseLevel'].apply(lambda x: 'average' if x == "u'average'" or x == "'average'" 
                                                                        else 'quiet' if x == "'quiet'" or x == "u'quiet'" 
                                                                        else 'loud' if x == "'loud'" or x == "u'loud'"
                                                                        else 'very_loud' if x == "'very_loud'" or x == "u'very_loud'"
                                                                        else x)
business_attr['Smoking'] = business_attr['Smoking'].apply(lambda x: 'no' if x == "u'no'" or x == "'no'" 
                                                                        else 'only_outdoor' if x == "'outdoor'" or x == "u'outdoor'" 
                                                                        else 'yes' if x == "'yes'" or x == "u'yes'" 
                                                                        else x)
business_attr['BYOBCorkage'] = business_attr['BYOBCorkage'].apply(lambda x: 'yes_free' if x == "u'yes_free'" or x == "'yes_free'" 
                                                                        else 'yes_paid' if x == "'yes_corkage'" or x == "u'yes_corkage'" 
                                                                        else 'no' if x == "'no'" or x == "u'no'" 
                                                                        else x)
business_attr['AgesAllowed'] = business_attr['AgesAllowed'].apply(lambda x: 'all_ages' if x == "u'all_ages'" or x == "'all_ages'" 
                                                                        else 'above_21' if x == "'21plus'" or x == "u'21plus'" 
                                                                        else 'above_18' if x == "'18plus'" or x == "u'18plus'" 
                                                                        else x)

# 5) Now that business_attr has all of the attributes we want in their own columns, we can add these attributes to the original dataframe.
total_business = pd.concat([business, business_attr], axis = 1)

# 6) Drop the old attributes column that is no longer needed.
total_business = total_business.drop('attributes', axis = 1)

# 7) This next step is optional, but I decided reduce the number of attributes to potentially observe to just 'RestaurantsPriceRange2', 
#'GoodForKids', 'NoiseLevel', 'GoodForDancing', 'RestaurantsAttire', 'BikeParking', 'WheelchairAccessible'.
reduced_business = total_business[['business_id', 'name', 'RestaurantsPriceRange2', 'GoodForKids', 'NoiseLevel', 'GoodForDancing',
                                   'RestaurantsAttire', 'BikeParking', 'WheelchairAccessible', 'categories']]

# 8) Because we want to be able to analyze reviews based on whether they're in a certain category or not, we must drop the rows where category data
# is missing. This process only affects 103 rows that have missing category data.
reduced_business = reduced_business.dropna(subset=['categories'])
reduced_business.head()

Unnamed: 0,business_id,name,RestaurantsPriceRange2,GoodForKids,NoiseLevel,GoodForDancing,RestaurantsAttire,BikeParking,WheelchairAccessible,categories
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ",,,,,,,,"Doctors, Traditional Chinese Medicine, Naturop..."
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,,,,,,,,"Shipping Centers, Local Services, Notaries, Ma..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,2.0,,,,,1.0,1.0,"Department Stores, Shopping, Fashion, Home & G..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,1.0,,,,,1.0,,"Restaurants, Food, Bubble Tea, Coffee & Tea, B..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,,1.0,,,,1.0,1.0,"Brewpubs, Breweries, Food"


In [3]:
reduced_business.to_csv('reduced_business.csv', index = False)