In [22]:
import pandas as pd
import json

**Converting JSON into a dataframe and csv**

In [23]:
# Filepath for the json version of the dataset
json_filepath = "yelp_academic_dataset_business.json"

# Reading the JSON file (weird since each line is a separate JSON object)
df = pd.read_json(json_filepath, lines=True)

# Normalizing the nested JSON parts and readding them as multiple columns
df_attributes = pd.json_normalize(df["attributes"])
df_hours = pd.json_normalize(df["hours"])
df = df.drop(["attributes", "hours"], axis=1)
df = pd.concat([df, df_attributes, df_hours], axis=1)

# Saving the dataframe to a csv file
df.to_csv("temp.csv")

**General Features of the Dataset**

Printing out the dimensions, attributes and their types, and first rows of the dataframe. The issues with the columns mentioned above can be seen here.

In [24]:
print(df.shape)
print(df.dtypes)
df.head()

(100000, 58)
business_id                    object
name                           object
address                        object
city                           object
state                          object
postal_code                    object
latitude                      float64
longitude                     float64
stars                         float64
review_count                    int64
is_open                         int64
categories                     object
ByAppointmentOnly              object
BusinessAcceptsCreditCards     object
BikeParking                    object
RestaurantsPriceRange2         object
CoatCheck                      object
RestaurantsTakeOut             object
RestaurantsDelivery            object
Caters                         object
WiFi                           object
BusinessParking                object
WheelchairAccessible           object
HappyHour                      object
OutdoorSeating                 object
HasTV                          object

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,...,RestaurantsCounterService,AgesAllowed,DietaryRestrictions,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,...,,,,,,,,,,
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,,,,0:0-0:0,8:0-18:30,8:0-18:30,8:0-18:30,8:0-18:30,8:0-14:0,
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,...,,,,8:0-22:0,8:0-22:0,8:0-22:0,8:0-22:0,8:0-23:0,8:0-23:0,8:0-22:0
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,...,,,,7:0-20:0,7:0-20:0,7:0-20:0,7:0-20:0,7:0-21:0,7:0-21:0,7:0-21:0
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,...,,,,,,14:0-22:0,16:0-22:0,12:0-22:0,12:0-22:0,12:0-18:0


**Cleaning the Data**

After initial investigation, we can conclude that the "business_id" feature is not useful for us and is likely used on Yelp's side. We can drop the column.

In [25]:
df.drop("business_id", axis=1, inplace=True)
df.head(1)

Unnamed: 0,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,...,RestaurantsCounterService,AgesAllowed,DietaryRestrictions,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
0,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,...,,,,,,,,,,


After normalizing the "attributes" and "hours" features, the dataframe gained a significant number of columns (one for each possible attribute and one for each day of the week).

We'll handle the attribute columns first. We can see that 

In [43]:
print(df["WiFi"].unique())

[nan "u'no'" "u'free'" "'free'" "'no'" 'None' "u'paid'" "'paid'"]


In [39]:
attribute_cols = df.columns[11:50]
for a in attribute_cols:
    print(a)

ByAppointmentOnly
BusinessAcceptsCreditCards
BikeParking
RestaurantsPriceRange2
CoatCheck
RestaurantsTakeOut
RestaurantsDelivery
Caters
WiFi
BusinessParking
WheelchairAccessible
HappyHour
OutdoorSeating
HasTV
RestaurantsReservations
DogsAllowed
Alcohol
GoodForKids
RestaurantsAttire
Ambience
RestaurantsTableService
RestaurantsGoodForGroups
DriveThru
NoiseLevel
GoodForMeal
BusinessAcceptsBitcoin
Smoking
Music
GoodForDancing
AcceptsInsurance
BestNights
BYOB
Corkage
BYOBCorkage
HairSpecializesIn
Open24Hours
RestaurantsCounterService
AgesAllowed
DietaryRestrictions
