# Exploratory Data Analysis
This notebook is focused on configuration, properly importing, and performing basic
statistical analysis on the data received. 

We will start with the basic imports:

In [1]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Utilities
import json

The next task is to import the data. It is stored in a json format across multiple files.

*NOTE: Please use the README.md of this repository to determine how the data will be located,
as well as a location for downloading the data.*

In [3]:
# File Locations
file_location = 'dataset/data/yelp_academic_dataset_'
file_type = '.json'

## Utility Loading Function
We are next going to create a utility function called `load_json`. This function is going to facilitate the loading of the 
complex `.json` data files that cannot be read directly by Pandas. This function will follow this specification:

**Description:** It will take a single unlabeled argument that contains the path to the json file containing data to read.

**Returns:** It will return a single pandas data frame that contains all data from the provided file.

**Errors:** Any errors thrown will be those thrown from the operating system failing to locate/open the file, or invalid creation
of a data frame.

In [4]:
def load_json(path, line_limit=None):
    temp = []
    with open(path) as fl:
        for i, line in enumerate(fl):
            temp.append(json.loads(line))
            if line_limit is not None:
                if i + 1 > line_limit:
                    break
    return pd.DataFrame(temp)

## Reading the Input
Now we will use the loaded filepath names and the newly created function to read the data into named data frames.

We will display the first 5 elements to show a sample of each data file.

In [5]:
business_data = load_json(f'{file_location}business{file_type}', line_limit=1e+4)
business_data.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [6]:
checkin_data = load_json(f'{file_location}checkin{file_type}', line_limit=1e+4)
checkin_data.head()

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."


In [7]:
review_data = load_json(f'{file_location}review{file_type}', line_limit=1e+4)
review_data.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


In [8]:
tip_data = load_json(f'{file_location}tip{file_type}', line_limit=1e+4)
tip_data.head()

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,AGNUgVwnZUey3gcPCJ76iw,3uLgwr0qeCNMjKenHJwPGQ,Avengers time with the ladies.,2012-05-18 02:17:21,0
1,NBN4MgHP9D3cw--SnauTkA,QoezRbYQncpRqyrLH6Iqjg,They have lots of good deserts and tasty cuban...,2013-02-05 18:35:10,0
2,-copOvldyKh1qr-vzkDEvw,MYoRNLb5chwjQe3c_k37Gg,It's open even when you think it isn't,2013-08-18 00:56:08,0
3,FjMQVZjSqY8syIO-53KFKw,hV-bABTK-glh5wj31ps_Jw,Very decent fried chicken,2017-06-27 23:05:38,0
4,ld0AperBXk1h6UbqmM80zw,_uN0OudeJ3Zl_tf6nxg5ww,Appetizers.. platter special for lunch,2012-10-06 19:43:09,0


In [9]:
user_data = load_json(f'{file_location}user{file_type}', line_limit=1e+4)
user_data.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,...,4,1,6,2,12,16,26,26,10,9
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0


## Describing the Input
This section aims to describe the data columns numerically, to better understand the structure
of the given data.

In [10]:
print(f'The shape of the business data: {business_data.shape}')
business_data.describe()

The shape of the business data: (10001, 14)


Unnamed: 0,latitude,longitude,stars,review_count,is_open
count,10001.0,10001.0,10001.0,10001.0,10001.0
mean,36.687318,-89.414182,3.609589,46.961304,0.79672
std,5.907242,14.966271,0.975007,122.296957,0.402459
min,27.5843,-120.095137,1.0,5.0,0.0
25%,32.192524,-90.358515,3.0,8.0,1.0
50%,38.766872,-86.126907,4.0,15.0,1.0
75%,39.95337,-75.411889,4.5,39.0,1.0
max,53.647812,-74.658572,5.0,4554.0,1.0


In [11]:
print(f'The shape of the checkin data: {checkin_data.shape}')
checkin_data.describe()

The shape of the checkin data: (10001, 2)


Unnamed: 0,business_id,date
count,10001,10001
unique,10001,10001
top,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
freq,1,1


In [12]:
print(f'The shape of the review data: {review_data.shape}')
review_data.describe()

The shape of the review data: (10001, 9)


Unnamed: 0,stars,useful,funny,cool
count,10001.0,10001.0,10001.0,10001.0
mean,3.854315,0.889111,0.246475,0.335466
std,1.346653,2.092224,0.88518,1.050976
min,1.0,0.0,0.0,0.0
25%,3.0,0.0,0.0,0.0
50%,4.0,0.0,0.0,0.0
75%,5.0,1.0,0.0,0.0
max,5.0,91.0,26.0,44.0


In [13]:
print(f'The shape of the tip data: {tip_data.shape}')
tip_data.describe()

The shape of the tip data: (10001, 5)


Unnamed: 0,compliment_count
count,10001.0
mean,0.013399
std,0.121739
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,3.0


In [14]:
print(f'The shape of the user data: {user_data.shape}')
user_data.describe()

The shape of the user data: (10001, 22)


Unnamed: 0,review_count,useful,funny,cool,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
count,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0,10001.0
mean,278.590641,911.575442,465.394061,606.29967,32.556544,3.832163,60.310969,8.323368,6.861914,5.40446,3.414959,38.886911,95.09769,85.79732,85.79732,33.113989,23.823818
std,502.613179,3695.182871,2292.865663,3071.13992,162.938223,0.447322,337.469773,64.611146,97.449155,38.825055,39.375894,186.291158,606.296052,451.360398,451.360398,184.164556,245.55829
min,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,34.0,45.0,10.0,14.0,2.0,3.59,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0
50%,115.0,176.0,55.0,68.0,7.0,3.84,3.0,2.0,0.0,0.0,0.0,5.0,6.0,6.0,6.0,3.0,1.0
75%,323.0,628.0,239.0,299.0,25.0,4.08,20.0,5.0,2.0,2.0,1.0,19.0,28.0,33.0,33.0,16.0,4.0
max,16567.0,173089.0,98459.0,144849.0,12497.0,5.0,12391.0,4347.0,7039.0,1744.0,2607.0,8616.0,28974.0,13280.0,13280.0,7309.0,14045.0


## Data Cleanup
The next major task that needs to be accomplished is a further parsing of specific columns within certain parts of the data set that 
could be better represented in another, Python-native, format.

For example, within the business data file, there is an `attributes` column containing attributes/tags that are voluntarily recommended
by business owners. These should be separated out and merged into a new data frame. This is not the only instance in which
malformed data is present. All of these instances will be rectified here, and new/separate frames will be generated containing
this data in a more-manageable format.