# 1. Converting Yelp Dataset

In [1]:
import pandas as pd

### Business table

First I load the business JSON file in a pandas dataframe.

In [2]:
# Load business JSON file. 
business_json_path = 'JSON\yelp_academic_dataset_business.json'
business = pd.read_json(business_json_path, lines=True)
business.categories[0]

'Active Life, Gun/Rifle Ranges, Guns & Ammo, Shopping'

I am interested in giving insights to CoffeeKing, a new startup coffee company that wishes to provide a unique and novel experience to its customers. Therefore, the businesses I will need to analyze are coffee shops. I select the businesses that belong to a coffee category.

In [3]:
# Find businesses that belong to a Coffee category to reduce the business table. 
business = business[business['categories'].notna()]
df_b = business[business['categories'].str.contains('Coffee')]
df_b

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
32,DCsS3SgVFO56F6wRO_ewgA,Missy Donuts & Coffee,1255 W Main St,Mesa,AZ,85201,33.414409,-111.858378,2.5,7,0,"{'BikeParking': 'True', 'BusinessParking': '{'...","Donuts, Juice Bars & Smoothies, Food, Coffee &...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."
45,_xOeoXfPUQTNlUAhXl32ug,Starbucks,150 Boulevard Crémazie E,Montréal,QC,H2P 1E2,45.542993,-73.640218,3.5,4,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsPri...","Coffee & Tea, Food","{'Monday': '5:30-23:0', 'Tuesday': '5:30-23:0'..."
54,lK-wuiq8b1TuU7bfbQZgsg,Hingetown,,Cleveland,OH,44113,41.489343,-81.711029,3.0,4,1,"{'Alcohol': 'u'none'', 'GoodForKids': 'True', ...","Shopping Centers, Food, Coffee & Tea, Cafes, M...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."
110,8k62wYhDVq1-652YbJi5eg,Tim Hortons,90 Adelaide Street W,Toronto,ON,M5H 3V9,43.649859,-79.382060,3.0,8,1,"{'OutdoorSeating': 'False', 'RestaurantsDelive...","Bagels, Donuts, Food, Cafes, Coffee & Tea, Res...",
115,8Hvp1tYKiQbBgGIwkCRK5g,Tony's Family Restaurant,1515 W Pleasant Valley Rd,Parma,OH,44134,41.361185,-81.688755,4.0,60,1,"{'OutdoorSeating': 'False', 'RestaurantsReserv...","Coffee & Tea, Restaurants, Food, Breakfast & B...","{'Monday': '6:0-21:0', 'Tuesday': '6:0-21:0', ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209247,_rZyr1lrIoBaz65XiDPP6A,Fixe,5985 Rue Saint-Hubert,Montréal,QC,H2S 2L7,45.534416,-73.598967,4.0,24,0,"{'RestaurantsTakeOut': 'True', 'RestaurantsAtt...","Cafes, Food, Internet Cafes, Coffee & Tea, Bre...","{'Wednesday': '9:0-16:0', 'Thursday': '9:0-16:..."
209301,NeM7anGnTOTn7sEJavS3sw,Starbucks,"1597 Washington Pike, Space A-1",Bridgeville,PA,15017,40.381477,-80.095680,4.5,26,1,"{'BikeParking': 'True', 'Caters': 'False', 'Ou...","Food, Coffee & Tea","{'Monday': '5:0-21:30', 'Tuesday': '5:0-21:30'..."
209318,00liP5s4IKsq97EH4Cc0Tw,Starbucks,9051 E. Indian Bend Road,Scottsdale,AZ,85250,33.538119,-111.886227,2.0,69,1,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Coffee & Tea, Food","{'Monday': '4:0-20:0', 'Tuesday': '4:0-20:0', ..."
209334,7V82ANZ7_ARkA7o0pAMAlA,Galaxy Cafe,"835 Seven Hills Dr, Ste 190",Henderson,NV,89052,35.996722,-115.124645,3.5,60,0,"{'GoodForKids': 'True', 'RestaurantsReservatio...","Restaurants, Breakfast & Brunch, Food, Coffee ...","{'Monday': '6:0-19:0', 'Tuesday': '6:0-19:0', ..."


### Tips table 

Now I will reduce the tip table to have only those meant for coffee shops. 

In [4]:
# Load tip JSON file.
tip_json_path = 'JSON\yelp_academic_dataset_tip.json'
tips = pd.read_json(tip_json_path, lines=True)

In [5]:
# Remove columns from the business table. 
df_b_tmp = df_b[['business_id', 'name']]
# Join the reduced business table with the tip table using business_id.
df_t = df_b_tmp.set_index('business_id').join(tips.set_index('business_id'))
df_t

Unnamed: 0_level_0,name,user_id,text,date,compliment_count
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
--Rsj71PBe31h5YljVseKA,Circle K,GwWjnIPiaw9uBOVGC8s-RQ,3.39,2012-07-06 23:48:39,0.0
--Rsj71PBe31h5YljVseKA,Circle K,v_rQGT2VgIbveQfDK7KAXw,Rude,2015-08-08 00:45:42,0.0
--Rsj71PBe31h5YljVseKA,Circle K,R5WcogaoAwjdHxrB2v5NsQ,"Avoid. \n\nCan she do it, ladies & gentlemen y...",2016-06-02 20:46:49,0.0
--Rsj71PBe31h5YljVseKA,Circle K,B_4YTmV1JsFqC-pZM2Im-w,Stay away. The staff acts like its a burden t...,2014-03-05 01:17:25,0.0
--Rsj71PBe31h5YljVseKA,Circle K,HL6oi7o6VSfd6zOMYPcgLw,Staff looks like a bunch of monkeys trying to ...,2014-08-04 16:30:11,0.0
...,...,...,...,...,...
zzZfgEpwrpi4Ywdaj3OIuQ,Adesso Cafe,QXXqp9SplGzmtRYFyYZx7w,Adesso is a local favorite and Frank runs it p...,2018-07-12 22:00:56,0.0
zzZfgEpwrpi4Ywdaj3OIuQ,Adesso Cafe,fmzIm7RxEdii5Jz44PtO7g,Coffee comes from Commonplace coffee. Espresso...,2018-05-30 16:49:45,0.0
zzZfgEpwrpi4Ywdaj3OIuQ,Adesso Cafe,fmzIm7RxEdii5Jz44PtO7g,Now serving Rooted vegan ice cream for purchase,2019-12-11 10:28:59,0.0
zzZfgEpwrpi4Ywdaj3OIuQ,Adesso Cafe,6tbXpUIU6upoeqWNDo9k_A,Cuban Coffee,2018-12-09 14:23:46,0.0


### Review Table

Both the review and user JSON files are huge, which provoques a memory issue when trying to load them. 
There is a method that allow me to load this datasets into a pandas dataframe.

First, I can reduce memory usage by specifying the data type of each column. The specifications of the table are in the documentation from Yelp.

In [6]:
review_json_path = 'JSON\yelp_academic_dataset_review.json'
size = 1000000
review = pd.read_json(review_json_path, lines=True,
                      dtype={'review_id':str,'user_id':str,
                             'business_id':str,'stars':int,
                             'date':str,'text':str,'useful':int,
                             'funny':int,'cool':int},
                      chunksize=size)

Then I merge the reviews and business tables, using business_id, to have one with the reviews of coffee shops by chunks using the code provided at:

    https://gist.github.com/gyhou/66628a4bfff04b4b67b625173d6ec194#file-review_chunks-py

In [7]:
chunk_list = []
for chunk_review in review:
    # Renaming column name to avoid conflict with business overall star rating
    chunk_review = chunk_review.rename(columns={'stars': 'review_stars'})
    # Inner merge with edited business file so only reviews related to the business remain
    chunk_merged = pd.merge(df_b, chunk_review, on='business_id', how='inner')
    # Show feedback on progress
    print(f"{chunk_merged.shape[0]} out of {size:,} related reviews")
    chunk_list.append(chunk_merged)
# After trimming down the review file, concatenate all relevant data back to one dataframe
df = pd.concat(chunk_list, ignore_index=True, join='outer', axis=0)

54663 out of 1,000,000 related reviews
54295 out of 1,000,000 related reviews
50885 out of 1,000,000 related reviews
53518 out of 1,000,000 related reviews
56664 out of 1,000,000 related reviews
51699 out of 1,000,000 related reviews
56663 out of 1,000,000 related reviews
57903 out of 1,000,000 related reviews
1719 out of 1,000,000 related reviews


The last step is to remove all the columns that belong to the business table. 

In [8]:
drop_columns = ['name','address', 'city', 'state', 'postal_code',
                'latitude', 'longitude', 'stars',
                'review_count', 'is_open', 'attributes',
                'hours','categories']
df_r = df.drop(drop_columns, axis=1)
df_r

Unnamed: 0,business_id,review_id,user_id,review_stars,useful,funny,cool,text,date
0,DCsS3SgVFO56F6wRO_ewgA,MWWbCEb6gxwzOGQD60K8eg,cSQnJ7JTY78ki5ai57kZ9A,4,0,0,0,9.99 for a dozen raised is a Lil much the cake...,2015-08-07 15:02:50
1,DCsS3SgVFO56F6wRO_ewgA,Q-p7Q2VGcTf4OnArCNODJw,5_CoaRC22jwmuUzZwlRG_g,1,1,0,0,Dirty dinning list will not go here again sad ...,2016-02-13 17:50:23
2,DCsS3SgVFO56F6wRO_ewgA,udUMB6LLEUP1veOguc1TvQ,5_CoaRC22jwmuUzZwlRG_g,5,0,0,0,Just went here for a late night donut fix lol....,2015-09-29 06:18:47
3,DCsS3SgVFO56F6wRO_ewgA,mokg21BrvNdAiF3ByjhcJw,c5ebpS7ex6npffT9Nlvqvw,1,1,0,0,Would of given 0 stars if possible first impre...,2015-11-10 18:55:53
4,DCsS3SgVFO56F6wRO_ewgA,g3tNpTu7fADO_-flxfreRg,EXys-sSmm5auoqs6Jkyh7g,5,0,0,0,This is my first Yelp review. We just tried Mi...,2016-01-25 02:26:47
...,...,...,...,...,...,...,...,...,...
438004,IE1lzZvdD9UnGeB1kXjuOQ,0tATaGK4gh_HMt5rwQqDAg,M42BPIClbFMUxF5fuwxUkg,1,1,0,0,Coffee passable. Patio impossible. Occupied by...,2019-06-22 20:17:58
438005,r6bvqwhWy73SgyK_w8Y5Lg,OcoT1AeVDn8f95uXAEfxsA,MxMTnZ86FlqBeUTCWTHKIg,2,3,1,0,Dont use the mobile to order. If You lineup an...,2018-12-15 15:06:24
438006,W39f_7mEdphd-wzGwgPxow,NR4t67FLeVqCHh3vEtCQOg,a_tWnn_sCYEPmWdO68BUpQ,1,0,0,0,I had a horrible time at this mcdonalds. my fr...,2019-11-04 19:56:08
438007,5ebXuHpZRNdalRULfnfnpw,YMTGS-9Yb_9oTXdJj7XDTA,p3im0kHlpI1d0NIOdjB-3A,1,0,0,0,"Ordered a latte to go in my reusable mug, but ...",2019-06-24 19:50:22


### User Table

To build the user table, I perform the same procedure as for the review. The only difference is that now I use the new review table I created at the last step and use the user_id key to join both tables. 

User table specifications: 

In [9]:
users_json_path = 'JSON\yelp_academic_dataset_user.json'
size = 1000000
users = pd.read_json(users_json_path, lines=True,
                      dtype={'user_id':str,'name':str, 'review_count' :int,
                             'yelping_since':str,'friends':str,
                             'useful':int,'funny':int,'cool':int,
                             'fans':int,'elite':str, 'average_stars':float,
                             'compliment_hot':int, 'compliment_more':int,
                             'compliment_profile':int, 'compliment_cute':int,
                             'compliment_list':int, 'compliment_note': int,
                             'compliment_plain':int, 'compliment_cool':int,
                             'compliment_funny':int, 'compliment_writer':int,
                             'compliment_photos':int
                            
                            },
                      chunksize=size)

Join users and reviews by chunk using user_id:

In [10]:
chunk_list = []
for chunk_user in users:
    # Drop columns that aren't needed
    #--chunk_review = chunk_review.drop(['review_id','useful','funny','cool'], axis=1)
    # Renaming column name to avoid conflict with business overall star rating
    chunk_user = chunk_user.rename(columns={'useful': 'user_useful'})
    chunk_user = chunk_user.rename(columns={'cool': 'user_cool'})
    chunk_user = chunk_user.rename(columns={'funny': 'user_funny'})
    # Inner merge with edited business file so only reviews related to the business remain
    chunk_merged = pd.merge(df_r, chunk_user, on='user_id', how='inner')
    # Show feedback on progress
    print(f"{chunk_merged.shape[0]} out of {size:,} related reviews")
    chunk_list.append(chunk_merged)
# After trimming down the review file, concatenate all relevant data back to one dataframe
df_tmp = pd.concat(chunk_list, ignore_index=True, join='outer', axis=0)

375848 out of 1,000,000 related reviews
62161 out of 1,000,000 related reviews


Drop columns from the review table: 

In [11]:
drop_columns = ['review_id','business_id', 'review_stars', 'useful', 'funny',
                'cool', 'text', 'date']

df_u = df_tmp.drop(drop_columns, axis=1)
df_u

Unnamed: 0,user_id,name,review_count,yelping_since,user_useful,user_funny,user_cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,cSQnJ7JTY78ki5ai57kZ9A,J,21,2014-01-28 12:26:38,11,7,2,,"61CaXPw6l1noO7_iorhj7g, cRUqQ0jqKpBbyJbiEI6dWQ",1,...,0,0,0,0,0,0,0,0,0,0
1,5_CoaRC22jwmuUzZwlRG_g,jennifer,18,2009-05-10 02:36:33,13,1,1,,"7TYEPCNmD0ykAWZu785EOw, Fk2lnGF41--KCPGYFZS07A...",0,...,0,0,0,0,0,0,0,0,0,0
2,5_CoaRC22jwmuUzZwlRG_g,jennifer,18,2009-05-10 02:36:33,13,1,1,,"7TYEPCNmD0ykAWZu785EOw, Fk2lnGF41--KCPGYFZS07A...",0,...,0,0,0,0,0,0,0,0,0,0
3,c5ebpS7ex6npffT9Nlvqvw,Jeremy,24,2014-10-06 18:44:57,11,0,3,,"3gmYVRwIprY-LuAyLeSSUg, Eimr4sd_Kec_zRGj64uU4g...",1,...,0,0,0,0,1,0,0,0,0,0
4,c5ebpS7ex6npffT9Nlvqvw,Jeremy,24,2014-10-06 18:44:57,11,0,3,,"3gmYVRwIprY-LuAyLeSSUg, Eimr4sd_Kec_zRGj64uU4g...",1,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438004,adrJBGpSrxSh7gBFNaRKNA,Jack,1,2018-12-01 22:20:27,0,0,0,,,0,...,0,0,0,0,0,0,0,0,0,0
438005,M42BPIClbFMUxF5fuwxUkg,Tasso,1,2012-03-14 21:28:20,1,0,0,,1ZYdej2yiKhBzyzV9wf-sw,0,...,0,0,0,0,0,1,0,0,0,0
438006,MxMTnZ86FlqBeUTCWTHKIg,Cherry,1,2017-08-05 02:19:53,3,1,0,,"ViVgZP1VGmBWAeotEc1RdA, ipWGIUoVWNqkontzz1tZtw...",0,...,0,0,0,0,0,0,0,0,0,1
438007,p3im0kHlpI1d0NIOdjB-3A,Alison,1,2019-06-24 19:50:20,0,0,0,,,0,...,0,0,0,0,0,0,0,0,0,0


Finally, I save our resulting tables in csv files. 

In [12]:
csv_name = "yelp_business.csv"
df_b.to_csv(csv_name, index=False)
csv_name = "yelp_tip.csv"
df_t.to_csv(csv_name, index=False)
csv_name = "yelp_review.csv"
df_r.to_csv(csv_name, index=False)
csv_name = "yelp_user.csv"
df_u.to_csv(csv_name, index=False)