# Extra Examples - Merging

Heres a dataset dumped directly from a database, so we need to stitch it together ourselves.
https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings

The dataset comes with a README file that outlines where everything comes from which might help.

Lets try to:

1. Merge all restaurant data
2. Merge all user data
3. Merge restaurant data and user data together using user ratings
4. Realise that we've merged too much, and merge user ratings + user profile + geoplaces
5. Use some groupby power and determine the top five restaurants in the dataset

In [16]:
import pandas as pd
import os

files = [f for f in os.listdir() if f.endswith(".csv")]
print(files)

['chefmozaccepts.csv', 'chefmozcuisine.csv', 'chefmozhours4.csv', 'chefmozparking.csv', 'geoplaces2.csv', 'rating_final.csv', 'usercuisine.csv', 'userpayment.csv', 'userprofile.csv']


## Merging restaurant data

In [17]:
# your code here
df = pd.read_csv(files[0])
files.remove(files[0])
for file in files:
    dfs = pd.read_csv(file)
    if 'placeID' not in dfs.columns:
        df = df.merge(dfs, on='userID')
    else:
        df = df.merge(dfs, on='placeID')
    print(file)
    display(df.head())

chefmozcuisine.csv


Unnamed: 0,placeID,Rpayment,Rcuisine
0,135110,cash,Spanish
1,135110,VISA,Spanish
2,135110,MasterCard-Eurocard,Spanish
3,135110,American_Express,Spanish
4,135110,bank_debit_cards,Spanish


chefmozhours4.csv


Unnamed: 0,placeID,Rpayment,Rcuisine,hours,days
0,135110,cash,Spanish,08:00-19:00;,Mon;Tue;Wed;Thu;Fri;
1,135110,cash,Spanish,00:00-00:00;,Sat;
2,135110,cash,Spanish,00:00-00:00;,Sun;
3,135110,VISA,Spanish,08:00-19:00;,Mon;Tue;Wed;Thu;Fri;
4,135110,VISA,Spanish,00:00-00:00;,Sat;


chefmozparking.csv


Unnamed: 0,placeID,Rpayment,Rcuisine,hours,days,parking_lot
0,135110,cash,Spanish,08:00-19:00;,Mon;Tue;Wed;Thu;Fri;,none
1,135110,cash,Spanish,00:00-00:00;,Sat;,none
2,135110,cash,Spanish,00:00-00:00;,Sun;,none
3,135110,VISA,Spanish,08:00-19:00;,Mon;Tue;Wed;Thu;Fri;,none
4,135110,VISA,Spanish,00:00-00:00;,Sat;,none


geoplaces2.csv


Unnamed: 0,placeID,Rpayment,Rcuisine,hours,days,parking_lot,latitude,longitude,the_geom_meter,name,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,Wine-Beer,not permitted,informal,no_accessibility,medium,?,quiet,f,closed,Internet
1,135109,cash,Italian,08:00-21:00;,Sat;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,Wine-Beer,not permitted,informal,no_accessibility,medium,?,quiet,f,closed,Internet
2,135109,cash,Italian,08:00-21:00;,Sun;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,Wine-Beer,not permitted,informal,no_accessibility,medium,?,quiet,f,closed,Internet
3,135106,cash,Mexican,18:00-23:30;,Mon;Tue;Wed;Thu;Fri;,none,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rinc�n de San Francisco,...,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
4,135106,cash,Mexican,18:00-23:30;,Sat;,none,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rinc�n de San Francisco,...,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none


rating_final.csv


Unnamed: 0,placeID,Rpayment,Rcuisine,hours,days,parking_lot,latitude,longitude,the_geom_meter,name,...,price,url,Rambience,franchise,area,other_services,userID,rating,food_rating,service_rating
0,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,medium,?,quiet,f,closed,Internet,U1030,0,0,0
1,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,medium,?,quiet,f,closed,Internet,U1020,2,2,1
2,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,medium,?,quiet,f,closed,Internet,U1051,1,1,1
3,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,medium,?,quiet,f,closed,Internet,U1041,1,2,1
4,135109,cash,Italian,08:00-21:00;,Sat;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,medium,?,quiet,f,closed,Internet,U1030,0,0,0


usercuisine.csv


Unnamed: 0,placeID,Rpayment,Rcuisine_x,hours,days,parking_lot,latitude,longitude,the_geom_meter,name,...,url,Rambience,franchise,area,other_services,userID,rating,food_rating,service_rating,Rcuisine_y
0,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,?,quiet,f,closed,Internet,U1030,0,0,0,Mexican
1,135109,cash,Italian,08:00-21:00;,Sat;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,?,quiet,f,closed,Internet,U1030,0,0,0,Mexican
2,135109,cash,Italian,08:00-21:00;,Sun;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,?,quiet,f,closed,Internet,U1030,0,0,0,Mexican
3,135088,cash,Cafeteria,09:00-16:00;,Mon;Tue;Wed;Thu;Fri;,public,18.876011,-99.21989,0101000020957F0000E14AD4DBC7765AC1F7B33C85B153...,Cafeteria cenidet,...,www.cenidet.edu.mx,quiet,f,closed,none,U1030,1,0,1,Mexican
4,135088,cash,Cafeteria,00:00-00:00;,Sat;,public,18.876011,-99.21989,0101000020957F0000E14AD4DBC7765AC1F7B33C85B153...,Cafeteria cenidet,...,www.cenidet.edu.mx,quiet,f,closed,none,U1030,1,0,1,Mexican


userpayment.csv


Unnamed: 0,placeID,Rpayment,Rcuisine_x,hours,days,parking_lot,latitude,longitude,the_geom_meter,name,...,Rambience,franchise,area,other_services,userID,rating,food_rating,service_rating,Rcuisine_y,Upayment
0,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,quiet,f,closed,Internet,U1030,0,0,0,Mexican,cash
1,135109,cash,Italian,08:00-21:00;,Sat;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,quiet,f,closed,Internet,U1030,0,0,0,Mexican,cash
2,135109,cash,Italian,08:00-21:00;,Sun;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,quiet,f,closed,Internet,U1030,0,0,0,Mexican,cash
3,135088,cash,Cafeteria,09:00-16:00;,Mon;Tue;Wed;Thu;Fri;,public,18.876011,-99.21989,0101000020957F0000E14AD4DBC7765AC1F7B33C85B153...,Cafeteria cenidet,...,quiet,f,closed,none,U1030,1,0,1,Mexican,cash
4,135088,cash,Cafeteria,00:00-00:00;,Sat;,public,18.876011,-99.21989,0101000020957F0000E14AD4DBC7765AC1F7B33C85B153...,Cafeteria cenidet,...,quiet,f,closed,none,U1030,1,0,1,Mexican,cash


userprofile.csv


Unnamed: 0,placeID,Rpayment,Rcuisine_x,hours,days,parking_lot,latitude_x,longitude_x,the_geom_meter,name,...,hijos,birth_year,interest,personality,religion,activity,color,weight,budget,height
0,135109,cash,Italian,08:00-21:00;,Mon;Tue;Wed;Thu;Fri;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,independent,1991,eco-friendly,hard-worker,Catholic,student,black,64,medium,1.75
1,135109,cash,Italian,08:00-21:00;,Sat;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,independent,1991,eco-friendly,hard-worker,Catholic,student,black,64,medium,1.75
2,135109,cash,Italian,08:00-21:00;,Sun;,none,18.921785,-99.23535,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,...,independent,1991,eco-friendly,hard-worker,Catholic,student,black,64,medium,1.75
3,135088,cash,Cafeteria,09:00-16:00;,Mon;Tue;Wed;Thu;Fri;,public,18.876011,-99.21989,0101000020957F0000E14AD4DBC7765AC1F7B33C85B153...,Cafeteria cenidet,...,independent,1991,eco-friendly,hard-worker,Catholic,student,black,64,medium,1.75
4,135088,cash,Cafeteria,00:00-00:00;,Sat;,public,18.876011,-99.21989,0101000020957F0000E14AD4DBC7765AC1F7B33C85B153...,Cafeteria cenidet,...,independent,1991,eco-friendly,hard-worker,Catholic,student,black,64,medium,1.75


## Merging User data

In [25]:
# your code here
df_user = None
for file in files:
    if file.startswith('user'):
        df_temp = pd.read_csv(file)
        if df_user is not None:
            df_user = df_user.merge(df_temp, on='userID')
        else:
            df_user = df_temp.copy()
        display(df_user.head())

Unnamed: 0,userID,Rcuisine
0,U1001,American
1,U1002,Mexican
2,U1003,Mexican
3,U1004,Bakery
4,U1004,Breakfast-Brunch


here we go


Unnamed: 0,userID,Rcuisine,Upayment
0,U1001,American,cash
1,U1002,Mexican,cash
2,U1003,Mexican,cash
3,U1004,Bakery,cash
4,U1004,Bakery,bank_debit_cards


here we go


Unnamed: 0,userID,Rcuisine,Upayment,latitude,longitude,smoker,drink_level,dress_preference,ambience,transport,...,hijos,birth_year,interest,personality,religion,activity,color,weight,budget,height
0,U1001,American,cash,22.139997,-100.978803,False,abstemious,informal,family,on foot,...,independent,1989,variety,thrifty-protector,none,student,black,69,medium,1.77
1,U1002,Mexican,cash,22.150087,-100.983325,False,abstemious,informal,family,public,...,independent,1990,technology,hunter-ostentatious,Catholic,student,red,40,low,1.87
2,U1003,Mexican,cash,22.119847,-100.946527,False,social drinker,formal,family,public,...,independent,1989,none,hard-worker,Catholic,student,blue,60,low,1.69
3,U1004,Bakery,cash,18.867,-99.183,False,abstemious,informal,family,public,...,independent,1940,variety,hard-worker,none,professional,green,44,medium,1.53
4,U1004,Bakery,bank_debit_cards,18.867,-99.183,False,abstemious,informal,family,public,...,independent,1940,variety,hard-worker,none,professional,green,44,medium,1.53


## Merging User ratings as well

In [27]:
# your code here
user_rating = pd.read_csv('rating_final.csv')
user_rating = user_rating.merge(df_user, on='userID')
user_rating = user_rating.merge(df,on='userID')
user_rating.head()

Unnamed: 0,userID,placeID_x,rating_x,food_rating_x,service_rating_x,Rcuisine,Upayment_x,latitude,longitude,smoker_x,...,hijos_y,birth_year_y,interest_y,personality_y,religion_y,activity_y,color_y,weight_y,budget_y,height_y
0,U1077,135085,2,2,2,Mexican,VISA,22.156469,-100.98554,False,...,kids,1987,technology,thrifty-protector,Catholic,student,blue,65,medium,1.71
1,U1077,135085,2,2,2,Mexican,VISA,22.156469,-100.98554,False,...,kids,1987,technology,thrifty-protector,Catholic,student,blue,65,medium,1.71
2,U1077,135085,2,2,2,Mexican,VISA,22.156469,-100.98554,False,...,kids,1987,technology,thrifty-protector,Catholic,student,blue,65,medium,1.71
3,U1077,135085,2,2,2,Mexican,VISA,22.156469,-100.98554,False,...,kids,1987,technology,thrifty-protector,Catholic,student,blue,65,medium,1.71
4,U1077,135085,2,2,2,Mexican,VISA,22.156469,-100.98554,False,...,kids,1987,technology,thrifty-protector,Catholic,student,blue,65,medium,1.71


## Merge Subsets

In [4]:
# your code here

## Top 5 restaurants based off rating

Note to answer this we didn't actually need the user profile data. But we might use it to remove votes from users that don't satisfy criteria (for example, we might want to make sure the user has been to multiple restaurants, or is a certain age, or doesnt have suspicious voting trends - aka giving everyone a one).

In [5]:
# your code here
