## Merge

When data becomes multi-dimensional - covering multiple aspects of information - it usually happens that a lot of information is redundant.
Take for example the next dataset, we have collected ratings of restaurants from users, when a single user rates 2 restaurants the information of the user relates to both rows, yet it would be wasteful to keep this info twice.
The same can happen when we have a restaurant with 2 ratings, the location of the restaurant is kept twice in our data, which is not scalable.

We solve this problem using relational data, the idea is that we have a common key column in 2 of our tables which we can use to join the data for further processing.

In our example we use a dataset with consumers, restaurants and ratings between those, you can find more information [here](https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings).

In [1]:
import pandas as pd

In [2]:
rating_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c3_data_preprocessing/data/cuisine/rating_final.csv')
rating_df

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2
...,...,...,...,...,...
1156,U1043,132630,1,1,1
1157,U1011,132715,1,1,0
1158,U1068,132733,1,1,0
1159,U1068,132594,1,1,1


this first table we read contains the userID from whom the rating came, the placeID is the restaurant he/she rated and the numerical values of the 3 different ratings.

Perhaps you can find out what the min and max values for the ratings are?

to know the type of restaurant, we can not read another table

In [3]:
cuisine_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c3_data_preprocessing/data/cuisine/chefmozcuisine.csv')
cuisine_df

Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food
...,...,...
911,132005,Seafood
912,132004,Seafood
913,132003,International
914,132002,Seafood


This table also contains the placeID, so we should be able to merge/join these 2 tables and create a new table with info of both.
Notice how we specify the 'on' parameter where we denote placeID as our common key.

In [4]:
merged_df = pd.merge(rating_df, cuisine_df, on='placeID', how='inner')
merged_df

Unnamed: 0,userID,placeID,rating,food_rating,service_rating,Rcuisine
0,U1077,135085,2,2,2,Fast_Food
1,U1108,135085,1,2,1,Fast_Food
2,U1081,135085,1,2,1,Fast_Food
3,U1056,135085,2,2,2,Fast_Food
4,U1134,135085,2,1,2,Fast_Food
...,...,...,...,...,...,...
1038,U1061,132958,2,2,2,American
1039,U1025,132958,1,0,0,American
1040,U1097,132958,2,1,1,American
1041,U1096,132958,1,2,2,American


Great! now we have more info about the rating that were given, being the type of cuisine that they rated.
We could figure out which cuisines are available in our dataset and do a comparison, let us count the occurences of each cuisine.

In [5]:
merged_df.Rcuisine.value_counts()

Mexican             238
Bar                 140
Cafeteria           102
Fast_Food            91
Seafood              62
Bar_Pub_Brewery      59
Pizzeria             51
Chinese              41
American             39
International        37
Contemporary         32
Burgers              31
Japanese             29
Italian              26
Family               14
Cafe-Coffee_Shop     12
Breakfast-Brunch      9
Game                  7
Vietnamese            6
Bakery                5
Mediterranean         4
Armenian              4
Regional              4
Name: Rcuisine, dtype: int64

A lot of mexican, which is not surpising as this dataset comes from Mexico.
I wonder if there is a difference between 'Bar' and 'Bar_Pub_Brewery', we can see if the average rating for those 2 differ.

In [6]:
for cuisine in ['Bar', 'Bar_Pub_Brewery']:
    print(cuisine)
    print(merged_df[merged_df.Rcuisine==cuisine][['rating', 'food_rating', 'service_rating']].mean())
    print()

Bar
rating            1.200000
food_rating       1.135714
service_rating    1.085714
dtype: float64

Bar_Pub_Brewery
rating            1.305085
food_rating       1.169492
service_rating    1.203390
dtype: float64



just looking at the averages we can deduces that while food ratings do not change a lot, the service seems a lot better at the Brewery.


In [7]:
merged_df[merged_df.Rcuisine=='Cafeteria'][['rating', 'food_rating', 'service_rating']].mean()

rating            1.205882
food_rating       1.127451
service_rating    1.078431
dtype: float64

In [8]:
merged_df[merged_df.Rcuisine=='Cafe-Coffee_Shop'][['rating', 'food_rating', 'service_rating']].mean()

rating            1.583333
food_rating       1.333333
service_rating    1.416667
dtype: float64

As easy as it looks, we can now merge information of different tables in our dataset and perform some simple comparisons, in later sections we will see how we can improve on those.

As an exercise I already read in the table containing the info about which type of payment the user has opted for.
Could you find out if the type of payment could have an influence on the rating?

In [9]:
user_payment_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c3_data_preprocessing/data/cuisine/userpayment.csv')
user_payment_df

Unnamed: 0,userID,Upayment
0,U1001,cash
1,U1002,cash
2,U1003,cash
3,U1004,cash
4,U1004,bank_debit_cards
...,...,...
172,U1134,cash
173,U1135,cash
174,U1136,cash
175,U1137,cash
