## Pivot

When using the groupby operation we used 1 categorical variable to seperate/group our data into those categories.
Here we go a step further and use 2 categories to aggregate our data, resulting in a comparison matrix.

Aside from that, the pivot operation can in general be used to go from a long data format, to a wide data format.
To keep things uniform we stick with the same cuisine dataset.

In [1]:
import pandas as pd

In [2]:
rating_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c3_data_preprocessing/data/cuisine/rating_final.csv')
rating_df

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2
...,...,...,...,...,...
1156,U1043,132630,1,1,1
1157,U1011,132715,1,1,0
1158,U1068,132733,1,1,0
1159,U1068,132594,1,1,1


And again we merge with the geolocations data, I feel that it becomse obvious here how these operations are very related to eachother.

In [3]:
geo_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c3_data_preprocessing/data/cuisine/geoplaces2.csv')

A subtle difference between last time is that I did not first group per restaurant, however this leads to a dataframe that has a lot of redundant information!
Try to look in the merged dataframe and spot the copies of data.

In [4]:
geo_rating_df = pd.merge(rating_df, geo_df, on='placeID')
geo_rating_df

Unnamed: 0,userID,placeID,rating,food_rating,service_rating,latitude,longitude,the_geom_meter,name,address,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,U1077,135085,2,2,2,22.150802,-100.982680,0101000020957F00009F823DA6094858C18A2D4D37F9A4...,Tortas Locas Hipocampo,Venustiano Carranza 719 Centro,...,No_Alcohol_Served,not permitted,informal,no_accessibility,medium,?,familiar,f,closed,none
1,U1108,135085,1,2,1,22.150802,-100.982680,0101000020957F00009F823DA6094858C18A2D4D37F9A4...,Tortas Locas Hipocampo,Venustiano Carranza 719 Centro,...,No_Alcohol_Served,not permitted,informal,no_accessibility,medium,?,familiar,f,closed,none
2,U1081,135085,1,2,1,22.150802,-100.982680,0101000020957F00009F823DA6094858C18A2D4D37F9A4...,Tortas Locas Hipocampo,Venustiano Carranza 719 Centro,...,No_Alcohol_Served,not permitted,informal,no_accessibility,medium,?,familiar,f,closed,none
3,U1056,135085,2,2,2,22.150802,-100.982680,0101000020957F00009F823DA6094858C18A2D4D37F9A4...,Tortas Locas Hipocampo,Venustiano Carranza 719 Centro,...,No_Alcohol_Served,not permitted,informal,no_accessibility,medium,?,familiar,f,closed,none
4,U1134,135085,2,1,2,22.150802,-100.982680,0101000020957F00009F823DA6094858C18A2D4D37F9A4...,Tortas Locas Hipocampo,Venustiano Carranza 719 Centro,...,No_Alcohol_Served,not permitted,informal,no_accessibility,medium,?,familiar,f,closed,none
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1156,U1061,132958,2,2,2,22.144979,-101.005683,0101000020957F000049095EB34A4858C15CB4BD1EE1AB...,tacos los volcanes,avenida hivno nacional,...,No_Alcohol_Served,none,informal,completely,low,?,quiet,t,closed,none
1157,U1025,132958,1,0,0,22.144979,-101.005683,0101000020957F000049095EB34A4858C15CB4BD1EE1AB...,tacos los volcanes,avenida hivno nacional,...,No_Alcohol_Served,none,informal,completely,low,?,quiet,t,closed,none
1158,U1097,132958,2,1,1,22.144979,-101.005683,0101000020957F000049095EB34A4858C15CB4BD1EE1AB...,tacos los volcanes,avenida hivno nacional,...,No_Alcohol_Served,none,informal,completely,low,?,quiet,t,closed,none
1159,U1096,132958,1,2,2,22.144979,-101.005683,0101000020957F000049095EB34A4858C15CB4BD1EE1AB...,tacos los volcanes,avenida hivno nacional,...,No_Alcohol_Served,none,informal,completely,low,?,quiet,t,closed,none


Now that we have our workable data, we can choose 2 categories and create a comparison matrix using the pivot operation.
Yet there might be a problem that we still have to resolve, can you figure out the problem reading the error at the end of the stack trace below?

In [5]:
geo_rating_df.pivot(index='alcohol', columns='smoking_area', values='rating')

ValueError: Index contains duplicate entries, cannot reshape

It says: 'Index contains duplicate entries, cannot reshape' meaning that some combinations of our 2 categories, alcohol and smoking area have duplicates, which is understandable.
I opted to solve this by grouping over the 2 categories and taking the mean for each combination, then i take this grouped data and pivot by setting the alcohol consumption as index and the smoking are as columns.

In [6]:
grouped_geo_rating_df = geo_rating_df.groupby(['alcohol','smoking_area'])[['rating','food_rating', 'service_rating']].mean().reset_index()
grouped_geo_rating_df.pivot(index='alcohol', columns='smoking_area', values='rating')

smoking_area,none,not permitted,only at bar,permitted,section
alcohol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Full_Bar,1.305556,0.857143,,1.5,1.272727
No_Alcohol_Served,1.186788,1.124402,,1.114286,1.265823
Wine-Beer,1.217391,1.0,1.368421,1.3,1.275


Wonderful! Now we have for each combination an average rating, notice however that not every combination has the same sample size, so comparing might be tricky if you only have a few ratings.

To figure that out I counted the ratings per combination.

In [7]:
geo_rating_df.groupby(['alcohol','smoking_area']).count().reset_index().pivot(index='alcohol', columns='smoking_area', values='rating')

smoking_area,none,not permitted,only at bar,permitted,section
alcohol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Full_Bar,36.0,7.0,,4.0,33.0
No_Alcohol_Served,439.0,209.0,,35.0,79.0
Wine-Beer,161.0,9.0,19.0,10.0,120.0


It seems that there might e a correlation between the 2 categories, as a lot of place where smoking is not permitted/none, there is no alcohol served, which makes sense.
Comparing the ratings with alcohol allowance for places where smoking is not permitted is not a good idea, the counts are 7, 209 and 9, very unbalanced.

In [8]:
geo_df.columns

Index(['placeID', 'latitude', 'longitude', 'the_geom_meter', 'name', 'address',
       'city', 'state', 'country', 'fax', 'zip', 'alcohol', 'smoking_area',
       'dress_code', 'accessibility', 'price', 'url', 'Rambience', 'franchise',
       'area', 'other_services'],
      dtype='object')

I printed the columns above, perhaps you could figure out a relation between the price category and the (R)ambience of the restaurant?
Perhaps there are other combinations of which I did not think of, try some out!