## Groupby

In the previous section we saw how to combine information of multiple tables from our dataset.
Here we are going to build further on that by using the merged information to group on categorical variables.

In [1]:
import pandas as pd

In [2]:
rating_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c3_data_preprocessing/data/cuisine/rating_final.csv')
rating_df

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2
...,...,...,...,...,...
1156,U1043,132630,1,1,1
1157,U1011,132715,1,1,0
1158,U1068,132733,1,1,0
1159,U1068,132594,1,1,1


Again we have our rating data containing the users, places and ratings they gave.
As a simple example we could just group by the placeID column and take the mean, this would give us the mean rating for each restaurant

In [3]:
grouped_rating_df = rating_df.groupby('placeID').mean().sort_values('rating')
grouped_rating_df

Unnamed: 0_level_0,rating,food_rating,service_rating
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
132654,0.250000,0.25,0.250000
135040,0.250000,0.25,0.250000
132560,0.500000,1.00,0.250000
132663,0.500000,0.50,0.666667
135069,0.500000,0.50,0.750000
...,...,...,...
132755,1.800000,2.00,1.600000
132922,1.833333,1.50,1.833333
134986,2.000000,2.00,2.000000
135034,2.000000,2.00,1.600000


Keep in mind that this might be tricky, as we do not always have as much records per group, we could count the amount per records using a groupby operation and count.

In [4]:
rating_df.groupby('placeID').rating.count()

placeID
132560     4
132561     4
132564     4
132572    15
132583     4
          ..
135088     6
135104     7
135106    10
135108    11
135109     4
Name: rating, Length: 130, dtype: int64

Taking an average of 4 ratings might not be ideal, so we should keep in mind that our groups have a good sample size.

Let's make things more interesting and insert some location data.

In [5]:
geo_df = pd.read_csv('./data/cuisine/geoplaces2.csv').set_index('placeID')
geo_df

Unnamed: 0_level_0,latitude,longitude,the_geom_meter,name,address,city,state,country,fax,zip,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
134999,18.915421,-99.184871,0101000020957F000088568DE356715AC138C0A525FC46...,Kiku Cuernavaca,Revolucion,Cuernavaca,Morelos,Mexico,?,?,No_Alcohol_Served,none,informal,no_accessibility,medium,kikucuernavaca.com.mx,familiar,f,closed,none
132825,22.147392,-100.983092,0101000020957F00001AD016568C4858C1243261274BA5...,puesto de tacos,esquina santos degollado y leon guzman,s.l.p.,s.l.p.,mexico,?,78280,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,open,none
135106,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rinc�n de San Francisco,Universidad 169,San Luis Potosi,San Luis Potosi,Mexico,?,78000,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
132667,23.752697,-99.163359,0101000020957F00005D67BCDDED8157C1222A2DC8D84D...,little pizza Emilio Portes Gil,calle emilio portes gil,victoria,tamaulipas,?,?,?,No_Alcohol_Served,none,informal,completely,low,?,familiar,t,closed,none
132613,23.752903,-99.165076,0101000020957F00008EBA2D06DC8157C194E03B7B504E...,carnitas_mata,lic. Emilio portes gil,victoria,Tamaulipas,Mexico,?,?,No_Alcohol_Served,permitted,informal,completely,medium,?,familiar,t,closed,none
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132866,22.141220,-100.931311,0101000020957F000013871838EC4A58C1B5DF74F8E396...,Chaires,Ricardo B. Anaya,San Luis Potosi,San Luis Potosi,Mexico,?,?,No_Alcohol_Served,not permitted,informal,completely,medium,?,familiar,f,closed,none
135072,22.149192,-101.002936,0101000020957F0000E7B79B1DB94758C1D29BC363D8AA...,Sushi Itto,Venustiano Carranza 1809 C Polanco,San Luis Potosi,SLP,Mexico,?,78220,No_Alcohol_Served,none,informal,no_accessibility,medium,sushi-itto.com.mx,familiar,f,closed,none
135109,18.921785,-99.235350,0101000020957F0000A6BF695F136F5AC1DADF87B20556...,Paniroles,?,?,?,?,?,?,Wine-Beer,not permitted,informal,no_accessibility,medium,?,quiet,f,closed,Internet
135019,18.875011,-99.159422,0101000020957F0000B49B2E5C6E785AC12F9D58435241...,Restaurant Bar Coty y Pablo,Paseo de Las Fuentes 24 Pedregal de Las Fuentes,Jiutepec,Morelos,Mexico,?,?,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,closed,none


Here we have for each restaurant information about its location, I mentioned earlier that grouping per restaurant might be dangerous as some restaurants have nearly no reviews.
By adding information such as city, state and country we have other categorical variables to group by.
Notice how we use the merge operation from previous section, but this time specify our common key is the index.

In [6]:
geo_rating_df = pd.merge(grouped_rating_df, geo_df, left_index=True, right_index=True)
geo_rating_df

Unnamed: 0_level_0,rating,food_rating,service_rating,latitude,longitude,the_geom_meter,name,address,city,state,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
132654,0.250000,0.25,0.250000,23.735523,-99.129588,0101000020957F000040E8F628488557C18224E8B94845...,Carnitas Mata Calle 16 de Septiembre,16 de Septiembre,victoria,tamaulipas,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,closed,none
135040,0.250000,0.25,0.250000,22.135617,-100.969709,0101000020957F00001B552189B84A58C15A2AAEFD2CA2...,Restaurant los Compadres,Camino a Simon Diaz 155 Centro,San Luis Potosi,SLP,...,Wine-Beer,none,informal,no_accessibility,high,?,familiar,f,closed,none
132560,0.500000,1.00,0.250000,23.752304,-99.166913,0101000020957F0000FC60BDA8E88157C1B2C357D6DA4E...,puesto de gorditas,frente al tecnologico,victoria,tamaulipas,...,No_Alcohol_Served,permitted,informal,no_accessibility,low,?,familiar,f,open,none
132663,0.500000,0.50,0.666667,23.752511,-99.166954,0101000020957F0000FDF8D26EE08157C1FEDB6A1FDB4E...,tacos abi,?,victoria,tamaulipas,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,closed,none
135069,0.500000,0.50,0.750000,22.140129,-100.944872,0101000020957F000038E5D546B74A58C18FD29AD0D29A...,Abondance Restaurante Bar,Industrias 908 Valle Dorado,San Luis Potosi,SLP,...,Wine-Beer,none,informal,no_accessibility,low,?,familiar,f,closed,none
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132755,1.800000,2.00,1.600000,22.153324,-101.019546,0101000020957F000026CADE45A14658C1F011EBCA55AF...,La Estrella de Dimas,Av. de los Pintores,San Luis Potosi,S.L.P.,...,No_Alcohol_Served,none,informal,partially,medium,?,familiar,f,closed,variety
132922,1.833333,1.50,1.833333,22.151135,-100.982311,0101000020957F000060A98A38FF4758C146718E41D9A4...,cafe punta del cielo,?,?,?,...,No_Alcohol_Served,permitted,formal,completely,medium,?,familiar,f,closed,none
134986,2.000000,2.00,2.000000,18.928798,-99.239513,0101000020957F00002A0D05E2D96D5AC1AB058CB1EC56...,Restaurant Las Mananitas,Ricardo Linares 107,Cuernavaca,Morelos,...,Wine-Beer,none,formal,no_accessibility,high,lasmananitas.com.mx,familiar,f,closed,none
135034,2.000000,2.00,1.600000,22.140517,-101.021422,0101000020957F000026D92BB4894858C161A7552DA2B0...,Michiko Restaurant Japones,Cordillera de Los Alpes 160 Lomas 2 Seccion,San Luis Potosi,SLP,...,No_Alcohol_Served,none,informal,no_accessibility,medium,?,familiar,f,closed,none


By adding this amount of data, things are getting a bit cluttered, thankfully we can use pandas to get a list of all our columns.

In [7]:
geo_rating_df.columns

Index(['rating', 'food_rating', 'service_rating', 'latitude', 'longitude',
       'the_geom_meter', 'name', 'address', 'city', 'state', 'country', 'fax',
       'zip', 'alcohol', 'smoking_area', 'dress_code', 'accessibility',
       'price', 'url', 'Rambience', 'franchise', 'area', 'other_services'],
      dtype='object')

How about we try and see if we can find a difference between countries for the ratings?

In [8]:
geo_rating_df.groupby('country')[['rating', 'food_rating', 'service_rating']].mean()

Unnamed: 0_level_0,rating,food_rating,service_rating
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
?,1.166045,1.232946,1.069169
Mexico,1.200977,1.229093,1.118162
mexico,1.06266,1.069006,0.900064


Ah, it seems we forgot to do some data cleaning here, perhaps you could jump in and fix this string problem, might as well tackle the missing value while we are at it.
Aside from that, we can see that lower-case Mexico is not doing very well, perhaps the food was so bad they forgot how to write Mexico?

Jokes aside, do you see the ressemblance between this and our rudimentary approach of comparing different categories?
We are slowly getting more and more efficient using these operations, how about the difference between alcohol consumption?

In [9]:
geo_rating_df.groupby('alcohol')[['rating', 'food_rating', 'service_rating']].mean()

Unnamed: 0_level_0,rating,food_rating,service_rating
alcohol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Full_Bar,1.287124,1.218315,1.170311
No_Alcohol_Served,1.148075,1.19473,1.042417
Wine-Beer,1.231887,1.26184,1.174437


Something we can remark here is that the food rating for no alcohol locations seems to be holding up, whilst the general rating and service rating fall behind.
This would suggest that the food rating indeed is for the food, where the type of drinks served have no influence.

As a last we look at the difference between accessibility, does that influences our ratings?

In [10]:
geo_rating_df.groupby('accessibility')[['rating', 'food_rating', 'service_rating']].mean()

Unnamed: 0_level_0,rating,food_rating,service_rating
accessibility,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
completely,1.132494,1.203597,1.049709
no_accessibility,1.196189,1.206242,1.091278
partially,1.275356,1.330294,1.219991


It seems having partial accessibility is the way to go here, performing better than complete accessibility.
We can however find that is due to a low sample size of 9 restaurants, making it prone to variation.

In [11]:
geo_rating_df.accessibility.value_counts()

no_accessibility    76
completely          45
partially            9
Name: accessibility, dtype: int64

You should get the hang of it by now, perhaps you can play some more with the other categories.

There is one thing I still would like to address, you perhaps have notices that in the beginning I first took the average rating per restaurant and later again took the average per category.
This is a bad practice as a bad restaurant with one review has equal influence as a good restaurant with 100 reviews, perhaps you can think of a way to group all reviews from a category instead of the average for each restaurant?

In the previous section we added the cuisine type, perhaps you could do some groupby operations on that too here?