### Pandas Lab -- Grouping & Merging

Welcome to today's lab!  It will come in two different parts:  

One section will be devoted to using the `groupby` method in order to answer different questions about our data.  

The second portion will be devoted towards combining grouping & merging to create summary statistics -- one of the more important features you can add to a dataset for statistical modeling.  

### Section I - Grouping

In [1]:
import numpy as np
import pandas as pd

In [214]:
pd.options.display.max_rows = 1000

In [2]:
df = pd.read_csv("/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurants.csv")

In [3]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                252108 non-null  object 
 1   visit_date        252108 non-null  object 
 2   visitors          252108 non-null  int64  
 3   calendar_date     252108 non-null  object 
 4   day_of_week       252108 non-null  object 
 5   holiday           252108 non-null  int64  
 6   genre             252108 non-null  object 
 7   area              252108 non-null  object 
 8   latitude          252108 non-null  float64
 9   longitude         252108 non-null  float64
 10  reserve_visitors  108394 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 21.2+ MB


In [9]:
df['visit_date'] = pd.to_datetime(df['visit_date'])

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                252108 non-null  object        
 1   visit_date        252108 non-null  datetime64[ns]
 2   visitors          252108 non-null  int64         
 3   calendar_date     252108 non-null  object        
 4   day_of_week       252108 non-null  object        
 5   holiday           252108 non-null  int64         
 6   genre             252108 non-null  object        
 7   area              252108 non-null  object        
 8   latitude          252108 non-null  float64       
 9   longitude         252108 non-null  float64       
 10  reserve_visitors  108394 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(5)
memory usage: 21.2+ MB


**Question 1:** What restaurant had the highest total amount of visitors throughout the dataset?

In [9]:
# your answer here

In [4]:
df.groupby('id')['visitors'].sum().max()

18717

In [5]:
a = df.groupby('id')['visitors'].sum().reset_index()

In [6]:
a[(a['visitors']==18717)]

Unnamed: 0,id,visitors
172,air_399904bdb7685ca0,18717


In [16]:
df.iloc[172]

id                          air_ba937bf13d40fb24
visit_date                            2016-08-09
visitors                                      22
calendar_date                         2016-08-09
day_of_week                              Tuesday
holiday                                        0
genre                                 Dining bar
area                Tōkyō-to Minato-ku Shibakōen
latitude                                 35.6581
longitude                                139.752
reserve_visitors                             NaN
Name: 172, dtype: object

In [None]:
air_399904bdb7685ca0

In [None]:
# Proposed solution

In [70]:
visits = df.groupby('id')['visitors'].sum()

In [59]:
df

id
air_00a91d42b08b08d9    6051
air_0164b9927d20bcc3    1378
air_0241aa3964b7f861    3919
air_0328696196e46f18     921
air_034a3d5b40d5b1b1    3722
                        ... 
air_fea5dc9594450608    3969
air_fee8dcf4d619598e    7496
air_fef9ccb3ba0da2f7    2357
air_ffcc2d5087e1b476    4919
air_fff68b929994bfbd    1369
Name: visitors, Length: 829, dtype: int64

In [62]:
df.reset_index

AttributeError: 'function' object has no attribute 'idmax'

**Question 2:** What was the average difference in attendance between holidays & non-holidays for each restaurant?

In [53]:
# your answer here
?pd.Series.diff

In [49]:
df.groupby(['id','holiday'])['visitors'].mean().diff()

id                    holiday
air_00a91d42b08b08d9  0                NaN
                      1          -5.103896
air_0164b9927d20bcc3  0         -11.708333
                      1          -1.291667
air_0241aa3964b7f861  0           1.883905
                                   ...    
air_fef9ccb3ba0da2f7  1           2.534783
air_ffcc2d5087e1b476  0           8.436975
                      1          -9.436975
air_fff68b929994bfbd  0          -5.906615
                      1          -0.093385
Name: visitors, Length: 1646, dtype: float64

In [34]:
q2 = df.groupby(['id','holiday'])['visitors'].mean().reset_index()

In [41]:
q2[(q2['holiday']==1)]

Unnamed: 0,id,holiday,visitors
1,air_00a91d42b08b08d9,1,21.000000
3,air_0164b9927d20bcc3,1,8.000000
5,air_0241aa3964b7f861,1,10.176471
7,air_0328696196e46f18,1,7.166667
9,air_034a3d5b40d5b1b1,1,14.400000
...,...,...,...
1637,air_fea5dc9594450608,1,14.400000
1639,air_fee8dcf4d619598e,1,28.882353
1641,air_fef9ccb3ba0da2f7,1,12.000000
1643,air_ffcc2d5087e1b476,1,11.000000


In [42]:
q2[(q2['holiday']==0)]

Unnamed: 0,id,holiday,visitors
0,air_00a91d42b08b08d9,0,26.103896
2,air_0164b9927d20bcc3,0,9.291667
4,air_0241aa3964b7f861,0,9.883905
6,air_0328696196e46f18,0,7.981818
8,air_034a3d5b40d5b1b1,0,14.855932
...,...,...,...
1636,air_fea5dc9594450608,0,14.488636
1638,air_fee8dcf4d619598e,0,25.848708
1640,air_fef9ccb3ba0da2f7,0,9.465217
1642,air_ffcc2d5087e1b476,0,20.436975


In [45]:
q2[(q2['holiday']==0)]['visitors'] - q2[(q2['holiday']==1)]['visitors']

### Suggested solution Q2:

In [76]:
restauarant_visits = df.groupby(['id','holiday'])['visitors'].mean().reset_index()

In [77]:
restauarant_visits

Unnamed: 0,id,holiday,visitors
0,air_00a91d42b08b08d9,0,26.103896
1,air_00a91d42b08b08d9,1,21.000000
2,air_0164b9927d20bcc3,0,9.291667
3,air_0164b9927d20bcc3,1,8.000000
4,air_0241aa3964b7f861,0,9.883905
...,...,...,...
1641,air_fef9ccb3ba0da2f7,1,12.000000
1642,air_ffcc2d5087e1b476,0,20.436975
1643,air_ffcc2d5087e1b476,1,11.000000
1644,air_fff68b929994bfbd,0,5.093385


In [79]:
restauarant_visits.groupby('id')['visitors'].diff().dropna()

1      -5.103896
3      -1.291667
5       0.292566
7      -0.815152
9      -0.455932
          ...   
1637   -0.088636
1639    3.033644
1641    2.534783
1643   -9.436975
1645   -0.093385
Name: visitors, Length: 817, dtype: float64

**Question 3:** Can you grab the first 15 rows of dates for each restaurant?  The last 15 rows?

In [2]:
# your answer here

In [43]:
?df.groupby

In [47]:
?df.sort_values

In [51]:
q3 = df.sort_values(by=['id','visit_date'],ascending=[True,True]).groupby('id').head(15)

In [52]:
q3.iloc[0:45,:]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
166836,air_00a91d42b08b08d9,2016-07-01,35,2016-07-01,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166837,air_00a91d42b08b08d9,2016-07-02,9,2016-07-02,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0
166838,air_00a91d42b08b08d9,2016-07-04,20,2016-07-04,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166839,air_00a91d42b08b08d9,2016-07-05,25,2016-07-05,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166840,air_00a91d42b08b08d9,2016-07-06,29,2016-07-06,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166841,air_00a91d42b08b08d9,2016-07-07,34,2016-07-07,Thursday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166842,air_00a91d42b08b08d9,2016-07-08,42,2016-07-08,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166843,air_00a91d42b08b08d9,2016-07-09,11,2016-07-09,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166844,air_00a91d42b08b08d9,2016-07-11,25,2016-07-11,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166845,air_00a91d42b08b08d9,2016-07-12,24,2016-07-12,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,


In [65]:
q3b = (df.sort_values(by=['id','visit_date'],ascending=[True,False]).groupby('id').head(15)).sort_values(by=['id','visit_date'],ascending=[True,True])

In [66]:
q3b.iloc[0:45,:]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
167048,air_00a91d42b08b08d9,2017-04-05,35,2017-04-05,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2.0
167049,air_00a91d42b08b08d9,2017-04-06,29,2017-04-06,Thursday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,8.0
167050,air_00a91d42b08b08d9,2017-04-07,17,2017-04-07,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,1.0
167051,air_00a91d42b08b08d9,2017-04-08,9,2017-04-08,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,33.0
167052,air_00a91d42b08b08d9,2017-04-10,17,2017-04-10,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
167053,air_00a91d42b08b08d9,2017-04-11,43,2017-04-11,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2.0
167054,air_00a91d42b08b08d9,2017-04-12,28,2017-04-12,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2.0
167055,air_00a91d42b08b08d9,2017-04-13,34,2017-04-13,Thursday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,7.0
167056,air_00a91d42b08b08d9,2017-04-14,39,2017-04-14,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0
167057,air_00a91d42b08b08d9,2017-04-17,19,2017-04-17,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,


`(
    df.sort_values(
       by=['id','visit_date'],
       ascending=True,False]
       )
       .groupby('id')
       .head(15)
)
.sort_values(by=['id','visit_date'],ascending=[True,True])`

So

*1:* Sort the values so that visit date is ordered descending

*2:* Group all the values with the same ID together in the data set

*3:* Take the first 15 rows with head

*4:* Convert to it's own data frame

*5:* Resort the resulting groups of 15 rows so that the dates are in ascending order

### Jonathan's approach (Q3):

In [6]:
df.iloc[:15]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
5,air_ba937bf13d40fb24,2016-01-19,9,2016-01-19,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
6,air_ba937bf13d40fb24,2016-01-20,31,2016-01-20,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
7,air_ba937bf13d40fb24,2016-01-21,21,2016-01-21,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
8,air_ba937bf13d40fb24,2016-01-22,18,2016-01-22,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
9,air_ba937bf13d40fb24,2016-01-23,26,2016-01-23,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


In [15]:
restaurant_vals = df.groupby('id').apply(lambda x: x.iloc[:15])

# This is effectively creating an additional index based on the id on the right

In [14]:
restaurant_vals[:17]

Unnamed: 0_level_0,Unnamed: 1_level_0,index,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
air_00a91d42b08b08d9,0,166836,air_00a91d42b08b08d9,2016-07-01,35,2016-07-01,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,1,166837,air_00a91d42b08b08d9,2016-07-02,9,2016-07-02,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0
air_00a91d42b08b08d9,2,166838,air_00a91d42b08b08d9,2016-07-04,20,2016-07-04,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,3,166839,air_00a91d42b08b08d9,2016-07-05,25,2016-07-05,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,4,166840,air_00a91d42b08b08d9,2016-07-06,29,2016-07-06,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,5,166841,air_00a91d42b08b08d9,2016-07-07,34,2016-07-07,Thursday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,6,166842,air_00a91d42b08b08d9,2016-07-08,42,2016-07-08,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,7,166843,air_00a91d42b08b08d9,2016-07-09,11,2016-07-09,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,8,166844,air_00a91d42b08b08d9,2016-07-11,25,2016-07-11,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,9,166845,air_00a91d42b08b08d9,2016-07-12,24,2016-07-12,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,


In [19]:
restaurant_vals_rev = df.groupby('id').apply(lambda x: x.iloc[-15:])

In [20]:
restaurant_vals_rev[:20]

Unnamed: 0_level_0,Unnamed: 1_level_0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
air_00a91d42b08b08d9,167053,air_00a91d42b08b08d9,2017-04-11,43,2017-04-11,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2.0
air_00a91d42b08b08d9,167054,air_00a91d42b08b08d9,2017-04-12,28,2017-04-12,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,2.0
air_00a91d42b08b08d9,167055,air_00a91d42b08b08d9,2017-04-13,34,2017-04-13,Thursday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,7.0
air_00a91d42b08b08d9,167056,air_00a91d42b08b08d9,2017-04-14,39,2017-04-14,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0
air_00a91d42b08b08d9,167057,air_00a91d42b08b08d9,2017-04-17,19,2017-04-17,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,167058,air_00a91d42b08b08d9,2017-04-18,35,2017-04-18,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,167059,air_00a91d42b08b08d9,2017-04-19,17,2017-04-19,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
air_00a91d42b08b08d9,167060,air_00a91d42b08b08d9,2017-04-20,38,2017-04-20,Thursday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,1.0
air_00a91d42b08b08d9,167061,air_00a91d42b08b08d9,2017-04-21,55,2017-04-21,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,6.0
air_00a91d42b08b08d9,167062,air_00a91d42b08b08d9,2017-04-22,18,2017-04-22,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,37.0


**Question 4:** Grab the quarterley sales for each individual restaurant within our dataset

In [3]:
# your answer here

In [69]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


In [138]:
df['visitors'].isnull().sum()

0

In [137]:
df['reserve_visitors'].isnull().sum()

143714

In [None]:
# interpreting quarterly sales, as number of visitors each quarter

In [70]:
df.visit_date.dt.year

0         2016
1         2016
2         2016
3         2016
4         2016
          ... 
252103    2017
252104    2017
252105    2017
252106    2017
252107    2017
Name: visit_date, Length: 252108, dtype: int64

In [68]:
df.visit_date.dt.quarter

0         1
1         1
2         1
3         1
4         1
         ..
252103    2
252104    2
252105    1
252106    1
252107    2
Name: visit_date, Length: 252108, dtype: int64

In [126]:
q4 = df.groupby(by=['id',df.visit_date.dt.year,df.visit_date.dt.quarter])['visitors'].sum()

`df.groupby(by=['id',df.visit_date.dt.year,df.visit_date.dt.quarter])['visitors'].sum()`
The above works in terms of generating an output.

The problem is that it runs into an error when resetting the index.
So need to rename the columns (somehow) within the series.


The way to do that was in the end simple: `q4df.index.names = ['id','visit_year','visit_date']`. Then because the names were changed we could use `reset_index()` to put them all at the same level.


In [140]:
df \
.groupby(by=['id',df.visit_date.dt.year,df.visit_date.dt.quarter]).sum()['visitors']

id                    visit_date  visit_date
air_00a91d42b08b08d9  2016        3             1780
                                  4             1740
                      2017        1             2041
                                  2              490
air_0164b9927d20bcc3  2016        4              627
                                                ... 
air_ffcc2d5087e1b476  2017        2              390
air_fff68b929994bfbd  2016        3              404
                                  4              452
                      2017        1              411
                                  2              102
Name: visitors, Length: 3903, dtype: int64

In [110]:
q4

id                    visit_date  visit_date
air_00a91d42b08b08d9  2016        3             1780
                                  4             1740
                      2017        1             2041
                                  2              490
air_0164b9927d20bcc3  2016        4              627
                                                ... 
air_ffcc2d5087e1b476  2017        2              390
air_fff68b929994bfbd  2016        3              404
                                  4              452
                      2017        1              411
                                  2              102
Name: visitors, Length: 3903, dtype: int64

In [111]:
type(q4)

pandas.core.series.Series

In [94]:
?pd.DataFrame

In [115]:
q4df = q4.to_frame('quarterly_visits')

In [157]:
q4df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,quarterly_visits
id,visit_date,visit_date,Unnamed: 3_level_1
air_00a91d42b08b08d9,2016,3,1780
air_00a91d42b08b08d9,2016,4,1740
air_00a91d42b08b08d9,2017,1,2041
air_00a91d42b08b08d9,2017,2,490
air_0164b9927d20bcc3,2016,4,627
...,...,...,...
air_ffcc2d5087e1b476,2017,2,390
air_fff68b929994bfbd,2016,3,404
air_fff68b929994bfbd,2016,4,452
air_fff68b929994bfbd,2017,1,411


In [166]:
q4df.index

MultiIndex([('air_00a91d42b08b08d9', 2016, 3),
            ('air_00a91d42b08b08d9', 2016, 4),
            ('air_00a91d42b08b08d9', 2017, 1),
            ('air_00a91d42b08b08d9', 2017, 2),
            ('air_0164b9927d20bcc3', 2016, 4),
            ('air_0164b9927d20bcc3', 2017, 1),
            ('air_0164b9927d20bcc3', 2017, 2),
            ('air_0241aa3964b7f861', 2016, 1),
            ('air_0241aa3964b7f861', 2016, 2),
            ('air_0241aa3964b7f861', 2016, 3),
            ...
            ('air_fef9ccb3ba0da2f7', 2017, 1),
            ('air_fef9ccb3ba0da2f7', 2017, 2),
            ('air_ffcc2d5087e1b476', 2016, 3),
            ('air_ffcc2d5087e1b476', 2016, 4),
            ('air_ffcc2d5087e1b476', 2017, 1),
            ('air_ffcc2d5087e1b476', 2017, 2),
            ('air_fff68b929994bfbd', 2016, 3),
            ('air_fff68b929994bfbd', 2016, 4),
            ('air_fff68b929994bfbd', 2017, 1),
            ('air_fff68b929994bfbd', 2017, 2)],
           names=['id', 'visit_date', 'visi

In [172]:
q4df.index.names = ['id','visit_year','visit_date']

In [173]:
q4df.index

MultiIndex([('air_00a91d42b08b08d9', 2016, 3),
            ('air_00a91d42b08b08d9', 2016, 4),
            ('air_00a91d42b08b08d9', 2017, 1),
            ('air_00a91d42b08b08d9', 2017, 2),
            ('air_0164b9927d20bcc3', 2016, 4),
            ('air_0164b9927d20bcc3', 2017, 1),
            ('air_0164b9927d20bcc3', 2017, 2),
            ('air_0241aa3964b7f861', 2016, 1),
            ('air_0241aa3964b7f861', 2016, 2),
            ('air_0241aa3964b7f861', 2016, 3),
            ...
            ('air_fef9ccb3ba0da2f7', 2017, 1),
            ('air_fef9ccb3ba0da2f7', 2017, 2),
            ('air_ffcc2d5087e1b476', 2016, 3),
            ('air_ffcc2d5087e1b476', 2016, 4),
            ('air_ffcc2d5087e1b476', 2017, 1),
            ('air_ffcc2d5087e1b476', 2017, 2),
            ('air_fff68b929994bfbd', 2016, 3),
            ('air_fff68b929994bfbd', 2016, 4),
            ('air_fff68b929994bfbd', 2017, 1),
            ('air_fff68b929994bfbd', 2017, 2)],
           names=['id', 'visit_year', 'visi

In [175]:
q4df_final = q4df.reset_index()

In [176]:
q4df_final

Unnamed: 0,id,visit_year,visit_date,quarterly_visits
0,air_00a91d42b08b08d9,2016,3,1780
1,air_00a91d42b08b08d9,2016,4,1740
2,air_00a91d42b08b08d9,2017,1,2041
3,air_00a91d42b08b08d9,2017,2,490
4,air_0164b9927d20bcc3,2016,4,627
...,...,...,...,...
3898,air_ffcc2d5087e1b476,2017,2,390
3899,air_fff68b929994bfbd,2016,3,404
3900,air_fff68b929994bfbd,2016,4,452
3901,air_fff68b929994bfbd,2017,1,411


### Jonathan's approach (Q4):

In [21]:
df.groupby(['id',df.visit_date.dt.year, df.visit_date.dt.quarter])['visitors'].sum()

id                    visit_date  visit_date
air_00a91d42b08b08d9  2016        3             1780
                                  4             1740
                      2017        1             2041
                                  2              490
air_0164b9927d20bcc3  2016        4              627
                                                ... 
air_ffcc2d5087e1b476  2017        2              390
air_fff68b929994bfbd  2016        3              404
                                  4              452
                      2017        1              411
                                  2              102
Name: visitors, Length: 3903, dtype: int64

In [22]:
df.groupby(['id',df.visit_date.dt.year, df.visit_date.dt.quarter])['visitors'].sum().index

MultiIndex([('air_00a91d42b08b08d9', 2016, 3),
            ('air_00a91d42b08b08d9', 2016, 4),
            ('air_00a91d42b08b08d9', 2017, 1),
            ('air_00a91d42b08b08d9', 2017, 2),
            ('air_0164b9927d20bcc3', 2016, 4),
            ('air_0164b9927d20bcc3', 2017, 1),
            ('air_0164b9927d20bcc3', 2017, 2),
            ('air_0241aa3964b7f861', 2016, 1),
            ('air_0241aa3964b7f861', 2016, 2),
            ('air_0241aa3964b7f861', 2016, 3),
            ...
            ('air_fef9ccb3ba0da2f7', 2017, 1),
            ('air_fef9ccb3ba0da2f7', 2017, 2),
            ('air_ffcc2d5087e1b476', 2016, 3),
            ('air_ffcc2d5087e1b476', 2016, 4),
            ('air_ffcc2d5087e1b476', 2017, 1),
            ('air_ffcc2d5087e1b476', 2017, 2),
            ('air_fff68b929994bfbd', 2016, 3),
            ('air_fff68b929994bfbd', 2016, 4),
            ('air_fff68b929994bfbd', 2017, 1),
            ('air_fff68b929994bfbd', 2017, 2)],
           names=['id', 'visit_date', 'visi

In [24]:
type(df.groupby(['id',df.visit_date.dt.year, df.visit_date.dt.quarter])['visitors'].sum())

# A series is what comes back - it's a series with an index 3 levels deep.
# The groupby creates index levels rather than columns. It helps to organise the data

pandas.core.series.Series

In [None]:
# Getting it into a data frame

In [25]:
df['year'] = df['visit_date'].dt.year
df['quarter'] = df['visit_date'].dt.quarter

In [26]:
df.groupby(['id','year','quarter'])['visitors'].sum().reset_index()

Unnamed: 0,id,year,quarter,visitors
0,air_00a91d42b08b08d9,2016,3,1780
1,air_00a91d42b08b08d9,2016,4,1740
2,air_00a91d42b08b08d9,2017,1,2041
3,air_00a91d42b08b08d9,2017,2,490
4,air_0164b9927d20bcc3,2016,4,627
...,...,...,...,...
3898,air_ffcc2d5087e1b476,2017,2,390
3899,air_fff68b929994bfbd,2016,3,404
3900,air_fff68b929994bfbd,2016,4,452
3901,air_fff68b929994bfbd,2017,1,411


**Question 6:** What restaurant had the highest amount of reservations?

In [4]:
# your answer here

In [183]:
df.groupby('id').sum().sort_values(by='visitors',ascending=False)['visitors']


id
air_399904bdb7685ca0    18717
air_f26f36ec4dc5adb0    18577
air_e55abd740f93ecc4    18101
air_99157b6163835eec    18097
air_5c817ef28f236bdf    18009
                        ...  
air_9dd7d38b0f1760c4      803
air_5b704df317ed1962      800
air_fdcfef8bd859f650      625
air_bbe1c1a47e09f161      581
air_a21ffca0bea1661a      190
Name: visitors, Length: 829, dtype: int64

In [None]:
air_399904bdb7685ca0

In [187]:
df.groupby('id').sum().idxmax()['visitors']

'air_399904bdb7685ca0'

In [None]:
# Jonathan's answers

In [29]:
df.groupby('id')['visitors'].sum().idxmax

<bound method Series.idxmax of id
air_00a91d42b08b08d9    6051
air_0164b9927d20bcc3    1378
air_0241aa3964b7f861    3919
air_0328696196e46f18     921
air_034a3d5b40d5b1b1    3722
                        ... 
air_fea5dc9594450608    3969
air_fee8dcf4d619598e    7496
air_fef9ccb3ba0da2f7    2357
air_ffcc2d5087e1b476    4919
air_fff68b929994bfbd    1369
Name: visitors, Length: 829, dtype: int64>

**Question 7:** What is the total number of missing entries for each restaurant?  

In [5]:
# your answer here

In [191]:
(df['visit_date'].max()-df['visit_date'].min()).days

477

So 477 is the maximum number of entries per restaurant

In [199]:
df.groupby('id').count()['visit_date']

id
air_00a91d42b08b08d9    232
air_0164b9927d20bcc3    149
air_0241aa3964b7f861    396
air_0328696196e46f18    116
air_034a3d5b40d5b1b1    251
                       ... 
air_fea5dc9594450608    274
air_fee8dcf4d619598e    288
air_fef9ccb3ba0da2f7    245
air_ffcc2d5087e1b476    243
air_fff68b929994bfbd    269
Name: visit_date, Length: 829, dtype: int64

In [202]:
q7 = df.groupby('id').apply(lambda x: (df['visit_date'].max()-df['visit_date'].min()).days - x.count()['visit_date'])

In [203]:
q7.reset_index()

Unnamed: 0,id,0
0,air_00a91d42b08b08d9,245
1,air_0164b9927d20bcc3,328
2,air_0241aa3964b7f861,81
3,air_0328696196e46f18,361
4,air_034a3d5b40d5b1b1,226
...,...,...
824,air_fea5dc9594450608,203
825,air_fee8dcf4d619598e,189
826,air_fef9ccb3ba0da2f7,232
827,air_ffcc2d5087e1b476,234


In [None]:
# Jonathan's answer

In [30]:
df.isnull().sum()

id                       0
visit_date               0
visitors                 0
calendar_date            0
day_of_week              0
holiday                  0
genre                    0
area                     0
latitude                 0
longitude                0
reserve_visitors    143714
year                     0
quarter                  0
dtype: int64

In [31]:
# Can't do this
df.groupby('id').isnull().sum()

AttributeError: 'DataFrameGroupBy' object has no attribute 'isnull'

In [35]:
df.groupby('id').apply(lambda x: x.isnull().sum())

Unnamed: 0_level_0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,year,quarter
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
air_00a91d42b08b08d9,0,0,0,0,0,0,0,0,0,0,122,0,0
air_0164b9927d20bcc3,0,0,0,0,0,0,0,0,0,0,50,0,0
air_0241aa3964b7f861,0,0,0,0,0,0,0,0,0,0,249,0,0
air_0328696196e46f18,0,0,0,0,0,0,0,0,0,0,50,0,0
air_034a3d5b40d5b1b1,0,0,0,0,0,0,0,0,0,0,130,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
air_fea5dc9594450608,0,0,0,0,0,0,0,0,0,0,139,0,0
air_fee8dcf4d619598e,0,0,0,0,0,0,0,0,0,0,149,0,0
air_fef9ccb3ba0da2f7,0,0,0,0,0,0,0,0,0,0,123,0,0
air_ffcc2d5087e1b476,0,0,0,0,0,0,0,0,0,0,124,0,0


In [36]:

df.groupby('id').apply(lambda x: x.isnull().sum())['reserve_visitors']

id
air_00a91d42b08b08d9    122
air_0164b9927d20bcc3     50
air_0241aa3964b7f861    249
air_0328696196e46f18     50
air_034a3d5b40d5b1b1    130
                       ... 
air_fea5dc9594450608    139
air_fee8dcf4d619598e    149
air_fef9ccb3ba0da2f7    123
air_ffcc2d5087e1b476    124
air_fff68b929994bfbd    132
Name: reserve_visitors, Length: 829, dtype: int64

In [38]:
# This adds it up for each of the individual grouped data sets.
df.groupby('id').apply(lambda x: x.isnull().sum().sum()) 

id
air_00a91d42b08b08d9    122
air_0164b9927d20bcc3     50
air_0241aa3964b7f861    249
air_0328696196e46f18     50
air_034a3d5b40d5b1b1    130
                       ... 
air_fea5dc9594450608    139
air_fee8dcf4d619598e    149
air_fef9ccb3ba0da2f7    123
air_ffcc2d5087e1b476    124
air_fff68b929994bfbd    132
Length: 829, dtype: int64

**Question 8:**  Create two variables, `train` and `test`.  Make `train` a dataset that contains all but the **last 15 rows** for each restaurant.  Make `test` the last 15 rows for each restaurant.

In [6]:
# your answer here

In [215]:
(df.assign(rn=df.sort_values(['visit_date']).groupby('id').cumcount()+1)).sort_values(by=['id','rn']).iloc[:500]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,rn
166836,air_00a91d42b08b08d9,2016-07-01,35,2016-07-01,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,1
166837,air_00a91d42b08b08d9,2016-07-02,9,2016-07-02,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0,2
166838,air_00a91d42b08b08d9,2016-07-04,20,2016-07-04,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,3
166839,air_00a91d42b08b08d9,2016-07-05,25,2016-07-05,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,4
166840,air_00a91d42b08b08d9,2016-07-06,29,2016-07-06,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,5
166841,air_00a91d42b08b08d9,2016-07-07,34,2016-07-07,Thursday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,6
166842,air_00a91d42b08b08d9,2016-07-08,42,2016-07-08,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,7
166843,air_00a91d42b08b08d9,2016-07-09,11,2016-07-09,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,8
166844,air_00a91d42b08b08d9,2016-07-11,25,2016-07-11,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,9
166845,air_00a91d42b08b08d9,2016-07-12,24,2016-07-12,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,,10


In [216]:
q8 = (df.assign(rn=df.sort_values(['visit_date']).groupby('id').cumcount()+1)).sort_values(by=['id','rn'])

In [222]:
q8.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 252108 entries, 166836 to 216647
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                252108 non-null  object        
 1   visit_date        252108 non-null  datetime64[ns]
 2   visitors          252108 non-null  int64         
 3   calendar_date     252108 non-null  object        
 4   day_of_week       252108 non-null  object        
 5   holiday           252108 non-null  int64         
 6   genre             252108 non-null  object        
 7   area              252108 non-null  object        
 8   latitude          252108 non-null  float64       
 9   longitude         252108 non-null  float64       
 10  reserve_visitors  108394 non-null  float64       
 11  rn                252108 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(3), object(5)
memory usage: 35.0+ MB


In [234]:
q8b = (q8.groupby('id').max()['rn']).to_frame('max_rows').reset_index()

In [235]:
q8b

Unnamed: 0,id,max_rows
0,air_00a91d42b08b08d9,232
1,air_0164b9927d20bcc3,149
2,air_0241aa3964b7f861,396
3,air_0328696196e46f18,116
4,air_034a3d5b40d5b1b1,251
5,air_036d4f1ee7285390,281
6,air_0382c794b73b51ad,298
7,air_03963426c9312048,429
8,air_04341b588bde96cd,472
9,air_049f6d5b402a31b2,258


In [239]:
q8c = q8.merge(q8b,how="left",on="id")

In [240]:
q8c.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 252108 entries, 0 to 252107
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                252108 non-null  object        
 1   visit_date        252108 non-null  datetime64[ns]
 2   visitors          252108 non-null  int64         
 3   calendar_date     252108 non-null  object        
 4   day_of_week       252108 non-null  object        
 5   holiday           252108 non-null  int64         
 6   genre             252108 non-null  object        
 7   area              252108 non-null  object        
 8   latitude          252108 non-null  float64       
 9   longitude         252108 non-null  float64       
 10  reserve_visitors  108394 non-null  float64       
 11  rn                252108 non-null  int64         
 12  max_rows          252108 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(5)
memor

In [243]:
(q8c['rn'] <=  q8c['max_rows'] -15)

0          True
1          True
2          True
3          True
4          True
          ...  
252103    False
252104    False
252105    False
252106    False
252107    False
Length: 252108, dtype: bool

In [249]:
?np.where

In [250]:
q8c['train_test'] = np.where((q8c['rn'] <=  q8c['max_rows'] -15),'train','test')

In [264]:
# q8c.iloc[48574]
q8c.iloc[208574]

id                              air_d1418d6fd6d634f2
visit_date                       2017-02-21 00:00:00
visitors                                          10
calendar_date                             2017-02-21
day_of_week                                  Tuesday
holiday                                            0
genre                                        Izakaya
area                Hyōgo-ken Kōbe-shi Motomachidōri
latitude                                     34.6882
longitude                                    135.187
reserve_visitors                                  16
rn                                               214
max_rows                                         274
train_test                                     train
Name: 208574, dtype: object

In [261]:
q8c[q8c['id']=='air_00a91d42b08b08d9']['train_test'].value_counts()

train    217
test      15
Name: train_test, dtype: int64

In [263]:
q8c[q8c['id']=='air_36429b5ca4407b3e']['train_test'].value_counts()

train    245
test      15
Name: train_test, dtype: int64

In [265]:
q8c[q8c['id']=='air_d1418d6fd6d634f2']['train_test'].value_counts()

train    259
test      15
Name: train_test, dtype: int64

In [279]:
q8t = (q8c.groupby(by=['id','train_test']).count()['visit_date']).reset_index()

In [280]:
q8t

Unnamed: 0,id,train_test,visit_date
0,air_00a91d42b08b08d9,test,15
1,air_00a91d42b08b08d9,train,217
2,air_0164b9927d20bcc3,test,15
3,air_0164b9927d20bcc3,train,134
4,air_0241aa3964b7f861,test,15
...,...,...,...
1653,air_fef9ccb3ba0da2f7,train,230
1654,air_ffcc2d5087e1b476,test,15
1655,air_ffcc2d5087e1b476,train,228
1656,air_fff68b929994bfbd,test,15


In [287]:
q8t.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1658 entries, 0 to 1657
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          1658 non-null   object
 1   train_test  1658 non-null   object
 2   visit_date  1658 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 39.0+ KB


In [294]:
((q8t[q8t['train_test']=='test']['visit_date']==15).reset_index())['visit_date'].value_counts()

True    829
Name: visit_date, dtype: int64

In [300]:
q8b.max_rows.min()
# All the restaurants had at least 20 observations.

20

### Grouping & Merging

In this section of the lab, we are going to create different types of summary statistics -- where the rows for an individual sample can be compared with a larger group statistic.

**Bonus:** If you want to make this a little bit more effective, instead of using the entire `df`, try using a grouping from the `train` variable you just created, and use the grouping's values to populate both the training and test sets.

Use the technique discussed in class to create columns for the following stats:

**Question 1:** Create columns that list the average, median and standard deviation of visitors for each restaurant

In [7]:
# your answer here

In [39]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,year,quarter
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1


In [41]:
restaurant_stats = df.groupby('id')['visitors'].agg([np.mean,np.median,np.std]).rename({'mean':'restaurant-mean','median':'restaurant-median','std':"restaurant-standard"},axis=1)

In [42]:
restaurant_stats.head()

Unnamed: 0_level_0,restaurant-mean,restaurant-median,restaurant-standard
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
air_00a91d42b08b08d9,26.081897,26.0,12.435364
air_0164b9927d20bcc3,9.248322,8.0,6.34898
air_0241aa3964b7f861,9.896465,9.0,6.214877
air_0328696196e46f18,7.939655,6.0,6.733807
air_034a3d5b40d5b1b1,14.828685,12.0,13.154107


In [44]:
df.merge(restaurant_stats,left_on='id',right_index=True,how='left')

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,year,quarter,restaurant-mean,restaurant-median,restaurant-standard
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,22.782609,22.0,11.810526
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,22.782609,22.0,11.810526
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,22.782609,22.0,11.810526
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,22.782609,22.0,11.810526
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,22.782609,22.0,11.810526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0,2017,2,44.595745,43.0,24.796265
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0,2017,2,44.595745,43.0,24.796265
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0,2017,1,44.595745,43.0,24.796265
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0,2017,1,44.595745,43.0,24.796265


**Question 2:** Create a column that lists the average and median sales amount for each restaurant on a particular day of the week.

In [8]:
# your answer here

In [46]:
day_of_week_stats = df.groupby(['id','day_of_week'])['visitors'].agg([np.mean,np.median]).rename({'mean':'day_of_week-mean','median':'day_of_week-median'},axis=1)

In [49]:
day_of_week_stats.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,day_of_week-mean,day_of_week-median
id,day_of_week,Unnamed: 2_level_1,Unnamed: 3_level_1
air_00a91d42b08b08d9,Friday,36.5,35.5
air_00a91d42b08b08d9,Monday,22.457143,19.0
air_00a91d42b08b08d9,Saturday,14.973684,11.0
air_00a91d42b08b08d9,Sunday,2.0,2.0
air_00a91d42b08b08d9,Thursday,29.868421,30.0
air_00a91d42b08b08d9,Tuesday,24.35,24.5
air_00a91d42b08b08d9,Wednesday,28.125,28.0
air_0164b9927d20bcc3,Friday,11.464286,10.5
air_0164b9927d20bcc3,Monday,7.5,6.0
air_0164b9927d20bcc3,Saturday,6.409091,4.5


In [47]:
df.merge(day_of_week_stats,left_on=['id','day_of_week'],right_index=True, how='left')

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,year,quarter,day_of_week-mean,day_of_week-median
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,23.843750,25.0
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,20.292308,21.0
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,34.738462,35.0
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,27.651515,27.0
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,13.754386,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0,2017,2,68.428571,65.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0,2017,2,57.285714,60.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0,2017,1,44.000000,34.5
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0,2017,1,26.333333,27.5


**Question 3:** Create columns that display the average and median sales amount for each genre in each city on each day of the week.  Create a column called `city` that captures the first value of `area` in order to this.  Values should be `Tokyo`, `Hiroshima`, etc.  **Hint:** You should use the `str` attribute combined with `split` in order to do this.

In [9]:
# your answer here

In [51]:
df.head(5)

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,year,quarter
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1


In [56]:
df['city'] = df['area'].apply(lambda x: x.split(' ')[0])

In [59]:
df['city'].value_counts()

Tōkyō-to         133063
Fukuoka-ken       39645
Ōsaka-fu          22821
Hyōgo-ken         17846
Hokkaidō          13055
Hiroshima-ken      9858
Miyagi-ken         5959
Shizuoka-ken       5798
Niigata-ken        4063
Name: city, dtype: int64

In [62]:
city_genre_stats = df.groupby(['genre','city'])['visitors'].agg([np.mean,np.median]).rename({'mean':'genre_city-mean','median':'genre_city-median'},axis=1)

In [63]:
df.merge(city_genre_stats,left_on=['genre','city'],right_index=True, how='left')

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,year,quarter,city,genre_city-mean,genre_city-median
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to,17.928582,14.0
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to,17.928582,14.0
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to,17.928582,14.0
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to,17.928582,14.0
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to,17.928582,14.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0,2017,2,Hyōgo-ken,21.655303,17.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0,2017,2,Hyōgo-ken,21.655303,17.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0,2017,1,Hyōgo-ken,21.655303,17.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0,2017,1,Hyōgo-ken,21.655303,17.0


In [64]:
df.head(5)

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,year,quarter,city
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,2016,1,Tōkyō-to


In [66]:
df_all = df.merge(restaurant_stats,left_on='id',right_index=True,how='left')
df_all = df_all.merge(day_of_week_stats,left_on=['id','day_of_week'],right_index=True, how='left')
df_all = df_all.merge(city_genre_stats,left_on=['genre','city'],right_index=True, how='left')

In [67]:
df_all.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,...,year,quarter,city,restaurant-mean,restaurant-median,restaurant-standard,day_of_week-mean,day_of_week-median,genre_city-mean,genre_city-median
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,...,2016,1,Tōkyō-to,22.782609,22.0,11.810526,23.84375,25.0,17.928582,14.0
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,...,2016,1,Tōkyō-to,22.782609,22.0,11.810526,20.292308,21.0,17.928582,14.0
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,...,2016,1,Tōkyō-to,22.782609,22.0,11.810526,34.738462,35.0,17.928582,14.0
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,...,2016,1,Tōkyō-to,22.782609,22.0,11.810526,27.651515,27.0,17.928582,14.0
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,...,2016,1,Tōkyō-to,22.782609,22.0,11.810526,13.754386,12.0,17.928582,14.0


In [68]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 21 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   id                   252108 non-null  object        
 1   visit_date           252108 non-null  datetime64[ns]
 2   visitors             252108 non-null  int64         
 3   calendar_date        252108 non-null  object        
 4   day_of_week          252108 non-null  object        
 5   holiday              252108 non-null  int64         
 6   genre                252108 non-null  object        
 7   area                 252108 non-null  object        
 8   latitude             252108 non-null  float64       
 9   longitude            252108 non-null  float64       
 10  reserve_visitors     108394 non-null  float64       
 11  year                 252108 non-null  int64         
 12  quarter              252108 non-null  int64         
 13  city          