### Pandas Lab -- Cleaning, Merging, & Grouping

This lab is designed to introduce students to common use cases for Pandas when working with data:

 - Creating new information out of your existing data set
 - Merging, concatenating, and joining different data sources
 - Grouping -- With both time & non-time based data

In [3]:
import pandas as pd
import numpy as np
import datetime

In [59]:
df = pd.read_csv(r"/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurants.csv")

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                252108 non-null  object 
 1   visit_date        252108 non-null  object 
 2   visitors          252108 non-null  int64  
 3   calendar_date     252108 non-null  object 
 4   day_of_week       252108 non-null  object 
 5   holiday           252108 non-null  int64  
 6   genre             252108 non-null  object 
 7   area              252108 non-null  object 
 8   latitude          252108 non-null  float64
 9   longitude         252108 non-null  float64
 10  reserve_visitors  108394 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 21.2+ MB


In [3]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


### Section I: Creating Data Out of Your Existing Columns

Go ahead and create the following columns in your dataset.

**Column 1:**

  - **Column Name:** Weekend
  - **Values:** `True` if `day_of_week` is either Friday or Saturday, `False` if not

In [5]:
df = pd.read_csv(r"/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurants.csv",parse_dates = ['visit_date'])

In [6]:
# your answer here

conditions = [
    df['day_of_week'] == 'Friday',
    df['day_of_week'] == "Saturday"
]

results = [
    True,
    True
]



df['Weekend'] = np.select(conditions, results, False)

In [7]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,Weekend
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,True
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,True
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False


In [12]:
df['weekend_2'] = np.where((df.day_of_week == 'Saturday')|(df.day_of_week == 'Sunday'), 'Weekend', 'Weekday')

In [13]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,Weekend,weekend_2
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False,Weekday
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False,Weekday
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,True,Weekday
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,True,Weekend
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,False,Weekday


0         False
1         False
2         False
3          True
4         False
          ...  
252103    False
252104     True
252105     True
252106    False
252107     True
Name: day_of_week, Length: 252108, dtype: bool

**Column 2:**

 - **Column Name:** Reservation Activity
 - **Values:**
   - `Low` if `reserve_visitors` is in the bottom .25 percentile
   - `Medium` if `reserve_visitors` is in the middle .50 percentile
   - `High`if `reserve_visitors` is in the top .25 percentile
   
**Hint:** Use the `quantile` method to get this value

In [None]:
# your answer here

In [5]:
df['reserve_visitors'].quantile

<bound method Series.quantile of 0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
          ... 
252103     6.0
252104    37.0
252105    35.0
252106     3.0
252107    32.0
Name: reserve_visitors, Length: 252108, dtype: float64>

In [21]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


In [20]:
df = df.drop(labels='reserve_visitors_quantile',axis=1)

In [24]:
df['reserve_visitors'].quantile(q=0.25)

4.0

In [14]:
df['reserve_visitors'].quantile(0.25)

4.0

In [15]:
df['reserve_visitors'].quantile(0.5)

10.0

In [16]:
df['reserve_visitors'].quantile(0.75)

24.0

In [17]:
# my workings
conditions = [
    df['reserve_visitors'] <  df['reserve_visitors'].quantile(q=0.25),
    df['reserve_visitors'] <  df['reserve_visitors'].quantile(q=0.75),
    df['reserve_visitors'] <= df['reserve_visitors'].quantile(q=1)
]

results = [
    'Low',
    'Medium',
    'High'
]



df['reservation_activity'] = np.select(conditions, results, 'Unknown')

In [18]:
df['reservation_activity'].value_counts()

Unknown    143714
Medium      60263
High        27193
Low         20938
Name: reservation_activity, dtype: int64

In [None]:
# Review lessons:

In [22]:
df['reserve_visitors'] > df['reserve_visitors'].quantile(0.75)

0         False
1         False
2         False
3         False
4         False
          ...  
252103    False
252104     True
252105     True
252106    False
252107     True
Name: reserve_visitors, Length: 252108, dtype: bool

In [23]:
(df['reserve_visitors'] < df['reserve_visitors'].quantile(0.75))&(df['reserve_visitors'] > df['reserve_visitors'].quantile(0.25))


0         False
1         False
2         False
3         False
4         False
          ...  
252103     True
252104    False
252105    False
252106    False
252107    False
Name: reserve_visitors, Length: 252108, dtype: bool

In [None]:
# Review Lesson:

In [28]:
conditions = [
    df['reserve_visitors'] > df['reserve_visitors'].quantile(0.75),
    (df['reserve_visitors'] < df['reserve_visitors'].quantile(0.75))&(df['reserve_visitors'] > df['reserve_visitors'].quantile(0.25)),
    df['reserve_visitors'] < df['reserve_visitors'].quantile(0.25)
]

results = [
    'High',
    'Medium',
    'Low'
]

df['ReserveQ'] = np.select(conditions, results, 'Other')

**Column 3:**

 - **Column Name:** Days
 - **Values:**
   - The length of time that has passed from the beginning of the time series, in days
 - **Note:** When you subtract these columns, your column will be a **time delta**.  See if you can use the `dt` attribute to convert these values into an integer.  Ie, if your value reads `3 days`, you want that to be 3 instead.  You can read more about different time periods in pandas here:  https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

### My solution:

In [35]:
# your answer here

df['visit_date'].min()



'2016-01-01'

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    252108 non-null  object 
 1   visit_date            252108 non-null  object 
 2   visitors              252108 non-null  int64  
 3   calendar_date         252108 non-null  object 
 4   day_of_week           252108 non-null  object 
 5   holiday               252108 non-null  int64  
 6   genre                 252108 non-null  object 
 7   area                  252108 non-null  object 
 8   latitude              252108 non-null  float64
 9   longitude             252108 non-null  float64
 10  reserve_visitors      108394 non-null  float64
 11  reservation_activity  252108 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 23.1+ MB


In [45]:
pd.to_datetime(df['visit_date'])

0        2016-01-13
1        2016-01-14
2        2016-01-15
3        2016-01-16
4        2016-01-18
            ...    
252103   2017-04-21
252104   2017-04-22
252105   2017-03-26
252106   2017-03-20
252107   2017-04-09
Name: visit_date, Length: 252108, dtype: datetime64[ns]

In [46]:
df['visit_date_dt'] = pd.to_datetime(df['visit_date'])

In [48]:
df.head(1)

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,reservation_activity,visit_date_dt
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,Unknown,2016-01-13


In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   id                    252108 non-null  object        
 1   visit_date            252108 non-null  object        
 2   visitors              252108 non-null  int64         
 3   calendar_date         252108 non-null  object        
 4   day_of_week           252108 non-null  object        
 5   holiday               252108 non-null  int64         
 6   genre                 252108 non-null  object        
 7   area                  252108 non-null  object        
 8   latitude              252108 non-null  float64       
 9   longitude             252108 non-null  float64       
 10  reserve_visitors      108394 non-null  float64       
 11  reservation_activity  252108 non-null  object        
 12  visit_date_dt         252108 non-null  datetime64[ns]
dtyp

In [74]:
df['Days']=(df['visit_date_dt']-pd.to_datetime('2016-01-01')).dt.days

In [75]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,reservation_activity,visit_date_dt,Days
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,Unknown,2016-01-13,12
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,Unknown,2016-01-14,13
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,Unknown,2016-01-15,14
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,Unknown,2016-01-16,15
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,,Unknown,2016-01-18,17


### Class 6 review:

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 15 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   id                    252108 non-null  object        
 1   visit_date            252108 non-null  datetime64[ns]
 2   visitors              252108 non-null  int64         
 3   calendar_date         252108 non-null  object        
 4   day_of_week           252108 non-null  object        
 5   holiday               252108 non-null  int64         
 6   genre                 252108 non-null  object        
 7   area                  252108 non-null  object        
 8   latitude              252108 non-null  float64       
 9   longitude             252108 non-null  float64       
 10  reserve_visitors      108394 non-null  float64       
 11  Weekend               252108 non-null  bool          
 12  weekend_2             252108 non-null  object        
 13 

In [29]:
df['visit_date'].min()

Timestamp('2016-01-01 00:00:00')

In [34]:
df['visit_date'] - df['visit_date'].min()

0         12 days
1         13 days
2         14 days
3         15 days
4         17 days
           ...   
252103   476 days
252104   477 days
252105   450 days
252106   444 days
252107   464 days
Name: visit_date, Length: 252108, dtype: timedelta64[ns]

In [35]:
(df['visit_date'] - df['visit_date'].min()).dtype

dtype('<m8[ns]')

In [36]:
(df['visit_date'] - df['visit_date'].min()) - 15
# This fails since you can't just do addition / subtraction on times

TypeError: Addition/subtraction of integers and integer-arrays with TimedeltaArray is no longer supported.  Instead of adding/subtracting `n`, use `n * obj.freq`

In [37]:
df['time'] = (df['visit_date'] - df['visit_date'].min())

### Section II: Merging Dataframes

The dataset we have been working with so far (`master.csv`), is actually a combined version of several datasets.  

In this section of the lab, we are going to re-create it manually from its individual pieces.

In the `restaurant data` folder, you'll find the following files:

 - `air_reserve.csv`
 - `air_store_info.csv`
 - `air_visit_data.csv`
 - `date_info.csv`
 
They contain all the constituent info for the `master.csv` file that we're currently using. 

You should have 252108 rows when you are finished.

Using merges, piece the files together to recreate the one we are currently working on.  

**Hint:** To get the number of reservations in the `reserve_visitors` column, you will have to use the `groupby` method first for each store_id and day before doing the merging.

You will also have to make sure each column is the same datatype.

Some operations that might come in handy:

 - `dt.date` -- converts a datetime to a date
 - `pd.to_datetime` if you need to convert something from a string to a date

In [None]:
# your answer here

**Hint:**fairly difficult part is air_reserve.
You need to use a group by style operation. This is more complicated since we have different reservations at different times of days.

*visit_datetime* is at a lower level of detail than we want, so need to bring it up some levels (so needs to be parsed as datatime. Can convert using pd.to_datetime; then can use .dt.date 

 `reservation.groupby(p'air_store_id','visit_datetime'])['reserve_visitors'].sum().reset_index`

In [49]:
reserve = pd.read_csv("/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurant_data/air_reserve.csv")

store = pd.read_csv("/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurant_data/air_store_info.csv")

visits = pd.read_csv("/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurant_data/air_visit_data.csv")

dateinfo = pd.read_csv("/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurant_data/date_info.csv")

In [50]:
reserve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92378 entries, 0 to 92377
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   air_store_id      92378 non-null  object
 1   visit_datetime    92378 non-null  object
 2   reserve_datetime  92378 non-null  object
 3   reserve_visitors  92378 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 2.8+ MB


In [52]:
reserve.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,air_877f79706adbfb06,2016-01-01 19:00:00,2016-01-01 16:00:00,1
1,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,3
2,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,6
3,air_877f79706adbfb06,2016-01-01 20:00:00,2016-01-01 16:00:00,2
4,air_db80363d35f10926,2016-01-01 20:00:00,2016-01-01 01:00:00,5


In [51]:
store.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 829 entries, 0 to 828
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   air_store_id    829 non-null    object 
 1   air_genre_name  829 non-null    object 
 2   air_area_name   829 non-null    object 
 3   latitude        829 non-null    float64
 4   longitude       829 non-null    float64
dtypes: float64(2), object(3)
memory usage: 32.5+ KB


In [53]:
store.head()

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
3,air_a17f0778617c76e2,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852
4,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599


In [55]:
visits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   air_store_id  252108 non-null  object
 1   visit_date    252108 non-null  object
 2   visitors      252108 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 5.8+ MB


In [56]:
visits.head()

Unnamed: 0,air_store_id,visit_date,visitors
0,air_ba937bf13d40fb24,2016-01-13,25
1,air_ba937bf13d40fb24,2016-01-14,32
2,air_ba937bf13d40fb24,2016-01-15,29
3,air_ba937bf13d40fb24,2016-01-16,22
4,air_ba937bf13d40fb24,2016-01-18,6


In [57]:
dateinfo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   calendar_date  517 non-null    object
 1   day_of_week    517 non-null    object
 2   holiday_flg    517 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 12.2+ KB


In [58]:
dateinfo.head()

Unnamed: 0,calendar_date,day_of_week,holiday_flg
0,2016-01-01,Friday,1
1,2016-01-02,Saturday,1
2,2016-01-03,Sunday,1
3,2016-01-04,Monday,0
4,2016-01-05,Tuesday,0


In [66]:
bdf_1['id', 'genre','area','latitude','longitude' ] = store.iloc[:,:] 

TypeError: 'module' object does not support item assignment

In [69]:
bdf_1 = store.iloc[:,:]

AttributeError: module 'pandas' has no attribute 'info'

In [72]:
visits.head()

Unnamed: 0,air_store_id,visit_date,visitors
0,air_ba937bf13d40fb24,2016-01-13,25
1,air_ba937bf13d40fb24,2016-01-14,32
2,air_ba937bf13d40fb24,2016-01-15,29
3,air_ba937bf13d40fb24,2016-01-16,22
4,air_ba937bf13d40fb24,2016-01-18,6


In [74]:
bdf_2 = bdf_1.merge(visits, how="inner", on="air_store_id")

In [75]:
bdf_2

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude,visit_date,visitors
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,18
1,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-02,37
2,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-03,20
3,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-04,16
4,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-05,15
...,...,...,...,...,...,...,...
252103,air_c8fe396d6c46275d,Karaoke/Party,Hokkaidō Sapporo-shi Minami 3 Jōnishi,43.055460,141.340956,2017-04-18,25
252104,air_c8fe396d6c46275d,Karaoke/Party,Hokkaidō Sapporo-shi Minami 3 Jōnishi,43.055460,141.340956,2017-04-19,12
252105,air_c8fe396d6c46275d,Karaoke/Party,Hokkaidō Sapporo-shi Minami 3 Jōnishi,43.055460,141.340956,2017-04-20,11
252106,air_c8fe396d6c46275d,Karaoke/Party,Hokkaidō Sapporo-shi Minami 3 Jōnishi,43.055460,141.340956,2017-04-21,35


In [76]:
dateinfo

Unnamed: 0,calendar_date,day_of_week,holiday_flg
0,2016-01-01,Friday,1
1,2016-01-02,Saturday,1
2,2016-01-03,Sunday,1
3,2016-01-04,Monday,0
4,2016-01-05,Tuesday,0
...,...,...,...
512,2017-05-27,Saturday,0
513,2017-05-28,Sunday,0
514,2017-05-29,Monday,0
515,2017-05-30,Tuesday,0


In [80]:
bdf_3 = bdf_2.merge(dateinfo,how='inner',left_on='visit_date',right_on="calendar_date")

In [81]:
bdf_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 252108 entries, 0 to 252107
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   air_store_id    252108 non-null  object 
 1   air_genre_name  252108 non-null  object 
 2   air_area_name   252108 non-null  object 
 3   latitude        252108 non-null  float64
 4   longitude       252108 non-null  float64
 5   visit_date      252108 non-null  object 
 6   visitors        252108 non-null  int64  
 7   calendar_date   252108 non-null  object 
 8   day_of_week     252108 non-null  object 
 9   holiday_flg     252108 non-null  int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 21.2+ MB


In [82]:
reserve.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors
0,air_877f79706adbfb06,2016-01-01 19:00:00,2016-01-01 16:00:00,1
1,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,3
2,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,6
3,air_877f79706adbfb06,2016-01-01 20:00:00,2016-01-01 16:00:00,2
4,air_db80363d35f10926,2016-01-01 20:00:00,2016-01-01 01:00:00,5


In [83]:
reserve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92378 entries, 0 to 92377
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   air_store_id      92378 non-null  object
 1   visit_datetime    92378 non-null  object
 2   reserve_datetime  92378 non-null  object
 3   reserve_visitors  92378 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 2.8+ MB


In [84]:
reserve['visit_datetime'] = pd.to_datetime(reserve['visit_datetime'])

In [85]:
reserve['reserve_datetime'] = pd.to_datetime(reserve['reserve_datetime'])

In [86]:
reserve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92378 entries, 0 to 92377
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   air_store_id      92378 non-null  object        
 1   visit_datetime    92378 non-null  datetime64[ns]
 2   reserve_datetime  92378 non-null  datetime64[ns]
 3   reserve_visitors  92378 non-null  int64         
dtypes: datetime64[ns](2), int64(1), object(1)
memory usage: 2.8+ MB


In [87]:
# reserve["visit_date"] = 
reserve["visit_datetime"].dt.date

0        2016-01-01
1        2016-01-01
2        2016-01-01
3        2016-01-01
4        2016-01-01
            ...    
92373    2017-05-29
92374    2017-05-30
92375    2017-05-31
92376    2017-05-31
92377    2017-05-31
Name: visit_datetime, Length: 92378, dtype: object

In [88]:
reserve["visit_date"] = reserve["visit_datetime"].dt.date

In [89]:
reserve.head()

Unnamed: 0,air_store_id,visit_datetime,reserve_datetime,reserve_visitors,visit_date
0,air_877f79706adbfb06,2016-01-01 19:00:00,2016-01-01 16:00:00,1,2016-01-01
1,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,3,2016-01-01
2,air_db4b38ebe7a7ceff,2016-01-01 19:00:00,2016-01-01 19:00:00,6,2016-01-01
3,air_877f79706adbfb06,2016-01-01 20:00:00,2016-01-01 16:00:00,2,2016-01-01
4,air_db80363d35f10926,2016-01-01 20:00:00,2016-01-01 01:00:00,5,2016-01-01


In [96]:
bdf_4 = reserve.groupby(['air_store_id','visit_date'])['reserve_visitors'].sum().reset_index()

In [112]:
bdf_4.head()

Unnamed: 0,air_store_id,visit_date,reserve_visitors
0,air_00a91d42b08b08d9,2016-10-31,2
1,air_00a91d42b08b08d9,2016-12-05,9
2,air_00a91d42b08b08d9,2016-12-14,18
3,air_00a91d42b08b08d9,2016-12-17,2
4,air_00a91d42b08b08d9,2016-12-20,4


In [97]:
bdf_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29830 entries, 0 to 29829
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   air_store_id      29830 non-null  object
 1   visit_date        29830 non-null  object
 2   reserve_visitors  29830 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 699.3+ KB


In [99]:
bdf_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 252108 entries, 0 to 252107
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   air_store_id    252108 non-null  object 
 1   air_genre_name  252108 non-null  object 
 2   air_area_name   252108 non-null  object 
 3   latitude        252108 non-null  float64
 4   longitude       252108 non-null  float64
 5   visit_date      252108 non-null  object 
 6   visitors        252108 non-null  int64  
 7   calendar_date   252108 non-null  object 
 8   day_of_week     252108 non-null  object 
 9   holiday_flg     252108 non-null  int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 21.2+ MB


In [101]:
bdf_3.merge(bdf_4,how="left",left_on = ['air_store_id','visit_date'], right_on = ['air_store_id','visit_date'])

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude,visit_date,visitors,calendar_date,day_of_week,holiday_flg,reserve_visitors
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,18,2016-07-01,Friday,0,
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,3,2016-07-01,Friday,0,
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,18,2016-07-01,Friday,0,
3,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2016-07-01,10,2016-07-01,Friday,0,
4,air_99c3eae84130c1cb,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2016-07-01,64,2016-07-01,Friday,0,
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_c7d30ab0e07f31d5,Other,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,2016-04-11,16,2016-04-11,Monday,0,
252104,air_8a906e5801eac81c,Other,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,2016-04-11,13,2016-04-11,Monday,0,
252105,air_2cee51fa6fdf6c0d,Western food,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,2016-04-11,6,2016-04-11,Monday,0,
252106,air_0728814bd98f7367,Western food,Tōkyō-to Suginami-ku Asagayaminami,35.699566,139.636438,2016-04-11,3,2016-04-11,Monday,0,


In [116]:
bdf_3['visit_date'] = pd.to_datetime(bdf_3['visit_date'])

In [117]:
bdf_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 252108 entries, 0 to 252107
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   air_store_id    252108 non-null  object        
 1   air_genre_name  252108 non-null  object        
 2   air_area_name   252108 non-null  object        
 3   latitude        252108 non-null  float64       
 4   longitude       252108 non-null  float64       
 5   visit_date      252108 non-null  datetime64[ns]
 6   visitors        252108 non-null  int64         
 7   calendar_date   252108 non-null  object        
 8   day_of_week     252108 non-null  object        
 9   holiday_flg     252108 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(2), object(5)
memory usage: 21.2+ MB


In [119]:
bdf_4['visit_date'] = pd.to_datetime(bdf_4['visit_date'])

In [120]:
bdf_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29830 entries, 0 to 29829
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   air_store_id      29830 non-null  object        
 1   visit_date        29830 non-null  datetime64[ns]
 2   reserve_visitors  29830 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 699.3+ KB


In [121]:
bdf_3.merge(bdf_4,how="left",left_on = ['air_store_id','visit_date'], right_on = ['air_store_id','visit_date'])

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude,visit_date,visitors,calendar_date,day_of_week,holiday_flg,reserve_visitors
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,18,2016-07-01,Friday,0,3.0
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,3,2016-07-01,Friday,0,
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,18,2016-07-01,Friday,0,
3,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2016-07-01,10,2016-07-01,Friday,0,
4,air_99c3eae84130c1cb,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2016-07-01,64,2016-07-01,Friday,0,
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_c7d30ab0e07f31d5,Other,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,2016-04-11,16,2016-04-11,Monday,0,
252104,air_8a906e5801eac81c,Other,Tōkyō-to Taitō-ku Higashiueno,35.712607,139.779996,2016-04-11,13,2016-04-11,Monday,0,
252105,air_2cee51fa6fdf6c0d,Western food,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,2016-04-11,6,2016-04-11,Monday,0,6.0
252106,air_0728814bd98f7367,Western food,Tōkyō-to Suginami-ku Asagayaminami,35.699566,139.636438,2016-04-11,3,2016-04-11,Monday,0,


In [122]:
bdf_5 = bdf_3.merge(bdf_4,how="left",left_on = ['air_store_id','visit_date'], right_on = ['air_store_id','visit_date'])

In [123]:
bdf_5.head()

Unnamed: 0,air_store_id,air_genre_name,air_area_name,latitude,longitude,visit_date,visitors,calendar_date,day_of_week,holiday_flg,reserve_visitors
0,air_0f0cdeee6c9bf3d7,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,18,2016-07-01,Friday,0,3.0
1,air_7cc17a324ae5c7dc,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,3,2016-07-01,Friday,0,
2,air_fee8dcf4d619598e,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,2016-07-01,18,2016-07-01,Friday,0,
3,air_83db5aff8f50478e,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2016-07-01,10,2016-07-01,Friday,0,
4,air_99c3eae84130c1cb,Italian/French,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2016-07-01,64,2016-07-01,Friday,0,


224044