### Lab:  Model Validation With Gradient Boosting

Welcome to this evening's lab!  It's going to be a fun one.  For today's class, we're going to try and take a crack at model building in a wholistic way.  

Specifically, we're going to try and do three different things:

 - Try out different versions of our data, and use our validation scores to see if something was an improvement or not
 - We're going to adjust model parameters to try and adjust our results to help curb overfitting
 - We're going to try and find model parameters that maximize our score for our dataset
 
The idea is that we'll be able to do a mini-walkthrough to test what it's like to build and validate a model and try and improve our results.

**Step 1:** Using the suggestions from the homework prompt given previously, try and add 3-4 different features ( columns ) to your data, and use your validation score to determine if they improved your results.  

This is meant to be open ended, and to allow you a chance to re-discover material from previous labs.

In [None]:
# your code here

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
import category_encoders as ce
from sklearn.pipeline import make_pipeline


In [2]:
df = pd.read_csv("../../data/restaurants.csv", parse_dates=['visit_date'])

In [3]:
# Save this in a file somewhere!
def denote_null_values(df):
    """Denotes whether or not there are null values or not"""
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

In [4]:
def create_val_splits(df, val_units=15, return_val=False):
    """Function that will take in a dataset and split it up into training, validation, and test sets"""
    # split into training, validation, and test sets
    train = df.groupby('id').apply(lambda x: x.iloc[:-val_units]).reset_index(drop=True)
    test  = df.groupby('id').apply(lambda x: x.iloc[-val_units:]).reset_index(drop=True)
    if return_val:
        val   = train.groupby('id').apply(lambda x: x.iloc[-val_units:]).reset_index(drop=True)
        train = train.groupby('id').apply(lambda x: x.iloc[:-val_units]).reset_index(drop=True)
        return train, val, test
    else:
        return train, test

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                252108 non-null  object        
 1   visit_date        252108 non-null  datetime64[ns]
 2   visitors          252108 non-null  int64         
 3   calendar_date     252108 non-null  object        
 4   day_of_week       252108 non-null  object        
 5   holiday           252108 non-null  int64         
 6   genre             252108 non-null  object        
 7   area              252108 non-null  object        
 8   latitude          252108 non-null  float64       
 9   longitude         252108 non-null  float64       
 10  reserve_visitors  108394 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(5)
memory usage: 21.2+ MB


In [14]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
166836,air_00a91d42b08b08d9,2016-07-01,35,2016-07-01,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166837,air_00a91d42b08b08d9,2016-07-02,9,2016-07-02,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0
166838,air_00a91d42b08b08d9,2016-07-04,20,2016-07-04,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166839,air_00a91d42b08b08d9,2016-07-05,25,2016-07-05,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166840,air_00a91d42b08b08d9,2016-07-06,29,2016-07-06,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,


In [6]:
df.sort_values(by=['id','visit_date'], ascending=True, inplace=True)

In [7]:
grouping = df.groupby('id').apply(lambda x: x['visitors'].shift())
df['visitors_yesterday'] = grouping.values

In [8]:
grouping = df.groupby('id').apply(lambda x: x['visitors'].shift(7))
df['visitors_last_week'] = grouping.values

In [9]:
grouping = df.groupby('id').apply(lambda x: x['visitors'].rolling(7).mean())
df['visitors_7_day_ma'] = grouping.values


In [10]:
df['month'] = df['visit_date'].dt.month

In [11]:
df['quarter'] = df['visit_date'].dt.quarter

In [12]:
# Set up data 
date_col = df['visit_date']
df.drop('visit_date',axis=1,inplace=True)
df.drop('calendar_date',axis=1,inplace=True)
df = denote_null_values(df)

In [35]:
df.head()

Unnamed: 0,id,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,visitors_yesterday,visitors_last_week,visitors_7_day_ma,month,quarter,reserve_visitors_missing,visitors_yesterday_missing,visitors_last_week_missing,visitors_7_day_ma_missing
166836,air_00a91d42b08b08d9,35,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,35.0,35.0,27.714286,7,3,True,True,True,True
166837,air_00a91d42b08b08d9,9,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0,35.0,35.0,27.714286,7,3,False,False,True,True
166838,air_00a91d42b08b08d9,20,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,9.0,35.0,27.714286,7,3,True,False,True,True
166839,air_00a91d42b08b08d9,25,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,20.0,35.0,27.714286,7,3,True,False,True,True
166840,air_00a91d42b08b08d9,29,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,25.0,35.0,27.714286,7,3,True,False,True,True


In [13]:
df['reserve_visitors'] = df['reserve_visitors'].fillna(0)

In [14]:
df['visitors_yesterday'] = df['visitors_yesterday'].bfill()
df['visitors_last_week'] = df['visitors_last_week'].bfill()
df['visitors_7_day_ma'] = df['visitors_7_day_ma'].bfill()

In [25]:
ordinal_encoder = ce.OrdinalEncoder(cols='day_of_week')
target_encoder = ce.TargetEncoder()
gbm = GradientBoostingRegressor()

pipe = make_pipeline(ordinal_encoder, target_encoder, gbm)
pipe.verbose = True

In [26]:
pipe


Pipeline(steps=[('ordinalencoder', OrdinalEncoder(cols='day_of_week')),
                ('targetencoder', TargetEncoder()),
                ('gradientboostingregressor', GradientBoostingRegressor())],
         verbose=True)

In [None]:
# Set up training, validation, and test data sets

In [35]:
train, val, test = create_val_splits(df,return_val = True)

In [36]:
X_train, X_val, X_test = (
    train.drop('visitors',axis=1)
    ,val.drop('visitors',axis=1)
    ,test.drop('visitors',axis=1)
    )
y_train, y_val, y_test = (
    train['visitors'],val['visitors'], test['visitors'])

In [29]:
pipe.fit(X_train, y_train)

[Pipeline] .... (step 1 of 3) Processing ordinalencoder, total=   0.1s


  elif pd.api.types.is_categorical(cols):


[Pipeline] ..... (step 2 of 3) Processing targetencoder, total=   0.3s
[Pipeline]  (step 3 of 3) Processing gradientboostingregressor, total=  27.2s


Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['day_of_week'],
                                mapping=[{'col': 'day_of_week',
                                          'data_type': dtype('O'),
                                          'mapping': Friday       1
Saturday     2
Monday       3
Tuesday      4
Wednesday    5
Thursday     6
Sunday       7
NaN         -2
dtype: int64}])),
                ('targetencoder', TargetEncoder(cols=['id', 'genre', 'area'])),
                ('gradientboostingregressor', GradientBoostingRegressor())],
         verbose=True)

In [30]:
pipe.score(X_val, y_val)

0.6046829140012382

In [31]:
grouping = df.groupby('id').apply(lambda x: x['reserve_visitors'].rolling(7).mean())
df['reservations_7_day_ma'] = grouping.values

In [32]:
grouping = df.groupby('id').apply(lambda x: x['reserve_visitors'].shift())
df['reservations_yesterday'] = grouping.values

In [33]:
# Address null values in new reservation variables
df = denote_null_values(df)
df['reservations_7_day_ma'] = df['reservations_7_day_ma'].bfill()
df['reservations_yesterday'] = df['reservations_yesterday'].bfill()

In [50]:
df.head()

Unnamed: 0,id,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,visitors_yesterday,...,month,quarter,reserve_visitors_missing,visitors_yesterday_missing,visitors_last_week_missing,visitors_7_day_ma_missing,reservations_7_day_ma,reservations_yesterday,reservations_7_day_ma_missing,reservations_yesterday_missing
166836,air_00a91d42b08b08d9,35,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,35.0,...,7,3,True,True,True,True,0.571429,0.0,True,True
166837,air_00a91d42b08b08d9,9,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0,35.0,...,7,3,False,False,True,True,0.571429,0.0,True,False
166838,air_00a91d42b08b08d9,20,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,9.0,...,7,3,True,False,True,True,0.571429,4.0,True,False
166839,air_00a91d42b08b08d9,25,Tuesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,20.0,...,7,3,True,False,True,True,0.571429,0.0,True,False
166840,air_00a91d42b08b08d9,29,Wednesday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,25.0,...,7,3,True,False,True,True,0.571429,0.0,True,False


In [38]:
pipe1 = make_pipeline(ordinal_encoder, target_encoder, gbm)
pipe1.verbose = True

In [39]:
pipe1.fit(X_train,y_train)

  elif pd.api.types.is_categorical(cols):


[Pipeline] .... (step 1 of 3) Processing ordinalencoder, total=   0.1s
[Pipeline] ..... (step 2 of 3) Processing targetencoder, total=   0.3s
[Pipeline]  (step 3 of 3) Processing gradientboostingregressor, total=  32.6s


Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['day_of_week'],
                                mapping=[{'col': 'day_of_week',
                                          'data_type': dtype('O'),
                                          'mapping': Friday       1
Saturday     2
Monday       3
Tuesday      4
Wednesday    5
Thursday     6
Sunday       7
NaN         -2
dtype: int64}])),
                ('targetencoder', TargetEncoder(cols=['id', 'genre', 'area'])),
                ('gradientboostingregressor', GradientBoostingRegressor())],
         verbose=True)

In [40]:
pipe1.score(X_val, y_val)

0.6119449313777545

In [None]:
# The score uplift by adding the reservation values was : 0.612 vs 0.605; so not meaningful

**Step 2:** Try and reduce overfitting in your model, if it's persistent.  Ideally, you want your in-sample and out-of-sample scores to be about the same, or at least increasing or decreasing in proportional amounts.  

The idea here is two-fold:  see if you can narrow the gap between in-sample and out-of-sample results (using training & validation sets), while simultaneously **not** decreasing your model scores (or at least not by very much).  The idea being that the closer these two are, the more reliable your results are likely to be.

Some knobs you can turn:
 - `min_samples_leaf`: parameter in the category encoder that determines what cutoff point you can use for using the local vs. global average for the category
 - `subsample`: parameter in gbm that determines what fraction of your dataset to use at each boosting round.  This both reduces training time and makes each fitting round less related to the other
 - `max_features`: what portion of columns to use at each split.  This is very similar in purpose to `subsample`, but randomizes data at each split, vs. each round.

In [None]:
# your code here

In [41]:
pipe1.score(X_train, y_train), pipe1.score(X_val,y_val), pipe1.score(X_test, y_test)

# Very similar values across all setse so probably not over fitting

(0.614479215896371, 0.6119449313777545, 0.5853980331843698)

**Step 3:** Using the results that gave you the best answer from above, try now to find model parameters that maximize information extraction.  The three main ones are:

 - `n_estimators`:  how many boosting rounds to use
 - `learning_rate`: how much shrinkage to use at each update (keep this from .05 to .2)
 - `max_depth`: how deep each tree in your model goes
 
 **important:** fitting these things could take a looooong time.  We don't have all night.  So don't try and make this exhaustive, just try doing a little bit of parameter exploration to see if you can see in what directions to push model parameters to improve your results.  
 
 Note your validation score before proceeding to the next step.

In [None]:
# your code here

In [43]:
?GradientBoostingRegressor

In [46]:
ordinal_encoder = ce.OrdinalEncoder(cols='day_of_week')
target_encoder = ce.TargetEncoder()
gbm = GradientBoostingRegressor(n_estimators=200, learning_rate=0.2, max_depth=5)

pipe2 = make_pipeline(ordinal_encoder, target_encoder, gbm)
pipe2.fit(X_train, y_train)
pipe2.score(X_val, y_val)


  elif pd.api.types.is_categorical(cols):


0.6256446375526998

In [47]:
ordinal_encoder = ce.OrdinalEncoder(cols='day_of_week')
target_encoder = ce.TargetEncoder()
gbm = GradientBoostingRegressor(
    n_estimators=400, 
    learning_rate=0.08, 
    max_depth=4, 
    min_samples_leaf=10)

pipe3 = make_pipeline(ordinal_encoder, target_encoder, gbm)
pipe3.fit(X_train, y_train)
pipe3.score(X_val, y_val)

  elif pd.api.types.is_categorical(cols):


0.633447066739966

In [53]:
score_train = pipe3.score(X_train, y_train) 
score_val = pipe3.score(X_val, y_val) 
score_test = pipe3.score(X_test, y_test) 

print(f"Using 400 trees / learning rate 0.08, depth of 4, and min samples per leave of 10 gives:")
print(f"Training: {score_train};\tValadation {score_val};\tTesting: {score_test}")


Using 400 trees / learning rate 0.08, depth of 4, and min samples per leave of 10 gives:
Training: 0.6461142741765327;	Valadation 0.633447066739966;	Testing: 0.6506914614590198


In [58]:
ordinal_encoder = ce.OrdinalEncoder(cols='day_of_week')
target_encoder = ce.TargetEncoder()
gbm = GradientBoostingRegressor(
    n_estimators=300, 
    learning_rate=0.1, 
    max_depth=6, 
    min_samples_leaf=10)

pipe4 = make_pipeline(ordinal_encoder, target_encoder, gbm)
pipe4.fit(X_train, y_train)
pipe4.score(X_val, y_val)



  elif pd.api.types.is_categorical(cols):


0.6300630851425253

In [57]:
score_train = pipe4.score(X_train, y_train) 
score_val = pipe4.score(X_val, y_val) 
score_test = pipe4.score(X_test, y_test) 

print(f"Using 300 trees / learning rate 0.2, depth of 6, and min samples per leave of 10 gives:")
print(f"Training: {score_train};\tValadation {score_val};\tTesting: {score_test}")


Using 600 trees / learning rate 0.05, depth of 3, and min samples per leave of 20 gives:
Training: 0.6169707994240498;	Valadation 0.6292984105382688;	Testing: 0.6344626266204751


**Step 4:** Take the best version of your model & your data, and fit it on **all** of your training + validation data.  The idea is that now that we've found the best version of what we have to work with, we want to give it as much training samples as possible.  

In [None]:
# your code here

In [None]:
ordinal_encoder = ce.OrdinalEncoder(cols='day_of_week')
target_encoder = ce.TargetEncoder()
gbm = GradientBoostingRegressor(
    n_estimators=400, 
    learning_rate=0.08, 
    max_depth=4, 
    min_samples_leaf=10)

pipe3 = make_pipeline(ordinal_encoder, target_encoder, gbm)
pipe3.fit(X_train, y_train)
pipe3.score(X_val, y_val)

**Step 5:** Score your model on your test set.

Note how your validation + test scores compared to one another.

In [None]:
# your code here

In [60]:
train_final, test_final = create_val_splits(df,return_val = False)

In [61]:
train_final.head(3)

Unnamed: 0,id,visitors,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,visitors_yesterday,...,month,quarter,reserve_visitors_missing,visitors_yesterday_missing,visitors_last_week_missing,visitors_7_day_ma_missing,reservations_7_day_ma,reservations_yesterday,reservations_7_day_ma_missing,reservations_yesterday_missing
0,air_00a91d42b08b08d9,35,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,35.0,...,7,3,True,True,True,True,0.571429,0.0,True,True
1,air_00a91d42b08b08d9,9,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0,35.0,...,7,3,False,False,True,True,0.571429,0.0,True,False
2,air_00a91d42b08b08d9,20,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,0.0,9.0,...,7,3,True,False,True,True,0.571429,4.0,True,False


In [63]:
X_train_f, y_train_f = train_final.drop('visitors',axis=1), train_final['visitors']
X_test_f, y_test_f = test_final.drop('visitors',axis=1), test_final['visitors']

In [64]:
ordinal_encoder = ce.OrdinalEncoder(cols='day_of_week')
target_encoder = ce.TargetEncoder()
gbm = GradientBoostingRegressor(
    n_estimators=400, 
    learning_rate=0.08, 
    max_depth=4, 
    min_samples_leaf=10)

pipe_final = make_pipeline(ordinal_encoder, target_encoder, gbm)
pipe_final.fit(X_train_f, y_train_f)
pipe_final.score(X_test_f, y_test_f)

  elif pd.api.types.is_categorical(cols):


0.6686809929381787

Overall result from final model was 0.669

In [65]:
import eli5



In [74]:
X_train_f.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239673 entries, 0 to 239672
Data columns (total 21 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              239673 non-null  object 
 1   day_of_week                     239673 non-null  object 
 2   holiday                         239673 non-null  int64  
 3   genre                           239673 non-null  object 
 4   area                            239673 non-null  object 
 5   latitude                        239673 non-null  float64
 6   longitude                       239673 non-null  float64
 7   reserve_visitors                239673 non-null  float64
 8   visitors_yesterday              239673 non-null  float64
 9   visitors_last_week              239673 non-null  float64
 10  visitors_7_day_ma               239673 non-null  float64
 11  month                           239673 non-null  int64  
 12  quarter         

In [70]:
X_train_f.columns.to_list()

['id',
 'day_of_week',
 'holiday',
 'genre',
 'area',
 'latitude',
 'longitude',
 'reserve_visitors',
 'visitors_yesterday',
 'visitors_last_week',
 'visitors_7_day_ma',
 'month',
 'quarter',
 'reserve_visitors_missing',
 'visitors_yesterday_missing',
 'visitors_last_week_missing',
 'visitors_7_day_ma_missing',
 'reservations_7_day_ma',
 'reservations_yesterday',
 'reservations_7_day_ma_missing',
 'reservations_yesterday_missing']

In [73]:
pipe_final.named_steps

{'ordinalencoder': OrdinalEncoder(cols=['day_of_week'],
                mapping=[{'col': 'day_of_week', 'data_type': dtype('O'),
                          'mapping': Friday       1
 Saturday     2
 Monday       3
 Tuesday      4
 Wednesday    5
 Thursday     6
 Sunday       7
 NaN         -2
 dtype: int64}]),
 'targetencoder': TargetEncoder(cols=['id', 'genre', 'area']),
 'gradientboostingregressor': GradientBoostingRegressor(learning_rate=0.08, max_depth=4, min_samples_leaf=10,
                           n_estimators=400)}

In [83]:
feature_list = X_train_f.columns.to_list()
len(feature_list)

21

In [76]:
?eli5.explain_weights

In [84]:
eli5.explain_weights(pipe_final.named_steps['gradientboostingregressor'],top=30,feature_names=feature_list)

Weight,Feature
0.8566  ± 0.4915,visitors_7_day_ma
0.0706  ± 0.2349,day_of_week
0.0225  ± 0.2318,visitors_last_week
0.0105  ± 0.3222,visitors_yesterday
0.0064  ± 0.1867,longitude
0.0061  ± 0.1921,genre
0.0051  ± 0.1915,latitude
0.0047  ± 0.1758,id
0.0039  ± 0.2210,reservations_7_day_ma
0.0032  ± 0.1618,reserve_visitors
