### Lab -- Data Prep & Gradient Boosting

Welcome to today's lab!  Today we're going to shift our attention to a more demanding dataset -- the restaurants data.  A quarter million rows, dates, and categorical data make this a more interesting, realistic use case of boosting.  

The point of today's lab will be to experiment with different encoding methods and model parameters.

In [2]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
import category_encoders as ce



**Step 1:**  Load in your dataset

In [3]:
# your code here
df = pd.read_csv('../../data/restaurants.csv',parse_dates=['visit_date'])

**Step 2:** Create a training and test set.

Make the test set the **last 15 observations for each restaurant**.

Turn each of these variables into `X_train, y_train`, and `X_test, y_test`, respectively.

**Hint:**  This harkens back to our grouping lab -- check this if you forget how to do it.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                252108 non-null  object        
 1   visit_date        252108 non-null  datetime64[ns]
 2   visitors          252108 non-null  int64         
 3   calendar_date     252108 non-null  object        
 4   day_of_week       252108 non-null  object        
 5   holiday           252108 non-null  int64         
 6   genre             252108 non-null  object        
 7   area              252108 non-null  object        
 8   latitude          252108 non-null  float64       
 9   longitude         252108 non-null  float64       
 10  reserve_visitors  108394 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(5)
memory usage: 21.2+ MB


In [47]:
df = df.fillna(0)

In [48]:
# From solutions manual:

# we'll sort the values
df.sort_values(by=['id', 'visit_date'], ascending=True, inplace=True)


# split into training & test
train = df.groupby('id').apply(lambda x: x.iloc[:-15])
test  = df.groupby('id').apply(lambda x: x.iloc[-15:])

# drop the date column -- no need for it
train.drop('visit_date', axis=1, inplace=True)
test.drop('visit_date', axis=1, inplace=True)

# and turn it into X & y
X_train, y_train = train.drop('visitors', axis=1), train['visitors']
X_test, y_test   = test.drop('visitors', axis=1), test['visitors']

In [None]:
# your code here

**Step 3:** Experiment with different encoding methods

Let's do a quick check to see how different encoding methods work out of the box on our dataset.

You're going to repeat the same process for each of `OrdinalEncoder`, `TargetEncoder`, and `OneHotEncoder` and see which one gives you the best results on our data.

**3a:** Use an `OrdinalEncoder` to transform your training set with the `fit_transform` method.  Then use the `transform` method to transform your test set.  

**Important:** An important detail here is that the test set is being transformed according to the values in your training set.  

If you are confused about how the transformation is happening, try using the `mapping()` method on your category encoder to get a hang of what's going on.

In [49]:
# your code here
oe = ce.OrdinalEncoder()
X_train_oe = X_train.copy()
X_test_oe = X_test.copy()
oe.fit_transform(X_train_oe)
X_train_oe = oe.transform(X_train_oe)
X_test_oe = oe.transform(X_test_oe)

In [50]:
X_train_oe

Unnamed: 0_level_0,Unnamed: 1_level_0,id,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
air_00a91d42b08b08d9,166836,1,1,1,0,1,1,35.694003,139.753595,0.0
air_00a91d42b08b08d9,166837,1,2,2,0,1,1,35.694003,139.753595,4.0
air_00a91d42b08b08d9,166838,1,3,3,0,1,1,35.694003,139.753595,0.0
air_00a91d42b08b08d9,166839,1,4,4,0,1,1,35.694003,139.753595,0.0
air_00a91d42b08b08d9,166840,1,5,5,0,1,1,35.694003,139.753595,0.0
...,...,...,...,...,...,...,...,...,...,...
air_fff68b929994bfbd,216629,829,216,3,0,11,87,35.708146,139.666288,0.0
air_fff68b929994bfbd,216630,829,217,4,0,11,87,35.708146,139.666288,0.0
air_fff68b929994bfbd,216631,829,227,5,0,11,87,35.708146,139.666288,2.0
air_fff68b929994bfbd,216632,829,435,6,0,11,87,35.708146,139.666288,8.0


In [51]:
X_test_oe[0:30]

Unnamed: 0_level_0,Unnamed: 1_level_0,id,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
air_00a91d42b08b08d9,167048,1,227.0,5,0,1,1,35.694003,139.753595,2.0
air_00a91d42b08b08d9,167049,1,435.0,6,0,1,1,35.694003,139.753595,8.0
air_00a91d42b08b08d9,167050,1,436.0,1,0,1,1,35.694003,139.753595,1.0
air_00a91d42b08b08d9,167051,1,-1.0,2,0,1,1,35.694003,139.753595,33.0
air_00a91d42b08b08d9,167052,1,-1.0,3,0,1,1,35.694003,139.753595,0.0
air_00a91d42b08b08d9,167053,1,-1.0,4,0,1,1,35.694003,139.753595,2.0
air_00a91d42b08b08d9,167054,1,-1.0,5,0,1,1,35.694003,139.753595,2.0
air_00a91d42b08b08d9,167055,1,-1.0,6,0,1,1,35.694003,139.753595,7.0
air_00a91d42b08b08d9,167056,1,-1.0,1,0,1,1,35.694003,139.753595,4.0
air_00a91d42b08b08d9,167057,1,-1.0,3,0,1,1,35.694003,139.753595,0.0


**3b:** Initialize a `GradientBoostingRegressor` with the default parameters, fit it on your training set, and score it on your test set.

In [None]:
# your code here

In [52]:
gbr_oe = GradientBoostingRegressor()

In [53]:
gbr_oe.fit(X_train_oe,y_train)

GradientBoostingRegressor()

In [57]:
gbr_oe.score(X_train_oe,y_train) # Score on training set

0.16976661376016133

In [58]:
gbr_oe.score(X_test_oe,y_test) # Score on actual test

0.1539151787555132

**3c:** Repeat these same steps for the `TargetEncoder` and the `OneHotEncoder`

**Important:** The `OneHotEncoder` can take awhile to fit.  If nothing happens in around 4 minutes, just cancel the process and try it again later on when you have more time.

# Target encoding

In [60]:
# your code here
te = ce.TargetEncoder()
X_train_te = X_train.copy()
X_test_te = X_test.copy()
te.fit_transform(X_train_te,y_train)
X_train_te = te.transform(X_train_te)
X_test_te = te.transform(X_test_te)



  elif pd.api.types.is_categorical(cols):


In [61]:
gbr_te = GradientBoostingRegressor()

In [62]:
gbr_te.fit(X_train_oe,y_train)

GradientBoostingRegressor()

In [63]:
gbr_te.score(X_train_te,y_train) # Score on training set # -0.9

-0.906887999634423

In [64]:
gbr_oe.score(X_test_te,y_test) # Score on actual test # - 0.8

-0.8187349268600508

# OneHotEncoding

In [67]:
ohc = ce.OneHotEncoder()
X_train_ohc= X_train.copy()
X_test_ohc= X_test.copy()
ohc.fit_transform(X_train_ohc)
X_train_ohc= ohc.transform(X_train_ohc)
X_test_ohc= ohc.transform(X_test_ohc)

  elif pd.api.types.is_categorical(cols):


In [68]:
gbr_ohc = GradientBoostingRegressor()

In [69]:
gbr_ohc.fit(X_train_ohc,y_train)

GradientBoostingRegressor()

In [70]:
gbr_te.score(X_train_ohc,y_train)

ValueError: Number of features of the model must match the input. Model n_features is 9 and input n_features is 1420 

In [None]:
gbr_oe.score(X_test_ohc,y_test)

In [None]:
# Expect to get similar results for Ordinal and OneHot encoding when using tree based models.

**Step 4:** Look at your most important features

Similar to the previous lab, take your model's most important features and load them into a dataframe to see what's driving your results.

In [None]:
# your code here

**Step 5:** Can model parameters improve your score?  

Take the **best** version of your encoding method and try changing some parameters with your model to see if it improves your score.  

You won't have a ton of time to do this, but try some of the following:

 - Try increasing the number of trees your model uses -- 250, 500, or perhaps more trees if time permits
 - Try experimenting with differing values for tree depth -- the default is 3, but perhaps 4, 5 or 6 works better
 - Try improving fitting time by introducing some **randomness** into your data with the following two model parameters:
   - `subsample`: this dictates what proportion of your data will be used for each tree.  A value of `0.7` means 70% of your data will be used for a particular tree, chosen at random
   - `max_features`: this is the portion of columns that are used at each individual split.  If you enter an integer the model will randomly select that number of columns, if you enter a decimal it will randomly select that portion of columns.
   - It can be very useful to find the most sparse model that will still give you comparable results.  Ie, if you find a gbm with 500 trees and a max_depth of 4 gives you the best results, it can be very beneficial if you can get those same results with a `subsample` value of 0.6 and a `max_features` score of 0.7, because your model will fit ~50% faster.
   
This step is open ended, so we will likely have to end class in the middle of it.

In [None]:
# your code here