# Modeling Demand

**Target variable**: `Proj_TRN_RoomsPickup`: How many transient rooms will be booked for each stay date, from this point (8/1/17) forward, at current prices?

In [1]:
import pandas as pd
import numpy as np

from agg import prep_demand_features
from demand_features import rf_cols, rf2_cols

pd.options.display.max_rows = 160
pd.options.display.max_columns = 250
pd.options.display.max_colwidth = None

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import r2_score

from xgboost import XGBRegressor

DATE_FMT = "%Y-%m-%d"
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingRandomSearchCV, HalvingGridSearchCV

print(__doc__)

Automatically created module for IPython interactive environment


In [2]:
print(len(rf_cols))
len(set(rf_cols))

40


40

In [3]:
print(len(rf2_cols))
len(set(rf2_cols))

54


54

In [5]:
df_stats = pd.read_csv("../data/h2_stats.csv")

## Splitting Up Our Data for Train/Test

Our training set will contain all dates prior to as_of_date.

Our testing set will contain 31 stay dates starting on as_of_date. Our predictions will be used to provide price recommendations later on.

In [6]:
mask = (df_stats["StayDate"] < '2017-08-01')
test_mask = (df_stats['AsOfDate'] == '2017-08-01')
df_train = df_stats.loc[mask].copy()
df_test = df_stats.loc[test_mask].copy()

X_train = df_train[rf2_cols].copy()
X_test = df_test[rf2_cols].copy()
y_train = df_train['ACTUAL_TRN_RoomsPickup'].copy()
y_test = df_test['ACTUAL_TRN_RoomsPickup'].copy()

In [7]:
X_train.shape

(11216, 54)

In [8]:
X_train.head()

Unnamed: 0,week_of_year,RoomsOTB,RoomsOTB_STLY,TRN_RoomsOTB,TRN_RoomsOTB_STLY,TRNP_RoomsOTB,TRNP_RoomsOTB_STLY,WE,DaysUntilArrival,RemSupply,RemSupply_STLY,Mon,Sat,Sun,Thu,Tue,Wed,ACTUAL_RoomsPickup_STLY,ACTUAL_TRN_RoomsPickup_STLY,ACTUAL_TRNP_RoomsPickup_STLY,OTB_GapToLYA_RoomsSold,OTB_GapToLYA_TRN_RoomsSold,OTB_GapToLYA_TRNP_RoomsSold,TM30_RoomsPickup,TM30_RoomsPickup_STLY,TM30_TRN_RoomsPickup,TM30_TRN_RoomsPickup_STLY,TM30_TRNP_RoomsPickup,TM30_TRNP_RoomsPickup_STLY,TM15_RoomsPickup,TM15_RoomsPickup_STLY,TM15_TRN_RoomsPickup,TM15_TRN_RoomsPickup_STLY,TM15_TRNP_RoomsPickup,TM15_TRNP_RoomsPickup_STLY,TM05_RoomsPickup,TM05_RoomsPickup_STLY,TM05_TRN_RoomsPickup,TM05_TRN_RoomsPickup_STLY,TM05_TRNP_RoomsPickup,TM05_TRNP_RoomsPickup_STLY,Pace_RoomsOTB,Pace_RemSupply,Pace_TRN_RoomsOTB,Pace_TRNP_RoomsOTB,Pace_TM30_RoomsPickup,Pace_TM30_TRN_RoomsPickup,Pace_TM30_TRNP_RoomsPickup,Pace_TM15_RoomsPickup,Pace_TM15_TRN_RoomsPickup,Pace_TM15_TRNP_RoomsPickup,Pace_TM05_RoomsPickup,Pace_TM05_TRN_RoomsPickup,Pace_TM05_TRNP_RoomsPickup
0,30.0,212.0,34.0,103.0,15.0,109.0,19.0,False,0.0,24.0,211.0,False,False,True,False,False,False,0.0,0.0,0.0,-178.0,-88.0,-90.0,-3.0,-12.0,3.0,2.0,-6.0,-14.0,0.0,-9.0,6.0,3.0,-6.0,-12.0,-3.0,-10.0,3.0,3.0,-6.0,-13.0,178.0,-187.0,88.0,90.0,9.0,1.0,8.0,9.0,3.0,6.0,7.0,0.0,7.0
1,31.0,189.0,48.0,149.0,16.0,40.0,32.0,False,1.0,49.0,200.0,True,False,False,False,False,False,3.0,2.0,0.0,-138.0,-131.0,-8.0,7.0,-28.0,8.0,3.0,-1.0,-31.0,15.0,-26.0,16.0,2.0,-1.0,-28.0,10.0,-18.0,11.0,0.0,-1.0,-18.0,141.0,-151.0,133.0,8.0,35.0,5.0,30.0,41.0,14.0,27.0,28.0,11.0,17.0
2,31.0,210.0,30.0,172.0,14.0,38.0,16.0,False,2.0,32.0,209.0,False,False,False,False,True,False,8.0,5.0,0.0,-172.0,-153.0,-22.0,14.0,-16.0,13.0,2.0,1.0,-18.0,17.0,-16.0,16.0,1.0,1.0,-17.0,11.0,-17.0,10.0,0.0,1.0,-17.0,180.0,-177.0,158.0,22.0,30.0,11.0,19.0,33.0,15.0,18.0,28.0,10.0,18.0
3,31.0,218.0,80.0,178.0,16.0,40.0,64.0,False,3.0,27.0,156.0,False,False,False,False,False,True,21.0,13.0,2.0,-117.0,-149.0,26.0,4.0,-35.0,5.0,0.0,-1.0,-35.0,2.0,-2.0,3.0,0.0,-1.0,-2.0,-1.0,-19.0,0.0,-1.0,-1.0,-18.0,138.0,-129.0,162.0,-24.0,39.0,5.0,34.0,4.0,3.0,1.0,18.0,1.0,17.0
4,31.0,213.0,80.0,181.0,10.0,32.0,70.0,False,4.0,33.0,154.0,False,False,False,True,False,False,33.0,23.0,2.0,-100.0,-148.0,40.0,6.0,-5.0,6.0,-1.0,0.0,-4.0,2.0,-3.0,2.0,-1.0,0.0,-2.0,-1.0,0.0,-1.0,0.0,0.0,0.0,133.0,-121.0,171.0,-38.0,11.0,7.0,4.0,5.0,3.0,2.0,-1.0,-1.0,0.0


## LINEAR REGRESSION

Failed to generalize. Our target variable is not a linear combination of the rate & revenue features that we know have an impact on demand.

In [9]:
%%time
lm = LinearRegression()
lr_model = lm.fit(X_train, y_train)
scores = cross_val_score(lm, X_train, y_train, scoring='r2', cv=5)
scores.mean()

CPU times: total: 641 ms
Wall time: 649 ms


np.float64(0.7478181826688248)

In [10]:
lr_model.score(X_test, y_test)

0.20894690077559708

## RANDOM FOREST MODEL

I had high hopes for RF, and it came through. It works because of the amount and quality of the features I have engineered, despite the small training set. 

That's just not the case for H2, even after adding back in TRNT

In [11]:
%%time
rfm = RandomForestRegressor(n_jobs=-1, random_state=21)
rf_model = rfm.fit(X_train, y_train)
scores = cross_val_score(rfm, X_train, y_train, scoring='r2', cv=5)
scores.mean()

CPU times: total: 3min 24s
Wall time: 59.1 s


np.float64(0.7935590705679733)

In [12]:
rf_model.score(X_test, y_test)

0.4640531596946865

In [13]:
len(rf2_cols)

54

## XGBOOST MODEL (GRADIENT BOOSTING TREES)

XGBoost failed to generalize, likely due to the small training sample. 

In [14]:
# %%time
xgbm = XGBRegressor(n_jobs=-1, random_state=21)
xgb_model = xgbm.fit(X_train, y_train)
scores = cross_val_score(xgbm, X_train, y_train, scoring='r2', cv=5)
scores.mean()

np.float64(0.7994357274460274)

In [15]:
xgbm.score(X_test, y_test)

0.5876031314617223

## MOVING FORWARD WITH RANDOM FOREST....

H2 model not as good (not even close). Hoping it can be fixed with hyperparameters, but it's likely due to the features not being able to predict city demand as well as resorts. After all, resorts tend to have more seasonal  demand than city.


## Successive Halving Grid Search

In [16]:
max_depth = list(range(10, 40, 3))
max_depth.append(None)

random_grid = {
    "n_estimators": range(300, 700, 50),
    "max_depth": max_depth,
}

rf = RandomForestRegressor()
rf_hgs = (HalvingGridSearchCV(rf, random_grid, verbose=10, random_state=20, cv=5, n_jobs=-1))

rf_hgs.fit(X_train, y_train)

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 138
max_resources_: 11216
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 88
n_resources: 138
Fitting 5 folds for each of 88 candidates, totalling 440 fits
----------
iter: 1
n_candidates: 30
n_resources: 414
Fitting 5 folds for each of 30 candidates, totalling 150 fits
----------
iter: 2
n_candidates: 10
n_resources: 1242
Fitting 5 folds for each of 10 candidates, totalling 50 fits
----------
iter: 3
n_candidates: 4
n_resources: 3726
Fitting 5 folds for each of 4 candidates, totalling 20 fits
----------
iter: 4
n_candidates: 2
n_resources: 11178
Fitting 5 folds for each of 2 candidates, totalling 10 fits


Randomized halving did not improve score much. Resulting params were:
{'n_estimators': 740,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 80}

Trying HalvingGridSearch now, maybe it can tell me something.

In [17]:
rf_hgs.score(X_test, y_test)

0.46164266736740356

In [18]:
rf_hgs.best_params_

{'max_depth': None, 'n_estimators': 650}

In [None]:
# rf_hgs.to_csv("halving_random_results_h2.csv")

AttributeError: 'HalvingGridSearchCV' object has no attribute 'to_csv'

Parameters of random grid search
```
random_grid = {
    "n_estimators": range(200, 2000, 100),
    "max_features": ["auto", "sqrt"],
    "max_depth": range(10, 110, 11),
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False]
}

rf = RandomForestRegressor()
rf_random = (RandomizedSearchCV(rf, random_grid, verbose=2, n_iter=50, random_state=42, n_jobs=-1))

rf_random.fit(X_train, y_train)
```

Results of random grid search:

```
{'n_estimators': 500,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 43,
 'bootstrap': True}
```

Score: 0.6519058137402494

## Brute Force Hyperparameter Tuning (GridSearchCV)

Best params thus far: 
Setup params:
```
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'bootstrap': [True], 'max_depth': [30, 56, 2],
                         'max_features': ['auto'],
                         'min_samples_split': [2, 3, 4, 8],
                         'n_estimators': range(300, 800, 40)},
             verbose=10)
```
Best resulting params:
```
{'bootstrap': True,
 'max_depth': 56,
 'max_features': 'auto',
 'min_samples_split': 3,
 'n_estimators': 300}
```

 $R^2$ CV score: `0.7785714200550233`


<font size="5.1" color='blue' style='strong'>Round 2 (Best Results, **Final Model**)</font>


Param grid:
```
rf_grid = {
    "n_estimators": range(150, 500, 50),
    "max_features": ['auto'],
    "max_depth": range(32,56,2),
    "bootstrap": [True],
    "min_samples_split": [2, 3, 4]
}
```

And the **results**:
```
{'bootstrap': True,
 'max_depth': 48,
 'min_samples_split': 2,
 'n_estimators': 150}
```
$R^2$ CV score: `0.779336423856766`
 
### Round 3 (Worse than Round 2)

Param grid:
```
rf_grid = {
    "n_estimators": range(75, 225, 25),
    "max_depth": [47, 48, 49],
    "bootstrap": [True],
    "min_samples_split": [2],
}
```

And the **results**:

Best params:
```
{'bootstrap': True,
 'max_depth': 47,
 'min_samples_split': 2,
 'n_estimators': 125}
```
$R^2$ CV score: `0.7775378755829061`

In [None]:
# rf_grid = {
#     "n_estimators": range(75, 200, 25),
#     "max_depth": [47, 48],
#     "bootstrap": [True],
#     "min_samples_split": [2],
# }
# rfm = RandomForestRegressor()

# rf_grid = GridSearchCV(rfm, rf_grid, n_jobs=-1, verbose=10, cv=5)
# rf_grid.fit(X1_train, y1_train)

In [None]:
# rf_grid.best_params_

In [None]:
# rf_grid.best_score_

In [None]:
# rf_grid.score(X1_test, y1_test)

## Final Model

In [22]:
rf = RandomForestRegressor(n_estimators=350, max_depth=25, n_jobs=-1)
                           

rf.fit(X_train, y_train)

In [1]:
rf.score(X_test, y_test)

NameError: name 'rf' is not defined

## Now that we have our model, let's get it in the simulation so we can evaluate our results.

Head over to `demand_model_evaluation.ipynb` for more.