## 6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column `'price'`).

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

Let's load the data:

In [47]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [48]:
columns = [
    'neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365',
    'price'
]

df = pd.read_csv('AB_NYC_2019.csv', usecols=columns)
df.reviews_per_month = df.reviews_per_month.fillna(0)

* Apply the log tranform to `price`
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1

In [49]:
price_logs = np.log1p(df.price)
df['price'] = price_logs
df.price

0        5.010635
1        5.420535
2        5.017280
3        4.499810
4        4.394449
           ...   
48890    4.262680
48891    3.713572
48892    4.753590
48893    4.025352
48894    4.510860
Name: price, Length: 48895, dtype: float64

In [50]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.price
y_val = df_val.price
y_test = df_test.price

del df_train['price']
del df_val['price']
del df_test['price']

Now, use `DictVectorizer` to turn train and validation into matrices:

In [70]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.tree import export_text

In [71]:
dv = DictVectorizer(sparse=False)

train_dicts = df_train.fillna(0).to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val.fillna(0).to_dict(orient='records')
X_val = dv.transform(val_dicts)

## Question 1

Let's train a decision tree regressor to predict the price variable. 

* Train a model with `max_depth=1`

In [72]:
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=1)

In [73]:
print(export_text(dt, feature_names=dv.get_feature_names()))

|--- room_type=Entire home/apt <= 0.50
|   |--- value: [4.29]
|--- room_type=Entire home/apt >  0.50
|   |--- value: [5.16]



Which feature is used for splitting the data?

* `room_type`
* `neighbourhood_group`
* `number_of_reviews`
* `reviews_per_month`

## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1`  (optional - to make training faster)

In [89]:
from sklearn.ensemble import RandomForestRegressor

In [90]:
def rmse(y , y_pred):
    se = (y - y_pred) ** 2
    mse = se.mean()
    return np.sqrt(mse)

In [95]:
clf = RandomForestRegressor(n_estimators=10,random_state=1,n_jobs=-1)
clf.fit(X_train , y_train)
y_pred = clf.predict(X_val)

In [96]:
rmse(y_val , y_pred).round(3)

0.466

What's the RMSE of this model on validation?

* 0.059
* 0.259
* 0.459
* 0.659

The nearest one is 0.459

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10
* Set `random_state` to `1`
* Evaluate the model on the validation dataset

In [112]:
n_e = np.linspace(start = 10,stop = 200, num = 20)
n_e

array([ 10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100., 110.,
       120., 130., 140., 150., 160., 170., 180., 190., 200.])

In [116]:
for i in n_e:
    f = i.astype(int)
    clf = RandomForestRegressor(n_estimators=f,random_state=1,n_jobs=-1)
    clf.fit(X_train , y_train)
    y_pred = clf.predict(X_val)
    rmse_ = rmse(y_val , y_pred).round(4)
    print('%5s   %5s' % (f , rmse_))

   10   0.4655
   20   0.4526
   30   0.4502
   40   0.4493
   50   0.4476
   60   0.4466
   70   0.4462
   80   0.4461
   90   0.4455
  100   0.4453
  110   0.4451
  120   0.4448
  130   0.4448
  140   0.4445
  150   0.4442
  160   0.4442
  170   0.4442
  180   0.4442
  190   0.4442
  200   0.4442


After which value of `n_estimators` does RMSE stop improving?

- 10
- 50
- 70
- 120

The RMSE stops improving after 120 there is an improvment but we can neglect it

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10)
* Fix the random seed: `random_state=1`

In [118]:
for max_dep in [10, 15, 20, 25]:
    print(max_dep)
    for i in n_e:
        f = i.astype(int)
        clf = RandomForestRegressor(max_depth = max_dep,n_estimators=f,random_state=1,n_jobs=-1)
        clf.fit(X_train , y_train)
        y_pred = clf.predict(X_val)
        rmse_ = rmse(y_val , y_pred).round(4)
        print('%5s   %5s' % (f , rmse_))
    

10
   10   0.4492
   20   0.4459
   30   0.4451
   40   0.4449
   50   0.4445
   60   0.4442
   70   0.4442
   80   0.4442
   90   0.4442
  100   0.4443
  110   0.4443
  120   0.4442
  130   0.4442
  140   0.4441
  150   0.4438
  160   0.4438
  170   0.4437
  180   0.4437
  190   0.4438
  200   0.4438
15
   10   0.455
   20   0.4469
   30   0.4452
   40   0.4441
   50   0.4435
   60   0.4425
   70   0.4422
   80   0.4418
   90   0.4415
  100   0.4415
  110   0.4414
  120   0.4412
  130   0.4411
  140   0.4409
  150   0.4406
  160   0.4406
  170   0.4405
  180   0.4405
  190   0.4405
  200   0.4406
20
   10   0.4611
   20   0.4494
   30   0.4474
   40   0.4463
   50   0.445
   60   0.4441
   70   0.4439
   80   0.4437
   90   0.4434
  100   0.4432
  110   0.4431
  120   0.443
  130   0.4429
  140   0.4428
  150   0.4425
  160   0.4425
  170   0.4424
  180   0.4424
  190   0.4425
  200   0.4425
25
   10   0.4646
   20   0.4519
   30    0.45
   40   0.4487
   50   0.4472
   60   0.4461
  

What's the best `max_depth`:

* 10
* 15
* 20
* 25

Bonus question (not graded):

Will the answer be different if we change the seed for the model?

The best max depth is 15

## Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorith, it finds the best split. 
When doint it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the imporatant features 
for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field. 

For this homework question, we'll find the most important feature:

* Train the model with these parametes:
    * `n_estimators=10`,
    * `max_depth=20`,
    * `random_state=1`,
    * `n_jobs=-1` (optional)
* Get the feature importance information from this model

What's the most important feature? 

* `neighbourhood_group=Manhattan`
* `room_type=Entire home/apt`	
* `longitude`
* `latitude`

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` first to `0.1` and then to `0.01`

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* 0.01

## Submit the results


Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline


The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

