## 6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column `'price'`).

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

Let's load the data:

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
columns = [
    'neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365',
    'price'
]

df = pd.read_csv('AB_NYC_2019.csv', usecols=columns)
df.reviews_per_month = df.reviews_per_month.fillna(0)

* Apply the log tranform to `price`
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)


df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)


# log transformation
y_train = np.log1p(df_train.price.values)
y_val = np.log1p(df_val.price.values)
y_test = np.log1p(df_test.price.values)

del df_train['price']
del df_test['price']
del df_val['price']

In [5]:
df_train.shape, df_test.shape, df_val.shape

((29337, 9), (9779, 9), (9779, 9))

Now, use `DictVectorizer` to turn train and validation into matrices:

In [6]:
from  sklearn.feature_extraction import DictVectorizer

In [7]:
features = [
    'neighbourhood_group', 'room_type', 'latitude', 'longitude',
    'minimum_nights', 'number_of_reviews','reviews_per_month',
    'calculated_host_listings_count', 'availability_365'
]


In [8]:
dv = DictVectorizer(sparse=True)

train_dict = df_train[features].to_dict(orient='record')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[features].to_dict(orient='record')
X_val = dv.fit_transform(val_dict)

  train_dict = df_train[features].to_dict(orient='record')
  val_dict = df_val[features].to_dict(orient='record')


## Question 1

Let's train a decision tree regressor to predict the price variable. 

* Train a model with `max_depth=1`

In [9]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

In [10]:
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=1)

In [11]:
print(export_text(dt, feature_names=dv.get_feature_names()))

|--- room_type=Entire home/apt <= 0.50
|   |--- value: [4.29]
|--- room_type=Entire home/apt >  0.50
|   |--- value: [5.15]





Which feature is used for splitting the data?

* `room_type`
* `neighbourhood_group`
* `number_of_reviews`
* `reviews_per_month`

room type is used for splitting the data

## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1`  (optional - to make training faster)

In [12]:
from sklearn.ensemble import RandomForestRegressor

In [13]:
rf = RandomForestRegressor(n_estimators=10,random_state=1,n_jobs=-1)
rf.fit(X_train, y_train)

RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=1)

In [14]:
from sklearn.metrics import mean_squared_error

In [15]:
y_pred = rf.predict(X_val)

In [16]:
np.sqrt(mean_squared_error(y_val, y_pred))

0.4615925727520376

What's the RMSE of this model on validation?

* 0.059
* 0.259
* 0.459
* 0.659

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10
* Set `random_state` to `1`
* Evaluate the model on the validation dataset

In [17]:
from tqdm.auto import tqdm

In [18]:
scores = []

for n in tqdm(range(10, 201, 10)):
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1, warm_start=True)
    rf.fit(X_tr ain, y_train)
    
    y_pred = rf.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    scores.append((n, rmse))

  0%|          | 0/20 [00:00<?, ?it/s]


KeyboardInterrupt



In [None]:
df_scores = pd.DataFrame(scores, columns=['n_estimators', 'rmse'])

In [None]:
plt.plot(df_scores.n_estimators, df_scores.rmse.round(3))

After which value of `n_estimators` does RMSE stop improving?

- 10
- 50
- 70
- Answer `120`

In [None]:
scores = []

rf = RandomForestRegressor(n_estimators=0, random_state=1,
                           n_jobs=-1, warm_start=True)
    
    
for n in tqdm(range(10, 201, 10)):
    rf.n_estimators = n
    rf.fit(X_train, y_train)
    
    y_pred = rf.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    scores.append((n, rmse))
    
df_scores = pd.DataFrame(scores, columns=['n_estimators', 'rmse'])

In [None]:
plt.plot(df_scores.n_estimators, df_scores.rmse.round(3))

## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10)
* Fix the random seed: `random_state=1`

In [None]:
scores = []

for d in tqdm([10, 15, 20, 25]):
    rf = RandomForestRegressor(n_estimators=0, 
                               max_depth=d,
                               random_state=1,
                               n_jobs=-1, warm_start=True)


    for n in tqdm(range(10, 201, 10)):
        rf.n_estimators = n
        rf.fit(X_train, y_train)

        y_pred = rf.predict(X_val)
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        
        scores.append((d, n, rmse))
        
        
columns = ['max_depth', 'n_estimators', 'rmse']
df_scores = pd.DataFrame(scores, columns=columns)

What's the best `max_depth`:

* 10
* 15
* 20
* 25

Bonus question (not graded):

Will the answer be different if we change the seed for the model?

In [None]:
df_scores

In [None]:
for d in [10, 15, 20, 25]:
    df_subset = df_scores[df_scores.max_depth == d]
    plt.plot(df_subset.n_estimators, df_subset.rmse, label=d)
    
plt.legend()

## Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorith, it finds the best split. 
When doint it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the imporatant features 
for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field. 

For this homework question, we'll find the most important feature:

* Train the model with these parametes:
    * `n_estimators=10`,
    * `max_depth=20`,
    * `random_state=1`,
    * `n_jobs=-1` (optional)
* Get the feature importance information from this model

In [21]:
rf = RandomForestRegressor(n_estimators=10, max_depth=20,random_state=1,
                               n_jobs=-1)
rf.fit(X_train, y_train)

RandomForestRegressor(max_depth=20, n_estimators=10, n_jobs=-1, random_state=1)

In [27]:
rf.feature_importances_

array([7.63710105e-02, 3.04991234e-02, 1.51383518e-01, 1.54262713e-01,
       5.41530341e-02, 3.07601099e-04, 7.81672494e-04, 3.41951124e-02,
       1.11962413e-03, 1.23220309e-04, 4.31537678e-02, 5.26686508e-02,
       3.91899191e-01, 4.36076475e-03, 4.72099572e-03])

In [28]:
df_importance = pd.DataFrame()

In [29]:
df_importance['features'] = dv.get_feature_names()
df_importance['importance'] = rf.feature_importances_



In [30]:
df_importance.head()

Unnamed: 0,features,importance
0,availability_365,0.076371
1,calculated_host_listings_count,0.030499
2,latitude,0.151384
3,longitude,0.154263
4,minimum_nights,0.054153


In [31]:
df_importance.sort_values(by='importance', ascending=False)

Unnamed: 0,features,importance
12,room_type=Entire home/apt,0.391899
3,longitude,0.154263
2,latitude,0.151384
0,availability_365,0.076371
4,minimum_nights,0.054153
11,reviews_per_month,0.052669
10,number_of_reviews,0.043154
7,neighbourhood_group=Manhattan,0.034195
1,calculated_host_listings_count,0.030499
14,room_type=Shared room,0.004721


What's the most important feature? 

* `neighbourhood_group=Manhattan`
* `room_type=Entire home/apt`	
* `longitude`
* `latitude`

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

In [32]:
# !pip install XGBoost
import xgboost as xgb

In [None]:
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,

    'objective': 'reg:squarederror',
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}

Now change `eta` first to `0.1` and then to `0.01`

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* 0.01

## Submit the results


Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.


## Deadline


The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

