# Data Analysis for Software Engineers

## Practical Assignment 5
## Ensemble Learning

<hr\>
**General Information**

**Due date:** 20 May 2018, 23:59 <br\>
**Submission link:** [here](https://www.dropbox.com/request/8ZfcRf2dAiGPAH8nL8Zy)

Add your name to this notebook's title<br\>
<hr\>

Take in to account that some tasks may not have rigorous and comprehensive solution.<br\>
Support your code with comments and illustation if needed. The more conclusions, derivations and explanations you provide - the better. <br\>


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12,8)

# Prepare dataset (1 point)

During this task we are going to check some ensemble models on California Housing Dataset. This is going to be regression task, so error metric is set to **RMSE**.

Load `housing.csv` dataset. 

You are going to work mostly with tree-based models, so:
* No feature normalization is required 
* Categorical feature "`ocean_proximity`" should be encoded either with labels `(0, 1, 2, 3, 4)` or with OHE.
* Features with missing values should be filled with extraordinary values
* Target column is "`median_house_value`"

As a result you should get `np.array` "`y`" with target feature and `np.array` "`X`" with other features.

In [None]:
## Your Code Here

## Split to train and validation sets
Choose arbitrary `random_seed` and split you data to train and validation parts in proportion 80/20.

As a result you should get `X_train`, `X_valid`, `y_train` and `y_valid`

In [None]:
## Your Code Here

# Random Forest Model (3 points)

Implement function `learn_rf(...)`

**Inputs:**
`X_train`, `y_train`, `X_valid`, `y_valid`, `num_trees`

**Outputs:**
* `importance` - `np.array` with feature importances
* `pred_train` - `np.array` with predictions of RF on train set
* `pred_train_ind` - `np.array` with predictions of each individual trees on train set
* `pred_valid` - `np.array` with predictions of RF on validation set
* `pred_valid_ind` - `np.array` with predictions of each individual trees on validation set

Train Random Forest Regression models with 10, 20, 30, 50, 100 and 200 trees.

* Show feature names and their importance (at any number of trees)
* For every random forest model compare train- and validation- errors’ variance of individual trees and ensemble model. Explain your observations.

In [None]:
## Your Code Here

# Gradient Boosing Model (3 points)

Implement function `learn_gbm(...)`

**Inputs:**
`X_train`, `y_train`, `X_valid`, `y_valid`, `num_stages`, `learning_rate`.

**Outputs:**
* `importance` - `np.array` with feature importances
* `pred_train` - `np.array` with **final** predictions of GBM on train set
* `pred_train_staged` - `np.array` with **staged** predictions on train set (after each new tree)
* `pred_valid` - `np.array` with **final** predictions of GBM on validation set
* `pred_valid_staged` - `np.array` with **staged** predictions on validation set (after each new tree) 

Useful functions and classes: `GradientBoostingRegressor.staged_predict()`

You are highly encouraged to use advanced libraries of gradient boosting instead of sklearn implementation, like [XGBoost](https://github.com/dmlc/xgboost), [CatBoost](https://github.com/catboost/catboost) or [LightGBM](https://github.com/Microsoft/LightGBM). 

So pick one, make sure you have installed the last version, and use it in this task.

Use `learn_gbt` function to train Gradient Boosting Tree Regression model with `1000` trees 

* Try to pick 3 values of  `learning_rate` for 
    * Aggresive learning - error on train set is getting low, but validation error decreases only on first steps and goes up afterwards
    * Slow learning - error on train and validation sets is getting low VERY slowly
    * Healthy learning - error on train and validation sets is getting low and optimal number of iterations (where validation error (almost) stops decreasing) is somewhere between 800 and 1000.
* Show feature names and importance (for the latter `learning_rate` value). Are importances different  from Random Forest case? Why?

In [None]:
## Your Code Here

# Linear Regression with Random Trees Embedding (3 points)

Set baseline pipeline with the following steps:

* FeatureUnion of:
    * OneHotEncoding of categorical features
    * StandartScaling of other features
* Linear Regression

Fit on (`X_train`, `y_train`) and predict on (`X_valid`, `y_valid`) - you should get `pred_valid_base`.

In [None]:
## Your Code Here

Implement function `learn_lm(...)` which will learn the following pipeline:
* RandomTreesEmbedding
* LinearRegression

**Inputs:**
`X_train`, `y_train`, `X_valid`, `y_valid`, `num_trees`, `max_depth`.

**Outputs:**
* `pred_valid` - `np.array` with predictions on validation set

Useful functions and classes: [RandomTreesEmbedding](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomTreesEmbedding.html#sklearn.ensemble.RandomTreesEmbedding)

Run it with various `max_depth` and `num_trees`. Compare this pipeline with base pipeline on validation set as RMSE(this_pipeline)/RMSE(base_pipeline).

In [None]:
## Your Code Here