### Gradient Boosting & Scikit-Learn Intro

This lab is designed to give everyone their first introduction to the Scikit-Learn API, and Gradient Boosting, one of the most powerful techniques in predictive modeling.

During this lab you'll see if you can build a model, understand its working parts, and make improvements to your results!  

The great thing about `Scikit Learn` is that its API is almost identical from one algorithm to another, so once you get the hang of how to use it, using different methods is fairly seamless.

In [15]:
import numpy as np
import pandas as pd
import random
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor


**Step 1:** Load in the `housing.csv` file

In [37]:
# your answer here
df = pd.read_csv('../../data/housing.csv')

In [38]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


**Step 2:** Randomly shuffle your dataset using the `sample` method, using a `random_state` of 42.

In [35]:
?df.sample

In [32]:
# your answer here
df = df.sample(random_state=42)

**Step 3:** Declare your `X` & `y` variables -- We'll be predicting price.

In [33]:
# your answer here
X = df.drop('PRICE',axis=1)
y = df['PRICE']

In [34]:
X.shape

(1, 13)

**Step 4:** Create a training and test set.

The training set will be the first 80% of the dataset (for both `X` & `y`), and test set will be the last 20%.  Do this for both `X` & `y`, using your shuffled data.

Subsequent questions will refer to the variables you created in this step as `X_train`, `X_test`, `y_train`, `y_test`.

In [93]:
dfn = df.sample(n=506, random_state=42)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
173,0.09178,0.0,4.05,0,0.510,6.416,84.1,2.6463,5,296,16.6,395.50,9.04,23.6
274,0.05644,40.0,6.41,1,0.447,6.758,32.9,4.0776,4,254,17.6,396.90,3.53,32.4
491,0.10574,0.0,27.74,0,0.609,5.983,98.8,1.8681,4,711,20.1,390.11,18.07,13.6
72,0.09164,0.0,10.81,0,0.413,6.065,7.8,5.2873,4,305,19.2,390.91,5.52,22.8
452,5.09017,0.0,18.10,0,0.713,6.297,91.8,2.3682,24,666,20.2,385.09,17.27,16.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439,9.39063,0.0,18.10,0,0.740,5.627,93.9,1.8172,24,666,20.2,396.90,22.88,12.8
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0
337,0.03041,0.0,5.19,0,0.515,5.895,59.6,5.6150,5,224,20.2,394.81,10.56,18.5
236,0.52058,0.0,6.20,1,0.507,6.631,76.5,4.1480,8,307,17.4,388.45,9.54,25.1


In [96]:
X_train = dfn[0:404].drop('PRICE',axis=1)
X_test = dfn[404:].drop('PRICE',axis=1)
y_train = dfn[0:404]['PRICE']
y_test = dfn[404:]['PRICE']

**Step 5:** Import `GradientBoostingRegressor` and initialize it.

In [97]:
# your answer here
from sklearn.ensemble import GradientBoostingRegressor


In [98]:
gbm = GradientBoostingRegressor()

**Step 6:** Call the `fit()` method on `X_train` & `y_train`

In [None]:
# your answer here

In [99]:
gbm.fit(X_train,y_train)

GradientBoostingRegressor()

**Step 7:** Make a column that represents the predictions your model made for each sample in your original dataset.

In [None]:
# your answer here

In [101]:
X_all = df.drop('PRICE',axis=1)

In [103]:
df['Prediction'] = gbm.predict(X_all)

In [104]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE,Prediction
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,25.265672
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,22.720981
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,34.579933
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,34.49691
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,35.383703


**Step 8:** Check the score of your model using the `score()` method.  Compare your score on both the training set and test set.

In [106]:
# your answer here
gbm.score(X_train, y_train)

0.979070400857534

In [107]:

gbm.score(X_test, y_test)

0.8881050127378671

**Step 9:** Take a look at the values returned from the `feature_importances_` attribute

In [None]:
# your answer here

In [108]:
gbm.feature_importances_

array([2.23551365e-02, 2.61425264e-04, 1.78567603e-03, 5.27250438e-04,
       3.23981096e-02, 3.53916273e-01, 6.47102363e-03, 6.48034070e-02,
       1.19174289e-03, 1.68877723e-02, 2.59209282e-02, 1.15403967e-02,
       4.61940858e-01])

In [109]:
X.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')

**Step 10:** To make a bit more sense out of these, let's put these values into a more readable format.  

Try making a 2 column dataframe using `X.columns` and the values from `feature_importances_` (they should correspond to one another).

In [None]:
# your answer here

In [116]:
importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': gbm.feature_importances_
}).sort_values(by='Importance',ascending=False)

In [117]:
importances

Unnamed: 0,Feature,Importance
12,LSTAT,0.461941
5,RM,0.353916
7,DIS,0.064803
4,NOX,0.032398
10,PTRATIO,0.025921
0,CRIM,0.022355
9,TAX,0.016888
11,B,0.01154
6,AGE,0.006471
2,INDUS,0.001786


**Step 11:** Can you improve your results?  For now, toy around a little bit with a few different options for getting different results.  These could be any of the following:

 - changing the number of boosting rounds used via `n_estimators`
 - changing the learning rate
 - removing columns that have lower feature importance, or very low correlation with the target variable

In [None]:
# your answer here

# Example from solutions manual:
# we'll try doing a parameter sweep
n_estimators  = [100, 250, 500]
learning_rate = [.05, .1]
tree_depth    = [3, 4, 5, 6]
cv_scores     = []

for estimators in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            # print(f"Fitting model for: {estimators, rate, depth}")
            gbm.set_params(n_estimators=estimators, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train, y_train)
            cv_scores.append((gbm.score(X_test, y_test), estimators, rate, depth))
            # print(gbm.score(X_test, y_test), estimators, rate, depth)