### Gradient Boosting & Scikit-Learn Intro

This lab is designed to give everyone their first introduction to the Scikit-Learn API, and Gradient Boosting, one of the most powerful techniques in predictive modeling.

During this lab you'll see if you can build a model, understand its working parts, and make improvements to your results!  

The great thing about `Scikit Learn` is that its API is almost identical from one algorithm to another, so once you get the hang of how to use it, using different methods is fairly seamless.

**Step 1:** Load in the `housing.csv` file

In [3]:
# your answer here
import pandas as pd
df = pd.read_csv('../../data/housing.csv')

**Step 2:** Randomly shuffle your dataset using the `shuffle` method, using a `random_state` of 42.

In [13]:
# your answer here
df = df.sample(df.shape[0], random_state=42)

In [11]:
df.sample(10, random_state=30)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
410,51.1358,0.0,18.1,0,0.597,5.757,100.0,1.413,24,666,20.2,2.6,10.11,15.0
376,15.288,0.0,18.1,0,0.671,6.649,93.3,1.3449,24,666,20.2,363.02,23.24,13.9
473,4.64689,0.0,18.1,0,0.614,6.98,67.6,2.5329,24,666,20.2,374.68,11.66,29.8
499,0.17783,0.0,9.69,0,0.585,5.569,73.5,2.3999,6,391,19.2,395.77,15.1,17.5
172,0.13914,0.0,4.05,0,0.51,5.572,88.5,2.5961,5,296,16.6,396.9,14.69,23.1
107,0.13117,0.0,8.56,0,0.52,6.127,85.2,2.1224,5,384,20.9,387.69,14.09,20.4
280,0.03578,20.0,3.33,0,0.4429,7.82,64.5,4.6947,5,216,14.9,387.31,3.76,45.4
314,0.3692,0.0,9.9,0,0.544,6.567,87.3,3.6023,4,304,18.4,395.69,9.28,23.8
274,0.05644,40.0,6.41,1,0.447,6.758,32.9,4.0776,4,254,17.6,396.9,3.53,32.4
460,4.81213,0.0,18.1,0,0.713,6.701,90.0,2.5975,24,666,20.2,255.23,16.42,16.4


In [14]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
360,4.54192,0.0,18.1,0,0.77,6.398,88.0,2.5182,24,666,20.2,374.56,7.79,25.0
10,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
324,0.34109,0.0,7.38,0,0.493,6.415,40.1,4.7211,5,287,19.6,396.9,6.12,25.0
315,0.25356,0.0,9.9,0,0.544,5.705,77.7,3.945,4,304,18.4,396.42,11.5,16.2
164,2.24236,0.0,19.58,0,0.605,5.854,91.8,2.422,5,403,14.7,395.11,11.64,22.7


**Step 3:** Declare your `X` & `y` variables -- We'll be predicting price.

In [15]:
# your answer here
X = df.drop('PRICE', axis=1)
y = df['PRICE']

**Step 4:** Create a training and test set.

The training set will be the first 80% of the dataset (for both `X` & `y`), and test set will be the last 20%.  Do this for both `X` & `y`, using your shuffled data.

Subsequent questions will refer to the variables you created in this step as `X_train`, `X_test`, `y_train`, `y_test`.

In [19]:
cutoff

404

In [20]:
# your answer here
cutoff = int(df.shape[0]*.8)

X_train, X_test = X[:cutoff].copy(), X[cutoff:].copy()
y_train, y_test  = y[:cutoff].copy(), y[cutoff:].copy()

**Step 5:** Import `GradientBoostingRegressor` and initialize it.

In [21]:
# your answer here
from sklearn.ensemble import GradientBoostingRegressor
gbm = GradientBoostingRegressor()

**Step 6:** Call the `fit()` method on `X_train` & `y_train`

In [22]:
y_train.shape

(404,)

In [23]:
# your answer here
gbm.fit(X_train, y_train)

GradientBoostingRegressor()

In [29]:
gbm.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'ls',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'presort': 'deprecated',
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

**Step 7:** Make a column that represents the predictions your model made for each sample in your original dataset.

In [30]:
# your answer here
df['Prediction'] = gbm.predict(X)

**Step 8:** Check the score of your model using the `score()` method.  Compare your score on both the training set and test set.

In [31]:
# your answer here
print(f"Train score: {gbm.score(X_train, y_train)}, Test Score: {gbm.score(X_test, y_test)}")

Train score: 0.9801409607979961, Test Score: 0.8446952433760847


**Step 9:** Take a look at the values returned from the `feature_importances_` attribute

In [23]:
# your answer here
gbm.feature_importances_

array([2.38910315e-02, 1.62021421e-04, 1.57662443e-03, 5.29198447e-04,
       3.17380199e-02, 3.53906434e-01, 6.33477836e-03, 6.90423871e-02,
       1.19487226e-03, 1.75971501e-02, 2.20623401e-02, 9.71452873e-03,
       4.62250613e-01])

**Step 10:** To make a bit more sense out of these, let's put these values into a more readable format.  

Try making a 2 column dataframe using `X.columns` and the values from `feature_importances_` (they should correspond to one another).

In [32]:
# your answer here
importances = pd.DataFrame({
    'Column': X.columns,
    'Importance': gbm.feature_importances_
}).sort_values(by='Importance', ascending=False)

importances

Unnamed: 0,Column,Importance
12,LSTAT,0.496847
5,RM,0.302499
7,DIS,0.060755
0,CRIM,0.038028
10,PTRATIO,0.028727
4,NOX,0.026016
6,AGE,0.017759
9,TAX,0.014207
11,B,0.011008
2,INDUS,0.001435


In [37]:
importances.Importance.sum()

1.0

**Step 11:** Can you improve your results?  For now, toy around a little bit with a few different options for getting different results.  These could be any of the following:

 - changing the number of boosting rounds used via `n_estimators`
 - changing the learning rate
 - removing columns that have lower feature importance, or very low correlation with the target variable

In [43]:
# we'll try doing a parameter sweep
n_estimators  = [100, 250, 500]
learning_rate = [.05, .1]
tree_depth    = [3, 4, 5, 6]
cv_scores     = []

for estimators in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            # print(f"Fitting model for: {estimators, rate, depth}")
            gbm.set_params(n_estimators=estimators, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train, y_train)
            cv_scores.append((gbm.score(X_test, y_test), estimators, rate, depth))
            # print(gbm.score(X_test, y_test), estimators, rate, depth)

In [44]:
cv_scores

[(0.8224341074159518, 100, 0.05, 3),
 (0.8560376485333699, 100, 0.05, 4),
 (0.8502636533485133, 100, 0.05, 5),
 (0.8523252868976864, 100, 0.05, 6),
 (0.8408998751138631, 100, 0.1, 3),
 (0.8692618828172283, 100, 0.1, 4),
 (0.8454196105446514, 100, 0.1, 5),
 (0.8521875487483637, 100, 0.1, 6),
 (0.8450412356597884, 250, 0.05, 3),
 (0.8611501650085038, 250, 0.05, 4),
 (0.852773784162489, 250, 0.05, 5),
 (0.8481232737116362, 250, 0.05, 6),
 (0.855592518212879, 250, 0.1, 3),
 (0.8680290987999356, 250, 0.1, 4),
 (0.8511478148386187, 250, 0.1, 5),
 (0.841276640854057, 250, 0.1, 6),
 (0.8527748643036183, 500, 0.05, 3),
 (0.872006685799987, 500, 0.05, 4),
 (0.8660593958188115, 500, 0.05, 5),
 (0.8622696815290509, 500, 0.05, 6),
 (0.8580326227992955, 500, 0.1, 3),
 (0.8665834332422507, 500, 0.1, 4),
 (0.8518759868696968, 500, 0.1, 5),
 (0.8517225426661426, 500, 0.1, 6)]

In [45]:
max(cv_scores)

(0.872006685799987, 500, 0.05, 4)

In [46]:
min(cv_scores)

(0.8224341074159518, 100, 0.05, 3)