### Gradient Boosting & Scikit-Learn Intro

This lab is designed to give everyone their first introduction to the Scikit-Learn API, and Gradient Boosting, one of the most powerful techniques in predictive modeling.

During this lab you'll see if you can build a model, understand its working parts, and make improvements to your results!  

The great thing about `Scikit Learn` is that its API is almost identical from one algorithm to another, so once you get the hang of how to use it, using different methods is fairly seamless.

**Step 1:** Load in the `iowa_housing.csv` file

In [1]:
import pandas as pd
import numpy as np

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
from sklearn import ensemble

In [2]:
housing = pd.read_csv("/Users/imac/DAT07-28-AG/ClassMaterial/Unit3/data/iowa_train.csv")

In [3]:
housing.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageCars,SalePrice
0,1,60,8450,7,5,2003,1710,856,854,1710,2,1,2,208500
1,2,20,9600,6,8,1976,1262,1262,0,1262,2,0,2,181500
2,3,60,11250,7,5,2001,1786,920,866,1786,2,1,2,223500
3,4,70,9550,7,5,1915,1717,961,756,1717,1,0,3,140000
4,5,60,14260,8,5,2000,2198,1145,1053,2198,2,1,3,250000


**Step 3:** Declare your `X` & `y` variables -- We'll be predicting price.

In [4]:
X = housing.drop("SalePrice", axis=1)
y = housing["SalePrice"]

**Step 4:** Import `GradientBoostingRegressor` and initialize it.

In [5]:
from sklearn.ensemble import GradientBoostingRegressor

gbm = GradientBoostingRegressor()

**Step 5:** Call the `fit()` method on `X` & `y`

In [6]:
gbm.fit(X, y)

GradientBoostingRegressor()

**Step 6:** Make a column that represents the predictions your model made for each sample

In [7]:
housing["Predictions"] = gbm.predict(X)

In [8]:
housing.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageCars,SalePrice,Predictions
0,1,60,8450,7,5,2003,1710,856,854,1710,2,1,2,208500,197493.442572
1,2,20,9600,6,8,1976,1262,1262,0,1262,2,0,2,181500,174523.841519
2,3,60,11250,7,5,2001,1786,920,866,1786,2,1,2,223500,206374.028689
3,4,70,9550,7,5,1915,1717,961,756,1717,1,0,3,140000,158308.884054
4,5,60,14260,8,5,2000,2198,1145,1053,2198,2,1,3,250000,287262.76413


**Step 7:** Check the score of your model using the `score()` method

In [9]:
gbm.score(X, y)

0.9371367715410339

**Step 8:** Take a look at the values returned from the `feature_importances_` attribute

In [11]:
gbm.feature_importances_

array([0.0008086 , 0.00292751, 0.02844633, 0.56745621, 0.01227233,
       0.04760634, 0.07011041, 0.08204154, 0.03678904, 0.07635157,
       0.0086471 , 0.00183628, 0.06470675])

**Step 9:** To make a bit more sense out of these, let's put these values into a more readable format.  

Try making a 2 column dataframe using `X.columns` and the values from `feature_importances_` (they should correspond to one another).

**Step 10:** Can you improve your results?  For now, toy around a little bit with a few different options for getting different results.  These could be any of the following:

 - changing the number of boosting rounds used via `n_estimators`
 - changing the learning rate
 - removing columns that have lower feature importance, or very low correlation with the target variable

In [None]:
# your answer here