<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Ensemble Methods: Random Forests and Gradient Boosted Trees</h1></center>

In today's notebook, we're going to cover two of the more powerful and resilient machine learning algorithms used in predictive analytics--**_Random Forests_** and **_Gradient Boosted Trees_**.  These algorithms belong to a class of algorithms called **_Ensemble Methods_**.  

<center><h3>What are Ensemble Methods?</h3></center>

Ensemble Methods are machine learning algorithms that rely on the "Wisdom of the Crowd".  That is, they take the approach that many weak algorithms working together do better than 1 big, monolithic algorithm. In practice, they're often right.  Both of these algorithms create many small, poorly predictive learners that do only slightly better than chance.  However, as we'll see when we begin using them, with enough of these learners voting on the overall prediction, we often get great results, with the added benefit of models that are more resistant to variance in the dataset, and are resistant to overfitting than many other model types (We'll talk about why later).  

Before using examples in practice, Let's gain some intuition on how each algorithm works.  

<center><h3>Random Forests</h3></center>

**_Random Forest_** is a name for a supervised learning method created by Berkeley professor Leo Breiman in 2001, although prior work on this problem had been done by other professors before him (Breiman's white paper available [here](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf)).  The name for this algorithm gives an intuition for how it works--a **_Random Forest_** is just a collection of many small **_Decision Trees_**.  The secret to this algorithm is using **_Bootstrap Aggregation_** (or **_bagging_**, for short) and  **_subspace sampling_**, which is just a fancy way of saying that the algorithm selects random samples from the dataset with replacement (the _bagging_ step), and then selects  a random subset of columns from data ( the _subspace sampling_ step) to use when creating each new "weak" Decision Tree.  

In order to understand this model, let's visualize an example. 

Pretend that we have a dataset with 10 columns, and thousands of rows.  Our random forest algorithm would start by randomly selecting around 2/3 of the rows, and then randomly selecting 6 columns in the data that it will use to train on (this step is important--the learner does NOT have access to all of the columns for each data point, only a randomly selected subset!).  It will then train it's first **_weak learner_**-- a decision tree that is only allowed to use the 6 columns that were randomly selected. This becomes our first "tree" planted in our Random Forest.  The Random Forest algorithm will then repeat this step, sampling another 2/3's of the data, and grabbing another 6 columns from the dataset (recall that the sampling is done with replacement, which means that some of the same data and/or feature columns will likely be chosen again--including an exceedingly small chance that the exact same data/columns will be chosen again!).   After a sufficient number of trees have been created, the algorithm is ready to go!  
<br> 
<center>**_Wait! How Many Trees Should be in my Random Forest?_**</center>
 
The number of trees created for a Random Forest is a parameter specified by the user.  Typically, people tend to use the numbers 10, 30, or 100.  The more trees you have, the more accurate your Random Forest will likely be.  However, this algorithm is subject to _diminishing returns_ for each new tree--that is, each new tree created will add less accuracy than the tree before it.  At some point, adding new trees just takes up more memory without making the accuracy of the model any more predictive.  

For more background on how Random Forests work, check out the video below:

In [4]:
from IPython.display import HTML

HTML("""
<iframe width="560" height="315" src="https://www.youtube.com/embed/D_2LkhMJcfY" 
frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
""")

<center><h3>Building a Random Forest</h3></center>

Like all the other great machine learning algorithms, `sklearn` has a great implementation of Random Forests that we can use.  Let's start by building a classifer on the `pima_indians_diabetes` dataset contained within the `datasets` folder in this repo.  

You'll find the `RandomForestsClassifier` object contained with `sklearn.ensemble`.  For more information, see [sklearns' documentation for this classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). 

<center>**_Tuneable Parameters_**</center>

You might be able to increase the accuracy of your Random Forest Classifier by tuning some of it's parameters.  Think about the values you pass in for the following parameters, and see how the affect the accuracy of your model:

**_n_estimators:_** The number of Trees in your Random Forest. 

**_max_depth:_** How deep each Tree in the forst is allowed to go. 

**_min_samples_split:_** The minimum number of samples required to split a node in a Decision Tree.  

**_Challenge:_** Import the `pima_indians_diabetes` dataset, clean and scaled as needed, and then fit a random forest to this model. Create predictions and test the accuracy of the model. 


**_Stretch Challenge:_** Tune the parameters of the model, and track how it affects your accuracy.  (This algorithm is stochastic, so remember to set a random seed!)

In [13]:
%ls


 Volume in drive C is Windows
 Volume Serial Number is 361D-A6A7

 Directory of C:\Users\Don\Desktop\github\DS-2-Machine-Learning

03/22/2018  09:39 AM    <DIR>          .
03/22/2018  09:39 AM    <DIR>          ..
03/20/2018  04:00 PM    <DIR>          .ipynb_checkpoints
03/01/2018  06:07 AM            93,188 00_Titanic_Survival_Prediction.ipynb
03/15/2018  01:28 PM            18,468 01_Decision_Trees.ipynb
03/06/2018  03:37 PM            30,446 02_K_Nearest_Neighbors.ipynb
03/08/2018  02:19 PM            13,562 03_Naive_Bayesian_Classifiers.ipynb
03/13/2018  02:44 PM            53,983 04_Regression_Techniques.ipynb
03/20/2018  12:25 PM            43,579 05_PCA_and_Clustering.ipynb
03/20/2018  02:51 PM           138,875 06_K-Means_Clustering.ipynb
03/22/2018  09:39 AM            12,609 07_Ensemble_Methods.ipynb
03/15/2018  02:03 AM    <DIR>          datasets
03/13/2018  06:41 AM    <DIR>          img
03/01/2018  12:52 PM               361 iris
               9 File(s)        405,071 by

In [40]:
# Import the dataset, clean it, and then fit and a RandomForestClassifier 
# and make predictions on it below!
import pandas as pd

pima_df = pd.read_csv('datasets/pima_indians_diabetes.csv')

outcome = pima_df['Outcome']
pima_df = pima_df.drop(['Outcome'], axis=1)
outcome
#taking care of 0 values in the dataset
pima_df['BloodPressure']= pima_df['BloodPressure'].replace(0, int(pima_df['BloodPressure'].mean()))
pima_df['SkinThickness']=pima_df['SkinThickness'].replace(0, int(pima_df['SkinThickness'].mean()))
pima_df['BMI']=pima_df['BMI'].replace(0, int(pima_df['BMI'].mean()))
pima_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,20,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
5,5,116,74,20,0,25.6,0.201,30
6,3,78,50,32,88,31.0,0.248,26
7,10,115,69,20,0,35.3,0.134,29
8,2,197,70,45,543,30.5,0.158,53
9,8,125,96,20,0,31.0,0.232,54


In [70]:
#now that the set is clean I can do this here
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(pima_df, outcome)
forest = RandomForestClassifier(n_estimators=100, oob_score=True)
forest.fit(X_train, y_train)

#PREDICTION TIME
predicted = forest.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {forest.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')

Out-of-bag score estimate: 0.76
Mean accuracy score: 0.724


<center><h3>Gradient Boosted Trees</h3></center>

The other ensemble method we'll cover in this notebook is **_Gradient Boosted Trees_**, also called referred to as _Gradient Boosting_ for short (or GBT for really short).  

Gradient Boosting also uses the concept of **_weak learners_**, but wheras Random Forest uses Decision Trees, GBT typically **_stumps_**--Decision Trees with 1 split.  

For an intuitive visualization that shows how Gradient Boosted Trees can create very accurate with trees that are kept purposefully weak, take a look at the visualizations on [this website](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)  (don't worry too much about the math, although you are encouraged to click on the explanations such as "what is gradient boosting?"). 

When you've played around with those visualizations, take a look at [this article](http://mccormickml.com/2013/12/13/adaboost-tutorial/), which gives a more in-depth explanation of **_Adaboost_**, which is the classic algorithm for Gradient Boosted Trees. 

<center><h3>How Does Adaboost Work?</h3></center>

Adaboost starts grabbing a random subsample of the dataset.  It then creates a weak learner based on this subsample.  This weak learner is then used to make predictions on the remaining data, with the algorithm keeping track of which points it gets right, and which points it gets wrong.  Each data point is given a weight.  The ones that previous learners got wrong will have a high weight, since it is increasingly important to create weak learners that can get this point correct.  Conversely, the "easy" data points--the ones that many classifiers can get right--will see their weights shrink.  This is intuitive--if most of our weak learners can a data point right, it isn't that "hard", so we shouldn't worry about it too much.  

The higher the weight for a given data point, the more likely it is it will be inlcuded in the training set used to create the next weak learner, thereby increasing the chances that a weak learner will be created that can get the "hard" data points correct. In this way, the chances of correctly classifying "hard" data points will be _boosted_ each round!

For more information on how Gradient Boosted Trees work, check out the video below on Adaboost! Again, don't worry about the math--just try to gain an intuition for how the algorithm works!

In [65]:
HTML("""<iframe width="560" height="315" src="https://www.youtube.com/embed/BoGNyWW9-mE" frameborder="0" 
     allow="autoplay; encrypted-media" allowfullscreen></iframe>""")

<center><h3>Using Adaboost for Classification</h3></center>

Like Random Forests, `sklearn` contains a great implementation of a `GradientBoostingClassifier`, which is also found within `sklearn.ensemble`.  As you did above with Random Forests, you're going to use `sklearn`'s implementation of this algorithm to make classifications on the `pima_indians_diabetes` dataset.  

**_Challenge_**: Create a `GradientBoostingClassifier` object, fit it to the `pima_indians_dataset`, and then use it to make predictions and test the overall accuracy of the model.  


**_Stretch Challenge:_** Take a look at the documentation for [GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) and look at the parameters available.  Try tuning different parameters in the model and see how it affects the quality of the predictions made by the classifier!

**_Stretch Challenge_** Adaboost is the classic algorithm usually covered for learning GBT, but there are many more robust implementations of GBT that exist today.  The best seems to be `XGBoost`.  Work through [this tutorial](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/) to install, fit, and use `XGBoost` on the dataset.   

In [76]:
# Write your code below!
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
seed = 10
num_trees = 100
X,Y = pima_df, outcome
seed = 7
num_trees = 100
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

NameError: name 'GradientBoostingClassifier' is not defined