## 1: Introduction
In the last lecture, we learned about decision trees, and looked at ways to reduce overfitting. The most powerful method to reduce decision tree overfitting is called the random forest algorithm. In this notebook, we'll learn how to construct and apply random forests.

We've been using the dataset, Heart.csv, which we will keep using here. The data contains ... 

## 2: Ensemble Models
A random forest is a kind of **ensemble model**. Ensembles combine the predictions of multiple models to create a more accurate final prediction. We'll make a simple ensemble to see how it works.

We'll create two decision trees with slightly different parameters:

one with min_samples_leaf set to 2
one with max_depth set to 5
and check their accuracy separately. In the next screen, we'll combine their predictions and compare the combined accuracy with either tree's accuracy.

In [57]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import train_test_split, cross_val_score

In [58]:
# Instructions
'''
- Fit both clf and clf2 to the data.
- Use train[columns] as the predictors, and train["high_income"] as the target.
- Make predictions on the test set predictors (test[columns]) using both clf and clf2.
- For both sets of predictions, compute the AUC between the predictions and the actual values (test["high_income"]) 
   using the roc_auc_score function.
- Use the print function to display the AUC values for both.
'''

df3 = pd.read_csv('Data/Carseats.csv').drop('Unnamed: 0', axis=1)
df3.head()

# convert sales to binary
df3['High'] = df3.Sales.map(lambda x: 1 if x>8 else 0)


df3.ShelveLoc = pd.factorize(df3.ShelveLoc)[0]
df3.Urban = df3.Urban.map({'No':0, 'Yes':1})
df3.US = df3.US.map({'No':0, 'Yes':1})
df3.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 12 columns):
Sales          400 non-null float64
CompPrice      400 non-null int64
Income         400 non-null int64
Advertising    400 non-null int64
Population     400 non-null int64
Price          400 non-null int64
ShelveLoc      400 non-null int32
Age            400 non-null int64
Education      400 non-null int64
Urban          400 non-null int64
US             400 non-null int64
High           400 non-null int64
dtypes: float64(1), int32(1), int64(10)
memory usage: 36.0 KB


In [59]:
X = df3.drop(['Sales', 'High'], axis=1)
y = df3.High

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)

clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
clf.fit(X_train, y_train)

clf2 = DecisionTreeClassifier(random_state=1, max_depth=5)
clf2.fit(X_train, y_train)

predictions = clf.predict(X_test)
print(roc_auc_score(y_test, predictions))

predictions = clf2.predict(X_test)
print(roc_auc_score(y_test, predictions))

0.698945845391
0.69605208764


## 3: Combining Our Predictions
When we have multiple classifiers making predictions, we can treat each set of predictions as a column in a matrix. 

Here's an example where we have Decision Tree 1 (DT1), Decision Tree 2 (DT2), and DT3:

DT1  |  DT2 | DT3
--- | --- | --- 
0 | 1 | 0
1 | 1 | 1
0 | 0 | 1
1 | 0 | 0

When we add more models to our ensemble, we just add more columns to the combined predictions. Ultimately, we don't want this matrix, though -- we want one prediction per row in the training data. To do this, we'll need to create rules to turn each row of our matrix of predictions into a single number.

We want to create a Final Prediction vector:

DT1  |   DT2  |  DT3  |  Final Prediction
--- | --- | --- 
0    |   1  |    0   |   0
1   |    1  |    1   |   1
0   |    0    |  1  |    0
1  |     0   |   0  |    0

There are many ways to get from the output of multiple models to a final vector of predictions. One method is **majority voting**, where each classifier gets a "vote", and the most commonly voted value for each row wins. This only works if there are more than 2 classifiers (and ideally an odd number so we don't have to write a rule to break ties). Majority voting is what we applied in the example above.

Since in the last screen we only had two classifiers, we'll have to use a different method to combine predictions. We'll take the mean of all the items in a row. Right now, we're using the predict method, which returns either 0 or 1. predict returns something like this:

0
1
0
1


We can instead use the predict_proba method, which will predict a probability from 0 to 1 that a given class is the right one for a row. Since 0 and 1 are our two classes, we'll get a matrix with as many rows as the income dataframe and 2 columns. predict_proba will return something like this:


0  |  1
--- | --- 
.7  | .3
.2  | .8
.1  | .9


Each row will correspond to a prediction. The first column is the probability that the prediction is a 0, the second column is the probability that the prediction is a 1. Each row adds up to 1.

If we just take the second column, we get the average value that the classifier would predict for that row. If there's a .9 probability that the correct classification is 1, we can use the .9 as the value the classifier is predicting. This will give us a continuous output in a single vector instead of just 0 or 1.

We can then add all of the vectors we get through this method together and divide by the number of vectors to get the mean prediction by all the members of the ensemble. We can then round off to get 0 or 1 predictions.

If we use the predict_proba method on both classifiers from the last screen to generate probabilities, take the mean for each row, and then round the results, we'll get ensemble predictions.

In [60]:
'''
<Instruction>
Add predictions and predictions2, then divide by 2 to get the mean.
Use numpy.round to round all of the resulting predictions.
Print the resulting AUC score between the actual values and the predictions.
'''

predictions = clf.predict_proba(X_test)[:,1]
predictions2 = clf2.predict_proba(X_test)[:,1]
combined = (predictions + predictions2) / 2
rounded = np.round(combined)

print(roc_auc_score(y_test, rounded))

0.711140967342


## 4: Why Ensembling Works
As we can see from the previous screen, the combined predictions of the two trees had a higher AUC than either tree:

settings|test AUC
--- | --- 
min_samples_leaf: 2|0.698
max_depth: 5|0.696
combined predictions|0.711


To intuitively understand why this makes sense, think about two people at the same talent level. One learned programming in college. The other learned on their own.

If you give both of them a project, since they both have different knowledge and experience, they'll both approach it in slightly different ways. They may both produce code that achieves the same result, but one may run faster in certain areas. The other may have a better interface. Even though both of them have about the same talent level, because they approach the problem differently, their solutions are stronger in different areas.

If we combine the best parts of both of their projects, we'll end up with a stronger combined project.

Ensembling is the exact same. Both models are approaching the problem slightly differently, and building a different tree because we used different parameters for each. Each tree makes different predictions in different areas. Even though both trees have about the same accuracy, when we combine them, the result is stronger because it leverages the strengths of both approaches.

The more "diverse", or dissimilar, the models used to construct an ensemble, the stronger the combined predictions will be (assuming that all models have about the same accuracy). Ensembling a decision tree and a logistic regression model, which use very different approaches to arrive at their answers, will result in stronger predictions than ensembling two decision trees with similar parameters.

On the other side, if the models you ensemble are very similar in how they make predictions, you'll get a negligible boost from ensembling.

Ensembling models with very different accuracies will not generally improve your accuracy. Ensembling a model with a .75 AUC and a model with a .85 AUC on a test set will usually result in an AUC somewhere in between the two original values. There's a way around this which we'll discuss later on, called weighting.

## 5: Bagging
A random forest is an ensemble of decision trees. If we don't make any modifications to the trees and follow the same building algorithm, each tree will be the exact same, so we'll get no boost when we ensemble them. In order to make ensembling effective, we have to introduce variation into each individual decision tree model.

If we introduce variation, each tree will be be constructed slightly differently, and therefore will make different predictions. This variation is why the word "random" is in "random forest".

There are two main ways to introduce variation in a random forest -- bagging and random feature subsets. We'll dive into bagging first.

In a random forest, each tree is trained on a random sample of the data, or a "bag". This sampling is performed with replacement. When we sample with replacement, after we select a row from the data we're sampling, we put the row back in the data so it can be picked again. Some rows from the original data may appear in the "bag" multiple times.

In [61]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, BaggingRegressor, RandomForestRegressor

# we can achieve the above two tasks using the following codes
# Bagging: using all features
rfc1 = RandomForestClassifier(max_features=10, random_state=1)
rfc1.fit(X_train, y_train)
pred1 = rfc1.predict(X_test)
print(roc_auc_score(y_test, pred1))

0.741318726747


## 6: Selecting Random Features
With the bagging example from the previous screen, we gained some accuracy over a single decision tree. We achieved an AUC score of around 0.721 with bagging.

settings|test AUC
--- | --- 
min_samples_leaf: 2|0.698
max_depth: 5|0.696
combined predictions|0.711
min_samples_leaf: 2, with bagging|0.741

In this section, we'll only evaluate a constrained set of features, selected randomly. This introduces variation into the trees, and makes for more powerful ensembles.

We can also repeat our random subset selection process in scikit-learn. We just set the splitter parameter on DecisionTreeClassifier to "random", and the max_features parameter to "auto". If we have N columns, this will pick a subset of features of size sqrt(N).

In [56]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, BaggingRegressor, RandomForestRegressor

# we can achieve the above two tasks using the following codes
# Bagging: using all features
rfc1 = RandomForestClassifier(max_features=10, random_state=1)
rfc1.fit(X_train, y_train)
pred1 = rfc1.predict(X_test)
print(roc_auc_score(y_test, pred1))

# play around with the setting for max_features
rfc2 = RandomForestClassifier(max_features=8, random_state=1)
rfc2.fit(X_train, y_train)
pred2 = rfc2.predict(X_test)
print(roc_auc_score(y_test, pred2))

0.741318726747
0.766225713105


## 7: When To Use Random Forests

** Putting It All Together:**

settings|test AUC
--- | --- 
min_samples_leaf: 2|0.698
max_depth: 5|0.696
combined predictions|0.711
min_samples_leaf: 2, with bagging|0.741
min_samples_leaf: 2, with bagging and random subsets|0.766


The random forest algorithm is incredibly powerful, but isn't applicable to all tasks. The main strengths of a random forest are:

Very accurate predictions -- Random forests achieve near state of the art performance on many machine learning tasks. Along with neural networks and gradient boosted trees, they are typically one of the top performing algorithms.
Resistance to overfitting -- due to how they're constructed, random forests are fairly resistant to overfitting. Parameters like max_depth still have to be set and tweaked, though.

The main weaknesses are:
Hard to interpret -- because we've averaging the results of many trees, it can be hard to figure out why a random forest is making predictions the way it is.
Longer creation time -- making two trees takes twice as long as making one, 3 takes three times as long, and so on. Luckily, we can exploit multicore processors to parallelize tree construction. 

Given these tradeoffs, it makes sense to use random forests in situations where accuracy is of the utmost importance, and being able to interpret or explain the decisions the model is making isn't key. In cases where time is of the essence, or interpretability is important, a single decision tree may be a better choice.