# Intro to Random Forests

## Objective

Construct and apply random forests.

## Data Set

The data set contains individual income in the United States. The data is from the 1994 census, and contains information on an individual's marital status, age, type of work, and more. 

The target column is whether individuals make less than or equal to 50k a year, or more than 50k a year.

The data set can be downloaded from [the University of California, Irvine's website.](http://archive.ics.uci.edu/ml/datasets/Adult), the column description can be found [here](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names)

## Exploring the Data

In [1]:
import pandas as pd
import numpy as np
import math

income_file = "C:/Users/i7/csv/income.csv"

# Column names, not included in file.
names = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week','native-country','income']

income = pd.read_csv(income_file, names=names)
income.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [2]:
# Convert a single column from text categories to numbers
col = pd.Categorical(income["workclass"])
income["workclass"] = col.codes
print(income["workclass"].head(5))

0    7
1    6
2    4
3    4
4    4
Name: workclass, dtype: int8


In [3]:
# Convert all categorical column to numbers
for name in ["education", "marital-status", "occupation", "relationship", "race", "sex", "native-country", "income"]:
    col = pd.Categorical(income[name])
    income[name] = col.codes

## Splitting Data Into Train and Test Set

In [4]:
# Set a random seed so the shuffle is the same every time
np.random.seed(1)

# Shuffle the rows  
# This permutes the index randomly using numpy.random.permutation
# Then, it reindexes the dataframe with the result
# The net effect is to put the rows into random order
income = income.reindex(np.random.permutation(income.index))

train_max_row = math.floor(income.shape[0] * .8)
train = income.iloc[:train_max_row]
test = income.iloc[train_max_row:]

## Combining Model Predictions With Ensembles

A random forest is a kind of ensemble model. Ensembles combine the predictions of multiple models to create a more accurate final prediction. 

Create two decision trees with slightly different parameters:

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

columns = ["age", "workclass", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "hours-per-week", "native-country"]

clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
clf.fit(train[columns], train["income"])

clf2 = DecisionTreeClassifier(random_state=1, max_depth=5)
clf2.fit(train[columns], train["income"])

predictions = clf.predict(test[columns])
print(roc_auc_score(test["income"], predictions))

predictions = clf2.predict(test[columns])
print(roc_auc_score(test["income"], predictions))


0.687896422606
0.675985390651


## Combining the Predictions

In [8]:
predictions = clf.predict_proba(test[columns])[:,1]
predictions2 = clf2.predict_proba(test[columns])[:,1]
combined = (predictions + predictions2) / 2
rounded = np.round(combined)

roc_auc = roc_auc_score(rounded, test["income"])
print("roc_auc:", roc_auc)

roc_auc: 0.747124642443


##### the combined predictions of the two trees are more accurate than any single tree.

## Introducing Variation With Bagging

Introducing variation, each tree will be be constructed slightly differently, and will therefore make different predictions. This variation is what puts the "random" in "random forest."

In [10]:
# Build 10 trees
tree_count = 10

# Each "bag" will have 60% of the number of original rows.
bag_proportion = .6

predictions = []
for i in range(tree_count):
    # Select 60% of the rows from train, sampling with replacement.
    # Set a random state to ensure be able to replicate the results.
    # Set it to i instead of a fixed value so don't get the same sample every loop.
    # That would make all of the trees the same.
    bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
    
    # Fit a decision tree model to the "bag".
    clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75)
    clf.fit(bag[columns], bag["income"])
    
    # Using the model, make predictions on the test data.
    predictions.append(clf.predict_proba(test[columns])[:,1])
    
combined = np.sum(predictions, axis=0) / 10
rounded = np.round(combined)

roc_auc = roc_auc_score(rounded, test["income"])
print("roc_auc:", roc_auc)

roc_auc: 0.785415640465


## Random Subsets In Scikit-Learn

In [12]:
# Build 10 trees
tree_count = 10

# Each "bag" will have 70% of the number of original rows.
bag_proportion = .7

predictions = []
for i in range(tree_count):
    # Select 80% of the rows from train, sampling with replacement.
    # Set a random state to ensure able to replicate our results.
    # Set it to i instead of a fixed value so don't get the same sample every time.
    bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
    
    # Fit a decision tree model to the "bag".
    clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=75, splitter="random", max_features="auto")
    clf.fit(bag[columns], bag["income"])
    
    # Using the model, make predictions on the test data.
    predictions.append(clf.predict_proba(test[columns])[:,1])

combined = np.sum(predictions, axis=0) / 10
rounded = np.round(combined)

roc_auc = roc_auc_score(rounded, test["income"])
print("roc_auc:", roc_auc)

roc_auc: 0.789767997764


##### using random subsets above improved the accuracy versus just using bagging.

## Putting It All Together using sklearn.ensemble

In [13]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["income"])

predictions = clf.predict(test[columns])

roc_auc = roc_auc_score(rounded, test["income"])
print("roc_auc:", roc_auc)

roc_auc: 0.789767997764


## Tweaking Parameters To Increase Accuracy

In [14]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=75)

clf.fit(train[columns], train["income"])

predictions = clf.predict(test[columns])

roc_auc = roc_auc_score(rounded, test["income"])
print("roc_auc:", roc_auc)

roc_auc: 0.789767997764


## Overfitting

In [15]:
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=75)
clf.fit(train[columns], train["income"])

predictions = clf.predict(train[columns])
print(roc_auc_score(predictions, train["income"]))

predictions = clf.predict(test[columns])
print(roc_auc_score(predictions, test["income"]))

0.794137608735
0.793788646293


##### overfitting decreases significantly with a random forest, and accuracy goes up overall.