#### Random Forest

The most powerful tool to reduce the overfitting decision tree is random forest. A random forest is a kind of ensemble model.Ensembles combines the prediction of multiple models to create a more accurate predictions.

In [1]:
import pandas as pd
import numpy as np
import math
income = pd.read_csv("income.csv")
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

create two decision trees with differen parameters

- One with min_samples_leaf set to 2
- One with max_depth set to 5

In [2]:
cols = ['workclass','education', "marital_status", "occupation", "relationship", "race", "sex", "native_country", "high_income"]

for name in  cols:
    col = pd.Categorical(income[name])
    income[name]= col.codes
    
income.head()    

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


In [3]:
np.random.seed(1)
income=income.reindex(np.random.permutation(income.index))
train_max= math.floor(income.shape[0]*0.8)

train=income.iloc[0:train_max]
test = income.iloc[train_max:]

In [4]:
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]


clf= DecisionTreeClassifier(random_state=1,min_samples_leaf =2 )
clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
error = roc_auc_score(test['high_income'], predictions)


clf1= DecisionTreeClassifier(random_state=1,max_depth =5 )
clf1.fit(train[columns], train["high_income"])
clf1_predictions = clf1.predict(test[columns])
error1 = roc_auc_score(test['high_income'],predictions)

print(error)
print(error1)

0.6878964226062301
0.6878964226062301


Combining their predictions and compare the combined accuracy with the individual accuracies of both trees. Using the DecisionTreeClassifier.predict_proba() method on both classifiers take the mean for each row, and then round the results, we'll get ensemble predictions.

In [5]:
predictions = clf.predict_proba(test[columns])[:,1]
predictions2 = clf1.predict_proba(test[columns])[:,1]
combined = (predictions + predictions2) /2
rounded =np.round(combined)
print(roc_auc_score(test['high_income'], rounded))

0.7150846804038882


Combined predictions of the two trees have a higher AUC  than either tree on its own. Random forest is an ensemble of decision tree.In order to make ensembling effective, we have to introduce variation into each individual decision tree model. There are two ways to introduce variations 

 1) Bagging 2) Random feature subsets

#### Bagging
In a random forest we don't train each tree on entire data set. We train it on a random sample of the data, or a "bag," We perform sampling with replacement.which means that after we select a row from the data we're sampling, we put the row back in the data so it can be picked again. Some rows from the original data may appear in the "bag" multiple times

In [6]:
# We'll build 10 trees
tree_count = 10

# Each "bag" will have 60% of the number of original rows
bag_proportion = .6

predictions = []
for i in range(tree_count):
    # We select 60% of the rows from train, sampling with replacement
    # We set a random state to ensure we'll be able to replicate our results
    # We set it to i instead of a fixed value so we don't get the same sample in every loop
    # That would make all of our trees the same
    bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
    
    # Fit a decision tree model to the "bag"
    clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
    clf.fit(bag[columns], bag["high_income"])
    
    # Using the model, make predictions on the test data
    predictions.append(clf.predict_proba(test[columns])[:,1])
    
combined = np.sum(predictions,axis=0) /tree_count
    
rounded= np.round(combined)
    
print(roc_auc_score(test['high_income'], rounded))

0.7329963297474371


#### Variation using the random feature subsets

In [7]:
def cal_entropy(column):
    count = np.bincount(column)
    #print(count)
    probability = count/len(column)
    print(probability)
    
    entropy = 0
    for prob in probability:
        if prob > 0:
            entropy += prob * math.log(prob, 2)
    return -entropy
def calc_information_gain(data, split_name, target_name):
    
    ### calculate the entrophy
    original_entropy =  cal_entropy(data[target_name])
    
    column = data[split_name]
    median = column.median()
    
    left_split = data[column <= median]
    right_split = data[column > median]
    
    to_subtract = 0
    for subset in [left_split, right_split]:
        prob = (subset.shape[0] / data.shape[0]) 
        to_subtract += prob * cal_entropy(subset[target_name])
    return original_entropy - to_subtract

In [8]:
# Create the data set that we used two missions ago
data = pd.DataFrame([
    [0,4,20,0],
    [0,4,60,2],
    [0,5,40,1],
    [1,4,25,1],
    [1,5,35,2],
    [1,5,55,1]
    ])
data.columns = ["high_income", "employment", "age", "marital_status"]

# Set a random seed to make the results reproducible
np.random.seed(1)

# The dictionary to store our tree
tree = {}
nodes = []

# The function to find the column to split on
def find_best_column(data, target_name, columns):
    information_gains = []
    
    # Insert your code here
    cols = np.random.choice(columns, 2)
    for col in cols:
        information_gain = calc_information_gain(data, col, "high_income")
        information_gains.append(information_gain)

    # Find the name of the column with the highest gain
    highest_gain_index = information_gains.index(max(information_gains))
    highest_gain = cols[highest_gain_index]
    return highest_gain

# The function to construct an ID3 decision tree
def id3(data, target, columns, tree):
    unique_targets = pd.unique(data[target])
    nodes.append(len(nodes) + 1)
    tree["number"] = nodes[-1]

    if len(unique_targets) == 1:
        if 0 in unique_targets:
            tree["label"] = 0
        elif 1 in unique_targets:
            tree["label"] = 1
        return
    
    best_column = find_best_column(data, target, columns)
    column_median = data[best_column].median()
    
    tree["column"] = best_column
    tree["median"] = column_median
    
    left_split = data[data[best_column] <= column_median]
    right_split = data[data[best_column] > column_median]
    split_dict = [["left", left_split], ["right", right_split]]
    
    for name, split in split_dict:
        tree[name] = {}
        id3(split, target, columns, tree[name])


# Run the ID3 algorithm on our data set and print the resulting tree
id3(data, "high_income", ["employment", "age", "marital_status"], tree)
print(tree)

[0.5 0.5]
[0.33333333 0.66666667]
[0.66666667 0.33333333]
[0.5 0.5]
[0.66666667 0.33333333]
[0.33333333 0.66666667]
[0.33333333 0.66666667]
[0.5 0.5]
[0. 1.]
[0.33333333 0.66666667]
[0.5 0.5]
[0. 1.]
[0.5 0.5]
[1.]
[0. 1.]
[0.5 0.5]
[0.5 0.5]
[]
[0.66666667 0.33333333]
[0.66666667 0.33333333]
[]
[0.66666667 0.33333333]
[0.5 0.5]
[1.]
[0.5 0.5]
[0.5 0.5]
[]
[0.5 0.5]
[1.]
[0. 1.]
{'number': 1, 'column': 'age', 'median': 37.5, 'left': {'number': 2, 'column': 'employment', 'median': 4.0, 'left': {'number': 3, 'column': 'age', 'median': 22.5, 'left': {'number': 4, 'label': 0}, 'right': {'number': 5, 'label': 1}}, 'right': {'number': 6, 'label': 1}}, 'right': {'number': 7, 'column': 'age', 'median': 55.0, 'left': {'number': 8, 'column': 'age', 'median': 47.5, 'left': {'number': 9, 'label': 0}, 'right': {'number': 10, 'label': 1}}, 'right': {'number': 11, 'label': 0}}}


#### Random Subsets in scikit-learn

In [11]:
tree_count =10

#Each "bag" will have 60% of the number of original rows
bag_proportion =.6
predictions = []
for i in range(tree_count):
    # We select 60% of the rows from train, sampling with replacement
    # We set a random state to ensure we'll be able to replicate our results
    bag= train.sample(frac=bag_proportion, replace=True, random_state=i)
    
    clf= DecisionTreeClassifier(random_state=1, min_samples_leaf=2,splitter="random", max_features='auto')
    clf.fit(bag[columns],  bag["high_income"])
    predictions.append(clf.predict_proba(test[columns])[:,1])
    
combined = np.sum(predictions, axis=0)/ 10
rounded = np.round(combined)

print(roc_auc_score(test["high_income"], rounded))

0.7345958637997538


 RandomForestClassifier

In [12]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=5, random_state=1, min_samples_leaf=2)
clf.fit(train[columns], train['high_income'])
predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))

0.7347461391939776


Tweaking parameters to increase the accuracy:

Tweaking parameters can increase the accuracy of the forest. The easiest tweak is to increase the number of estimators



In [13]:
clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=2)

clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))

0.7379403213124711


####  Reducing Overfitting

In [16]:
clf = DecisionTreeClassifier(random_state=1, min_samples_leaf=5)

clf.fit(train[columns], train["high_income"])

predictions = clf.predict(train[columns])
print(roc_auc_score(train["high_income"], predictions))

predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))

clf = RandomForestClassifier(n_estimators=150, random_state=1, min_samples_leaf=5)


clf.fit(train[columns], train['high_income'])
predictions = clf.predict(train[columns])
print(roc_auc_score(train["high_income"], predictions))

predictions = clf.predict(test[columns])
print(roc_auc_score(test["high_income"], predictions))

0.8192570489534683
0.7139325899284541
0.7917047295143252
0.7498874343962398


overfitting decreased with a random forest, and accuracy went up overall.