<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Predicting-income-using-a-Random-Forest-Classifier" data-toc-modified-id="Predicting-income-using-a-Random-Forest-Classifier-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Predicting income using a Random Forest Classifier</a></span><ul class="toc-item"><li><span><a href="#The-income-data" data-toc-modified-id="The-income-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>The income data</a></span></li></ul></li><li><span><a href="#Format-the-data-for-Scikit-Learn" data-toc-modified-id="Format-the-data-for-Scikit-Learn-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Format the data for Scikit-Learn</a></span></li><li><span><a href="#Creating-the-Random-Forest" data-toc-modified-id="Creating-the-Random-Forest-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Creating the Random Forest</a></span><ul class="toc-item"><li><span><a href="#Fixing-the-feature-problem" data-toc-modified-id="Fixing-the-feature-problem-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Fixing the feature problem</a></span></li><li><span><a href="#Re-training-the-model" data-toc-modified-id="Re-training-the-model-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Re-training the model</a></span></li><li><span><a href="#Attempting-to-improve-classification-accuracy" data-toc-modified-id="Attempting-to-improve-classification-accuracy-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Attempting to improve classification accuracy</a></span></li></ul></li></ul></div>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Predicting income using a Random Forest Classifier

In [4]:
# the following was added after finding a warning when loading the dataset
# this function will just mute the warning
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

## The income data

This project uses income data supplied by UCi's machine learning repository:

https://archive.ics.uci.edu/ml/index.php

The aim is to accurately predict a person's income based on a number of features present in the dataset using a Random Forest Classifier.

In [5]:
income_data = pd.read_csv('income.csv', delimiter=', ')
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [32]:
income_data['income'].unique()

array(['<=50K', '>50K'], dtype=object)

The income column does not contain continuous data, rather each person is classified as having income either above or below the $50k mark.

So, this is a binary classification problem as opposed to a regression problem.

In [6]:
income_data.iloc[0]

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
income                    <=50K
Name: 0, dtype: object

In [7]:
len(income_data)

32561

# Format the data for Scikit-Learn

In [8]:
# the labels for the data are in the column "income"
labels = income_data['income']

In [11]:
# also have to select the features that we'll use
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex']]

In [12]:
# we then have to split the data into training set and test set:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)

# Creating the Random Forest

In [13]:
forest = RandomForestClassifier(random_state = 1)

In [14]:
forest.fit(train_data, train_labels)

ValueError: could not convert string to float: 'Female'

The error is due to the `Sex` column having string entries for male and female. We can alter this to be a binary column for `0 = Male` and `1 = Female`

## Fixing the feature problem

In [15]:
# creating a new column called "sex-int":
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

In [16]:
income_data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income', 'sex-int'],
      dtype='object')

In [17]:
# reassigning the data variable to have the new `sex-int` column and remove the `sex` column:
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', "sex-int"]]

In [18]:
# re-splitting the data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)

## Re-training the model

In [19]:
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [20]:
print(forest.score(test_data, test_labels))

0.8272939442328953


## Attempting to improve classification accuracy

There are other columns containing strings that could be changed and added to the DataFrame to increase accuracy.

Most of the data comes from the USA, so make entries where native-country = United-States = 0 and all other countries = 1

In [21]:
# adding a new column that contains 1 if the country is the USA or 0 otherwise
income_data['country-int'] = income_data['native-country'].apply(lambda row: 0 if row == 'United-States' else 1)

In [22]:
# re-selecting the data
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', "sex-int", 'country-int']]

In [26]:
# re-splitting the data
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)

In [27]:
# re-training the model
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [28]:
print(forest.score(test_data, test_labels))

0.8225033779633951


Adding this new feature has actually reduced the classification accuracy!