Welcome

In this project we will be using a Random Forest Classifier to predict whether or not a person earns more than $50,000 a year

The dataset is publicly available and can be found here https://archive.ics.uci.edu/ml/datasets/census+income

Random Forests are a method for classification based on ensemble, meaning that multiple decision trees are made when the classifier is trained and then the most frequent class that was predicted by each individual tree is chosen by the classifier

This will be a very brief project to get familiar with the Random Forest Classifier

In [2]:
#We first import the libraries we will use throughout the project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
#Now we import our data and name the columns
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
census_data = pd.read_csv('adult_data.txt', names = column_names)

In [5]:
census_data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


Okay let's now get the data into a format that Scikit Learn can work with

The labels we need are in the column 'income' which we will store in a separate variable

In [6]:
labels = census_data[['income']]

In [20]:
#Now let's select the variables which we will work with. Please note that that in order to fit our data to the classifier we cannot use columns that contain strings, since a Random Forest Classifier only works with continous values like floats and integers
data = census_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week']]

Okay we now split the data into a training set and a test set, the first we use it to train the Random Forest Classifier and the second one to test its accuracy

In [8]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)

In [9]:
#We now create the Random Forest Classifier from Scikit Learn library
forest = RandomForestClassifier(random_state = 1)
#And fit our training data
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

Time to check the accuracy of the classifier

In [10]:
print(forest.score(test_data, test_labels))

0.82201203783319


###Great we achieved an accuracy of 82%

Let's try and improve the score, we will add a new column 'sex' by tranforming it from male and female to 0 and 1 respectively

In [11]:
census_data['sex_int'] = census_data['sex'].apply(lambda row: 1 if row == ' Female' else 0)

In [12]:
#Let's see if that worked
census_data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,sex_int
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,1
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K,1
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K,1
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K,0
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K,1
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K,0


In [13]:
#We go through the same process again, this time adding the sex_int column

data = census_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex_int']]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

In [14]:
print(forest.score(test_data, test_labels))

0.8253285837120747


Nothing much changed huh?. Let's transform and add more columns to our training set

In [15]:
#Let's check the values in the native-country column
print(census_data['native-country'].value_counts())

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

We can see that most of the values come from the United States, let's try by setting the U.S to 1 and all other countries to 0

In [16]:
census_data['country_int'] = census_data['native-country'].apply(lambda row: 0 if row == ' United-States' else 1)

In [17]:
#Small check to see our new column
census_data.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,sex_int,country_int
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,1,1


In [18]:
#We go through the same process again, this time adding the country_int column

data = census_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex_int', 'country_int']]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

In [19]:
print(forest.score(test_data, test_labels))

0.823731728288908


The Random Forest Classifier accuracy did not change much by adding this additional column

We can keep transforming columns and adding new features to try and improve the accuracy of the classifier

At this moment this is all for this project

Thank you for reading