# Predicting Income
### Random Forest & Decision Tree

In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository.
By using this census data with a random forest, we will try to predict whether or not a person makes more than $50,000 a year.

In [1]:
#def warn(*args, **kwargs): pass import warnings warnings.warn = warn

In [85]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [36]:
income_data = pd.read_csv('income.csv', header=0, index_col=0, skipinitialspace=True)

In [69]:
income_data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,sexint,countryint
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,1,1
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K,1,0
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K,1,1
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K,0,0
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K,1,0
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K,0,0


In [13]:
income_data.shape

(32561, 15)

In [14]:
income_data.iloc[0]

age                            39
 workclass              State-gov
 fnlwgt                     77516
 education              Bachelors
 education-num                 13
 marital-status     Never-married
 occupation          Adm-clerical
 relationship       Not-in-family
 race                       White
 sex                         Male
 capital-gain                2174
 capital-loss                   0
 hours-per-week                40
 native-country     United-States
 income                     <=50K
Name: 0, dtype: object

We inspect the data of the first row, we can see from the 'income' column this person did not make 50,000 a year.

In [18]:
income_data.columns

Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',
       ' marital-status', ' occupation', ' relationship', ' race', ' sex',
       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',
       ' income'],
      dtype='object')

There’s a small problem with our data that is a little hard to catch — every string has an extra space at the start.
For example, the first row’s native-country is " United-States", but we want it to be "United-States".
This is happening because in income.csv there are spaces after the commas.

Let's fix it by adding skipinitialspace parameter to our earlier read_csv function.

In [37]:
income_data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [43]:
income_data['race'].values

array(['White', 'White', 'White', ..., 'White', 'White', 'White'],
      dtype=object)

Now the problem has been fixed.

In [54]:
income_data['sexint'] = income_data['sex'].apply(lambda row: 0 if row == "Male" else 1)

In [61]:
income_data['native-country'].value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

Since the majority of the data comes from "United-States", it might make sense to make a column
where every row that contains "United-States" becomes a 0 and any other country becomes a 1

In [62]:
income_data['countryint'] = income_data['native-country'].apply(lambda row : 0 if row == 'United-States' else 1)

In [64]:
data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sexint", "countryint"]]
labels = income_data[["income"]]

In [65]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)

### Random Forest Classifier

In [66]:
forest = RandomForestClassifier(random_state=1)

In [67]:
forest.fit(train_data, train_labels)

  forest.fit(train_data, train_labels)


RandomForestClassifier(random_state=1)

In [68]:
forest.score(test_data, test_labels)

0.8225033779633951

In [84]:
print(list(zip(["age", "capital-gain", "capital-loss", "hours-per-week", "sexint", "countryint"],forest.feature_importances_)))

[('age', 0.31351873523247886), ('capital-gain', 0.29270793491347497), ('capital-loss', 0.1174545721257282), ('hours-per-week', 0.20309967949670482), ('sexint', 0.0643516046682865), ('countryint', 0.008867473563326705)]


Lets use our model to make prediction on the income of few new data.

In [78]:
Kai = np.array([29, 12000, 0, 25, 0, 1])
Lisa = np.array([26, 2345, 0, 50, 1, 1])
Yeol = np.array([32, 0, 0, 13, 0, 0])

In [79]:
sample_test = np.array([Kai, Lisa, Yeol])

In [80]:
forest.predict(sample_test)



array(['>50K', '<=50K', '<=50K'], dtype=object)

Using the Random Forest Classifier,
we predict that Kai would be getting income more than 50,000 per year meanwhile Lisa and Yeol is predicted to get
less than 50,000 per year.

### Decision Tree Classifier

In [87]:
tree = DecisionTreeClassifier(random_state=1)
tree.fit(train_data, train_labels)

DecisionTreeClassifier(random_state=1)

In [88]:
tree.score(test_data, test_labels)

0.8226262129959464

Decision Tree accuracy score is almost as similar as Random Forest classifier which is at about 82% accuracy.

In [89]:
tree.predict(sample_test)



array(['>50K', '<=50K', '<=50K'], dtype=object)

Here we can see the prediction on the income of our sample data is similar for both classifier.