# Predicting Income with Random Forests
In this project, we will be using a dataset containing census information from [UCI’s Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census%20income).

By using this census data with a random forest, we will try to predict whether or not a person makes more than $50,000.

In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

In [2]:
income_data = pd.read_csv('income.csv', header=0, delimiter=", ", engine='python')

In [3]:
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
income_data.iloc[0]

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
income                    <=50K
Name: 0, dtype: object

In [5]:
income_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [6]:
income_data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

## Clean data

In [7]:
income_data['sex-int'] = income_data['sex'].map({'Male':1, 'Female':0})

In [8]:
def factorize(column):
    return pd.factorize(column)[0]+1

print(factorize(income_data['occupation']))

[1 2 3 ... 1 1 2]


In [9]:
income_data['occupation'].value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

In [10]:
income_data['native-country'].value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                      

In [11]:
income_data['workclass'].value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [12]:
income_data['education'].value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

In [13]:
income_data['marital-status'].value_counts()

Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital-status, dtype: int64

In [14]:
def string_to_01(data, column, string):
    return data[column].apply(lambda x: 0 if x == string else 1)

income_data['native-int'] = string_to_01(income_data, 'native-country', 'United-States')
income_data['workclass-int'] = string_to_01(income_data, 'workclass', 'private')
income_data['occupation-int'] = factorize(income_data['occupation'])
income_data['education-int'] = factorize(income_data['education'])
income_data['marital-status-int'] = factorize(income_data['marital-status'])

## Format The Data For Scikit-learn

In [15]:
labels = income_data['income']
many_data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week"]]

train_data, test_data, train_labels, test_labels = train_test_split(many_data, labels, random_state=1)

## Create The Random Forest

In [16]:
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
score = forest.score(test_data, test_labels)
print(score)

0.8222577078982926


In [17]:
list(zip(["age", "capital-gain", "capital-loss", "hours-per-week"],forest.feature_importances_))

[('age', 0.2899902941678581),
 ('capital-gain', 0.35559953754473134),
 ('capital-loss', 0.14510085058672306),
 ('hours-per-week', 0.2093093177006875)]

In [18]:
# add sex-int

labels = income_data['income']
many_data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int"]]

train_data, test_data, train_labels, test_labels = train_test_split(many_data, labels, random_state=1)

# Create The Random Forest
forest = RandomForestClassifier(random_state=1)
forest.fit(test_data, test_labels)
score = forest.score(test_data, test_labels)

print(score)
print(list(zip(["age", "capital-gain", "capital-loss", "hours-per-week", "sex"],forest.feature_importances_)))

0.8777791426114728
[('age', 0.3567466687648515), ('capital-gain', 0.26051378101374656), ('capital-loss', 0.10547162079308159), ('hours-per-week', 0.21412400088101088), ('sex', 0.06314392854730949)]


In [20]:
# add native-country 

labels = income_data['income']
many_data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int", "native-int"]]

train_data, test_data, train_labels, test_labels = train_test_split(many_data, labels, random_state=1)

# Create The Random Forest
forest = RandomForestClassifier(random_state=1)
forest.fit(test_data, test_labels)
score = forest.score(test_data, test_labels)
print(score)
print(list(zip(["age", "capital-gain", "capital-loss", "hours-per-week", "sex"],forest.feature_importances_)))


0.8831838840437293
[('age', 0.36810265849433305), ('capital-gain', 0.24551815314306222), ('capital-loss', 0.09803541590672699), ('hours-per-week', 0.21421150906431205), ('sex', 0.06354219573424456)]


## Explore On Your Own

In [21]:
# Random Forest

labels_2 = income_data['income']
many_data_2 = income_data[["age", "capital-gain", "capital-loss", "hours-per-week","sex-int", "native-int",
                        "workclass-int", "occupation-int", "education-int", "marital-status-int"]]

train_data, test_data, train_labels, test_labels = train_test_split(many_data_2, labels_2, test_size=0.2, random_state=1)

forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
score = forest.score(test_data, test_labels)
print(score)
list(zip(["age", "capital-gain", "capital-loss", "hours-per-week","sex-int", "native-int","workclass-int", 
          "occupation-int", "education-int", "marital-status-int"], forest.feature_importances_))

0.8484569322892677


[('age', 0.2484423946970997),
 ('capital-gain', 0.15457577071511933),
 ('capital-loss', 0.052094246428671304),
 ('hours-per-week', 0.12432737532346944),
 ('sex-int', 0.02800263369225799),
 ('native-int', 0.013422610650784065),
 ('workclass-int', 0.0),
 ('occupation-int', 0.11678639214042627),
 ('education-int', 0.10343861031626786),
 ('marital-status-int', 0.15890996603590413)]

- Which features tend to be more relevant? capital-loss, sex, age, capital-gain, the rest of data shared close number around 10%, while workclass was 0%. 

In [22]:
# Decision tree

labels_2 = income_data['income']
many_data_2 = income_data[["age", "capital-gain", "capital-loss", "hours-per-week","sex-int", "native-int",
                        "workclass-int", "occupation-int", "education-int", "marital-status-int"]]

train_data, test_data, train_labels, test_labels = train_test_split(many_data_2, labels_2, test_size=0.2, random_state=1)

classifier = tree.DecisionTreeClassifier(random_state=1)
classifier.fit(train_data, train_labels)
score = classifier.score(test_data, test_labels)

print(score)
list(zip(["age", "capital-gain", "capital-loss", "hours-per-week","sex-int", "native-int","workclass-int", 
          "occupation-int", "education-int", "marital-status-int"], classifier.feature_importances_))

0.8226623675725472


[('age', 0.20223783546788687),
 ('capital-gain', 0.1897618957019309),
 ('capital-loss', 0.054188247571789104),
 ('hours-per-week', 0.12437844306211633),
 ('sex-int', 0.019687006042570013),
 ('native-int', 0.019908724666832068),
 ('workclass-int', 0.0),
 ('occupation-int', 0.11838657844176449),
 ('education-int', 0.08356634224159795),
 ('marital-status-int', 0.18788492680351232)]

- When does the random forest do better than the single tree? 
- When does a single tree do just as well as the forest?