# Income prediction on census data

# Objective: 
To predict whether income exceeds 50K/yr based on census data

Dataset: Adult Data Set

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data


Variable description:
    
age: continuous

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

class: >50K, <=50K

In [1]:
# Pandas and Numpy libraries
import pandas as pd
import numpy as np
import sklearn

# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing

In [2]:
sklearn.__version__

'0.20.1'

In [3]:
# To split the dataset into train and test datasets
from sklearn.model_selection import train_test_split
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score

In [4]:
adult_df = pd.read_csv('adult.data', header = None)

In [5]:
adult_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


Load the dataset. Observe that this file has .data extention

For importing the census data, we are using pandas read_csv() method. This method is a very simple and fast method for importing 
data.

We are passing four parameters. The ‘adult.data’ parameter is the file name. The header parameter is for giving details to pandas
that whether the first row of data consists of headers or not. In our dataset, there is no header. So, we are passing None.

The delimiter parameter is for giving the information the delimiter that is separating the data. Here, we are using ‘ , ’ 
delimiter. This delimiter is to show delete the spaces before and after the data values. This is very helpful when there is 
inconsistency in spaces used with data values.

In [6]:
# Print columns in the adult data set
adult_df.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype='int64')

In [7]:
# Adding headers to the dataframe 
adult_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
                    'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

In [8]:
# Number of records(rows) in the dataframe
print(len(adult_df))
print(len(adult_df.index))

32561
32561


In [9]:
adult_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
# Handling missing data
# Test whether there is any null value in our dataset or not. We can do this using isnull() method.
adult_df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

The above output shows that there is no “null” value in our dataset.

Let’s try to test whether any categorical attribute contains a “ ?” in it or not. At times there exists “?” or ” ” in place of 
missing values. Using the below code snippet we are going to test whether adult_df data frame consists of categorical variables 
with values as “?”.

In [11]:
for value in ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']:
    print(value,":", sum(adult_df[value] == ' ?'))

workclass : 1836
education : 0
marital_status : 0
occupation : 1843
relationship : 0
race : 0
sex : 0
native_country : 583
income : 0


In [12]:
categorical_columns= ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']
for column in categorical_columns:
    print("Unique values in column {0} are : ".format(column))
    print(adult_df[column].unique())
    print(" ")
    print(" ")

Unique values in column workclass are : 
[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']
 
 
Unique values in column education are : 
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
 
 
Unique values in column marital_status are : 
[' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']
 
 
Unique values in column occupation are : 
[' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']
 
 
Unique values in column relationship are : 
[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-rela

The output of the above code snippet shows that there are 1836 missing values in workclass attribute. 1843 missing values in 
occupation attribute and 583 values in native_country attribute.

# Data preprocessing

For preprocessing, we are going to make a duplicate copy of our original dataframe.We are duplicating adult_df to adult_df_rev 
dataframe. Observe that we have used deep copy while copying. Why?

In [13]:
## Deep copy of adult_df
adult_df_rev = adult_df.copy()

Before doing missing values handling task, we need some summary statistics of our dataframe. For this, we can use describe() 
method. It can be used to generate various summary statistics, excluding NaN values.

In [14]:
adult_df_rev.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [15]:
adult_df_rev.describe(include= 'all')['workclass']['top']

' Private'

In [16]:
adult_df_rev.describe(include= 'all')['workclass'][2]

' Private'

We are passing an “include” parameter with value as “all”, this is used to specify that. we want summary statistics of all the 
attributes.

In [17]:
adult_df_rev.describe(include= 'all')

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
count,32561.0,32561,32561.0,32561,32561.0,32561,32561,32561,32561,32561,32561.0,32561.0,32561.0,32561,32561
unique,,9,,16,,7,15,6,5,2,,,,42,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,22696,,10501,,14976,4140,13193,27816,21790,,,,29170,24720
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,
std,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117827.0,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178356.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237051.0,,12.0,,,,,,0.0,0.0,45.0,,


# Data imputation 

Some of the categorical values have missing values i.e, “?”. We replace the “?” with the above describe methods top row’s value. 
For example, we replace the “?” values of workplace attribute with “Private” value.

In [18]:
for value in ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']:
    #adult_df_rev[value].replace(['?'], [adult_df_rev.describe(include='all')[value][2]],inplace='True')
    replaceValue = adult_df_rev.describe(include='all')[value][2]
    print(replaceValue)
    adult_df_rev[value][adult_df_rev[value]==' ?'] = replaceValue

 Private
 HS-grad


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


 Married-civ-spouse
 Prof-specialty
 Husband
 White
 Male
 United-States
 <=50K


In [19]:
for value in ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']:
    print(value,":", sum(adult_df_rev[value] == ' ?'))

workclass : 0
education : 0
marital_status : 0
occupation : 0
relationship : 0
race : 0
sex : 0
native_country : 0
income : 0


For Naive Bayes, we need to convert all the data values in one format.

We are going to encode all the labels with the value between 0 and n_classes-1. In the present case, it will be 0 and 1.

For implementing this, we are going to use LabelEncoder of scikit learn library.

In [20]:
# Hot Encoding 
le = preprocessing.LabelEncoder()
workclass_cat = le.fit_transform(adult_df.workclass)
education_cat = le.fit_transform(adult_df.education)
marital_cat   = le.fit_transform(adult_df.marital_status)
occupation_cat = le.fit_transform(adult_df.occupation)
relationship_cat = le.fit_transform(adult_df.relationship)
race_cat = le.fit_transform(adult_df.race)
sex_cat = le.fit_transform(adult_df.sex)
native_country_cat = le.fit_transform(adult_df.native_country)

In [21]:
#initialize the encoded categorical columns
adult_df_rev['workclass_cat'] = workclass_cat
adult_df_rev['education_cat'] = education_cat
adult_df_rev['marital_cat'] = marital_cat
adult_df_rev['occupation_cat'] = occupation_cat
adult_df_rev['relationship_cat'] = relationship_cat
adult_df_rev['race_cat'] = race_cat
adult_df_rev['sex_cat'] = sex_cat
adult_df_rev['native_country_cat'] = native_country_cat

In [22]:
adult_df_rev.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,...,native_country,income,workclass_cat,education_cat,marital_cat,occupation_cat,relationship_cat,race_cat,sex_cat,native_country_cat
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,...,United-States,<=50K,7,9,4,1,1,4,1,39
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,United-States,<=50K,6,9,2,4,0,4,1,39
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,...,United-States,<=50K,4,11,0,6,1,4,1,39
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,...,United-States,<=50K,4,1,2,6,0,2,1,39
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,...,Cuba,<=50K,4,9,2,10,5,2,0,5


In [23]:
#drop the old categorical columns from dataframe
dummy_fields = ['workclass','education','marital_status','occupation','relationship','race', 'sex', 'native_country']
adult_df_rev = adult_df_rev.drop(dummy_fields, axis = 1)

In [24]:
adult_df_rev

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,income,workclass_cat,education_cat,marital_cat,occupation_cat,relationship_cat,race_cat,sex_cat,native_country_cat
0,39,77516,13,2174,0,40,<=50K,7,9,4,1,1,4,1,39
1,50,83311,13,0,0,13,<=50K,6,9,2,4,0,4,1,39
2,38,215646,9,0,0,40,<=50K,4,11,0,6,1,4,1,39
3,53,234721,7,0,0,40,<=50K,4,1,2,6,0,2,1,39
4,28,338409,13,0,0,40,<=50K,4,9,2,10,5,2,0,5
5,37,284582,14,0,0,40,<=50K,4,12,2,4,5,4,0,39
6,49,160187,5,0,0,16,<=50K,4,6,3,8,1,2,0,23
7,52,209642,9,0,0,45,>50K,6,11,2,4,0,4,1,39
8,31,45781,14,14084,0,50,>50K,4,12,4,10,1,4,0,39
9,42,159449,13,5178,0,40,>50K,4,9,2,4,0,4,1,39


Reindex all the columns properly. We have passed the list of column names as a parameter and axis=1 for reindexing the columns.

In [25]:
adult_df_rev = adult_df_rev.reindex_axis(['age', 'workclass_cat', 'fnlwgt', 'education_cat',
                                    'education_num', 'marital_cat', 'occupation_cat',
                                    'relationship_cat', 'race_cat', 'sex_cat', 'capital_gain',
                                    'capital_loss', 'hours_per_week', 'native_country_cat', 
                                    'income'], axis= 1)
adult_df_rev.head(5)

  """


Unnamed: 0,age,workclass_cat,fnlwgt,education_cat,education_num,marital_cat,occupation_cat,relationship_cat,race_cat,sex_cat,capital_gain,capital_loss,hours_per_week,native_country_cat,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,<=50K
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,<=50K
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,<=50K
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,<=50K
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,<=50K


In [31]:
X  = adult_df_rev.values[:,:-1]

array([[39, 7, 77516, ..., 0, 40, 39],
       [50, 6, 83311, ..., 0, 13, 39],
       [38, 4, 215646, ..., 0, 40, 39],
       ...,
       [58, 4, 151910, ..., 0, 40, 39],
       [22, 4, 201490, ..., 0, 20, 39],
       [52, 5, 287927, ..., 0, 40, 39]], dtype=object)

In [32]:
adult_df_rev.values[:,-1:]

array([[' <=50K'],
       [' <=50K'],
       [' <=50K'],
       ...,
       [' <=50K'],
       [' <=50K'],
       [' >50K']], dtype=object)

In [34]:
adult_df_rev['income'].values

array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'],
      dtype=object)

Now we have created multiple categorical columns like “marital_cat”, “race_cat” etc. !

# Data Slicing

In [35]:
# Arrange data into independent variables and dependent variables
X = adult_df_rev.values[:,:-1]  ## Features
Y = adult_df_rev.values[:,14]  ## Target

In [36]:
# Split the data into train and test
# Train data size: 70% of original data
# Test data size: 30% of original data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 10)

Implement Gaussian Naive Bayes

In [37]:
clf = GaussianNB()
clf.fit(X_train, Y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit() method for training it. 
After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its 
parameters.

In [38]:
Y_pred = clf.predict(X_test)

Accuracy of our Gaussian Naive Bayes model

In [39]:
accuracy_score(Y_test, Y_pred, normalize = True)

0.7925069096120381

In [41]:
relation_dict = {0:' Not-in-family' ,
1:  ' Husband',
 2:' Wife',
 3:' Own-child',
4:  ' Unmarried',
5:  ' Other-relative'}

In [42]:
relation_dict

{0: ' Not-in-family',
 1: ' Husband',
 2: ' Wife',
 3: ' Own-child',
 4: ' Unmarried',
 5: ' Other-relative'}

# With One hot Encoding of categorical features

In [43]:
adult_df_rev.columns

Index(['age', 'workclass_cat', 'fnlwgt', 'education_cat', 'education_num',
       'marital_cat', 'occupation_cat', 'relationship_cat', 'race_cat',
       'sex_cat', 'capital_gain', 'capital_loss', 'hours_per_week',
       'native_country_cat', 'income'],
      dtype='object')

In [50]:
non_ordinal_cols = ['workclass_cat','marital_cat','occupation_cat','relationship_cat','race_cat','sex_cat','native_country_cat']

In [51]:
ohe_df = pd.get_dummies(adult_df_rev, columns=non_ordinal_cols)

In [53]:
ohe_df.shape

(32561, 94)

In [54]:
adult_df_rev.shape

(32561, 15)

In [55]:
ohe_df.head()

Unnamed: 0,age,fnlwgt,education_cat,education_num,capital_gain,capital_loss,hours_per_week,income,workclass_cat_0,workclass_cat_1,...,native_country_cat_32,native_country_cat_33,native_country_cat_34,native_country_cat_35,native_country_cat_36,native_country_cat_37,native_country_cat_38,native_country_cat_39,native_country_cat_40,native_country_cat_41
0,39,77516,9,13,2174,0,40,<=50K,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,9,13,0,0,13,<=50K,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,11,9,0,0,40,<=50K,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,1,7,0,0,40,<=50K,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,9,13,0,0,40,<=50K,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Curse of Dimensionality
# PCA

In [57]:
ohe_df.drop(non_ordinal_cols, axis = 1)

KeyError: "['workclass_cat' 'marital_cat' 'occupation_cat' 'relationship_cat'\n 'race_cat' 'sex_cat' 'native_country_cat'] not found in axis"

In [65]:
X = ohe_df.drop('income',axis=1).values

In [61]:
y = ohe_df['income'].values

array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'],
      dtype=object)

In [79]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)
clf = GaussianNB()
clf.fit(X_train, Y_train)

Y_pred = clf.predict(X_test)

accuracy_score(Y_test, Y_pred, normalize = True)

0.7934281912171154

In [None]:
X_train, X_test, Y_train, Y_test

In [80]:
from sklearn.ensemble import RandomForestClassifier

In [81]:
from sklearn.datasets import make_regression

In [82]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

In [83]:
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)

In [84]:
clf.fit(X_train,Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [85]:
Y_pred = clf.predict(X_test)

In [86]:
accuracy_score(Y_test, Y_pred, normalize = True)

0.8103183539768656