In [3]:
#necessary imports
import numpy as np
import pandas as pd

In [4]:
data = pd.read_csv("data.csv") #open the file containing the data 

## Data Preprocessing

This week we will be working with a dataset containing information about people's backgrounds, the target feature here is the income. So the task is, given background information about a certain person, predict whether his/her annual income is >=50K USD or <=50K USD. 

We will be using the popular pandas library for preprocessing our data before making a ML model out of it. You can go through this tutorial (http://pandas.pydata.org/pandas-docs/stable/10min.html) before going ahead with the lab work. 


When viewing the data, we can see that most features are categorical. We need to perform some preprocessing to convert the categorical features into numerical ones.

In [5]:
data.head() #view first few rows of data

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K


## Data exploration 

When looking at the unique values of the columns from our dataset, we can see that there are some unknowns as well (denoted by "?"). We will keep that as it is as it wont make such a big difference for our example.

In [6]:
data['age'].unique() #get the unique values from the 'age' column

array([39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 30, 23, 32, 40, 34, 25, 43,
       54, 35, 59, 56, 19, 20, 45, 22, 48, 21, 24, 57, 44, 41, 29, 18, 47,
       46, 36, 79, 27, 67, 33, 76, 17, 55, 61, 70, 64, 71, 68, 66, 51, 58,
       26, 60, 90, 75, 65, 77, 62, 63, 80, 72, 74, 69, 73, 81, 78, 88, 82,
       83, 84, 85, 86, 87], dtype=int64)

In [7]:
data['workclass'].unique() #get the unique values from the 'workclass' column

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [8]:
data['education'].unique() #get the unique values from the 'education' column

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

In [9]:
data['marital-status'].unique() #get the unique values from the 'marital-status' column

array([' Never-married', ' Married-civ-spouse', ' Divorced',
       ' Married-spouse-absent', ' Separated', ' Married-AF-spouse',
       ' Widowed'], dtype=object)

In [10]:
data['occupation'].unique() #get the unique values from the 'occupation' column

array([' Adm-clerical', ' Exec-managerial', ' Handlers-cleaners',
       ' Prof-specialty', ' Other-service', ' Sales', ' Craft-repair',
       ' Transport-moving', ' Farming-fishing', ' Machine-op-inspct',
       ' Tech-support', ' ?', ' Protective-serv', ' Armed-Forces',
       ' Priv-house-serv'], dtype=object)

In [11]:
data['relationship'].unique() #get the unique values from the 'relationship' column

array([' Not-in-family', ' Husband', ' Wife', ' Own-child', ' Unmarried',
       ' Other-relative'], dtype=object)

In [12]:
data['race'].unique() #get the unique values from the 'race' column

array([' White', ' Black', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo',
       ' Other'], dtype=object)

In [13]:
data['sex'].unique() #get the unique values from the 'sex' column

array([' Male', ' Female'], dtype=object)

In [14]:
data['hours-per-week'].unique() #get the unique values from the 'hours-per-week' column

array([40, 13, 16, 45, 50, 80, 30, 35, 60, 20, 52, 44, 15, 25, 38, 43, 55,
       48, 58, 32, 70,  2, 22, 56, 41, 28, 36, 24, 46, 42, 12, 65,  1, 10,
       34, 75, 98, 33, 54,  8,  6, 64, 19, 18, 72,  5,  9, 47, 37, 21, 26,
       14,  4, 59,  7, 99, 53, 39, 62, 57, 78, 90, 66, 11, 49, 84,  3, 17,
       68, 27, 85, 31, 51, 77, 63, 23, 87, 88, 73, 89, 97, 94, 29, 96, 67,
       82, 86, 91, 81, 76, 92, 61, 74, 95], dtype=int64)

In [15]:
data['native-country'].unique() #get the unique values from the 'native-country' column

array([' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico',
       ' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada',
       ' Germany', ' Iran', ' Philippines', ' Italy', ' Poland',
       ' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos',
       ' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic',
       ' El-Salvador', ' France', ' Guatemala', ' China', ' Japan',
       ' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland',
       ' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong',
       ' Ireland', ' Hungary', ' Holand-Netherlands'], dtype=object)

In [16]:
data['income'].unique() #get the unique values from the 'income' column

array([' <=50K', ' >50K'], dtype=object)

In [17]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K


## Categorical into numerical values

Here we perform a simple operation to convert the categorical features into numerical ones. For each categorical value of a certain column, find the index of that value in the list of unique values of that particular column and assign that number as the replacement for the categorical value (here we are creating a new column in the dataframe to store the numerical values). Below is an example. 

| catergorical value | numerical value |
|--------------------|-----------------|
| yes | 0 |
| yes | 0 |
| no | 1 |
| yes | 0 |
| no | 1 |
| no | 1 |

Since the list of unique values in the column "categorical value" is 
```python 
['yes','no']```
For each value in the column "catergorical value", we assign a number in the column "numerical value", which is simply the index of the corresponding value in the list of unique values.

To perform the above described operation, lambda functions (anonymous functions) have been used. You can go through this tutorial (https://www.w3schools.com/python/python_lambda.asp) to learn how these functions work. 

In [29]:
#creating new columns for storing the numerical versions of the respective categorical features

workclass_unique = list(data['workclass'].unique())
education_unique = list(data['education'].unique())
marital_status_unique = list(data['marital-status'].unique())
occupation_unique = list(data['occupation'].unique())
relationship_unique = list(data['relationship'].unique())
race_unique = list(data['race'].unique())
sex_unique = list(data['sex'].unique())
native_country_unique = list(data['native-country'].unique())
#converting text/categorical value in numerical values
data['workclass_'] = data['workclass'].apply(lambda x: workclass_unique.index(x))
data['education_'] = data['education'].apply(lambda x: education_unique.index(x))
data['marital-status_'] = data['marital-status'].apply(lambda x: marital_status_unique.index(x))
data['occupation_'] = data['occupation'].apply(lambda x: occupation_unique.index(x))
data['relationship_'] = data['relationship'].apply(lambda x: relationship_unique.index(x))
data['race_'] = data['race'].apply(lambda x: race_unique.index(x))
data['sex_'] = data['sex'].apply(lambda x: sex_unique.index(x))
data['native-country_'] = data['native-country'].apply(lambda x: native_country_unique.index(x))
data.tail()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income,workclass_,education_,marital-status_,occupation_,relationship_,race_,sex_,native-country_
32556,27,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,38,United-States,<=50K,2,6,1,10,2,0,1,0
32557,40,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,40,United-States,>50K,2,1,1,9,1,0,0,0
32558,58,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,40,United-States,<=50K,2,1,6,0,4,0,1,0
32559,22,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,20,United-States,<=50K,2,1,0,0,3,0,0,0
32560,52,Self-emp-inc,HS-grad,Married-civ-spouse,Exec-managerial,Wife,White,Female,40,United-States,>50K,6,1,1,1,2,0,1,0


In [30]:
data.to_csv("data_preprocessed.csv") #uncomment this to save to file

## Data preparation 

Now we create 2 new dataframes containing only the required features from the origional dataframe. Basically we want all the examples in X and all the labels in Y.

In [20]:
X = data[['age','workclass_','education_','marital-status_','occupation_','relationship_','race_','sex_','hours-per-week','native-country_']]
X.head()

Unnamed: 0,age,workclass_,education_,marital-status_,occupation_,relationship_,race_,sex_,hours-per-week,native-country_
0,39,0,0,0,0,0,0,0,40,0
1,50,1,0,1,1,1,0,0,13,0
2,38,2,1,2,2,0,0,0,40,0
3,53,2,2,1,2,1,1,0,40,0
4,28,2,0,1,3,2,1,1,40,1


In [21]:
unique_income = list(data['income'].unique())
Y = data['income'].apply(lambda x: unique_income.index(x))
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: income, dtype: int64

## Building the Decision Tree Classifier

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) #split the data in train and test (no validation set this time)

In [23]:
len(X_train)

26048

In [24]:
len(X_test)

6513

In [25]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(criterion="entropy") #initialize the classifier
clf.fit(X_train,y_train) #build the tree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [26]:
pred = clf.predict(X_test) #make predictions on the test set

In [27]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test,pred) #print out the accuracy of the classifier on the test data

0.7710732381391064