# Income prediction using Decision Tree #

Decision Tree converts a set of features and observations into a set of rules, resulting in a prediction. 

It gives a tree in which each node is a splitting feature that divides the observations into branches with an objective of creating homogeneous subsets.


The impurity of a split is measured commonly in two ways :
- GINI Index 
- Entropy

The splitting variable maximizes information gain (or minimizes impurity) which the difference between impurity before the split and after the split.

### Dataset Source

#### Source:

*http://archive.ics.uci.edu/ml/datasets/Adult*

#### Donor:

Ronny Kohavi and Barry Becker
<br>Data Mining and Visualization
<br>Silicon Graphics.
<br>e-mail: ronnyk@live.com for questions.

### Dataset(s) Description
- Request you to go to the link given above and read in details about the feature description.
- It will give you a detailed idea about all the features avaibale in the dataset.
- If you follow the link and see the data folder, you will get two files there one for train and one for test.
- So, when we are doing this lab activity, we do not have to split for train and test, instead we need to use and process both the files separately and complete the lab activity.
    **`adult_train.csv`** and **`adult_test.csv`**.

#### Import required libraries

In [2]:
# Import required libraries

import pandas as pd
import numpy as np

#### Read both train & test datasets from give url


In [3]:
# read both train & test datasets from http://archive.ics.uci.edu/ml/datasets/Adult

df_train = pd.read_csv('adult_train.csv', header=None, na_values='?')
df_test = pd.read_csv('adult_test.csv', header=None, na_values='?')

#### Look at some sample of the data to check for any abnormalities


In [4]:
# look at some sample of the data to check for any abnormalities

df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#### How many train , test records given?


In [57]:
print(df_train.shape)
print(df_test.shape)

(32561, 15)
(16281, 15)


#### Update the columns with the column names available in data description

In [5]:
df_train.columns = ['age','workclass','fnlwgt','education','education_num','marital_status',
                    'occupation','relationship','race','sex','capital_gain','capital_loss',
                    'hours_per_week','native_country','target']

df_test.columns = df_train.columns

#### Check for missing values

In [6]:
df_train.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
target               0
dtype: int64

#### Rename the target labels 

For **`<=50K`** replace them with **`0`** and for **`>50K`** replace them with **`1`**.

In [7]:
df_train.target = np.where(df_train['target']== '<=50K', 0, 1)
df_test.target =  np.where(df_test['target']== '<=50K.', 0, 1)

#### Check for the frequency Distribution of the target variable

In [8]:
df_train.target.value_counts(normalize=True)

0    0.75919
1    0.24081
Name: target, dtype: float64

In [9]:
df_test.target.value_counts(normalize=True)

0    0.763774
1    0.236226
Name: target, dtype: float64

#### Seperate Independent and Dependent attributes.

After all these pre-processing we are ready for our model building activity. 

For our lab activity we are going to consider all the features.

Please remember that we need to do the dummification process for the categorical features.

First we need to prepare our train and test data.

But, you may ask that we already have two data-sets one for train and one for test, then what else we need to do.

For many of the sickit learn models, we need to present the independent features and dependent feature (target column), separately.

In [10]:
X_train = df_train.drop(['target'], axis = 1)
y_train = df_train["target"]

In [11]:
X_test = df_test.drop(['target'], axis = 1)
y_test = df_test["target"]

###  Data preparation

Once we have completed the task of separating the independent and dependent features, the next step is to
complete the process of preparing the data which can be given to the model for training and prediction purpose.

As we have discussed before, 
We need to process the categorical features, or carry out the dummification process.
That process is going to give us many other features (normally one feature for every category), which we need to combine
with the numerical features to prepare the final data-matrix which will be given to the model for training purpose.

1. Identify categorical and numerical features.
2. Impute numerical features if required.
3. Dummy categorical features.
4. Combine imputed numerical features and dummy categorical features and make the final dataset.

Note - For our lab activity we will consider all the features, when you practice this experiment you are welcome to test it with different set of features to see how it is affecting the performance.

#### 1. Identify categorical and numerical features.

In [12]:
cat_cols = [col for col in X_train.columns if X_train[col].dtype == 'object']
num_cols  = [col for col in X_train.columns if X_train[col].dtype in['int64', 'float64']]
print("Categorical Columns:")
print(cat_cols)
print("Numerical Columns:")
print(num_cols)

Categorical Columns:
['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
Numerical Columns:
['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']


#### Check for missing values

In [13]:
X_train[num_cols].isnull().sum()

age               0
fnlwgt            0
education_num     0
capital_gain      0
capital_loss      0
hours_per_week    0
dtype: int64

In [14]:
X_test[cat_cols].isnull().sum()

workclass         963
education           0
marital_status      0
occupation        966
relationship        0
race                0
sex                 0
native_country    274
dtype: int64

In [15]:
X_train[cat_cols].isnull().sum()

workclass         1836
education            0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
native_country     583
dtype: int64

In [16]:
set(pd.get_dummies(X_train).columns) - set(pd.get_dummies(X_test).columns)

{'native_country_Holand-Netherlands'}

#### Impute missing values as 'unknown'

In [17]:
# Function to impute the missing values.
def fn_impute_missing_values(df):
    # Fill 'workclass' and 'occupation' columns with the 'unknown' value.
    df['workclass'].fillna("unknown",inplace=True)
    df['occupation'].fillna("unknown",inplace=True)
    # Fill 'native_country' column with the 'United-States' value.
    df['native_country'].fillna("United-States",inplace=True)
    # return the imputed dataframe.
    return df

In [18]:
X_train = fn_impute_missing_values(X_train)
X_test = fn_impute_missing_values(X_test)

In [19]:
X_test.index

RangeIndex(start=0, stop=16281, step=1)

#### One-hot encoding

In [20]:
from sklearn.preprocessing import OneHotEncoder

In [21]:
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

X_train_cat = pd.DataFrame(ohe.fit_transform(X_train[cat_cols]), columns = ohe.get_feature_names())
X_test_cat = pd.DataFrame(ohe.transform(X_test[cat_cols]), columns = ohe.get_feature_names())

In [22]:
X_train = X_train[num_cols].join(X_train_cat)

In [23]:
X_test = X_test[num_cols].join(X_test_cat)

In [24]:
df_test.shape

(16281, 15)

In [25]:
X_test.shape

(16281, 107)

In [26]:
X_test['x7_Holand-Netherlands'].value_counts()

0.0    16281
Name: x7_Holand-Netherlands, dtype: int64

#### Build a Decision Tree Model

In [52]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()

In [53]:
dt.fit(X_train, y_train)

DecisionTreeClassifier()

In [54]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

In [55]:
from sklearn.metrics import classification_report

print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     24720
           1       1.00      1.00      1.00      7841

    accuracy                           1.00     32561
   macro avg       1.00      1.00      1.00     32561
weighted avg       1.00      1.00      1.00     32561



In [56]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.88     12435
           1       0.60      0.62      0.61      3846

    accuracy                           0.81     16281
   macro avg       0.74      0.75      0.74     16281
weighted avg       0.81      0.81      0.81     16281

