# Income prediction

Recall that we have done a homework of data exploration on 'income.csv' to master the knowledge of Exploratory Data Analysis. In this homework, you are required to predict whether a person's income is high or low according to his relevant information including his age, education, occupation, race and so on. 


The attribute information is:

- **income**: the label of this dataset, belongs to \[high, low\] 
- **age**: the age of a person, a continuous variable.
- **work_class**: work class, belongs to \[Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked\].
- **education**: belongs to \[Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, - Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool\].
- **education_degree**: the education level of a person, an ordinal number variable.
- **marital_status**: marital status, belongs to \[Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse\]. 
- **job**: occupation, belongs to \[Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces\]. 
- **relationship**: belongs to \[Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried\].
- **race**: belongs to \[White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black\]. 
- **sex**: belongs to \[Female, Male\]. 
- **capital_gain**: capital gain, a continuous variable. 
- **capital_loss**: capital loss, a continuous variable. 
- **hours_per_week**: how long a person works every week, a continuous variable. 
- **birthplace**: belongs to \[United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, - Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands\].

Specifically, you are required to **fill the blanks of this notebook** based on your results. In this assignment, you will analyze how different features, models and hyper-parameters influence the performance.

## 1. Load Data

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

%matplotlib inline
# %config InlineBackend.figure_format = 'svg'


In [3]:
df = pd.read_csv('income.csv')

## 2. Exploratory Data Analysis

### Take a brief look at the data using `head()`

In [4]:
df.head()

Unnamed: 0,age,work_class,education,education_degree,marital_status,job,relationship,race,sex,capital_gain,capital_loss,hours_per_week,birthplace,income
0,90,,HS-grad,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,low
1,82,Private,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,low
2,66,,Some-college,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,low
3,54,Private,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,low
4,41,Private,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,low


### Observe the basic statistical information of continuous attributes

In [5]:
df.describe() # only describe the continuous variables

Unnamed: 0,age,education_degree,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,10.080679,1077.648844,87.30383,40.437456
std,13.640433,2.57272,7385.292085,402.960219,12.347429
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,48.0,12.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


### Count the NaN values

In [6]:
df.isnull().sum()  ### before

age                    0
work_class          1836
education              0
education_degree       0
marital_status         0
job                 1843
relationship           0
race                   0
sex                    0
capital_gain           0
capital_loss           0
hours_per_week         0
birthplace           583
income                 0
dtype: int64

### Remove NaN values due to small proportion to the whole dataset

In [7]:
df = df.dropna()

In [8]:
df.isnull().sum()  ### after

age                 0
work_class          0
education           0
education_degree    0
marital_status      0
job                 0
relationship        0
race                0
sex                 0
capital_gain        0
capital_loss        0
hours_per_week      0
birthplace          0
income              0
dtype: int64

### Pick out categorical and continuous variables

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 1 to 32560
Data columns (total 14 columns):
age                 30162 non-null int64
work_class          30162 non-null object
education           30162 non-null object
education_degree    30162 non-null int64
marital_status      30162 non-null object
job                 30162 non-null object
relationship        30162 non-null object
race                30162 non-null object
sex                 30162 non-null object
capital_gain        30162 non-null int64
capital_loss        30162 non-null int64
hours_per_week      30162 non-null int64
birthplace          30162 non-null object
income              30162 non-null object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB


### Observe categorical attributes

In [9]:
for col in df.select_dtypes([np.object]).columns:
    print('{}: {}\n'.format(col, df[col].unique()))

work_class: ['Private' 'State-gov' 'Federal-gov' 'Self-emp-not-inc' 'Self-emp-inc'
 'Local-gov' 'Without-pay']

education: ['HS-grad' '7th-8th' 'Some-college' '10th' 'Doctorate' 'Prof-school'
 'Bachelors' 'Masters' '11th' 'Assoc-voc' '1st-4th' '5th-6th' 'Assoc-acdm'
 '12th' '9th' 'Preschool']

marital_status: ['Widowed' 'Divorced' 'Separated' 'Never-married' 'Married-civ-spouse'
 'Married-spouse-absent' 'Married-AF-spouse']

job: ['Exec-managerial' 'Machine-op-inspct' 'Prof-specialty' 'Other-service'
 'Adm-clerical' 'Transport-moving' 'Sales' 'Craft-repair'
 'Farming-fishing' 'Tech-support' 'Protective-serv' 'Handlers-cleaners'
 'Armed-Forces' 'Priv-house-serv']

relationship: ['Not-in-family' 'Unmarried' 'Own-child' 'Other-relative' 'Husband' 'Wife']

race: ['White' 'Black' 'Asian-Pac-Islander' 'Other' 'Amer-Indian-Eskimo']

sex: ['Female' 'Male']

birthplace: ['United-States' 'Mexico' 'Greece' 'Vietnam' 'China' 'Taiwan' 'India'
 'Philippines' 'Trinadad&Tobago' 'Canada' 'South' 'Holan

### Merge values of similar semantics

In [10]:
df.education.replace({
    'Preschool': 'dropout',
    '10th': 'dropout',
    '11th': 'dropout',
    '12th': 'dropout',
    '1st-4th': 'dropout',
    '5th-6th': 'dropout',
    '7th-8th': 'dropout',
    '9th': 'dropout',
    'HS-Grad': 'HighGrad',
    'HS-grad': 'HighGrad',
    'Some-colloge': 'CommunityCollege',
    'Assoc-acdm': 'CommunityCollege',
    'Assoc-voc': 'CommunityCollege',
    'Prof-school': 'Masters',
}, inplace=True)

## 3. Classification Models

In [17]:
from sklearn.model_selection import train_test_split
from sklearn import metrics


# tentatively take 3 numerical attributes for convenience
X = df[['education_degree', 'age', 'hours_per_week']].values
Y = df[['income']].values

# train, test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=88, stratify=Y)

### KNN

In [12]:
## Example: Use KNN to predict income
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=50)

# change the shape of Y_train to (n_samples, ) using `.ravel()`
knn.fit(X_train, Y_train.ravel())

knn_pred = knn.predict(X_test)

print('The accuracy of the KNN is', metrics.accuracy_score(knn_pred, Y_test))

The accuracy of the KNN is 0.7892584816001769


### Hyper-parameter tuning with `GridSearchCV()`

In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': np.arange(30, 70)}

knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid, cv=5)

# change the shape of Y_train to (n_samples, ) using `.ravel()`
knn_cv.fit(X_train, Y_train.ravel())

print(knn_cv.best_params_)
print(knn_cv.best_score_)

{'n_neighbors': 66}
0.7899872116705348


### Your Tasks

As far as you can see, we have built a KNN classification model and select the best hyper-parameters with `GridSearchCV()`. In this task, you are asked to build your own models using `scikit-learn` APIs.

**Question 1 [10pts]**. Build a `Logistic Regression` model on training data and calculate accuracy over testing data.

**Question 2 [10pts]**. Build a `Decision Tree` model on training data and calculate accuracy over testing data. 

**Question 3 [20pts]**. Use graphviz to visualize the decision tree of Question 2, and use a proper tool to visualize the decision boundary of the decision tree.

**Question 4 [10pts]**. Build a `Random Forest` model with your customized parameters on training data and calculate accuracy over testing data.

**Question 5 [20pts]**. For `Random Forest`, use `GridSearchCV()` to find the **optimal** hyper-parameter combination over: 
    - `n_estimator`: the number of trees in the forest
    - `max_depth`: the maximum depth of the tree
    - `max_leaf_nodes`: grow trees with ``max_leaf_nodes`` in best-first fashion.
    
   You should specify your own sets of values for these hyper-parameters. What's more, you are required to print the importance of each features of the dataset.
   
   (*tip: using the `feature_importances_` attributes of the `RandomForestClassifier()` as we have learned in class*)

**Question 6 [10pts]**. Build a `AdaBoost` model on training data and calculate accuracy over testing data.

In [14]:
# Question 1: Build a `Logistic Regression` model on training data and calculate accuracy over testing data.

from sklearn.linear_model import LogisticRegression

Lr = LogisticRegression(solver='liblinear')

# change the shape of Y_train to (n_samples, ) using `.ravel()`
Lr.fit(X_train, Y_train.ravel())

Lr_pred = Lr.predict(X_test)

# print the accuracy (we can also use different kinds of solver to find the optimal one for this task)
print('The accuracy of the Logistic Regression is', metrics.accuracy_score(Lr_pred, Y_test))


The accuracy of the Logistic Regression is 0.7874903304232512


In [35]:
# Question 2: Build a `Decision Tree` model on training data and calculate accuracy over testing data.

from sklearn import tree
 
Tree = tree.DecisionTreeClassifier(criterion='gini')
 
# train the model on the reaining set
Tree.fit(X_train,Y_train.ravel())

# use the model to predict the values on test set
Tree_pred = Tree.predict(X_test)

# print the accuracy (we can also use different kinds criterion for this task - 'gini' & ''entropy)
print('The accuracy of the Decision Tree is', metrics.accuracy_score(Tree_pred, Y_test))

The accuracy of the Decision Tree is 0.7638413084318709


In [59]:
# Question 3: Use graphviz to visualize the decision tree of Question 2, and use a proper tool to visualize the decision boundary of the decision tree.

# !pip install graphviz
# !pip install IPython
# !pip install pydotplus

import graphviz
from IPython.display import Image  
from sklearn import tree
import pydotplus 


# There are two versions, I because of environment problems, I cannot visualize it, so I keep 2 versions

# versoin 1
tree.export_graphviz(Tree)

# versoin 2
# dot_data = tree.export_graphviz(Tree, out_file=None,  #Tree is the classifier in Question #2
#                          feature_names=df.income,   #name of corresponding features
#                          class_names=df.capital_gain  #name of corresponding classes
#                          filled=True, rounded=True,  
#                          special_characters=True)  

# # defining the graph (maybe there are some problem with environment, and I have changed the 
# graph = pydotplus.graph_from_dot_data(dot_data)

# graph.write_png('example.png')    #save the image

# Image(graph.create_png()) 

'digraph Tree {\nnode [shape=box] ;\n0 [label="X[0] <= 12.5\\ngini = 0.374\\nsamples = 21113\\nvalue = [5256, 15857]"] ;\n1 [label="X[1] <= 33.5\\ngini = 0.278\\nsamples = 15817\\nvalue = [2643, 13174]"] ;\n0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;\n2 [label="X[1] <= 27.5\\ngini = 0.12\\nsamples = 6745\\nvalue = [431, 6314]"] ;\n1 -> 2 ;\n3 [label="X[1] <= 23.5\\ngini = 0.048\\nsamples = 4242\\nvalue = [105, 4137]"] ;\n2 -> 3 ;\n4 [label="X[2] <= 54.5\\ngini = 0.009\\nsamples = 2665\\nvalue = [12, 2653]"] ;\n3 -> 4 ;\n5 [label="X[1] <= 21.5\\ngini = 0.006\\nsamples = 2585\\nvalue = [8, 2577]"] ;\n4 -> 5 ;\n6 [label="X[1] <= 20.5\\ngini = 0.001\\nsamples = 1779\\nvalue = [1, 1778]"] ;\n5 -> 6 ;\n7 [label="gini = 0.0\\nsamples = 1377\\nvalue = [0, 1377]"] ;\n6 -> 7 ;\n8 [label="X[0] <= 9.5\\ngini = 0.005\\nsamples = 402\\nvalue = [1, 401]"] ;\n6 -> 8 ;\n9 [label="X[2] <= 39.5\\ngini = 0.011\\nsamples = 177\\nvalue = [1, 176]"] ;\n8 -> 9 ;\n10 [label="gini = 0.0\\nsamp

In [28]:
# Question 4: Build a `Random Forest` model with your customized parameters on training data and calculate accuracy over testing data.

from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                            max_depth=None, max_features='auto', max_leaf_nodes=None,
                            min_samples_leaf=1, min_samples_split=2,
                            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
                            oob_score=False, random_state=None, verbose=0,
                            warm_start=False)

# training the model
RF.fit(X_train, Y_train.ravel())
# making predictions
RF_pred = RF.predict(X_test)

# print the accuracy (we can also use different combinition of parameters: criterion<gini...> & min_samples_leaf...)
print('The accuracy of the Random Forest is', metrics.accuracy_score(RF_pred, Y_test))

The accuracy of the Random Forest is 0.7737871588020776


In [52]:
# Question 5: Hyper-parameter serach over Random Forest and print feature importance list.

# Search round 1

# Below are the initial round of training
# We need to change the range of parameters step by step, to find the optimal ones
# just like binary search, we need to narrow down the range gradually
from sklearn.model_selection import GridSearchCV
param_set = {
    'n_estimators': range(90, 110, 5),
    'max_depth': range(10,21,3),
    'max_leaf_nodes': range(45,55,5),
} 

# Gsearch = GridSearchCV( RF, param_grid = param_set, scoring='roc_auc', cv=5 )
RF = RandomForestClassifier()
Gsearch = GridSearchCV( RF, param_grid = param_set, cv=5 )
Gsearch.fit(X_train, Y_train.ravel())

# Gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_


def print_best_score(gsearch,param_set):
     # print best score
    print("Best score: %0.3f" % gsearch.best_score_)
    print("Best parameters set:")
    # print the parameters
    best_parameters = gsearch.best_estimator_.get_params()
    for param_name in sorted(param_set.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
print_best_score(Gsearch,param_set)

# Output log
# Best score: 0.795
# Best parameters set:
# 	max_depth: 16
# 	max_leaf_nodes: 45
# 	n_estimators: 105

Best score: 0.795
Best parameters set:
	max_depth: 16
	max_leaf_nodes: 45
	n_estimators: 105


In [53]:
# Another round of training
# I just did 2 rounds of searching, beacuase of my limited computing resources
# each round of training takes about 50 mins on my PC

# the method & strategy is clear, so the rest is not difficult, I will not further carry them out, due to the poor computing capacity

from sklearn.model_selection import GridSearchCV
param_set = {
    'n_estimators': range(100, 111, 1),
    'max_depth': range(13,22,1),
    'max_leaf_nodes': range(35,45,2),
} 

# Gsearch = GridSearchCV( RF, param_grid = param_set, scoring='roc_auc', cv=5 )
RF = RandomForestClassifier()
Gsearch = GridSearchCV( RF, param_grid = param_set, cv=5 )
Gsearch.fit(X_train, Y_train.ravel())

# Gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_


def print_best_score(gsearch,param_set):
     # best score has improved 0.1%, compared with round #
    print("Best score: %0.3f" % gsearch.best_score_)
    print("Best parameters set:")
    # parameters
    best_parameters = gsearch.best_estimator_.get_params()
    for param_name in sorted(param_set.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
print_best_score(Gsearch,param_set)

# each time, we can check, if the output parameters are on the border of the range, if so, we need to expand the range in this direction

# Best score: 0.796
# Best parameters set:
# 	max_depth: 19
# 	max_leaf_nodes: 35
# 	n_estimators: 105


Best score: 0.796
Best parameters set:
	max_depth: 19
	max_leaf_nodes: 35
	n_estimators: 105


In [29]:
# Question 6: Build a `AdaBoost` model on training data and calculate accuracy over testing data.

from sklearn.ensemble import AdaBoostClassifier

# here the hyper-parameter is n_estimator, ew may as well take 100
Ada = AdaBoostClassifier(n_estimators=100, random_state=0)

Ada.fit(X_train, Y_train.ravel())

Ada_pred = Ada.predict(X_test)

# to improve performance, we can use loop to find the optimal parameter
print('The accuracy of the Ada Boost is', metrics.accuracy_score(Ada_pred, Y_test))

The accuracy of the Ada Boost is 0.7975466902420157


## 4. Feature Engineering

Before you start this part, we recommend you to read this [article](https://www.cnblogs.com/jasonfreak/p/5448385.html)

### Using `LabelEncoder()`: map categorical features to [0, C)

In [9]:
from sklearn.preprocessing import LabelEncoder

encoded_df = df.apply(LabelEncoder().fit_transform)
encoded_df.head()

Unnamed: 0,age,work_class,education,education_degree,marital_status,job,relationship,race,sex,capital_gain,capital_loss,hours_per_week,birthplace,income
1,65,2,11,8,6,3,1,4,0,0,89,17,38,1
3,37,2,5,3,0,6,4,4,0,0,88,39,38,1
4,24,2,15,9,5,9,3,4,0,0,88,39,38,1
5,17,2,11,8,0,7,4,4,0,0,87,44,38,1
6,21,2,0,5,5,0,4,4,1,0,87,39,38,1


### Using `pandas.get_dummies()`: map categorical features into one-hot encoding

In [10]:
cols = list(set(df.select_dtypes([np.object]).columns) - set(['income']))

onehot_df = pd.get_dummies(df, columns=cols)
onehot_df.head()

Unnamed: 0,age,education_degree,capital_gain,capital_loss,hours_per_week,income,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,...,birthplace_Thailand,birthplace_Trinadad&Tobago,birthplace_United-States,birthplace_Vietnam,birthplace_Yugoslavia,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White
1,82,9,0,4356,18,low,0,1,0,0,...,0,0,1,0,0,0,0,0,0,1
3,54,4,0,3900,40,low,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
4,41,10,0,3900,40,low,0,0,0,1,...,0,0,1,0,0,0,0,0,0,1
5,34,9,0,3770,45,low,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
6,38,6,0,3770,40,low,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


The aforementioned machine learning models are built upon **3 distinct attributes** (`education_degree`, `age` and `hours_per_week`) with **10 more attributes unused**. You are required to utilize those unused columns using the feature engineering methods introduced above to address this issue.]

**Question 7  [20pts]**. Compare the performance (accuracy) of different algorithms and different preprocessing methods on the dataset. Specifically, please fill the blanks in the table below:

|         Alg.        | Original 3 columns | All columns with `LabelEncoder` | All columns with `OneHot` |
|        :---:        |        :----:      |             :----:              |           :----:          |
| Logistic Regression |     &#xfeff;       |            &#xfeff;             |         &#xfeff;          |
| Decision Tree      |     &#xfeff;       |            &#xfeff;             |         &#xfeff;          |
| Random Forest       |     &#xfeff;       |            &#xfeff;             |         &#xfeff;          |
| AdaBoost      |     &#xfeff;       |            &#xfeff;             |         &#xfeff;          |

In [22]:
# Question 7: Compare the performance (accuracy) of different algorithms and different preprocessing methods on the dataset

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
import warnings

warnings.filterwarnings('ignore')

encoded_feat = encoded_df.drop(columns=['income']).values
encoded_labl = encoded_df[['income']].values
encoded_X_train, encoded_X_test, encoded_Y_train, encoded_Y_test = train_test_split(encoded_feat, encoded_labl, test_size=0.3)

onehot_feat = onehot_df.drop(columns=['income']).values
onehot_labl = onehot_df[['income']].values
onehot_X_train, onehot_X_test, onehot_Y_train, onehot_Y_test = train_test_split(onehot_feat, onehot_labl, test_size=0.3)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)


LR = LogisticRegression()
Tree = tree.DecisionTreeClassifier(criterion='gini')
RF = RandomForestClassifier()
Ada = AdaBoostClassifier()

models = [LR, Tree, RF, Ada]
model_names = ['LR', 'Tree', 'RF', 'Ada']
dataset = [[encoded_X_train, encoded_X_test, encoded_Y_train, encoded_Y_test], [onehot_X_train, onehot_X_test, onehot_Y_train, onehot_Y_test], [X_train, X_test, Y_train, Y_test]]

for index in range(len(models)):
    model = models[index]
    name = model_names[index]
    for data in dataset:
        model.fit(data[0], data[2])
        model_pred = model.predict(data[1])
        print('The accuracy of {} is'.format(name), metrics.accuracy_score(model_pred, data[3]))
        
print("The 3 accuracy are encoded, onehot and original in order.\n. To avoid filling in the table, I print them above")

The accuracy of LR is 0.8142336169742513
The accuracy of LR is 0.8470549231959332
The accuracy of LR is 0.7859431981434413
The accuracy of Tree is 0.804066747706929
The accuracy of Tree is 0.8049508232953918
The accuracy of Tree is 0.762294176152061
The accuracy of RF is 0.8292629019781191
The accuracy of RF is 0.8356724499944745
The accuracy of RF is 0.7735661399049619
The accuracy of Ada is 0.8503702066526688
The accuracy of Ada is 0.8523593767267101
The accuracy of Ada is 0.795115482373743
