### adult census income dataset :

to predict and classify the income as >50K or <=50K using logistic regression, decision tree, and random forest

In [None]:
import pandas as pd
import numpy as np

In [None]:
dataset = pd.read_csv("adult.csv")
dataset.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
30120,51,Self-emp-not-inc,32372,12th,8,Married-civ-spouse,Other-service,Husband,White,Male,0,0,99,United-States,<=50K
9882,33,Private,213226,HS-grad,9,Divorced,Sales,Not-in-family,White,Male,0,0,50,United-States,<=50K
18523,50,Private,143664,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,40,United-States,<=50K
15346,57,Private,231232,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
17195,41,Private,225892,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,48,United-States,>50K


### data preprocessing : cleaning, dropping columns, encoding

In [None]:
dataset.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
education.num,0
marital.status,0
occupation,0
relationship,0
race,0
sex,0


In [None]:
dataset.nunique()

Unnamed: 0,0
age,73
workclass,9
fnlwgt,21648
education,16
education.num,16
marital.status,7
occupation,15
relationship,6
race,5
sex,2


dropping columns with little or no impact to the model, and it affects the model

In [None]:
df1 = dataset.drop(["fnlwgt","native.country","race"], axis="columns")

In [None]:
df1

Unnamed: 0,age,workclass,education,education.num,marital.status,occupation,relationship,sex,capital.gain,capital.loss,hours.per.week,income
0,90,?,HS-grad,9,Widowed,?,Not-in-family,Female,0,4356,40,<=50K
1,82,Private,HS-grad,9,Widowed,Exec-managerial,Not-in-family,Female,0,4356,18,<=50K
2,66,?,Some-college,10,Widowed,?,Unmarried,Female,0,4356,40,<=50K
3,54,Private,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,Female,0,3900,40,<=50K
4,41,Private,Some-college,10,Separated,Prof-specialty,Own-child,Female,0,3900,40,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,Some-college,10,Never-married,Protective-serv,Not-in-family,Male,0,0,40,<=50K
32557,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,Female,0,0,38,<=50K
32558,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,Male,0,0,40,>50K
32559,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,Female,0,0,40,<=50K


removing rows with minority values, or values not available

In [None]:
df1["workclass"].value_counts()
# ?, Without-pay, Never-worked

Unnamed: 0_level_0,count
workclass,Unnamed: 1_level_1
Private,22696
Self-emp-not-inc,2541
Local-gov,2093
?,1836
State-gov,1298
Self-emp-inc,1116
Federal-gov,960
Without-pay,14
Never-worked,7


In [None]:
df1["education"].value_counts()
# Preschool

Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
HS-grad,10501
Some-college,7291
Bachelors,5355
Masters,1723
Assoc-voc,1382
11th,1175
Assoc-acdm,1067
10th,933
7th-8th,646
Prof-school,576


In [None]:
df1["occupation"].value_counts()
# Armed-Forces, ?

Unnamed: 0_level_0,count
occupation,Unnamed: 1_level_1
Prof-specialty,4140
Craft-repair,4099
Exec-managerial,4066
Adm-clerical,3770
Sales,3650
Other-service,3295
Machine-op-inspct,2002
?,1843
Transport-moving,1597
Handlers-cleaners,1370


In [None]:
df1["marital.status"].value_counts()
# Married-AF-spouse

Unnamed: 0_level_0,count
marital.status,Unnamed: 1_level_1
Married-civ-spouse,14976
Never-married,10683
Divorced,4443
Separated,1025
Widowed,993
Married-spouse-absent,418
Married-AF-spouse,23


In [None]:
df2 = df1[~(df1['workclass'].isin(['?', 'Without-pay', 'Never-worked']) |
        df1['education'].isin(['Preschool']) |
        df1['occupation'].isin(['Armed-Forces', '?']) |
        df1['marital.status'].isin(['Married-AF-spouse'])
    )
]

In [None]:
df2.head()

Unnamed: 0,age,workclass,education,education.num,marital.status,occupation,relationship,sex,capital.gain,capital.loss,hours.per.week,income
1,82,Private,HS-grad,9,Widowed,Exec-managerial,Not-in-family,Female,0,4356,18,<=50K
3,54,Private,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,Female,0,3900,40,<=50K
4,41,Private,Some-college,10,Separated,Prof-specialty,Own-child,Female,0,3900,40,<=50K
5,34,Private,HS-grad,9,Divorced,Other-service,Unmarried,Female,0,3770,45,<=50K
6,38,Private,10th,6,Separated,Adm-clerical,Unmarried,Male,0,3770,40,<=50K


encoding with dummy variables for categorical data

In [None]:
df3 = pd.get_dummies(df2, drop_first=True)
df3.head()

Unnamed: 0,age,education.num,capital.gain,capital.loss,hours.per.week,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,...,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,sex_Male,income_>50K
1,82,9,0,4356,18,False,True,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3,54,4,0,3900,40,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,41,10,0,3900,40,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
5,34,9,0,3770,45,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
6,38,6,0,3770,40,False,True,False,False,False,...,False,False,False,False,False,False,True,False,True,False


changing boolean to integer value for better performance

In [None]:
df3.replace({True: 1, False: 0}, inplace=True)
df3.head()

  df3.replace({True: 1, False: 0}, inplace=True)


Unnamed: 0,age,education.num,capital.gain,capital.loss,hours.per.week,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,...,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,sex_Male,income_>50K
1,82,9,0,4356,18,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,54,4,0,3900,40,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,41,10,0,3900,40,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,34,9,0,3770,45,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,38,6,0,3770,40,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0


### splitting into training and testing

In [None]:
x = df3.drop(["income_>50K"], axis="columns")
y = df3["income_>50K"]

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
len(x_test)

6126

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

hyperparameter tuning

In [None]:
model_param = {
    'decision_tree' : {
        'model' : DecisionTreeClassifier(),
        'param' : {
            'criterion' : ['gini', 'entropy'],
            'min_samples_split': [2, 5, 10, 20],
            'max_features': [None, 'sqrt', 'log2']
        }
    },
    'random_forest' : {
        'model' : RandomForestClassifier(),
        'param' : {
            'n_estimators' : [10,50,100,200],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 5],
            'max_features': ['sqrt', 'log2'],
        }
    },
    'logistic_regression' : {
        'model' : LogisticRegression(max_iter=1000),
        'param' : {
            'C' : [1,5,10,20],
            'penalty': ['l2'],
            'solver': ['liblinear', 'saga'],
        }
    }
}

In [None]:
models = []
for model, param in model_param.items():
    grid = GridSearchCV(param['model'], param['param'], cv=5, return_train_score=False)
    grid.fit(x_train, y_train)
    models.append({
        'model': model,
        'best_score': grid.best_score_,
        'best_params': grid.best_params_,
    })



In [None]:
results = pd.DataFrame(models, columns=['model', 'best_score', 'best_params'])
results

Unnamed: 0,model,best_score,best_params
0,decision_tree,0.840788,"{'criterion': 'gini', 'max_features': None, 'm..."
1,random_forest,0.860379,"{'max_features': 'sqrt', 'min_samples_leaf': 2..."
2,logistic_regression,0.84842,"{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}"


### training the models

In [None]:
# logistic regression
log_reg = LogisticRegression(C=1, penalty='l2', solver='liblinear', max_iter=1000)
log_reg.fit(x_train, y_train)

In [None]:
# decision tree
dec_tree = DecisionTreeClassifier(criterion='gini', max_features=None, min_samples_split=20)
dec_tree.fit(x_train, y_train)

In [None]:
# random forest
forest = RandomForestClassifier(max_features='sqrt', min_samples_leaf=2, min_samples_split=10, n_estimators=100)
forest.fit(x_train, y_train)

### evaluation metrics

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

In [None]:
eval = []
for mod in ([log_reg, dec_tree, forest]):
    pred = mod.predict(x_test)
    prob = mod.predict_proba(x_test)[:,1]
    acc = accuracy_score(y_test, pred)
    prec = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    roc_auc = roc_auc_score(y_test, prob)
    eval.append({
        "model": mod.__class__.__name__,
        "accuracy": acc,
        "precision": prec,
        "recall": recall,
        "roc_auc": roc_auc
    })

In [None]:
eval_results = pd.DataFrame(eval, columns=["model", "accuracy", "precision", "recall", "roc_auc"])
eval_results

Unnamed: 0,model,accuracy,precision,recall,roc_auc
0,LogisticRegression,0.842964,0.747934,0.579385,0.896714
1,DecisionTreeClassifier,0.838394,0.711538,0.615877,0.864049
2,RandomForestClassifier,0.857329,0.778317,0.615877,0.91246


The RandomForest Classifier performs the best according to the evaluation metrics data.

Accuracy - all the models have identical accuracy, which means they can predict the correct values identical number of times out of all predictions. still, random forest stands out here

Precision - random forest outperforms the other models with a relatively highest precision, i.e. it has fewer false positives than the other models

Recall - all the models have a lower recall value which means they detect more false negatives (<=50K), but the tree models are better than linear

ROC_AUC - random forest has the highest roc-auc because ensemble model improves probability and ranking stability and reduces variance

random forest has a better bias-variance tradeoff. but the logistic model with a relatively higher bias also performs good in the roc-auc score than the decision tree model with higher variance, which performs the worse

random forest model, balances the variance because of ensemble modelling through averaging, and has a low bias since it is non-linear model