The goal is determine whether a person makes over 50k a year or not based on some features like age, sex and education of a person

Evaluation on F1 score between the ground truth labels and the predicted labels

1. id: id of individual sample
2. age: age of a person
3. Working_class: working class to which the person belongs like Private, Self-emp-not-inc, Self-emp-inc etc.
4. fnlwgt: continuous
5. education: educational background of a person like Bachelors, 11th, 9th, 7th-8th, 12th, Masters.
6. education_num: continuous.
7. marital_status: marital status of a person.
8. occupation: Adm-clerical, Exec-managerial etc.
9. relationship: Wife, Husband, Unmarried etc.
10. race: White, Other, Black etc.
11. sex: Female, Male.
12. capital_gain: continuous.
13. capital_loss: continuous.
14. -hours_per_week: continuous.
15. native_country: United-States, Cambodia, England Puerto-Rico, etc.
16. earning: {>50K:0, <=50K:1}

Import the libraries

In [543]:
import numpy as np
from array import array
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [544]:
import warnings
warnings.filterwarnings("ignore")

Load the train data

In [545]:
df = pd.read_csv("Week5_train.csv")
df.head()

Unnamed: 0,id,Age,Working_class,fnlwgt,education,education_num,marital_status,Occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,earning
0,0,37,Private,280966,Bachelors,13,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,United-States,0
1,1,41,Private,205153,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,1
2,2,23,Private,237720,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,38,United-States,1
3,3,35,Private,276153,Bachelors,13,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Female,4650,0,40,United-States,1
4,4,28,Private,216178,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,1


Check for the dimension

In [546]:
df.shape

(13842, 16)

Null values in the dataset

In [547]:
df.isna().sum()

id                0
Age               0
Working_class     0
fnlwgt            0
education         0
education_num     0
marital_status    0
Occupation        0
relationship      0
race              0
gender            0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
earning           0
dtype: int64

Types of unique values in dataset

In [548]:
df.dtypes

id                 int64
Age                int64
Working_class     object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
Occupation        object
relationship      object
race              object
gender            object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
earning            int64
dtype: object

Working_class, education, marital_status, Occupation, relationship, race, gender, native_country

In [549]:
print(df.gender.values)

[' Male' ' Male' ' Male' ... ' Female' ' Female' ' Female']


In [550]:
df.gender.value_counts()

 Male      9882
 Female    3960
Name: gender, dtype: int64

In [551]:
df.race.value_counts()

 White                 12034
 Black                  1142
 Asian-Pac-Islander      442
 Amer-Indian-Eskimo      117
 Other                   107
Name: race, dtype: int64

In [552]:
df.Working_class.value_counts()

 Private             9386
 Self-emp-not-inc    1142
 Local-gov            950
 ?                    666
 Self-emp-inc         628
 State-gov            582
 Federal-gov          484
 Without-pay            4
Name: Working_class, dtype: int64

In [553]:
df.education.value_counts()

 HS-grad         4039
 Some-college    2989
 Bachelors       2725
 Masters          977
 Assoc-voc        614
 Assoc-acdm       422
 11th             399
 Prof-school      378
 10th             320
 Doctorate        264
 7th-8th          210
 9th              185
 12th             142
 5th-6th          110
 1st-4th           54
 Preschool         14
Name: education, dtype: int64

In [554]:
df.marital_status.value_counts()

 Married-civ-spouse       7648
 Never-married            3622
 Divorced                 1648
 Widowed                   381
 Separated                 365
 Married-spouse-absent     165
 Married-AF-spouse          13
Name: marital_status, dtype: int64

In [555]:
df.Occupation.value_counts()

 Prof-specialty       2185
 Exec-managerial      2166
 Craft-repair         1683
 Sales                1625
 Adm-clerical         1455
 Other-service        1110
 Machine-op-inspct     747
 ?                     666
 Transport-moving      638
 Handlers-cleaners     473
 Tech-support          378
 Farming-fishing       348
 Protective-serv       311
 Priv-house-serv        54
 Armed-Forces            3
Name: Occupation, dtype: int64

In [556]:
df.relationship.value_counts()

 Husband           6760
 Not-in-family     3107
 Own-child         1644
 Unmarried         1200
 Wife               808
 Other-relative     323
Name: relationship, dtype: int64

In [557]:
df.native_country.value_counts()

 United-States                 12486
 ?                               237
 Mexico                          215
 Philippines                      78
 Germany                          58
 Canada                           55
 Puerto-Rico                      54
 India                            48
 England                          48
 Cuba                             41
 El-Salvador                      41
 China                            40
 Jamaica                          35
 South                            35
 Italy                            31
 Vietnam                          30
 Japan                            29
 Poland                           28
 Dominican-Republic               21
 Guatemala                        21
 Iran                             18
 Ireland                          18
 Columbia                         17
 France                           15
 Taiwan                           15
 Haiti                            14
 Greece                           14
 

In [558]:
objects = ['Working_class', 'education', 'marital_status', 'relationship', 'race', 'Occupation', 'native_country', 'gender']

In [559]:
df[objects] = df[objects].astype("category")

One Hot Encoding

In [560]:
from sklearn.preprocessing import OneHotEncoder

In [561]:
df.dtypes

id                   int64
Age                  int64
Working_class     category
fnlwgt               int64
education         category
education_num        int64
marital_status    category
Occupation        category
relationship      category
race              category
gender            category
capital_gain         int64
capital_loss         int64
hours_per_week       int64
native_country    category
earning              int64
dtype: object

In [562]:
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('enc', OneHotEncoder(categories='auto',handle_unknown='ignore'), objects)], remainder= 'passthrough')

xt = ct.fit_transform(df[objects]).toarray()

In [563]:
df_encd = pd.DataFrame(xt, index= df.index)
df_other_cols = df.drop(columns= objects)
df_new = pd.concat([df_other_cols, df_encd], axis= 1)

In [564]:
df_new.head()

Unnamed: 0,id,Age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,earning,0,1,...,90,91,92,93,94,95,96,97,98,99
0,0,37,280966,13,0,0,40,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,1,41,205153,11,0,0,40,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,2,23,237720,13,0,0,38,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,3,35,276153,13,4650,0,40,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,4,28,216178,9,0,0,40,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [566]:
df_new.shape

(13842, 108)

Separating features and target variable

In [567]:
X = df_new.drop(['earning'], axis = 1)
y = df_new.earning

Perform train test split

In [568]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size= 0.1, random_state= 42)

In [569]:
from sklearn.metrics import classification_report

Apply machine learning algorithms

In [570]:
def score(model, title = "Default"):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print(classification_report(preds, y_test))

Logistic Regression

In [571]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
score(log, "Logistic Regression")

              precision    recall  f1-score   support

           0       0.36      0.66      0.47       311
           1       0.87      0.66      0.75      1074

    accuracy                           0.66      1385
   macro avg       0.62      0.66      0.61      1385
weighted avg       0.76      0.66      0.69      1385



Support Vector Machine

In [572]:
from sklearn.svm import SVC
sv = SVC(kernel = 'rbf', C = 5, gamma = 'scale', probability= True)
score(sv, "SVC")

              precision    recall  f1-score   support

           0       0.18      0.91      0.30       111
           1       0.99      0.63      0.77      1274

    accuracy                           0.66      1385
   macro avg       0.58      0.77      0.54      1385
weighted avg       0.92      0.66      0.73      1385



Decision Tree Classifier

In [574]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
score(dt, "Decision Tree")

              precision    recall  f1-score   support

           0       0.72      0.70      0.71       581
           1       0.79      0.80      0.79       804

    accuracy                           0.76      1385
   macro avg       0.75      0.75      0.75      1385
weighted avg       0.76      0.76      0.76      1385



Random Forest Classifier

In [575]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
score(rf, "Random Forest")

              precision    recall  f1-score   support

           0       0.79      0.74      0.77       603
           1       0.81      0.85      0.83       782

    accuracy                           0.80      1385
   macro avg       0.80      0.80      0.80      1385
weighted avg       0.80      0.80      0.80      1385



KNeighbors Classifier

In [577]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors= 5, metric = 'minkowski', p =2)
score(knn, "KNeighbors")

              precision    recall  f1-score   support

           0       0.38      0.51      0.44       427
           1       0.74      0.64      0.69       958

    accuracy                           0.60      1385
   macro avg       0.56      0.57      0.56      1385
weighted avg       0.63      0.60      0.61      1385



Gaussain NB

In [578]:
from sklearn.naive_bayes import GaussianNB
g = GaussianNB()
score(g, "Gausian")

              precision    recall  f1-score   support

           0       0.29      0.77      0.43       216
           1       0.94      0.66      0.77      1169

    accuracy                           0.68      1385
   macro avg       0.62      0.72      0.60      1385
weighted avg       0.84      0.68      0.72      1385



Gradient Boosting Classifier

In [579]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
score(gbc, "Gradient Boosting")

              precision    recall  f1-score   support

           0       0.77      0.77      0.77       571
           1       0.84      0.84      0.84       814

    accuracy                           0.81      1385
   macro avg       0.80      0.80      0.80      1385
weighted avg       0.81      0.81      0.81      1385



Histogram Gradient Boosting Classifier

In [580]:
from sklearn.ensemble import HistGradientBoostingClassifier
hgbc = HistGradientBoostingClassifier()
score(hgbc, "Histogram Gradient Boosting")

              precision    recall  f1-score   support

           0       0.80      0.77      0.79       590
           1       0.84      0.86      0.85       795

    accuracy                           0.82      1385
   macro avg       0.82      0.82      0.82      1385
weighted avg       0.82      0.82      0.82      1385



XGB Classifier

In [581]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
score(xgb, "XGBoost")

              precision    recall  f1-score   support

           0       0.81      0.76      0.79       602
           1       0.83      0.86      0.84       783

    accuracy                           0.82      1385
   macro avg       0.82      0.81      0.81      1385
weighted avg       0.82      0.82      0.82      1385



CatBoost Classifier

In [582]:
from catboost import CatBoostClassifier
catb = CatBoostClassifier(verbose= 0, n_estimators= 100)
score(catb, "CatBoost")

              precision    recall  f1-score   support

           0       0.80      0.77      0.79       589
           1       0.84      0.86      0.85       796

    accuracy                           0.82      1385
   macro avg       0.82      0.82      0.82      1385
weighted avg       0.82      0.82      0.82      1385



F1 Score:
1. Logistic = 0.75
2. SVC = 0.77
3. Decision tree = 0.79
4. Random Forest = 0.83
5. K neighbour = 0.69
6. GaussianNB = 0.77
7. Gradient Boosting = 0.84
8. Histogram Gradient Boosting = 0.85
9. XG Boosting = 0.84
10. CatBoosting = 0.85

Load test data

In [583]:
df_t = pd.read_csv("Week5_test.csv")
df_t.head()

Unnamed: 0,id,Age,Working_class,fnlwgt,education,education_num,marital_status,Occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country
0,0,34,Private,174789,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,45,United-States
1,1,38,Private,181943,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,35,United-States
2,2,45,Private,175625,HS-grad,9,Separated,Adm-clerical,Unmarried,White,Female,0,0,38,United-States
3,3,20,Private,121023,Some-college,10,Never-married,Adm-clerical,Own-child,White,Female,0,0,15,United-States
4,4,41,Local-gov,81054,Some-college,10,Divorced,Exec-managerial,Unmarried,White,Female,0,0,25,United-States


In [584]:
df_t.shape

(13840, 15)

In [585]:
df_t.isna().sum()

id                0
Age               0
Working_class     0
fnlwgt            0
education         0
education_num     0
marital_status    0
Occupation        0
relationship      0
race              0
gender            0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
dtype: int64

In [586]:
df_t[objects] = df_t[objects].astype("category")

One Hot Encoding

In [587]:
from sklearn.preprocessing import OneHotEncoder

In [588]:
xt_t = ct.fit_transform(df_t[objects]).toarray()

In [589]:
df_encd_t = pd.DataFrame(xt_t, index= df_t.index)
df_other_cols_t = df_t.drop(columns= objects)
df_new_t = pd.concat([df_other_cols_t, df_encd_t], axis= 1)

In [590]:
df_new_t.shape

(13840, 107)

Apply model to test data

In [591]:
target = catb.predict(df_new_t)
d = pd.DataFrame(target)
d.index = df_t.id
d.columns = ['earning']
d.to_csv('submission_final.csv', index= True)

In [592]:
d.head()

Unnamed: 0_level_0,earning
id,Unnamed: 1_level_1
0,0
1,1
2,1
3,1
4,1
