In this problem, I developed a logistic regression(LR) model, a Support Vector
Machine model, and a K-mean model to solve a binary classification problem. The goal were to
create a model to predict whether or not a person’s income is larger than or less than 50k
per year based on 9 attributes, and compare their performane.


### I. Exploratory Data Analysis:

a) Describe the dataset

In [4]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(100)

In [None]:
data = pd.read_csv("income.csv")
data.info()
data.describe()

data.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26215 entries, 0 to 26214
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   income          26215 non-null  int64 
 1   age             26215 non-null  int64 
 2   workclass       24819 non-null  object
 3   education       26215 non-null  object
 4   marital-status  26215 non-null  object
 5   occupation      24814 non-null  object
 6   relationship    26215 non-null  object
 7   race            26215 non-null  object
 8   sex             26215 non-null  object
 9   hours-per-week  26215 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 2.0+ MB


(26215, 10)

In [6]:
data.head(5)

Unnamed: 0,income,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week
0,0,39,State-gov,Bachelors,NotMarried,Adm-clerical,Not-in-family,White,Male,40
1,0,50,Self-emp-not-inc,Bachelors,Married,Exec-managerial,Husband,White,Male,13
2,0,38,Private,HS-grad,Separated,Handlers-cleaners,Not-in-family,White,Male,40
3,0,53,Private,11th,Married,Handlers-cleaners,Husband,Black,Male,40
4,0,28,Private,Bachelors,Married,Prof-specialty,Wife,Black,Female,40


b) Deal with missing values (if there are any).

In [7]:
# check if there's any missing values
data.isna().sum()

income               0
age                  0
workclass         1396
education            0
marital-status       0
occupation        1401
relationship         0
race                 0
sex                  0
hours-per-week       0
dtype: int64

In [8]:
# remove rows with missing value
data = data.dropna()

In [9]:
# check if there's any missing values again
data.isna().sum()

income            0
age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
hours-per-week    0
dtype: int64

c) Remove duplicated inputs if there are any

In [10]:
# check if these is any duplicated rows
data.duplicated().any()

True

In [11]:
# remove duplicate
data = data.drop_duplicates()
data.duplicated().any()

False

d) Handle the categorical variables.

• For the ordinal variable education, assign values 1 to 16 to the
categories in this order: Preschool, 1st-4th, 5th-6th, 7th-8th, 9th,
10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm,
Bachelors, Masters, Prof-school, Doctorate.

• For the binary variable sex, assign value 0 to Male and value 1 to
Female

• For the rest of the variables, apply dummy coding to deal with them

In [12]:
# drop the column "relationship"
data = data.drop("relationship", axis = 1)

In [13]:

# handle the variable "education"
data['education'] = data['education'].replace({
    'Preschool': 1,
    '1st-4th': 2,
    '5th-6th': 3,
    '7th-8th': 4,
    '9th': 5,
    '10th': 6,
    '11th': 7,
    '12th': 8,
    'HS-grad': 9,
    'Some-college': 10,
    'Assoc-voc': 11,
    'Assoc-acdm': 12,
    'Bachelors': 13,
    'Masters': 14,
    'Prof-school': 15,
    'Doctorate': 16
})


In [14]:
# handle the variable "sex"
data['sex'] = data['sex'].replace({
    'Male': 0,
    'Female': 1
})

In [15]:
# handle the rest of the categorical variables
data = pd.get_dummies(data, columns=['workclass', 'marital-status', 'occupation','race'])
for i in range(5, 35):
    data.iloc[:,i] = data.iloc[:,i].replace({True: 1, False: 0})

In [16]:
data.info()
data.describe()
data.shape

<class 'pandas.core.frame.DataFrame'>
Index: 21537 entries, 0 to 26214
Data columns (total 35 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   income                        21537 non-null  int64
 1   age                           21537 non-null  int64
 2   education                     21537 non-null  int64
 3   sex                           21537 non-null  int64
 4   hours-per-week                21537 non-null  int64
 5   workclass_Federal-gov         21537 non-null  int64
 6   workclass_Local-gov           21537 non-null  int64
 7   workclass_Private             21537 non-null  int64
 8   workclass_Self-emp-inc        21537 non-null  int64
 9   workclass_Self-emp-not-inc    21537 non-null  int64
 10  workclass_State-gov           21537 non-null  int64
 11  workclass_Without-pay         21537 non-null  int64
 12  marital-status_Married        21537 non-null  int64
 13  marital-status_NotMarried     21537 

(21537, 35)

After preprocessing, the dataset consists of 21,537 rows and 35 columns with no missing
values. The target variable is "income," and the features "age," "sex," and "education" are
numeric. The remaining categorical features have been transformed into dummy variables.

e) Split the dataset into training and testing (with 10% of the dataset for
testing).

Training set: X_train_norm, y_train

Testing set: X_test_norm, y_test

In [17]:
# define the input and the target variable
array = data.values
X = array[:,1:35]
y = array[:,0]

In [18]:
# split the training and testing dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

f) Apply normalization on X (both training and test set)

In [19]:
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X_train)

X_train_norm = norm.transform(X_train)

X_test_norm = norm.transform(X_test)

### II. Train 2 machine learning models Logistic Regression and SVM:

a) Train 2 regression models, including Logistic Regression, and
SVM, with their default settings

In [104]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

model1 = LogisticRegression()
model1.fit(X_train_norm, y_train)
test_score = model1.score(X_test_norm, y_test)
print("Testing Accuracy of LogisticRegression:", test_score)

model2 = SVC()
model2.fit(X_train_norm, y_train)
test_score = model2.score(X_test_norm, y_test)
print("Testing Accuracy of SVC:", test_score)

Testing Accuracy of LogisticRegression: 0.7776230269266481
Testing Accuracy of SVC: 0.7924791086350975


b) Define 10-fold cross-validation to train and evaluate the two models
based on the average score.

The training data (X_train_norm, y_train) is split into 10 folds (based on KFold).

For each fold:

- 9 folds are used to train a new fresh model instance.

- 1 remaining fold is used to test it.

This repeats 10 times, so the model is re-trained 10 times, each time on a different subset of data.

In [105]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kfold = KFold(n_splits=10, shuffle=True, random_state=2) 

results = cross_val_score(model1, X_train_norm, y_train, cv=kfold)
print("Average Accuracy of Logistic Regression:",results.mean())

results = cross_val_score(model2, X_train_norm, y_train, cv=kfold)
print("Average Accuracy of SVM:",results.mean())

Average Accuracy of Logistic Regression: 0.8068915919018188
Average Accuracy of SVM: 0.8008038784580904


c) Apply parameter finetuning steps to the two models separately to
optimize the model performances and compare the cross-validated
results before and after finetuning for each model.

In [None]:
# Fine tune the parameters for logistic regression
from sklearn.model_selection import GridSearchCV

# The hyper-parameters to be tuned are:
# - LR:
# o The regularization can be of type L1 or L2. L1 performs feature selection by
# letting some weights equal 0, whereas L2 does not eliminate the
# weights and prevents large ones.
# o The regularization strength can be 1 or 10.
# o The optimization technique can be “saga” (Stochastic Average Gradient
# Descent) or “liblinear”(Linear optimization solver). The first one is suitable for
# large datasets, and the second one is better for smaller datasets.

# - SVM:
# o The kernel can be linear, which means no kernel function if applied to
# transform the data, or polynomial.
# o The regularization strength can be 1 or 10.
# o If the polynomial kernel is chosen, the degree can be 3 or 8.
# o “gamma” can be “auto” or “scale”, which are 2 ways gamma is calculated.
# This parameter is usually used with non-linear kernels, and it represents how
# strongly each data point influences the hyperplane. A large gamma means
# that a data point has more impact, and vice versa.

grid_params_lr = {
    'penalty': ['l1', 'l2'], 
    'C': [1, 10],
    'solver': ['saga', 'liblinear']
}

lr = LogisticRegression(max_iter = 300) 
gs_lr_result = GridSearchCV(lr, grid_params_lr, cv=kfold)
gs_lr_result.fit(X_train_norm, y_train) 
print(gs_lr_result.best_score_)  # best cross validation score
print(gs_lr_result.best_estimator_)



0.8068915919018188
LogisticRegression(C=1, max_iter=300, solver='liblinear')


In [107]:
# Fine tune the parameters for SVM  
from sklearn.model_selection import GridSearchCV

grid_params_svc = {
    'kernel': ['linear', 'poly'],
    'C': [1, 10],
    'degree': [3, 8],
    'gamma': ['auto','scale']
}

svc = SVC()
gs_svc_result = GridSearchCV(svc, grid_params_svc, cv=kfold)
gs_svc_result.fit(X_train_norm, y_train)
print(gs_svc_result.best_score_) # best cross validation score
print(gs_svc_result.best_params_)

0.8073558551294354
{'C': 1, 'degree': 8, 'gamma': 'scale', 'kernel': 'poly'}


The cross-validated results for the logistic regression model, before and after tuning, are 0.8068 and 0.8068, respectively. So, the choices of parameters before fine-tuning are also the best set of parameters from grid_params_lr.

The cross-validated results for the SVM model before and after tuning are 0.8008 and 0.8073, respectively. Therefore, the delicate tuning process has chosen a better set of parameters for the model.

SVM performs slightly better than LR for this problem. In
Both models, the accuracy on the train set and the test set, are similar, indicating that no
The model overfits the dataset, which is good.


d) Evaluate the two optimized models (with the best parameter setting from
the above step for each model type) on the test set.

In [108]:
test_accuracy = model1.score(X_test_norm, y_test)
print("Accuracy in testing of LR before tuning:", test_accuracy)
test_accuracy = model2.score(X_test_norm, y_test)
print("Accuracy in testing of SVM before tuning::", test_accuracy)

test_accuracy = gs_lr_result.best_estimator_.score(X_test_norm, y_test)
print("Accuracy in testing of LR after tuning:", test_accuracy)
test_accuracy = gs_svc_result.best_estimator_.score(X_test_norm, y_test)
print("Accuracy in testing of SVM after tuning::", test_accuracy)


Accuracy in testing of LR before tuning: 0.7776230269266481
Accuracy in testing of SVM before tuning:: 0.7924791086350975
Accuracy in testing of LR after tuning: 0.7776230269266481
Accuracy in testing of SVM after tuning:: 0.7966573816155988


The accuracy of the Logistic Regression model on the test set before fine tuning is 0.7776 and after fine tuning is 0.7776. So they are the same.

The accuracy of the SVM model on the test set before fine tuning is     0.7924 and after fine tuning is 0.7966. So the fine tuning steps did increase the accuracy of the model.

### III. Apply K-Mean Clustering on the normalized training input X:

a) Apply clustering on the normalized training input X


In [20]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0).fit(X_train_norm)

b) Identify how many data samples have been assigned to each cluster.

In [60]:
kmeans_labels = kmeans.labels_
unique_labels, unique_counts = np.unique(kmeans_labels, return_counts=True)
dict(zip(unique_labels, unique_counts))

{0: 10113, 1: 9270}

c) Extract a prototype from each cluster and investigate their similarity and
difference.


In [25]:
from sklearn.metrics.pairwise import pairwise_distances_argmin

kmeans_cluster_centers = kmeans.cluster_centers_
closest = pairwise_distances_argmin(kmeans.cluster_centers_, X_train_norm)

column_names = data.columns[1:].tolist()
X_train_df = pd.DataFrame(X_train, columns=column_names)

pd.set_option('display.max_columns', None)  
X_train_df.iloc[closest, :]

Unnamed: 0,age,education,sex,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,marital-status_Married,marital-status_NotMarried,marital-status_Separated,marital-status_Widowed,occupation_Adm-clerical,occupation_Armed-Forces,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White
6255,36,10,1,40,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10068,43,11,0,45,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


Compare the prototypes:
- Represent ‘income’ of 0:
Age: 36, education: 10, sex: female, hours per week: 40, workclass_Private: 1,
marital-status_NotMarried: 1, occupation_Admclerical: 1, white.
- Represent ‘income’ of 1:
Age: 43, education: 11, sex: male, hours per week: 45, workclass_Private: 1, maritalstatus_Married: 1, occupation_Exec-managerial: 1, white.

There is a difference in age, where the richer person is older, and he is older, and he works
5 more hours per week. There is a higher chance that people are richer when
they are older, have higher education, and work more, which makes sense. The richer person
is a married male, while the poorer person is a female who is not married. They have different
occupations.

On the other hand, they are all around middle age, have not much gap in education, work a
a lot, their work classes are private, and they are white.

d) Evaluate the clustering accuracy with the testing set and compare with
the results from 2d.

In [84]:
from sklearn.metrics import accuracy_score

kmeans_test_labels = kmeans.predict(X_test_norm)

accuracy = accuracy_score(y_test, kmeans_test_labels)
print("k means prediction accuracy on test set:", accuracy)

k means prediction accuracy on test set: 0.71355617455896


The accuracy of the k-mean model on the test set is 0.7135, which is lower than that of the logistic regression and SVM model. So, the two previous models might be better for this problem.