# Income Prediction and Clustering Analysis

## Overview

This project explores supervised and unsupervised learning techniques for income classification using a census dataset.

The study includes:

- Data preprocessing and feature engineering
- Logistic Regression and Support Vector Machine classification
- Hyperparameter tuning using GridSearchCV
- K-Means clustering for unsupervised pattern discovery
- Comparative performance analysis

The objective is to evaluate model performance and understand the relationship between supervised and unsupervised learning approaches on structured data.

## Data Preprocessing and Feature Engineering

In [None]:
#importing necessary packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.metrics import accuracy_score

In [None]:
# Load dataset
df = pd.read_csv("income.csv")

print(df.head())
print("\nDataset shape/dimension before preprocessing", df.shape, "\n")
print("Data types(before preprocessing):\n",df.dtypes, "\n")

# Class distribution to check if the dataset is balanced or imbalanced
print("Income distribution before preprocessing:\n", df['income'].value_counts(normalize=True), "\n")

# Correlation (optional, only numeric vars)
print("Correlation among numerical features before preprocessing:\n", df[['age', 'hours-per-week', 'income']].corr(), "\n")

   income  age         workclass  education marital-status         occupation  \
0       0   39         State-gov  Bachelors     NotMarried       Adm-clerical   
1       0   50  Self-emp-not-inc  Bachelors        Married    Exec-managerial   
2       0   38           Private    HS-grad      Separated  Handlers-cleaners   
3       0   53           Private       11th        Married  Handlers-cleaners   
4       0   28           Private  Bachelors        Married     Prof-specialty   

    relationship   race     sex  hours-per-week  
0  Not-in-family  White    Male              40  
1        Husband  White    Male              13  
2  Not-in-family  White    Male              40  
3        Husband  Black    Male              40  
4           Wife  Black  Female              40  

Dataset shape/dimension before preprocessing (26215, 10) 

Data types(before preprocessing):
 income             int64
age                int64
workclass         object
education         object
marital-status    

Introducing an "Unknown" category allows the model to learn whether missingness itself carries predictive information, rather than treating it purely as noise.

In [None]:
missing_count_before = df.isnull().sum()
print("Missing values count (before):\n", missing_count_before, "\n")

# observation of rows where both workclass and occupation are missing
both_missing = df[df['workclass'].isnull() & df['occupation'].isnull()].shape[0]

print(f"Rows where both workclass and occupation are missing: {both_missing}\n")

both_missing_idx = df[df['workclass'].isnull() & df['occupation'].isnull()].index
df.loc[both_missing_idx, ['workclass', 'occupation']] = 'Unknown'
df['workclass'].fillna(df['workclass'].mode()[0], inplace=True)
df['occupation'].fillna(df['occupation'].mode()[0], inplace=True)

missing_count_after = df.isnull().sum()
print("Missing values count(after):\n", missing_count_after, "\n")

Missing values count (before):
 income               0
age                  0
workclass         1396
education            0
marital-status       0
occupation        1401
relationship         0
race                 0
sex                  0
hours-per-week       0
dtype: int64 

Rows where both workclass and occupation are missing: 1396

Missing values count(after):
 income            0
age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
hours-per-week    0
dtype: int64 



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['workclass'].fillna(df['workclass'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['occupation'].fillna(df['occupation'].mode()[0], inplace=True)


Removing duplicate observations ensures that model performance reflects genuine patterns in the data rather than repeated samples.

In [None]:
# Check shape and number of duplicate rows
print("Shape before removing duplicates:", df.shape)
print("Number of duplicate rows:", df.duplicated().sum())

# Remove duplicates
df = df.drop_duplicates()

# Verify
print("Shape after removing duplicates:", df.shape)


Shape before removing duplicates: (26215, 10)
Number of duplicate rows: 3433
Shape after removing duplicates: (22782, 10)


Careful encoding prevents introducing artificial relationships between categories and helps maintain interpretability of model coefficients.

In [None]:
# Ordinal: education
education_mapping = {
    'Preschool':1, '1st-4th':2, '5th-6th':3, '7th-8th':4, '9th':5, '10th':6,
    '11th':7, '12th':8, 'HS-grad':9, 'Some-college':10, 'Assoc-voc':11, 'Assoc-acdm':12,
    'Bachelors':13, 'Masters':14, 'Prof-school':15, 'Doctorate':16
}
df['education'] = df['education'].replace(education_mapping)

# Binary: sex
df['sex'] = df['sex'].replace({'Male':0, 'Female':1})

# Nominal variables: use one-hot encoding
df = pd.get_dummies(df, columns=['workclass','marital-status','occupation','relationship','race'], drop_first=True)

# Verify
print(df.head())

   income  age  education  sex  hours-per-week  workclass_Local-gov  \
0       0   39         13    0              40                False   
1       0   50         13    0              13                False   
2       0   38          9    0              40                False   
3       0   53          7    0              40                False   
4       0   28         13    1              40                False   

   workclass_Never-worked  workclass_Private  workclass_Self-emp-inc  \
0                   False              False                   False   
1                   False              False                   False   
2                   False               True                   False   
3                   False               True                   False   
4                   False               True                   False   

   workclass_Self-emp-not-inc  ...  occupation_Unknown  \
0                       False  ...               False   
1                       

  df['education'] = df['education'].replace(education_mapping)
  df['sex'] = df['sex'].replace({'Male':0, 'Female':1})


In [None]:
print(df.head())
print("\nDataset shape/dimension after preprocessing", df.shape, "\n")
print("Data types(after preprocessing):\n",df.dtypes, "\n")

# Class distribution to check if the dataset is balanced or imbalanced
print("Income distribution after preprocessing:\n", df['income'].value_counts(normalize=True), "\n")

# Correlation (optional, only numeric vars)
print("Correlation among numerical features after preprocessing:\n", df[['age', 'hours-per-week', 'income']].corr(), "\n")
print(df.describe())

   income  age  education  sex  hours-per-week  workclass_Local-gov  \
0       0   39         13    0              40                False   
1       0   50         13    0              13                False   
2       0   38          9    0              40                False   
3       0   53          7    0              40                False   
4       0   28         13    1              40                False   

   workclass_Never-worked  workclass_Private  workclass_Self-emp-inc  \
0                   False              False                   False   
1                   False              False                   False   
2                   False               True                   False   
3                   False               True                   False   
4                   False               True                   False   

   workclass_Self-emp-not-inc  ...  occupation_Unknown  \
0                       False  ...               False   
1                       

## Train-Test Split

The dataset was split into training and test sets (90/10) to evaluate model performance on unseen data.

In [None]:
array = df.values
print(array)
X = array[:,1:]
y = array[:,0]
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)
# Verify shapes
print("Training set:", X_train.shape, y_train.shape)
print("Testing set:", X_test.shape, y_test.shape)

[[0 39 13 ... False False True]
 [0 50 13 ... False False True]
 [0 38 9 ... False False True]
 ...
 [0 27 12 ... False False True]
 [0 58 9 ... False False True]
 [1 52 9 ... False False True]]
Shape of X: (22782, 38)
Shape of y: (22782,)
Training set: (20503, 38) (20503,)
Testing set: (2279, 38) (2279,)


## Feature Scaling

Features were normalised using training-set parameters and applied consistently to the test set to ensure fair evaluation and prevent data leakage.

In [None]:
# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)
y_train = y_train.astype(int)

# transform testing data
X_test_norm = norm.transform(X_test)
y_test = y_test.astype(int)

print("Shape of X_train_norm: ", X_train_norm.shape, "\nand the X_train_norm:\n",X_train_norm,"\n")
print("Shape of X_train_norm: ", X_test_norm.shape, "\nand the X_train_norm:\n",X_test_norm)

Shape of X_train_norm:  (20503, 38) 
and the X_train_norm:
 [[0.28767123 0.6        1.         ... 0.         0.         1.        ]
 [0.30136986 0.6        0.         ... 1.         0.         0.        ]
 [0.61643836 0.4        1.         ... 1.         0.         0.        ]
 ...
 [0.10958904 0.6        1.         ... 0.         0.         1.        ]
 [0.38356164 0.6        0.         ... 0.         0.         1.        ]
 [0.16438356 0.53333333 1.         ... 0.         0.         1.        ]] 

Shape of X_train_norm:  (2279, 38) 
and the X_train_norm:
 [[0.53424658 0.53333333 1.         ... 0.         0.         1.        ]
 [0.49315068 0.66666667 1.         ... 1.         0.         0.        ]
 [0.24657534 0.53333333 1.         ... 0.         0.         1.        ]
 ...
 [0.31506849 0.53333333 0.         ... 0.         0.         1.        ]
 [0.1369863  0.6        1.         ... 0.         0.         1.        ]
 [0.21917808 0.53333333 1.         ... 0.         0.         1.  

## Supervised Learning: Classification Models

## Model Selection

Two supervised classification models were implemented and evaluated:

### Logistic Regression

Logistic Regression models the probability of a binary outcome using a logistic function. It assumes a linear relationship between the input features and the log-odds of the target variable. The model was trained using default parameters as a baseline before hyperparameter tuning.

### Support Vector Machine (SVM)

Support Vector Machines construct a decision boundary that maximises the margin between classes. Kernel functions allow SVM to handle non-linear decision boundaries. The model was initially trained using default hyperparameters before further optimisation.

In [None]:
# logistic regression model, parameters can be changed
model = LogisticRegression(solver="liblinear")
model.fit(X_train_norm, y_train)
test_score = model.score(X_test_norm, y_test)
print("Testing Accuracy of LR:", test_score)

# Support Vector Machine for classification, parameters can be changed
model = SVC()
model.fit(X_train_norm, y_train)
test_score = model.score(X_test_norm, y_test)
print("Testing Accuracy of SVC:", test_score)

Testing Accuracy of LR: 0.799034664326459
Testing Accuracy of SVC: 0.7959631417288284


### Model Evaluation Strategy

To ensure robust performance estimation, 10-fold cross-validation was applied to both models.

The dataset was partitioned into ten subsets, where nine folds were used for training and one fold for validation in each iteration. The final performance metric was computed as the average accuracy across all folds.

This approach reduces variance in evaluation and provides a more reliable estimate of generalisation performance compared to a single train-test split.

In [None]:
kfold = KFold(n_splits=10, shuffle=True, random_state=2)

model = LogisticRegression(solver="liblinear")
results = cross_val_score(model, X_train_norm, y_train, cv=kfold)
print("Average Accuracy of LR:",results.mean())

model = SVC()
results = cross_val_score(model, X_train_norm, y_train, cv=kfold)
print("Average Accuracy of SVM:",results.mean())

Average Accuracy of LR: 0.8104181660344152
Average Accuracy of SVM: 0.8071995100545838


## Hyperparameter Tuning

To improve model performance, hyperparameter tuning was conducted separately for both Logistic Regression and Support Vector Machine models using GridSearchCV.

A range of candidate hyperparameters was defined for each model, and 10-fold cross-validation was applied to identify the optimal configuration based on average validation accuracy.

The performance of each model before and after tuning was compared to assess the impact of regularisation strength, kernel choice, and solver selection on generalisation performance.

In [None]:
# fine tune parameters for lr model
grid_params_lr = {
    'penalty': ['l1', 'l2'],
    'C': [1, 10],
    'solver': ['saga', 'liblinear']
}

lr = LogisticRegression(max_iter=150)
gs_lr_result = GridSearchCV(lr, grid_params_lr, cv=kfold).fit(X_train_norm, y_train)
print(gs_lr_result.best_score_)

grid_params_svc = {
    'kernel': ['linear', 'poly'],
    'C': [1, 10],
    'degree': [3, 8],
    'gamma': ['auto','scale']
}

svc = SVC()
gs_svc_result = GridSearchCV(svc, grid_params_svc, cv=kfold).fit(X_train_norm, y_train)
print(gs_svc_result.best_score_)



0.8104670654410103
0.8076381776884565


Comparing baseline and tuned models provides insight into how regularisation strength and kernel selection influence model stability and predictive accuracy.

In [None]:
test_accuracy = gs_lr_result.best_estimator_.score(X_test_norm, y_test)
print("Accuracy in testing:", test_accuracy)
print(gs_lr_result.best_params_)

test_accuracy = gs_svc_result.best_estimator_.score(X_test_norm, y_test)
print("Accuracy in testing:", test_accuracy)
print(gs_svc_result.best_params_)

Accuracy in testing: 0.800351031154015
{'C': 10, 'penalty': 'l2', 'solver': 'saga'}
Accuracy in testing: 0.8034225537516455
{'C': 10, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}


## Unsupervised Learning: K-Means Clustering

K-Means clustering was applied to the normalised training data to explore underlying structure without using class labels.

The number of clusters was set to two, aligning with the binary income classes, to investigate whether natural groupings in the data correspond to income categories.

Cluster centroids were analysed to interpret the prototype characteristics of each group, providing insight into typical demographic patterns within the dataset.

Although K-Means does not use label information during training, aligning the number of clusters with the number of target classes enables indirect comparison with supervised models.

In [None]:
array2 = df.values
X2 = array2[:,1:]

# fit scaler on training data
norm2 = MinMaxScaler().fit(X2)
X2 = norm2.transform(X2)

kmeans = KMeans(n_clusters=2, random_state=0).fit(X2)

Significant imbalance between clusters may indicate that the natural structure of the data does not align evenly across demographic patterns.

In [None]:
kmeans_labels = kmeans.labels_
unique_labels, unique_counts = np.unique(kmeans_labels, return_counts=True)
dict(zip(unique_labels, unique_counts))

{np.int32(0): np.int64(11855), np.int32(1): np.int64(10927)}

Comparing centroid feature distributions helps reveal which variables contribute most strongly to cluster separation and highlights structural differences in socioeconomic attributes.

In [None]:
kmeans_cluster_centers = kmeans.cluster_centers_
closest = pairwise_distances_argmin(kmeans.cluster_centers_, X2)

# show the three data samples that can represent the three clusters
df.iloc[closest, :]

Unnamed: 0,income,age,education,sex,hours-per-week,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,...,occupation_Unknown,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White
11653,0,35,10,1,40,False,False,True,False,False,...,False,True,False,False,False,False,False,False,False,True
3351,1,47,10,0,44,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True


The lower clustering accuracy reflects the absence of target label information during training, reinforcing the advantage of supervised learning for predictive tasks.

In [None]:
y2 = array2[:,0]
y2 = y2.astype(int)
kmeans_labels = kmeans.labels_

accuracy = accuracy_score(y2, kmeans_labels)
print("k means prediction accuracy:", accuracy)

k means prediction accuracy: 0.6935738741111404
