# Introduction

The dataset utilized is the Family Income and Expenditure Survey 2012 Vol 1. 


This notebook tackles a classification task on the label 'ELECRTIC' on whether or not a family has electricity on their home or not.

# Dataset

The Family Income and Expenditure Survey (FIES) 2012 is a national survey conducted by the Philippine Statistics Authority (PSA) to gather comprehensive data on family income and spending patterns across households in the Philippines. The dataset provides insights into income distribution, expenditure habits, and economic well-being, helping policymakers, researchers, and analysts assess poverty levels, economic growth, and financial behavior among Filipino households.

# Python Libaries and Modules

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preprocessing and Cleaning

The data was preprocessed and cleaned through these methods in order:

**1.) Standardizing Column Names**
- Column names were stripped of whitespace, converted to uppercase, and spaces were replaced with underscores to ensure consistency and avoid potential errors when referencing column names in code.

**2.) Handling Missing Values**
- Columns with more than 50% missing values were dropped.
For numerical columns, missing values were filled with the median of the respective column.
For categorical columns, missing values were filled with the mode (most frequent value).
- Dropping columns with excessive missing values prevents misleading analysis.
Using the median for numerical data avoids the influence of outliers.
- Three specific categorical columns were modified: OCCUP, KB, and CW. If there is no value in OCCUP, there is no value as well in KB and CW. Zero values replaced the N/A ones for all three of them. In CW specifically, since 0 is already used there, all the zeroes were transformed to 7 first before filling up all the N/A values with zeroes.
- Asset ownership features that did not have any input were assumed to have a value of 0 (meaning they do not own that item).

**3.) Removing Duplicates**
- Duplicate rows were removed as they could distort results and introduce bias.

**4.) Converting Data Types**
- Categorical columns (e.g., region, employment status, household type) were converted to category type.
Income and expenditure columns were converted to float.
- Converting numerical columns to float ensures compatibility with mathematical operations whilst categorical columns were converted to category type for easier referencing.

**5.) Handling Outliers**
- The Interquartile Range (IQR) method was used to cap extreme values:
Values below Q1 - 1.5 * IQR were replaced with the lower bound.
Values above Q3 + 1.5 * IQR were replaced with the upper bound.
- Outliers can distort statistical analysis and machine learning models.
The IQR method retains the distribution's general shape while reducing the influence of extreme values.

In [None]:
incexp_df = pd.read_csv("FIES_2012_Cleaned.csv")
import pandas as pd
import numpy as np



# List of numeric columns
numeric_columns = ['W_SHSN', 'NATDC', 'HSE_ALTERTN', 'PSU', 'REGPC', 'T_ACTRENT',
                   'T_BIMPUTED_RENT', 'T_RENTVAL', 'BLDG_TYPE', 'RFACT', 'FSIZE', 'WALLS',
                   'BWEIGHT', 'AGRI_SAL', 'RSTR', 'NONAGRI_SAL', 'T_IMPUTED_RENT', 'AGE',
                   'T_FOOD_NEC', 'MEMBERS', 'NATPC', 'ROOF', 'FOOD_ACCOM_SRVC', 'W_HCN',
                   'POP_ADJ', 'SPOUSE_EMP', 'REGDC', 'HGC']

income_expenditure_columns = [
    "WAGES", "NETSHARE", "CASH_ABROAD", "CASH_DOMESTIC", "RENTALS_REC", "INTEREST", "PENSION", "DIVIDENDS",
    "OTHER_SOURCE", "NET_RECEIPT", "REGFT", "NET_CFG", "NET_LPR", "NET_FISH", "NET_FOR", "NET_RET", "NET_MFG",
    "NET_COM", "NET_TRANS", "NET_MIN", "NET_CONS", "NET_NEC", "EAINC", "TOINC", "LOSSES", "T_BREAD", "T_MEAT",
    "T_FISH", "T_MILK", "T_OIL", "T_FRUIT", "T_VEG", "T_SUGAR", "T_COFFEE", "T_MINERAL", "T_ALCOHOL", "T_TOBACCO",
    "T_OTHER_VEG", "T_FOOD_HOME", "T_FOOD_OUTSIDE", "T_FOOD", "T_CLOTH", "T_FURNISHING", "T_HEALTH", "T_HOUSING_WATER",
    "T_TRANSPORT", "T_COMMUNICATION", "T_RECREATION", "T_EDUCATION", "T_MISCELLANEOUS", "T_OTHER_EXPENDITURE",
    "T_OTHER_DISBURSEMENT", "T_NFOOD", "T_TOTEX", "T_TOTDIS", "T_OTHREC", "T_TOREC", "PCINC"
]

# Combine both lists
all_columns = numeric_columns + income_expenditure_columns

# Outlier handling function using IQR
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Cap the outliers
    df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
    df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

    return df

# Apply the outlier handling function to the selected columns
for col in all_columns:
    if col in incexp_df.columns:
        incexp_df = handle_outliers(incexp_df, col)


# Exploratory Data Analysis



Sanity check if there are missing values left in the cleaned dataset file

In [None]:
missing_values = incexp_df.isnull().sum()
print("Missing values per column:\n", missing_values[missing_values > 0])
# Get data types of each column
data_types = incexp_df.dtypes

# Display columns with missing values along with their data types
missing_data_info = pd.DataFrame({
    "Missing Values": missing_values[missing_values > 0],
    "Data Type": data_types[missing_values > 0]
})

print("Columns with Missing Values and Their Data Types:\n")
print(missing_data_info)

## Display Dataset/Table Information
In order to familiarize ourselves with out dataset, we first display the info of our dataset using the `.info()` function. Here we can see the number of rows and columns and the variable type of each feature.

In [None]:
# prints the info on the dataset
incexp_df.info()

# prints the rows and columns of the dataset
print("\nRows and columns of the Dataset:")
print(incexp_df.shape)

## Using `.head()` and `.tail()` function
We use the head and tail function to see what are the first values and the last value of the dataset. This is to give an idea on what kind of values we are working with.

In [None]:
incexp_df.head()

In [None]:
incexp_df.tail()

In [None]:
print(incexp_df.dtypes)

## Label Variable Distribution (Electric)

The distribution of the ELECTRIC column is heavily skewed towards those who already have electricity. Due to this, remodelling is needed in order for the machine learning models to work properly. 

In [None]:
incexp_df['ELECTRIC'].value_counts().plot(kind='bar')
plt.title('Target Variable Distribution')
plt.xlabel('Income Decile')
plt.ylabel('Count')
plt.show()

The two categories (1 and 2) of ELECTRIC need to be equalized by either creating synthetic data for the lacking category (2) or by deleting rows from the ample category (1). The majority class was undersampled so that we can still use realworld data for training and to prevent overfitting from too many rows of features.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample

# Define categorical columns to exclude from standardization
categorical_columns = [
    # Text-based categorical
    "W_REGN", "URB", "SEX", "MS", 'W_OID', "JOB", "HHTYPE", "TENURE", "TOILET", "ELECTRIC", "WATER",
    # Numeric categorical 
    "OCCUP", "KB", "CW", "AGELESS5", "AGE5_17", "EMPLOYED_PAY", "EMPLOYED_PROF"
]

# Checking class distribution before balancing
print("Class distribution before balancing:")
print(incexp_df['ELECTRIC'].value_counts())

# Splitting features and target variable
X = incexp_df.drop('ELECTRIC', axis=1)  # Drop the target column
y = incexp_df['ELECTRIC']               # Target column

# Separate categorical and numerical columns
numerical_columns = [col for col in X.columns if col not in categorical_columns]

# Standardize only numerical columns
scaler = StandardScaler()
X_scaled = X.copy()
X_scaled[numerical_columns] = scaler.fit_transform(X[numerical_columns])

# Combine scaled features and target back into a single DataFrame
balanced_df = pd.concat([X_scaled, y.reset_index(drop=True)], axis=1)

# Separate majority and minority classes
majority_class = balanced_df[balanced_df['ELECTRIC'] == 1]
minority_class = balanced_df[balanced_df['ELECTRIC'] == 2]

### OPTION 1: Undersample Majority Class ###
majority_undersampled = majority_class.sample(len(minority_class), random_state=42)
balanced_df = pd.concat([majority_undersampled, minority_class])

# Shuffle the balanced dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Update incexp_df with the balanced data
incexp_df = balanced_df

# Checking the new class distribution after balancing
print("\nClass distribution after balancing:")
print(incexp_df['ELECTRIC'].value_counts())

# Displaying a few rows of the updated DataFrame
print("\nSample of the balanced dataset:")
print(incexp_df.head())


## Feature Importance


import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

X = incexp_df.drop(columns=['ELECTRIC'])  
y = incexp_df['ELECTRIC']

rf = RandomForestClassifier()
selector = RFE(rf, n_features_to_select=20, step=10)  # step=10 removes 10 features per iteration
selector.fit(X, y)

selected_features = X.columns[selector.support_]
print("Selected features:", selected_features)


Selected features: Index(['W_OID', 'RFACT', 'NET_RECEIPT', 'REGFT', 'T_MEAT', 'T_MILK', 'T_VEG',
       'T_SUGAR', 'T_HOUSING_WATER', 'T_RENTVAL', 'T_COMMUNICATION', 'T_NFOOD',
       'T_TOTEX', 'TOILET', 'WATER', 'DISTANCE', 'TV_QTY', 'CD_QTY',
       'CELLPHONE_QTY', 'PCINC'],
      dtype='object')

## Feature Analysis

## Investigating Correlation between Features
In this section, we will investigate which features have a high correlation with each other. This is to determine which features should be used or removed depending on the data analysis we want to apply.

Here we test the correlation of the features `WAGES`, `AGRI_SAL`, `CASH_ABROAD`.

In [None]:
correlations = {}
for feature in selected_features:
    correlations[feature] = incexp_df[[feature, 'ELECTRIC']].corr().iloc[0, 1]

# Convert the dictionary to a DataFrame for better visualization
correlation_df = pd.DataFrame(list(correlations.items()), columns=['Feature', 'Correlation_with_ELECTRIC'])

# Sort the correlations in descending order
correlation_df = correlation_df.sort_values(by='Correlation_with_ELECTRIC', ascending=False)

# Print the correlation with ELECTRIC
print(correlation_df)

Nothing changed.

In [None]:
# List of features to include in the correlation test
features = ['W_OID', 'RFACT', 'NET_RECEIPT', 'TOINC', 'T_MEAT', 'T_SUGAR',
            'T_HOUSING_WATER', 'T_RENTVAL', 'T_COMMUNICATION', 'T_MISCELLANEOUS',
            'T_NFOOD', 'T_TOTEX', 'T_TOTDIS', 'T_TOREC', 'TOILET', 'WATER',
            'TV_QTY', 'CD_QTY', 'REF_QTY', 'PCINC']

# Selecting the features and computing the correlation matrix
correlation_test = incexp_df[features].corr()

# Display the correlation matrix
print(correlation_test)

T_TOREC, T_TOTDIS, and T_TOTEX are highly related to TOINC and due to that, they are dropped.

In [None]:
#Remaining Features
selected_features = ['WATER', 'NET_RECEIPT', 'TOILET', 'TV_QTY', 'T_HOUSING_WATER', 
            'CD_QTY', 'T_NFOOD', 'TOINC', 'PCINC', 'T_COMMUNICATION', 
            'T_MISCELLANEOUS', 'REF_QTY', 'W_OID', 'AGE', 'DISTANCE', 'T_MEAT', 'T_SUGAR', 'RFACT']


# Initial Model Training

## Logistic Regression

Seeing as this is a binary classification problem wherein the target variable is categorical, logistic regression is well suited for it. It is also easier to interpret with logistic regression as it provides probabilities for each class, which helps in understanding the confidence of predictions. Lastly, logistic regression is recommended as there is a linear relationship between the independent variables and the log-odds of the dependent variable.



## Naive Bayes

Naive Bayes was one of the chosen models as it's well known as a algorithm for classification through the use of probabilities. Gaussian Naive Bayes was used specifically due to the fact that the data we are using bears both continuous and categorical features, which the model can use to compute it's predictions.

## K-Nearest Neighbors Classifier

We now use the KNN Classifier for prediciting whether the given household has electricity. KNN or K-Nearest Neighbors, is known to be the most naive kind of supervies machine learning, by only basing its prediction on the distance of the nearest data point/s.

We decided to use this since our task involves classification and KNN is easy to understand and implement, no need for complex training. However, we also want to test whether KNN works well in a big dataset, since most claims stated that it only works well with small to medium datasets.

## All Models and their Outputs

In [None]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Define features and target variable
X = incexp_df[selected_features]  # Using the selected features from RFE
y = incexp_df['ELECTRIC']

# Dictionary to store model names and their instances
models = {
    "Logistic Regression": LogisticRegression(),
    "Naive Bayes": GaussianNB(),
    "K-Nearest Neighbors": KNeighborsClassifier(weights='distance')
}

# K-Fold Cross-Validation (K=5)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Iterate through each model
for model_name, model in models.items():
    accuracies = []
    print(f"\nModel: {model_name}")
    
    # Perform K-Fold Cross-Validation
    for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
        # Use iloc[] for row selection
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        # Train the model
        model.fit(X_train, y_train)
        
        # Predict using the model
        y_pred = model.predict(X_test)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        accuracies.append(accuracy)
        
        print(f"Fold {fold} Accuracy: {accuracy * 100:.2f}%")
        
        # Print classification report
        print(f"Classification Report for {model_name} (Fold {fold}):")
        print(classification_report(y_test, y_pred, zero_division=1))
        
        # Generate and display confusion matrix
        #cm = confusion_matrix(y_test, y_pred)
        # plt.figure(figsize=(6, 5))
        #sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_)
        #plt.title(f'Confusion Matrix for {model_name} (Fold {fold})')
        #plt.xlabel('Predicted')
        #plt.ylabel('Actual')
        #plt.show()

    # Average accuracy over all folds
    avg_accuracy = np.mean(accuracies) * 100
    variance = np.var(accuracies) * 100  # Variance of accuracies
    print(f"Average Accuracy for {model_name}: {avg_accuracy:.2f}%")
    print(f"Variance of Accuracy for {model_name}: {variance:.4f}%")


# Error Analysis

## Logistic Regression

![Screenshot 2025-03-27 084622.png](attachment:ec5ed643-94d8-46b5-a159-14e8242f69a0.png)

We can see that the precision of class 1 is 100% indicating that everything there is correctly classified true positives however this is not valid with it's recall being 0.00 meaning class 1 was never predicted in the first place. As for class 2, it has predicted true positives 51% of the time but the issue is that every instance has predicted it being class 2 and with there being 2 instances and an equalized model (50/50 distribution of both classes), that may be the reason why it has an average accuracy of 50%. The main challenge now is to make the model predict both of the two and not just one of them.

## Naive Bayes

In the reports, it seems to show that the Gaussian Naive Bayes model has the tendency to misclassify class 1 a little more compared to class 2, based on the precision of both classes and their recall, which class 1 has a precision of between 50-60%, but the recall is baring a result of less than 50%. While the predictions for class 2 is alot more accurate, having precision scores between 53-56%, and the recall being higher than 60%. This is showin that the model is predicting class 2 more compared to class 1.

## KNN Classifier

![download.png](attachment:da223108-d1a5-4422-9018-c46b0b25b6c8.png)

Here, we can see the last KNN fold results in an accuracy of 72.85%, a reasonable accuracy but not really a high performing one. This could still be improved.

Additionally we can see that precision, recall, and f-score of class 1 and 2 are balanced, having a line of 0.7, and the dataset is almost balanced with 1035 vs. 1079 instances.

Since class 1 has higher recall than precision, it catches most of the true positives of the category but also has some false positives. Vice versa, class 2 has a higher precision than recall, meaning that it is better at avoiding false positives but misses some true positives in its own class. We now try to improve the model performance of KNN.

# Improving Model Performance

## Logistic Regression

The Optuna library was utilized to determine the optimal hyperparameters for the highest accuracy. Optuna uses a Bayesian optimization approach that optimizes itself based on past results. 

In Optuna the tuning process goes from: 
- Defining the Objective Function
- Creating a Study
- Running the Optimization
- Extracting the Best Hyperparameters

import optuna
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
import numpy as np

# Define the objective function
def objective(trial):
    # Suggest hyperparameters to optimize
    penalty = trial.suggest_categorical("penalty", ["l1", "l2", "elasticnet", "none"])
    C = trial.suggest_loguniform("C", 1e-4, 1e2)
    solver = trial.suggest_categorical("solver", ["liblinear", "saga"])
    l1_ratio = trial.suggest_uniform("l1_ratio", 0.0, 1.0) if penalty == "elasticnet" else None
    
    # Logistic Regression model with suggested hyperparameters
    model = LogisticRegression(
        penalty=penalty,
        C=C,
        solver=solver,
        l1_ratio=l1_ratio if penalty == "elasticnet" else None,
        max_iter=1000
    )

    # K-Fold Cross-Validation (K=5)
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    accuracies = []

    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        # Train the model
        try:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            accuracies.append(accuracy)
        except Exception as e:
            return 0.0  # Return zero if training fails
    
    # Return the average accuracy over all folds
    avg_accuracy = np.mean(accuracies)
    return avg_accuracy


import optuna
from sklearn.linear_model import LogisticRegression

# Create a study and optimize the objective function
study = optuna.create_study(direction="maximize")

def objective(trial):
    # Suggest solver and penalty independently
    solver = trial.suggest_categorical("solver", ["liblinear", "saga"])
    penalty = trial.suggest_categorical("penalty", ["l1", "l2", "elasticnet", "none"])

    # Check for incompatible combinations and skip trial if invalid
    if solver == "liblinear" and penalty not in ["l1", "l2"]:
        raise optuna.TrialPruned()

    if solver == "saga" and penalty == "none":
        raise optuna.TrialPruned()

    # Use suggest_float with log=True for C
    C = trial.suggest_float("C", 1e-4, 1e2, log=True)

    # Use suggest_float instead of suggest_uniform for l1_ratio if using elasticnet
    l1_ratio = trial.suggest_float("l1_ratio", 0.0, 1.0) if penalty == "elasticnet" else None

    # Handle the 'none' penalty correctly
    penalty = None if penalty == "none" else penalty

    # Create and train the model
    model = LogisticRegression(penalty=penalty, C=C, solver=solver, l1_ratio=l1_ratio, max_iter=1000)
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)

    return accuracy

study.optimize(objective, n_trials=50)

# Print the best hyperparameters and score
print(f"\nBest trial: {study.best_trial.params}")
print(f"Best accuracy: {study.best_trial.value * 100:.2f}%")


From running the optimization, the accuracy is now ~ 90 %

In [None]:
best_params = {
    'solver': 'liblinear',
    'penalty': 'l1',
    'C': 2.1575802199275707
}


# Train the optimized Logistic Regression model
improved_log_model = LogisticRegression(
    solver=best_params['solver'],
    penalty=best_params['penalty'],
    C=best_params['C'],
    max_iter=1000  # Increase if convergence issues arise
)

The common hyperparameters that highly affect the Logistic Regression model is the solver being liblinear and the penalty being l1. The C value affects it as well although to a not so significant extent.

The solver being liblinear helped a ton due to it being designed for binary classification which is what the data is using (either with or without electricity). On the other hand, the l1 penalty introduces L1 regularization, which has the effect of shrinking less important feature coefficients to exactly zero.

## Naive Bayes


GridSearchCV was used to tune the Naive Bayes model



In [None]:
params_nb = {'var_smoothing': np.logspace(0,-9, num=100)}

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV

In [None]:
grid_model = GridSearchCV(estimator=GaussianNB(), param_grid = params_nb, verbose = 1, cv = 10, n_jobs=1)

In [None]:
accuracies = []

for train_index, test_index in kf.split(X):
   
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
  
    grid_model.fit(X_train, y_train)
    
   
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

print(f"Average Accuracy: {np.mean(accuracies) * 100:.2f}%")

## KNN Classifier

KNN has different ways of improving the model. Most commonly is the hyperparameter tuning. We decided to tune both the hyperparameter k and the type of distance to be used by the model.

### Cross-validation for getting the Hyperparameter K

Since KNN highly relies on the hyperparamter K, we use the cross-validation to determine the best value for hyperparameter k.

We use the sklearn.model_selection.cross_val_score() to do this.

In [None]:
from sklearn.model_selection import cross_val_score

We add an array of possible values for the hyperparameter k. We'll add 10 arbitrary values.


In [None]:
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 45, 100]

Additionally, we instantiate an empty array to put the scores of each hyperparameter k has.

In [None]:
k_scores = np.zeros((len(k_choices), 5))

We combine the X_train, X_test, y_train, and y_test values to be used.

In [None]:
X = np.concatenate((X_train, X_test))
y = np.concatenate((y_train, y_test))

for i in range(len(k_choices)):
    print("k is : " + str(k_choices[i]))
    model = KNeighborsClassifier(k_choices[i])
    k_scores[i] = cross_val_score(model, X, y, cv=5)
    print(k_scores[i])
    pass

In order to understand it better, we plot it using the matplotlib.pyplot and importing it as plt

In [None]:
import matplotlib.pyplot as plt

def scatter_plot(scores):
    for i in range(len(scores)):
        x=[k_choices[i]] * 5
        plt.scatter(x, scores[i])

scatter_plot(k_scores)

In order to visualize it more properly, we calculate the mean and standard deviation of each scores from the folds of each hyperparameter k.

We use np.mean to get the average score per KNN model and np.std for getting the standard deviation.

In [None]:
avg_k_scores = np.mean(k_scores, axis=1)
print("Average of k_scores: ", avg_k_scores)

std_k_scores = np.std(k_scores, axis=1)
print("\nStandard Deviation of k_scores: ", std_k_scores)

We now visualize this in a graph.

In [None]:
scatter_plot(k_scores)

plt.errorbar(k_choices, avg_k_scores, yerr=std_k_scores)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')

Here we can see that the accuracy peaks at the area with the orange dots. Basing this from the k_choices array, the hyperparamter k = 3 represents the orange dots, which means that setting the hyperparameter k to 3 results in a higher accuracy than the rest.

We now use 3 as our hyperparameter k in modeling our new KNN classifier.

In [None]:
knn_classifier = KNeighborsClassifier(3)
knn_classifier.fit(X_train, y_train)

We first test it on the training data to determine the training accuracy.

In [None]:
y_predicted = knn_classifier.predict(X_train)
def compute_accuracy(predictions, actual):
    num_correct = (predictions == actual).sum()
    accuracy = num_correct / predictions.shape[0] * 100
    return accuracy
print("Training accuracy: ", compute_accuracy(y_train, y_predicted))
print("\nClassification report:\n ", classification_report(y_train, y_predicted))

We now test it on the testing data.

In [None]:
y_predicted = knn_classifier.predict(X_test)

print("Testing accuracy: ",  compute_accuracy(y_test, y_predicted))
print("\nClassification_report:\n", classification_report(y_test, y_predicted))

From the results, there wasn't that much of an improvement to the model by simply adjusting the hyperparameter k.

Now, why not we adjust the measuring distance used by the model with the same value for hyperparameter k.

## Testing other Distance Metrics

Since KNN classifier uses Euclidean distance as the default, we will explore other distance metrics such as Manhattan, Minkowski, and Cosine distance to see whether we can improve our KNN model more.

### Manhattan Distance

The Euclidean distance is ignored and instead is based on the absolute differences of their Cartesian coordinates.

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors= 3, metric= 'manhattan')

In [None]:
knn_classifier.fit(X_train, y_train)

Test it on the training data first.

Next, testing it on the test data.

Here we can see that the accuracy for both training and testing has slightly improved by 3% (based on the testing accuracy) when using Manhattan distance.

### Minkowski Distance

Can be considered the generalization of Euclidean and Manhattan Distance.

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors= 3, metric= 'minkowski')

In [None]:
knn_classifier.fit(X_train, y_train)

We first test it on the training dataset.

In [None]:
y_predicted = knn_classifier.predict(X_train)

print("Training accuracy: ", compute_accuracy(y_train, y_predicted))
print("\nClassification report:\n ", classification_report(y_train, y_predicted))

Next, the test dataset.

y_predicted = knn_classifier.predict(X_test)

print("Testing accuracy: ", compute_accuracy(y_test, y_predicted))
print("\nClassification report:\n ", classification_report(y_test, y_predicted))

Here, we can see using Minkowski distance does not improve the accuracy for both the training and testing. We now move on to the last one.

Cosine Distance

This measures the difference of two vectors by getting their cosine of the angle between them.

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors=3, metric= 'cosine')

knn_classifier.fit(X_train, y_train)

In [None]:
y_predicted = knn_classifier.predict(X_train)

print("Training accuracy: ", compute_accuracy(y_train, y_predicted))
print("\nClassification report:\n ", classification_report(y_train, y_predicted))

Just from the training accuracy, we can already tell that Cosine distance does not improve our model, instead it worsens the model.

### Final Hyperparameters

From the cross-validation and testing on different metrics, we conclude that the hyperparameter k = 3 using the Manhattan distance provides an increase in the original KNN model. While it may not be big, this proves that further tuning of hyperparameters helps improve the model.

We will now finalize the knn_classifier variable with the said hyperparameters.

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors=3, metric='manhattan')

knn_classifier.fit(X_train, y_train)

# Model Performance Summary

In [None]:
models = {
    "Logistic Regression": improved_log_model,
    "Naive Bayes": grid_model,
    "K-Nearest Neighbors": knn_classifier
}

# K-Fold Cross-Validation (K=5)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Iterate through each model
for model_name, model in models.items():
    accuracies = []
    print(f"\nModel: {model_name}")
    
    # Perform K-Fold Cross-Validation
    for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
        # Use indexing for NumPy arrays
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # Train the model
        model.fit(X_train, y_train)
        
        # Predict using the model
        y_pred = model.predict(X_test)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        accuracies.append(accuracy)
        
        print(f"Fold {fold} Accuracy: {accuracy * 100:.2f}%")
        
        # Print classification report
        print(f"Classification Report for {model_name} (Fold {fold}):")
        print(classification_report(y_test, y_pred, zero_division=1))
        
        # Generate and display confusion matrix
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(6, 5))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_)
        plt.title(f'Confusion Matrix for {model_name} (Fold {fold})')
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.show()

    # Average accuracy over all folds
    avg_accuracy = np.mean(accuracies) * 100
    print(f"Average Accuracy for {model_name}: {avg_accuracy:.2f}%")


# Insights and Conclusions

Training the model was easier than expected. It was easier now since there are libraries which aid in getting the best hyperparameters for the model. The hardest part still is the preprocessing and feature extraction part. Preprocessing is tedious, especially with the number of columns present although it was nice that the categorical variables were one hot encoded beforehand and there's not much missing values. 

Some models produced better results on the initial model training but worse compared to the other models after tuning. 

# References

AI Tools were utilized for this work, namely ChatGPT.


Jain, K. (2021, Apr 2). How to Improve Naive Bayes?. Medium. Retrieved from https://medium.com/analytics-vidhya/how-to-improve-naive-bayes-9fa698e14cba

Gaussian Naive Bayes. (2025, Jan 29). GeeksforGeeks. Retrieved from https://www.geeksforgeeks.org/gaussian-naive-ba

Past notebook were used as a guide. These notebooks include:

K-Nearest Neighbors.ipynb
Additionally, short descriptions for the KNN metrics were taken from:

https://www.datacamp.com/tutorial/cosine-distance
https://www.datacamp.com/tutorial/cosine-distance
http://en.wikipedia.org/wiki/Taxicab_geometry
STINTSY slides provided in Canvases/