In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore') # mute warning messages
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load the dataset
# www.kaggle.com/datasets/rohit265/credit-card-eligibility-df-determining-factors
data = pd.read_csv('../data/raw/dataset.csv')
data_processed = pd.get_dummies(data, columns = ['Income_type' , 'Education_type', 'Family_status', 'Housing_type', 'Occupation_type'])
print(data_processed)

In [None]:
# Define features and target.
# Based on previous team analysis, only the following features were found to significantly impact the 'Target' feature.
# (Absolute value of coefficient is above 0.05 and p-value is below 0.05).
X = data_processed[['Num_children', 'Income_type_Commercial associate', 'Income_type_Pensioner', 'Income_type_State servant', 'Income_type_Working', 'Family_status_Single / not married']]
y = data_processed['Target']

In [None]:
# Split the dataset - by splitting the dataset into training and testing sets, we ensure that we have separate data to train the model and to evaluate its performance.
# Using stratification helps maintain the class distribution, which is critical for imbalanced datasets, while setting a random state guarantees reproducibility.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


In [None]:
# K-Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("KNN Accuracy: ", accuracy_score(y_test, y_pred_knn))
print("Classification Report: \n", classification_report(y_test, y_pred_knn))


The K-Nearest Neighbors (KNN) model demonstrated an overall accuracy of 0.681, indicating moderate performance in predicting credit card approvals. The detailed classification report reveals more nuanced insights into the model's effectiveness:

For Class 0 (Rejections):

Precision: 0.87
Recall: 0.74
F1-Score: 0.80
The model shows high precision (87%) for predicting rejections, meaning that when the model predicts a rejection, it is correct 87% of the time. However, the recall is slightly lower at 74%, indicating that the model correctly identifies 74% of all actual rejections.

For Class 1 (Approvals):

Precision: 0.14
Recall: 0.27
F1-Score: 0.18
The model performs poorly for predicting approvals. The low precision (14%) indicates a high number of false positives, and the low recall (27%) shows that the model misses a significant portion of actual approvals.

Definitions
Precision: The ratio of true positive predictions to the total number of positive predictions made. It reflects the accuracy of positive predictions.
Recall: The ratio of true positive predictions to the total number of actual positives. It measures the model's ability to identify all relevant cases.

Addressing the Imbalanced Dataset
The dataset is highly imbalanced, with significantly more rejections (Class 0) than approvals (Class 1). To improve the model's performance, especially for the minority class (approvals), several strategies can be employed:

1. Increase sample size for Class 1:
    Collect more samples of approved applications to balance the dataset better.

2. Advanced models:
    Explore more sophisticated models, such as gradient boosting machines, which are better at handling imbalanced datasets.

3. Hyperparameter tuning:
    Utilize grid search or other hyperparameter tuning techniques to optimize the model's performance.

4. Ensemble methods:
    Implement ensemble methods to combine the predictions of multiple models, potentially improving overall performance.

By addressing these points, we can aim to develop a more robust model that accurately predicts both credit card approvals and rejections, leading to better decision-making in credit card application processes.