# AMSA: K-Nearest Neighbors Classification

## Relevant libraries

In [11]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

from sklearn.model_selection import StratifiedKFold # for creating k-fold cv and deal with class imbalanceissue

from sklearn.neighbors import KNeighborsClassifier # for the k-nn classification model

from sklearn.pipeline import Pipeline # for genearing pipeline
from sklearn.compose import make_column_transformer  # for applying appropriate transformations for each columns
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler # OneHotEncoder is for converting categorical variables into numbers, MinMaxScaler is for scaling the features, this function will change values to be between zero and one

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # for tuning the hyperparameters

## Data

Importing the data

In [2]:
default_train_complete = pd.read_csv('../data/default_train_complete.csv').drop(columns = ['LoanID'])
X_train = default_train_complete.drop(columns = ['Default'])
y_train = default_train_complete['Default']

### Cross validation

Then, we can proceed on to create a 10-fold cv, also stratified on the target variable "Default"

In [3]:
cv_folds = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 30897) # the shuffle argument is used to randomize the order of the elements in the data before splitting them into folds

## Choice regarding the assessment metrics

Sensitivity because our main goal is to correctly predict defaulters on loans, we really care about true positive rates and at the same time want to minimize false negative rates (underpredicting positive defaulters). 

Specificity because another goal for our problem is that we will use the prediction results to decide whether to offer loans to certain individuals or not based on their probability of defaulting on their loans. As a result, it is also important that we correctly predict individuals that will not default on their loans, so that banks do not miss out on individuals that will not actually default on their loans. Thus, we also care about true negative rates, and at the same time want to minimize false positive rates (overpredicting defaulters). 

As a result, the two primary metrics should be Sensitivity and specificity. Here, it might be appropriate to say that we should aim for sensitivity of at least 70% and specificity of 60%. 

Furthermore, it is also important that we use other metrics as well to see different perspectives as well. More specifically, we will also use balanced accuracy (bal_accuracy) as the third assessment metric, as this metric is the arithmetic mean of sensitivity and specificity (it focuses on the balance between sensitivity and specificity).

Moreover, we will also include precision as well as we can see whether the model will be able to do a good job of correctly predict defaulters, while maintaining low incorrect predictions of non-defaulters as defaulters (high TP while minimizing FP). Also, to also examine whether our model does a good job or not of balancing between sensitivity and precision, we can also include F-1 (f_meas) as an assessment metric. 

To conclude, our assessment metrics of choices are:

1. sensitivity (recall), 
2. specificity, 
3. bal_accuracy, 
4. precision, 
5. f_meas

It is important to note that with Scikit-Learn does not have a built in metric for specificity, thus, we have to create the metric my ourselves. And you have to apply the make_scorer function to create a scorer object. Notably, when I have multiple scoring metrics and one of them is a custom function like specificity, the value that needs to be passed into the scoring parameter of the GridSearchCV() function must be a dictionary where the keys are the name of the scorers and the values are the actual scoring object. 

In [4]:
metrics_set = {
    'recall': 'recall', 
    'bal_accuracy': 'balanced_accuracy', 
    'precision': 'precision', 
    'f1': 'f1', 
    'roc_auc': 'roc_auc'
}

## Modeling

### k-NN Classification

First, we have to define and initialize the knn classification model object.

In [5]:
knn_class_model = KNeighborsClassifier(n_jobs = -1)

#### Creating the preprocessing pipeline for k-NN

By using the transformation pipeline from the ColumnsTransformer class, we are able to properly apply appropriate transformations to each column without having to separate the columns into chatagorical and numerical columns and do the transformation separately. 

Notably, with R conveniently deals with categorical variable through the conversion of the variable into a factor variable. In doing so, R allows us to elegantly fits all models without any issues. On the other hand, python would require us to convert the categories in the categorical variable into label encoding in order for the models to work. Consequently, we have to apply the OneHotEncoder() function. 

In [6]:
# we use ColumnTransformer() every time when features in the data need different preprocessing transformers
preprocessor = make_column_transformer(
    (MinMaxScaler(), [str(col) for col in X_train.select_dtypes(['int64', 'float64'])]), 
    (OneHotEncoder(), [str(col) for col in X_train.select_dtypes(['object', 'category'])]), 
    remainder = 'passthrough'
)

##### One Hot Encoding vs Label Encoding vs Target Encoding

- One Hot Encoding create new dummy variables for each of the unique categories in the original discrete features
- Label Encoding convert the values of the discrete features into random numbers in accordance to different unique categories in the original discrete features, but since the numbers are arbitrary, there can be problems. For instance, in decision trees when splitting, the colours may be grouped together when splitting as we randomly assigned numbers to each of the categories. 
- Target Encoding uses the target variable to determine what values to replace the discrete options with

Then, we formulate the pipeline workflow for our knn model.

In [7]:
knn_class_pipeline = Pipeline(
    steps = [
        ('preprocessor', preprocessor),  
        ('knn_class_model', knn_class_model)
    ]
)

#### Hyperparameter tuning

Then, we have to set up a tuning grid for tuning the hyperparameter, k. Notably, Parameters of pipelines can be set using '__' separated parameter names: 

In [8]:
knn_class_tune_grid = {
    "knn_class_model__n_neighbors": [i*2 + 1 for i in range(10, 60, 1)]
}

knn_class_tune_grid

{'knn_class_model__n_neighbors': [21,
  23,
  25,
  27,
  29,
  31,
  33,
  35,
  37,
  39,
  41,
  43,
  45,
  47,
  49,
  51,
  53,
  55,
  57,
  59,
  61,
  63,
  65,
  67,
  69,
  71,
  73,
  75,
  77,
  79,
  81,
  83,
  85,
  87,
  89,
  91,
  93,
  95,
  97,
  99,
  101,
  103,
  105,
  107,
  109,
  111,
  113,
  115,
  117,
  119]}

Then, we initialize the grid search. Notably, the scoring parameter of the GridSearchCV() function determines metrics that will be used for evaluating the model's performance. Where it will use the first metric in the list to select the best hyperparameter. 

In [9]:
knn_class_grid_search = GridSearchCV(
    estimator = knn_class_pipeline, 
    param_grid = knn_class_tune_grid, 
    cv = cv_folds,  
    scoring = metrics_set, 
    refit = 'recall', # without this parameter specified, there was an error, this parameter is used to identify the metric to used for fitting the best model
    error_score = "raise" # setting the error_score = "raise" to tell the machine to raise an error when there is fitting issue
)

Subsequently, we can fit the grid search object created.

In [None]:
knn_class_grid_search.fit(X  = X_train, y = y_train)

In [13]:
knn_class_randomized_grid_search = RandomizedSearchCV(
    estimator = knn_class_pipeline,
    param_distributions = knn_class_tune_grid, 
    cv = cv_folds, 
    scoring = metrics_set,
    refit = 'recall',
    error_score = 'raise'
)

In [14]:
knn_class_randomized_grid_search.fit(X = X_train, y = y_train)

##### Tuning results

In [16]:
knn_class_tune_results = pd.DataFrame(knn_class_randomized_grid_search.cv_results_)

knn_class_tune_results

# Note: initially, the test estimates are NaN. But after I run the code following code in my terminal: pip install threadpoolctl==3.1.0, it works. 

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn_class_model__n_neighbors,params,split0_test_recall,split1_test_recall,split2_test_recall,split3_test_recall,...,split3_test_roc_auc,split4_test_roc_auc,split5_test_roc_auc,split6_test_roc_auc,split7_test_roc_auc,split8_test_roc_auc,split9_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,0.153628,0.024971,1.435827,0.299012,25,{'knn_class_model__n_neighbors': 25},0.539573,0.565038,0.559532,0.547832,...,0.585439,0.57402,0.603895,0.590726,0.598757,0.592183,0.59048,0.592449,0.012773,9
1,0.160248,0.001573,1.73691,0.071784,95,{'knn_class_model__n_neighbors': 95},0.565038,0.567791,0.561597,0.582932,...,0.614385,0.597524,0.630634,0.609702,0.615175,0.611565,0.608276,0.614224,0.009268,1
2,0.164847,0.014853,1.639717,0.050291,67,{'knn_class_model__n_neighbors': 67},0.551961,0.586373,0.561597,0.579491,...,0.609524,0.596154,0.629439,0.616937,0.619252,0.607557,0.608946,0.613264,0.011315,5
3,0.161945,0.002554,1.886137,0.229906,103,{'knn_class_model__n_neighbors': 103},0.562973,0.574673,0.56022,0.577426,...,0.611509,0.598966,0.631072,0.608928,0.615913,0.609464,0.609155,0.613835,0.009034,3
4,0.169973,0.016134,2.033369,0.362318,51,{'knn_class_model__n_neighbors': 51},0.557467,0.581555,0.567791,0.570544,...,0.597183,0.590648,0.621721,0.606937,0.611305,0.598517,0.597699,0.605267,0.011233,8
5,0.162473,0.001383,1.901752,0.1713,83,{'knn_class_model__n_neighbors': 83},0.565726,0.579491,0.565038,0.589126,...,0.611318,0.598818,0.627266,0.611192,0.619942,0.609694,0.607521,0.613941,0.008703,2
6,0.167105,0.015585,2.061016,0.358297,115,{'knn_class_model__n_neighbors': 115},0.559532,0.571232,0.555403,0.57605,...,0.61073,0.597306,0.630836,0.606597,0.610155,0.608467,0.606721,0.612107,0.009308,6
7,0.167975,0.016034,1.710283,0.203671,23,{'knn_class_model__n_neighbors': 23},0.532003,0.556779,0.554714,0.558156,...,0.587322,0.571435,0.60599,0.592567,0.598407,0.59011,0.589746,0.591503,0.012845,10
8,0.161524,0.002029,1.697659,0.077493,69,{'knn_class_model__n_neighbors': 69},0.546456,0.589814,0.56435,0.58362,...,0.610215,0.594334,0.629188,0.61576,0.619522,0.608635,0.607457,0.613438,0.011423,4
9,0.160244,0.001309,1.78717,0.038649,119,{'knn_class_model__n_neighbors': 119},0.560908,0.570544,0.550585,0.577426,...,0.611518,0.598103,0.630424,0.603842,0.609249,0.60709,0.606506,0.611274,0.00921,7


We can also plot this in a graph.

In [18]:
# fig, axes = plt.subplots(2, 2, figsize = (10, 10))

# axes[0,0].plot(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_recall"])
# axes[0,0].scatter(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_recall"])
# axes[0,0].set_title('sensitivity')
# axes[0,0].set_ylabel('mean')
# axes[0,0].set_xlabel('n_neighbors')
# axes[0,0].grid()

# axes[0,1].plot(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_bal_accuracy"])
# axes[0,1].scatter(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_bal_accuracy"])
# axes[0,1].set_title('balanced accuracy')
# axes[0,1].set_ylabel('mean')
# axes[0,1].set_xlabel('n_neighbors')
# axes[0,1].grid()

# axes[1,0].plot(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_precision"])
# axes[1,0].scatter(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_precision"])
# axes[1,0].set_title('precision')
# axes[1,0].set_ylabel('mean')
# axes[1,0].set_xlabel('n_neighbors')
# axes[1,0].grid()

# axes[1,1].plot(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_f1"])
# axes[1,1].scatter(knn_class_tune_results.loc[:, "param_knn_class_model__n_neighbors"], knn_class_tune_results.loc[:, "mean_test_f1"])
# axes[1,1].set_title('f1')
# axes[1,1].set_ylabel('mean')
# axes[1,1].set_xlabel('n_neighbors')
# axes[1,1].grid()

Where the best hyperparameter is shown below

In [20]:
knn_class_randomized_grid_search.best_params_

{'knn_class_model__n_neighbors': 103}

In addition, the best estimator can also be obtained below

In [22]:
knn_class_randomized_grid_search.best_estimator_

For a more simplistic visualization, we can manually create a dataframe of the grid search results with the standard error as well. 

In [23]:
# First, create a data frame that contains the number of neighbors, the mean test estimates and the standard error of each mean test estimates. 
n_cv = 10
n_neighbors = []
mean_estimates = []
std_err = []

for i in range(len(knn_class_tune_results)):
    n_neighbors.append(knn_class_tune_results.loc[i, "param_knn_class_model__n_neighbors"])
    mean_estimates.append(knn_class_tune_results.loc[i, "mean_test_recall"])
    std_err_i = knn_class_tune_results.loc[i, "std_test_recall"]/np.sqrt(n_cv)
    std_err.append(std_err_i)

knn_class_tune_results_cleaned = pd.DataFrame({
    "n_neighbors": n_neighbors, 
    "mean_test_recall": mean_estimates, 
    "std_err_test_recall": std_err
})

knn_class_tune_results_cleaned = knn_class_tune_results_cleaned.sort_values(by = "mean_test_recall", ascending = False)

knn_class_tune_results_cleaned

Unnamed: 0,n_neighbors,mean_test_recall,std_err_test_recall
3,103,0.573916,0.003505
2,67,0.573159,0.003842
9,119,0.573159,0.005307
1,95,0.572884,0.003447
8,69,0.572815,0.00453
5,83,0.572195,0.002991
4,51,0.571989,0.002991
6,115,0.571507,0.004291
0,25,0.55788,0.0042
7,23,0.557054,0.004419


Here, it is evident that the best hyperparameter for our knn is n_neighbors of 97.  

However, it might be the case that the best estimator might that yield the highest level of mean sensitivity might not be significantly better than a simpler model in which its mean sensitivity lies within one standard error of the best estimator. As a result, it might be ideal to choose the simpler estimator because it would give us same level of sensitivity that is much simpler to interpret. 

Since in sklearn in Python, there is no functions similar to select_by_one_std_err() function of R in the tidymodels package. We have to manually apply the one standard error rule. 

In [24]:
best_estimator = knn_class_tune_results_cleaned.iloc[0, :] # becareful to USE iloc not loc to access the index of the best estimator not the value of the index!
#best_estimator["mean_test_score"]
best_estimator_range = best_estimator["mean_test_recall"] - best_estimator["std_err_test_recall"]
#best_estimator_range

best_estimator_one_std_err_list = []
for i in range(len(knn_class_tune_results_cleaned)):
    if knn_class_tune_results_cleaned.loc[i, "mean_test_recall"] >= best_estimator_range and knn_class_tune_results_cleaned.loc[i, "n_neighbors"] > best_estimator["n_neighbors"]: 
        best_estimator_one_std_err_list.append(knn_class_tune_results_cleaned.loc[i, :])
            
best_estimator_one_std_err_df = pd.DataFrame(best_estimator_one_std_err_list)

best_estimator_one_std_err_df

knn_best_estimator_one_std_err = best_estimator_one_std_err_df.tail(1)
knn_best_estimator_one_std_err

Unnamed: 0,n_neighbors,mean_test_recall,std_err_test_recall
9,119.0,0.573159,0.005307


Based on the result above, it is evident that the best hyperparameter when applying the one standard error rule is when n_neighbors is 119. Which yield a sensitivity rate of 57.3159%.

Finalize the workflow by specifying the model with the tuned hyperparameters

In [25]:
best_estimator_one_std_err_K = knn_best_estimator_one_std_err["n_neighbors"].iloc[0]
best_estimator_one_std_err_K

knn_class_best_estimator = KNeighborsClassifier(n_neighbors = best_estimator_one_std_err_K)

knn_class_best_estimator