<a href="https://colab.research.google.com/github/RenatodaCostaSantos/Machine-Learning---Lessons/blob/main/Supervised%20ML/k-NN/K_Nearest_Neighbors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting if a client will sign up for a term deposit

In this lesson, we will introduce the machine learning workflow and use the K-nearest neighbors algorithm to predict if clients of a bank will sign up for a term deposit.

We will use the [Banking Market Dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing). It contains data from a Portuguese bank institution that runs marketing programs to target clients to sign up for the investment program. 

Note:

- We start this lesson without the assistance of scikitlearn packages. It will make the coding less convenient but will illustrate how scikitlearn works under the hood. As the complexity of the workflow increases, we will appeal to the sklearn libraries.

- This lesson will be very repetitive and that's just to show explicitly how the workflow in a machine learning model building works in practice. There are a lot of fine tuning to be done and that takes time and plenty of repetitive tasks.

Let's import and read the file:

In [1]:
import pandas as pd
import numpy as np
from google.colab import drive

drive.mount('/content/drive') 


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
banking_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/subscription_prediction.csv')

In [3]:
banking_df.shape

(10122, 21)

There are 10122 observations and 21 features, including the target column named $y$.

In [4]:
banking_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [5]:
banking_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10122 entries, 0 to 10121
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             10122 non-null  int64  
 1   job             10122 non-null  object 
 2   marital         10122 non-null  object 
 3   education       10122 non-null  object 
 4   default         10122 non-null  object 
 5   housing         10122 non-null  object 
 6   loan            10122 non-null  object 
 7   contact         10122 non-null  object 
 8   month           10122 non-null  object 
 9   day_of_week     10122 non-null  object 
 10  duration        10122 non-null  int64  
 11  campaign        10122 non-null  int64  
 12  pdays           10122 non-null  int64  
 13  previous        10122 non-null  int64  
 14  poutcome        10122 non-null  object 
 15  emp.var.rate    10122 non-null  float64
 16  cons.price.idx  10122 non-null  float64
 17  cons.conf.idx   10122 non-null 

There are no null values but 11 categorical columns that have to transformed before proceeding with machine learning.

In [6]:
banking_df['y'].value_counts(normalize = True)

no     0.541593
yes    0.458407
Name: y, dtype: float64

The distribution of 'yes' and 'no' are fairly symmetrical. This is good because it reduces the tendency of the model to be biased.

In [7]:
banking_df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0
mean,40.313673,373.414049,2.369789,896.476882,0.297471,-0.432671,93.492407,-40.250573,3.035134,5138.838975
std,11.855014,353.277755,2.472392,302.175859,0.680535,1.714657,0.628615,5.271326,1.884191,85.859595
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,31.0,140.0,1.0,999.0,0.0,-1.8,92.963,-42.7,1.252,5076.2
50%,38.0,252.0,2.0,999.0,0.0,-0.1,93.444,-41.8,4.076,5191.0
75%,48.0,498.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.959,5228.1
max,98.0,4199.0,42.0,999.0,6.0,1.4,94.767,-26.9,5.045,5228.1


## Data preparation

First, we will convert the target column to 1 instead of 'yes' and 0 instead of 'no'.

In [8]:
# Transforming outcomes into a numerical variable
banking_df['y'].replace(['yes','no'], [1,0], inplace = True)

To be sure that the sample used as the train and test datasets will keep the proportions of the original data, we will use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method:



In [9]:
# Splitting the dataframe into train and test
train_df = banking_df.sample(frac = 0.85, random_state = 417)
test_df = banking_df.drop(index = train_df.index)

In [10]:
# Checking proportions of outcome values on the train set
train_df['y'].value_counts(normalize=True)

0    0.540098
1    0.459902
Name: y, dtype: float64

In [11]:
# Checking proportions of outcome values on the test set
test_df['y'].value_counts(normalize = True)

0    0.550066
1    0.449934
Name: y, dtype: float64

In [12]:
# Separate train features and target
X_train = train_df.drop('y', axis = 1)
y_train = train_df['y']

In [13]:
# Separate test features and target
X_test = test_df.drop('y', axis = 1)
y_test = test_df['y']

## KNN algorithm: fundamentals

The K-nearest neighbors algorithm starts by considering an arbitrary data point and calculating its distance from all additional observations. The user sets a number K which the algorithm will use to answer the following questions:

- What is the outcome for the K nearest neighbors to this point?
- Which value for the outcome is more prevalent?

Once it knows the answers, it assigns the most common value as the prediction for the randomly chosen data point.

To illustrate how the KNN algorithm works, let's build a function that does exactly that and test it in one of the data points of the banking_df dataframe.

In [14]:
def knn(feature, test_input, K):
  distances = []

  for row in feature:
    distance = np.sqrt((test_input - row)**2)
    distances.append(distance)
  
  X_train['distance'] = distances

  # Get indices of the K nearest neighbours of the test input
  indices = X_train['distance'].nsmallest(n = K).index
  # Filter the labels to contain only the knn and get the most common value
  prediction = y_train.filter(indices).mode()

  return prediction 

In [15]:
# Testing knn function
test_input = X_test['age'].iloc[9]
knn(X_train['age'], test_input, 3)

0    0
dtype: int64

The function predicted 0 for the test_input. It was done by looking at the mode of the $y$ labels of the 3 nearest neighbors in the training set. Let's check if the value in the test set is actually that one:

In [16]:
y_test.iloc[9]

0

The prediction was correct for this random choice of y_test data point. A quick play with other values will lead to some wrong predictions as expected.

## Evaluating the model

To check whether a model is performing well, we need a metric. Since we are interested in predicting clients that will sign up as much as the ones that will not sign up for a marketing campaign, we will use accuracy as the metric for this problem. 

The accuracy measures the proportion of correct predictions. In other words:
$$
accuracy = \frac{TP + TN}{TP + FP + TN + FN},
$$
where $TP, TN, FP$, and $FN$ stand for true positives, true negatives, false positives and false negatives, respectively.


For that, we need to know the total number of correct predictions. Let's apply the knn function on the test set and check it out:

In [17]:
# Calculating accuracy for the knn function
predictions = []
for i in range(len(y_test)):
  test_input = X_test['age'].iloc[i]
  prediction = knn(X_train['age'], test_input, 3).iloc[0]
  predictions.append(prediction)

X_test['age_predicted_y'] = predictions

In [18]:
results = y_test == X_test['age_predicted_y']

In [19]:
results.value_counts(normalize = True)

True     0.538867
False    0.461133
dtype: float64

The model is not doing very well since only ~53.8% were correctly identified. 

Let's see if it improves if we choose a different feature:

In [20]:
predictions = []
for i in range(len(y_test)):
  test_input = X_test['campaign'].iloc[i]
  prediction = knn(X_train['campaign'], test_input, 3).iloc[0]
  predictions.append(prediction)

X_test['campaign_predicted_y'] = predictions

In [21]:
results_campaign = y_test == X_test['campaign_predicted_y']

In [22]:
results_campaign.value_counts(normalize = True)

True     0.551383
False    0.448617
dtype: float64

The 'campaign' feature improved the model's prediction by 2%. That can happen in machine learning. Sometimes some features have a stronger correlation with the target variable than others. That alone does not guarantees a better performance but can lead to better predictions.

# Feature engineering

There are some categorical features in the data set. In this lesson we will not try to use all of them, but just a few to clarify how the KNN works. 

 We will use one-hot-enconding to transform the 'marital' feature in dummy indices.

In [23]:
banking_df_copy = banking_df.copy()

In [24]:
# Transforming 'marital' categorical feature into a numerical feature
banking_df_copy= pd.get_dummies(banking_df_copy,columns = ['marital'], drop_first= True)

In [25]:
banking_df_copy

Unnamed: 0,age,job,education,default,housing,loan,contact,month,day_of_week,duration,...,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,marital_married,marital_single,marital_unknown
0,40,admin.,basic.6y,no,no,no,telephone,may,mon,151,...,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0
1,56,services,high.school,no,no,yes,telephone,may,mon,307,...,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0
2,41,blue-collar,unknown,unknown,no,no,telephone,may,mon,217,...,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0,1,0,0
3,57,housemaid,basic.4y,no,yes,no,telephone,may,mon,293,...,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0,0,0,0
4,39,management,basic.9y,unknown,no,no,telephone,may,mon,195,...,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10117,64,retired,professional.course,no,yes,no,cellular,nov,fri,151,...,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,0,0,0,0
10118,37,admin.,university.degree,no,yes,no,cellular,nov,fri,281,...,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,1,1,0,0
10119,73,retired,professional.course,no,yes,no,cellular,nov,fri,334,...,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,1,1,0,0
10120,44,technician,professional.course,no,no,no,cellular,nov,fri,442,...,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,1,1,0,0


Let's include more features in the knn function and see if the performance of the model improves. Again, we use the Pandas.DataFrame.sample method to separate the dataset into train and test sets (note that we are using the copy of the dataset that includes some of the converted categorical variables here):

In [26]:
# Defining train and test sets
train_df_copy = banking_df_copy.sample(frac = 0.85, random_state = 417)
test_df_copy = banking_df_copy.drop(index = train_df.index)

In [27]:
# Checking proportions of target outcomes for the train set
train_df_copy['y'].value_counts(normalize=True)

0    0.540098
1    0.459902
Name: y, dtype: float64

In [28]:
# Checking proportions of target outcomes for the test set
test_df_copy['y'].value_counts(normalize = True)

0    0.550066
1    0.449934
Name: y, dtype: float64

In [29]:
# Separate train features and target
X_train_copy = train_df_copy.drop('y', axis = 1)
y_train_copy = train_df_copy['y']

In [30]:
# Separate test features and target
X_test_copy = test_df_copy.drop('y', axis = 1)
y_test_copy = test_df_copy['y']

Let's generalize the KNN function to include more features:

In [31]:
def knn(feature, test_input, K):

  distance = np.sqrt(((X_train_copy[feature] - test_input)**2).sum(axis=1))
  
  X_train_copy['distance'] = distance

  # Get indices of the K nearest neighbours of the test input
  indices = X_train_copy['distance'].nsmallest(n = K).index
  # Filter the labels to contain only the knn and get the most common value
  prediction = y_train_copy.filter(indices).mode()

  return  prediction 

In [32]:
# Testing knn function
test_input = X_test_copy[["age", "campaign", "marital_married", "marital_single"]].iloc[9]
knn(["age", "campaign", "marital_married", "marital_single"], test_input, 3)

0    0
dtype: int64

Let's evaluate the model:

In [33]:
predictions = []
for i in range(len(y_test_copy)):
  test_input = X_test_copy[["age", "campaign", "marital_married", "marital_single"]].iloc[i]
  prediction = knn(["age", "campaign", "marital_married", "marital_single"], test_input, 3).iloc[0]
  predictions.append(prediction)

X_test_copy['predicted_y'] = predictions

In [34]:
results_campaign_copy = y_test_copy == X_test_copy['predicted_y']

In [35]:
results_campaign_copy.value_counts(normalize = True)

True     0.554677
False    0.445323
dtype: float64

The model, with 4 features, had basically the same performance as the model using only the 'campaign' feature. This is not an improvement but illustrates how one can generalize the KNN algorithm.

## Feature engineering II

A very important step in the KNN algorithm is to normalize all values for the features. Since the classification is done by computing distances, features with values that are too spread apart can induce a non-representative higher distance between data points, leading to misclassifications.

A way to avoid that is to normalize the features before computing the distances.

We will normalize the age and campaign features and compute the accuracy of the model again. Since both features do not have values that are too spread apart, we do not expect a significant improvement in accuracy.

We use the following equation to normalize a feature:

$$x' = \frac{x - min(x)}{max(x) - min(x)}.$$

In [36]:
# Normalizing 'age' feature of train and test sets
X_train_copy['age'] = (X_train_copy['age'] - X_train_copy['age'].min())/(max(X_train_copy['age'])-min(X_train_copy['age']))
X_test_copy['age'] = (X_test_copy['age'] - X_test_copy['age'].min())/(max(X_test_copy['age'])-min(X_test_copy['age']))

In [37]:
# Normalizing 'campaign' feature of train and test sets
X_train_copy['campaign'] = (X_train_copy['campaign'] - X_train_copy['campaign'].min())/(max(X_train_copy['campaign'])-min(X_train_copy['campaign']))
X_test_copy['campaign'] = (X_test_copy['campaign'] - X_test_copy['campaign'].min())/(max(X_test_copy['campaign'])-min(X_test_copy['campaign']))

In [38]:
# Check normalized values
X_train_copy[['age','campaign']].head()

Unnamed: 0,age,campaign
7472,0.259259,0.02439
3408,0.469136,0.04878
5851,0.148148,0.0
9132,0.777778,0.02439
3790,0.345679,0.02439


In [39]:
# Testing predictions with normalized variables
predictions = []
for i in range(len(y_test_copy)):
  test_input = X_test_copy[["age", "campaign", "marital_married", "marital_single"]].iloc[i]
  prediction = knn(["age", "campaign", "marital_married", "marital_single"], test_input, 3).iloc[0]
  predictions.append(prediction)

X_test_copy['predicted_y'] = predictions

In [40]:
results_campaign_copy = y_test_copy == X_test_copy['predicted_y']

In [41]:
results_campaign_copy.value_counts(normalize = True)

True     0.550066
False    0.449934
dtype: float64

As expected, the accuracy remains roughly the same.

# Validation set

Using the test set too often can lead to bias in model predictions. The model learns from the test set and can start becoming very efficient in predicting the current dataset outcomes but that would not be the case once new data is introduced. To avoid that, it is good practice to introduce a validation set. In this section we will split the dataset into three parts:

- Train (60% of the dataset)
- Validation (20% of the dataset)
- Test (20% of the dataset).

In [42]:
from sklearn.model_selection import train_test_split
# Split features and outcome
X = banking_df_copy.drop('y', axis=1)
y = banking_df_copy['y']

In [43]:
# Defining X_val and y_val as 20% of X and y
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state= 417)

In [44]:
# Defining X_train, y_train as 60% of X and X_test and y_test as 20% of X and y
X_train, X_test, y_train, y_test = train_test_split(X_train,y_train, test_size = 0.2*X.shape[0]/X_train.shape[0], random_state = 417)

## Building and training a k-NN model

Now that we know how k-NN algorithm works, we will import the scikitlearn library and start building a more realistic machine-learning workflow for the banking_df dataframe. We will transform and include the 'default' feature and train the model once again:

In [45]:
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier

# Instantiate a classifier
model = KNeighborsClassifier(n_neighbors = 5)

In [46]:
# Transforming 'default' feature as dummy variables
X = pd.get_dummies( banking_df_copy,columns = ['default'], drop_first= True)

In [47]:
# define X_val and y_val as 20% of X and y
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state= 417)

In [48]:
# Defines X_train, y_train as 60% of X and X_test and y_test as 20% of X and y
X_train, X_test, y_train, y_test = train_test_split(X_train,y_train, test_size = 0.2*X.shape[0]/X_train.shape[0], random_state = 417)

In [49]:
from sklearn.preprocessing import MinMaxScaler
# Instantiate a scaler to normalize some of the features
scaler = MinMaxScaler()

In [50]:
# Use the scaler to transform all features that wil be used for this model
X_train_scaled = scaler.fit_transform(X_train[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])

# Underfitting and overfitting on k-NN

We follow the standard workflow of ML once again. However, we use the validation set to evaluate the model performance before presenting it to the test set. That will prevent introducing bias to the model once new data is presented.

To show how the number of neighbors impact the model predictions, let's check the accuracy of the predictions using $k=1$ and $k=2000$:

In [51]:
# Instantiate a classifier and fit the data (training the model)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_scaled,y_train)

KNeighborsClassifier(n_neighbors=1)

In [52]:
# Normalizing the features
X_val_scaled = scaler.fit_transform(X_val[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])

In [53]:
# Calculating a score using the validation set
val_score = knn.score(X_val_scaled,y_val)

In [54]:
val_score

0.7002469135802469

The accuracy considering only one neighbor and six features improved the model considerably, from ~ 55% to 70%.

Let's check what impact on the score if we increase the number of neighbors drastically:

In [55]:
# Build a knn classifier using 2000 nearest neighbors
knn = KNeighborsClassifier(n_neighbors=2000)
knn.fit(X_train_scaled,y_train)
val_score = knn.score(X_val_scaled,y_val)
print(val_score)

0.5837037037037037


As we see, drastically increasing the number of neighbors made our model underperform. In the first case, it is overfitting by creating a complex boundary to classify the data by only looking at the closest neighbor. In the second case, it is underfitting. It has to deal with so much information from so many neighbors that it struggles to learn about patterns in the data.

## Evaluating the model on the test set

To avoid underfitting or overfitting, let's choose an intermediate value for $k$, fit a model using the training set and check how it performs on the test set:

In [56]:
# Scaling the test set features
X_test_scaled = scaler.fit_transform(X_test[["marital_married", "marital_single", "marital_unknown", "default_unknown", "age", "duration"]])

In [57]:
# Building a knn classifier using 45 neighbors and testing it on the test set
knn = KNeighborsClassifier(n_neighbors=45)
knn.fit(X_train_scaled,y_train)
val_score = knn.score(X_val_scaled,y_val)
test_score = knn.score(X_test_scaled,y_test)
print(val_score)

0.7570370370370371


In [77]:
print(test_score)

0.7595061728395062


The model delivery a ~76% accuracy on the test set with a resonable number of neighbors.

# Improving the model

In order to exemplify a full ML workflow, we will convert all categorical features in dummy variables.

In [58]:
banking_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10122 entries, 0 to 10121
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              10122 non-null  int64  
 1   job              10122 non-null  object 
 2   education        10122 non-null  object 
 3   default          10122 non-null  object 
 4   housing          10122 non-null  object 
 5   loan             10122 non-null  object 
 6   contact          10122 non-null  object 
 7   month            10122 non-null  object 
 8   day_of_week      10122 non-null  object 
 9   duration         10122 non-null  int64  
 10  campaign         10122 non-null  int64  
 11  pdays            10122 non-null  int64  
 12  previous         10122 non-null  int64  
 13  poutcome         10122 non-null  object 
 14  emp.var.rate     10122 non-null  float64
 15  cons.price.idx   10122 non-null  float64
 16  cons.conf.idx    10122 non-null  float64
 17  euribor3m   

In [59]:
# Converting all categorical variables into dummy ones
banking_df_copy= pd.get_dummies(banking_df_copy,columns = ['job','education','default','housing','loan','contact','month','day_of_week','poutcome'], drop_first= True)

## Feature selection

Previously, we did some random feature selection to build a model. There are many ways of doing it though. Two common approaches are:
- To use domain knowledge to exclude features that are likely to not contribute to model accuracy or,
- To find features that have a strong correlation with the target variable. 

Let's compute the Pearson's correlation coeficient to search for linear correlation between features and the target variable: 

In [60]:
correlations = abs(banking_df_copy.corr())

In [61]:
correlations['y'].nlargest(n=6)

y               1.000000
nr.employed     0.468524
duration        0.468197
euribor3m       0.445328
emp.var.rate    0.429680
pdays           0.317997
Name: y, dtype: float64

It is import to be aware that the Pearson's coefficient picks up only linear correlations and any non-linear correlations would be left aside. So it helps to select features, but it is not a perfect approach. 

We could keep trying different features and check the model performance. We will leave that for another time and focus on other parameters that can change the model's performance. Let's check with more details how it changes with the number of neighbors $k$. 

We will split the data into train, validation and test once more and check the model's performance for different numbers of neighbors:

In [62]:
# Splitting the features and target variables
X = banking_df_copy.drop('y', axis = 1)
y = banking_df_copy['y']

In [63]:
# Spliting the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.2, random_state = 417)

In [64]:
# Splitting the test set
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size= 0.2*X.shape[0]/X_train.shape[0], random_state = 417)

In [65]:
# Scalling the train set
X_train_scaled = scaler.fit_transform(X_train)

In [66]:
# Scalling the validation set
X_val_scaled = scaler.fit_transform(X_val)

In [67]:
# Training a knn model for different values of k
k = [1,2,3,4,5]
accuracies = {}

for i in k:
  knn = KNeighborsClassifier(n_neighbors=i)
  knn.fit(X_train_scaled,y_train)
  val_accuracy = knn.score(X_val_scaled,y_val)
  accuracies[i] = val_accuracy

In [68]:
# Impact of k values on accuracy
accuracies

{1: 0.694320987654321,
 2: 0.6834567901234568,
 3: 0.7229629629629629,
 4: 0.7303703703703703,
 5: 0.7377777777777778}

## Hyperparameter parametrization

Every model has some parameters that we can tune. Those parameters are called hyperparameters and can influence the performance of a model, as we saw above, $k$ is one of them for the k-NN model.

We will experiment with other values for the hyperparameters of the k-NN model. We change $p$ which changes the metric, [$weights$](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) which changes the weight of the distance for each variable, and the number of neighbors:

In [69]:
# Impact of different choices of some of the hyperparameters for the kNN
k = [1,10,30,60,90]
accuracies = {}

for i in k:
  # weight = 'distance' give a higher weight for points that are closer to the test point / p is the power of the Minkowski metric (when p=2 we recover the Euclidean metric)
  knn = KNeighborsClassifier(n_neighbors=i, weights = 'distance', p = 5)
  knn.fit(X_train_scaled,y_train)
  val_accuracy = knn.score(X_val_scaled,y_val)
  accuracies[i] = val_accuracy

In [70]:
print(accuracies)

{1: 0.6869135802469136, 10: 0.7279012345679012, 30: 0.7422222222222222, 60: 0.745679012345679, 90: 0.745679012345679}


## Grid search

Grid search is a way to fine-tuning a model faster and more rigorously. Sklearn has a library for that. One can include the range of some of the hyperparameters and GridSearchCV method will search for the best values. It will also perform some cross-validation, which splits the dataset into many different training and validation sets while searching for the best hyperparameters. Let's try it here:

In [71]:
# Spliting the dataset into train and test sets and scaling the training set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 417)
X_train_scaled = scaler.fit_transform(X_train)

In [72]:
# Performing a gridsearch
from sklearn.model_selection import GridSearchCV
# Defining the grid
parameters = {'n_neighbors': [1,3,5], 'weights': ['distance','uniform'], 'p' : [1,2]}
# Instantiate a model
knn = KNeighborsClassifier()
# Instantiate a gridsearch classifier
clf = GridSearchCV(estimator = knn, param_grid = parameters, scoring = 'accuracy')
# Training the classifier
clf.fit(X_train_scaled,y_train)
# Finding best accuracy and best parameters
best_score = clf.best_score_
best_params = clf.best_params_

In [73]:
print(f"Best model's accuracy:  {best_score*100:.2f}%")
print(f"Best model's parameters: {best_params}")

Best model's accuracy:  75.35%
Best model's parameters: {'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}


## Evaluating the model on the test set

The final step is to check if the model will perform similarly once new data is presented to it. Let's check it using the test set:

In [74]:
# Scaling the test set
X_test_scaled = scaler.fit_transform(X_test)

In [75]:
# Finding the accuracy on test set
accuracy = clf.best_estimator_.score(X_test_scaled,y_test)

In [76]:
print(f"Model accuracy on test set: {accuracy*100:.2f}%")

Model accuracy on test set: 74.37%


The model did perform similarly on the train and test set. This is what one should expect of a good model.

The best model we found had a ~74% accuracy. That means that we can predict correctly ~74% of the clients that will or will not sign up for a term deposit with the bank.

# Summary

In this lesson we learned the steps for building a ML model:

1 - Split the data into X (features) and y (target).

1.1 - Convert categorical variables into dummy ones.

1.2 - Instantiate a scaler and use the fit_transform() method to scale all values into the interval 0-1 (this reduces misclassification since K-NN look at distances to classify a new data point and larger distances will have a larger weight on the classification).

2 - Split the X and y variable into train, validation and test sets.

3 - Instantiate a model.

3.1 - Use grid search to find a better model by changing its hyperparameters values.

4 - Fit the model using the training set (X_train and y_train).

4.1 - Get best values for the hyperparameters (GridSearchCV divides the training set into train and validation and perform cross-validation under the hood).

5 - Use the best_estimator_ method from GridSearchCV find the score on the test set.
