# Introduction to K-Nearest Neighbors

## Introduction to the Dataset

In the previous lesson, we looked at the machine learning workflow and trained a simple classifier to predict if a patient has breast cancer. We learned how we can quickly prototype a machine learning model and experiment with it to get reasonable results.

![image.png](attachment:5982c7df-f2b1-492b-a5fe-a5874cb597aa.png)

While there is a benefit to being able to quickly experiment and iterate, not understanding how the algorithm for a model actually works can impact the outcome of those random experiments.

In this lesson, we'll learn a different machine learning algorithm and implement it from scratch. We'll use it to build and train a classifier that can predict whether a bank customer will subscribe to a term deposit or not.

We'll use a modified version of the [Bank Marketing Dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing). It contains data on customers of a Portuguese banking institution that ran marketing campaigns to assess whether customers would subscribe to their product. The dataset consists of 21 columns, including the target variable:



- age: (numeric)
- job: type of job (categorical: 'admin','blue collar','entrepreneur','housemaid','management','retired','self employed','services','student','technician','unemployed','unknown')
- marital: marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- education: (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- default: has credit in default? (categorical: 'no','yes','unknown')
- housing: has housing loan? (categorical: 'no','yes','unknown')
- loan: has personal loan? (categorical: 'no','yes','unknown')
- contact: contact communication type (categorical: 'cellular','telephone')
- month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
- day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
- duration: last contact duration, in seconds (numeric).
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client (numeric)
- poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
- emp.var.rate: employment variation rate - quarterly indicator (numeric)
- cons.price.idx: consumer price index - monthly indicator (numeric)
- cons.conf.idx: consumer confidence index - monthly indicator (numeric)
- euribor3m: euribor 3 month rate - daily indicator (numeric)
- nr.employed: number of employees - quarterly indicator (numeric)
- y: has the client subscribed a term deposit? (binary: 'yes','no')

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
banking_df = pd.read_csv('../../Datasets/subscription_prediction.csv')
banking_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [3]:
# Select the categorical features
banking_df.select_dtypes(include='O')

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,y
0,admin.,married,basic.6y,no,no,no,telephone,may,mon,nonexistent,no
1,services,married,high.school,no,no,yes,telephone,may,mon,nonexistent,no
2,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,nonexistent,no
3,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,nonexistent,no
4,management,single,basic.9y,unknown,no,no,telephone,may,mon,nonexistent,no
...,...,...,...,...,...,...,...,...,...,...,...
10117,retired,divorced,professional.course,no,yes,no,cellular,nov,fri,nonexistent,no
10118,admin.,married,university.degree,no,yes,no,cellular,nov,fri,nonexistent,yes
10119,retired,married,professional.course,no,yes,no,cellular,nov,fri,nonexistent,yes
10120,technician,married,professional.course,no,no,no,cellular,nov,fri,nonexistent,yes


In [4]:
# select numerical values
banking_df.select_dtypes(exclude='O')

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0
1,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0
2,41,217,1,999,0,1.1,93.994,-36.4,4.857,5191.0
3,57,293,1,999,0,1.1,93.994,-36.4,4.857,5191.0
4,39,195,1,999,0,1.1,93.994,-36.4,4.857,5191.0
...,...,...,...,...,...,...,...,...,...,...
10117,64,151,3,999,0,-1.1,94.767,-50.8,1.028,4963.6
10118,37,281,1,999,0,-1.1,94.767,-50.8,1.028,4963.6
10119,73,334,1,999,0,-1.1,94.767,-50.8,1.028,4963.6
10120,44,442,1,999,0,-1.1,94.767,-50.8,1.028,4963.6


In [5]:
banking_df.dtypes.value_counts()

object     11
int64       5
float64     5
Name: count, dtype: int64

In [6]:
banking_df.shape

(10122, 21)

In [7]:
# check misssing values
banking_df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

In [8]:
banking_df['y'].value_counts()

y
no     5482
yes    4640
Name: count, dtype: int64

In [9]:
banking_df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0
mean,40.313673,373.414049,2.369789,896.476882,0.297471,-0.432671,93.492407,-40.250573,3.035134,5138.838975
std,11.855014,353.277755,2.472392,302.175859,0.680535,1.714657,0.628615,5.271326,1.884191,85.859595
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,31.0,140.0,1.0,999.0,0.0,-1.8,92.963,-42.7,1.252,5076.2
50%,38.0,252.0,2.0,999.0,0.0,-0.1,93.444,-41.8,4.076,5191.0
75%,48.0,498.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.959,5228.1
max,98.0,4199.0,42.0,999.0,6.0,1.4,94.767,-26.9,5.045,5228.1


## K-Nearest Neighbors (k-NN)

Upon exploring the data, we discovered that the dataset has:

- 10122 observations, 20 features, and 1 target variable.
- No missing values in the dataset.
- 5482 customers who didn't subscribe and 4640 who did.
- 10 categorical columns and 10 numeric columns, excluding the target column.

If we explore our dataset further, we could ask several questions to better analyze it. For example:

- How many customers under the age of 30 subscribed to the product?
- Were the customers who subscribed contacted more often than those who weren't during the marketing campaign?
- Which customers were contacted more often before this campaign?

We could potentially answer each of these questions ourselves and develop a complex set of rules that could tell us which customers are likely to subscribe given all the features available to us.



Let's look at a visual representation of the above. The following plot depicts customers who subscribed (purple) and those who didn't (blue). Our two axes correspond to two features. For example, one could be age and another campaign.

![image.png](attachment:ad390314-b44e-4db3-82b1-3eb03e061014.png)

Each customer is a data point in a 2-dimensional feature space and is defined by two numerical values. If we know the age of a customer and how many times they were contacted during the campaign, we can locate that point in that space.

The proximity of those customers in the feature space can tell us how similar they are to one another in relation to their label. For example, let's say that 3 out of 5 customers who are 30 to 32 years old and were contacted 2 to 4 times during the campaign subscribed to the product. In the plot, the data points for those customers would be relatively close to one another. We could say that customers within that age and campaign range of values are more likely to subscribe to the bank's product.

That's the kind of rule we could develop through our analysis and by looking at the data points in the feature space.

![image.png](attachment:9f615637-a208-4d5b-83ab-7b1f51e41231.png)

What if we add another customer (blue dot) to our feature space above?

![image.png](attachment:d4b03bcc-7d18-453e-bcc0-6ba07c7f550c.png)

How can we predict if this new customer is going to subscribe, given just those two features?

With what we learned above, we can calculate the distance of that blue dot from all the other points and look at the ones closest to it. If a majority of the points closest to it are purple, we can classify the new point as purple. If they are blue, we can classify it as blue.

![image.png](attachment:99dd64e4-efea-4d69-974d-05dc6d14b2de.png)

By looking at how closely-related those data points are in context of their labels, we are allowing those rules, like the ones we mentioned above, to develop on their own. This is the K-Nearest Neighbors algorithm.

1. For an unseen data point, the algorithm calculates the distance between that point and all the observations across all features in the training dataset.

2. It sorts those distances in ascending order.

3. It selects 
K
 observations with the smallest distances from the above step. These 
K
 observations are the K-nearest neighbors of that unseen data point.

Note that there should be at least 
K
≥
1
 observations in the dataset.

It calculates which labels of those neighbors is the most common, and assigns that label to the unseen data point.

Before we implement the algorithm, let's prepare our data.

## Data Preparation

When we explored our data, we noticed that our target column, y, stores the labels as yes or no strings. While those are reasonable categories and we can continue working with them as is, we'll encode those strings as the numbers 0 for no and 1 for yes.

In the previous lesson, we learned how to split the dataset into a training and test set. Instead of using scikit-learn's `train_test_split()` function, we'll implement the split ourselves. We'll opt for a `85-15%` split.

In order to split the dataset, we could take a direct approach of selecting the first N observations as the training set and the rest as the test set. But that poses a problem.  We don't know how many observations of those N have a label of 0 and how many have a label of 1.

Let's say N = 100. What if, out of those 100, only 5 observations had a label of 1?

When the dataset is imbalanced, a machine learning model might struggle to accurately predict the labels because it hasn't had enough information to learn to distinguish between the classes. Ideally, the model should have enough data corresponding to each class so it can learn from the data effectively.

Even though our dataset has a reasonably balanced class distribution, we need to make sure that both the train and test sets have a similar percentage of subscribed customers.

The data collection process can also introduce certain biases. It's possible that the clients were selected in a specific order. For example, the collection process could've added the newest clients first. If we were to select the first N observations, we could be introducing bias into our model. That's why, when creating our training and test sets, randomly selecting observations is important, as it can help reduce any such biases.

Fortunately for us, this isn't complicated to implement in pandas.

In [10]:
banking_df['y'] = banking_df['y'].apply(lambda x: 1 if x == 'yes' else 0)

# Randomize the sample
train_df = banking_df.sample(frac=0.85,random_state=417)
test_df = banking_df.drop(train_df.index)

In [11]:
banking_df['y'].value_counts(normalize=True) * 100

y
0    54.159257
1    45.840743
Name: proportion, dtype: float64

In [12]:
train_df['y'].value_counts(normalize=True) * 100

y
0    54.009763
1    45.990237
Name: proportion, dtype: float64

In [13]:
test_df['y'].value_counts(normalize=True) * 100

y
0    55.006588
1    44.993412
Name: proportion, dtype: float64