##  Introduction to the Dataset
In the previous lesson, we looked at the machine learning workflow and trained a simple classifier to predict if a patient has breast cancer. We learned how we can quickly prototype a machine learning model and experiment with it to get reasonable results.
While there is a benefit to being able to quickly experiment and iterate, not understanding how the algorithm for a model actually works can impact the outcome of those random experiments.

In this lesson, we'll learn a different machine learning algorithm and implement it from scratch. We'll use it to build and train a classifier that can predict whether a bank customer will subscribe to a term deposit or not.We'll use a modified version of the Bank Marketing Dataset https://archive.ics.uci.edu/ml/datasets/bank+marketing. It contains data on customers of a Portuguese banking institution that ran marketing campaigns to assess whether customers would subscribe to their product. The dataset consists of 21 columns, including the target variable:
- age: (numeric)
- job: type of job (categorical: 'admin.','blue collar','entrepreneur','housemaid','management','retired','self employed','services','student','technician','unemployed','unknown')
- marital: marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- education: (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- default: has credit in default? (categorical: 'no','yes','unknown')
- housing: has housing loan? (categorical: 'no','yes','unknown')
- loan: has personal loan? (categorical: 'no','yes','unknown')
- contact: contact communication type (categorical: 'cellular','telephone')
- month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
- day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
- duration: last contact duration, in seconds (numeric).
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client (numeric)
- poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
- emp.var.rate: employment variation rate - quarterly indicator (numeric)
- cons.price.idx: consumer price index - monthly indicator (numeric)
- cons.conf.idx: consumer confidence index - monthly indicator (numeric)
- euribor3m: euribor 3 month rate - daily indicator (numeric)
- nr.employed: number of employees - quarterly indicator
- y: has the client subscribed a term deposit? (binary: 'yes','no')





In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
subscription=pd.read_csv('subscription_prediction.csv')
subscription.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [3]:
subscription.shape

(10122, 21)

In [4]:
subscription.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [5]:
subscription.info

<bound method DataFrame.info of        age          job   marital            education  default housing loan  \
0       40       admin.   married             basic.6y       no      no   no   
1       56     services   married          high.school       no      no  yes   
2       41  blue-collar   married              unknown  unknown      no   no   
3       57    housemaid  divorced             basic.4y       no     yes   no   
4       39   management    single             basic.9y  unknown      no   no   
...    ...          ...       ...                  ...      ...     ...  ...   
10117   64      retired  divorced  professional.course       no     yes   no   
10118   37       admin.   married    university.degree       no     yes   no   
10119   73      retired   married  professional.course       no     yes   no   
10120   44   technician   married  professional.course       no      no   no   
10121   74      retired   married  professional.course       no     yes   no   

       

In [6]:
subscription.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

In [7]:
subscription['y'].value_counts()

no     5482
yes    4640
Name: y, dtype: int64

In [8]:
subscription.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0,10122.0
mean,40.313673,373.414049,2.369789,896.476882,0.297471,-0.432671,93.492407,-40.250573,3.035134,5138.838975
std,11.855014,353.277755,2.472392,302.175859,0.680535,1.714657,0.628615,5.271326,1.884191,85.859595
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,31.0,140.0,1.0,999.0,0.0,-1.8,92.963,-42.7,1.252,5076.2
50%,38.0,252.0,2.0,999.0,0.0,-0.1,93.444,-41.8,4.076,5191.0
75%,48.0,498.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.959,5228.1
max,98.0,4199.0,42.0,999.0,6.0,1.4,94.767,-26.9,5.045,5228.1


## K-Nearest Neighbors (k-NN) I
***Upon exploring the data, we discovered that the dataset has:***

-  observations, 20 features and 1 target variable.
- No missing values in the dataset.
- 5482 customers who didn't subscribe and 4640 who did.
- 10 categorical columns and 10 numeric columns, excluding the target column.
**If we explore our dataset further, we could ask several questions to better analyze it. For example:**
- How many customers under the age of 30 subscribed to the product?
- Were the customers who subscribed contacted more often than those who weren't during the marketing campaign?
- Which customers were contacted more often before this campaign?

We could potentially answer each of these questions ourselves and develop a complex set of rules that could tell us which customers are likely to subscribe given all the features available to us.

Let's look at a visual representation of the above. The following plot depicts customers who subscribed (purple) and those who didn't (blue). Our two axes correspond to two features. For example, one could be age and another campaign.

<img src='scatter.svg' width=500 height=500>

Each customer is a data point in a 2-dimensional feature space and is defined by two numerical values. If we know the age of a customer and how many times they were contacted during the campaign, we can locate that point in that space.

The proximity of those customers in the feature space can tell us how similar they are to one another in relation to their label. For example, let's say that 3 out of 5 customers who are 30 to 32 years old and were contacted 2 to 4 times during the campaign subscribed to the product. In the plot, the data points for those customers would be relatively close to one another. We could say that customers within that age and campaign range of values are more likely to subscribe to the bank's product.
That's the kind of rule we could develop through our analysis and by looking at the data points in the feature space.
<img src='scatter2.svg' width=400 height=400>

What if we add another customer (blue dot) to our feature space above?
## K-Nearest Neighbors (k-NN) II
<img src='scatter3.svg' width=400 height=400>

How can we predict if this new customer is going to subscribe, given just those two features?

With what we learned above, we can calculate the distance of that blue dot from all the other points and look at the ones closest to it. If a majority of the points closest to it are purple, we can classify the new point as purple. If they are blue, we can classify it as blue.
<img src='scatter4.svg' width=400 height=400>

By looking at how closely-related those data points are in context of their labels, we are allowing those rules, like the ones we mentioned above, to develop on their own. This is the K-Nearest Neighbors algorithm.
- 1 For an unseen data point, the algorithm calculates the distance between that point and all the observations across all features in the training dataset.

- 2 It sorts those distances in ascending order.

- 3 It selects K observations with the smallest distances from the above step. These K
  observations are the K-nearest neighbors of that unseen data point.

`Note that there should be at least K ≥ 1` observations in the dataset.
- 4 It calculates which labels of those neighbors is the most common, and assigns that label to the unseen data point.

Before we implement the algorithm, let's prepare our data.

## Data Preparation
When we explored our data, we noticed that our target column, y, stores the labels as yes or no strings. While those are reasonable categories and we can continue working with them as is, we'll encode those strings as the numbers 0 for no and 1 for yes.

In the previous lesson, we learned how to split the dataset into a training and test set. Instead of using scikit-learn's train_test_split() function, we'll implement the split ourselves. We'll opt for a 85-15% split.
In order to split the dataset, we could take a direct approach of selecting the first N observations as the training set and the rest as the test set. But that poses a problem. We don't know how many observations of those N have a label of 0 and how many have a label of 1.

Let's say N = 100. What if, out of those 100, only 5 observations had a label of 1?

When the dataset is imbalanced, a machine learning model might struggle to accurately predict the labels because it hasn't had enough information to learn to distinguish between the classes.Ideally, the model should have enough data corresponding to each class so it can learn from the data effectively.

Even though our dataset has a reasonably balanced class distribution, we need to make sure that both the train and test sets have a similar percentage of subscribed customers.
The data collection process can also introduce certain biases. It's possible that the clients were selected in a specific order. For example, the collection process could've added the newest clients first. If we were to select the first N observations, we could be introducing bias into our model. That's why, when creating our training and test sets, randomly selecting observations is important, as it can help reduce any such biases.

Fortunately for us, this isn't complicated to implement in pandas.

In [9]:
for col in subscription.index:
    sub=subscription['y'][col]
    if sub=='yes':
        sub=1
    else:
        sub=0
    subscription['y'][col]=sub

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subscription['y'][col]=sub


In [10]:
subscription['y'].value_counts()

0    5482
1    4640
Name: y, dtype: int64

In [11]:
subscription.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [12]:
#85% of the data 
eight_fiveper=subscription.shape[0]*85/100
eight_fiveper

8603.7

In [13]:
#randomly select 85% of the data
train_df=subscription.sample(frac=0.85,random_state=417 )
test_df=subscription.drop(train_df.index)

In [14]:
train_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
7472,38,technician,married,university.degree,no,yes,no,cellular,may,tue,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.266,5099.1,0
3408,55,retired,married,high.school,no,yes,yes,cellular,jul,tue,...,3,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228.1,0
5851,29,technician,single,university.degree,no,no,yes,cellular,apr,mon,...,1,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
9132,80,retired,divorced,basic.4y,no,yes,yes,cellular,apr,wed,...,2,999,0,nonexistent,-1.8,93.749,-34.6,0.642,5008.7,1
3790,45,blue-collar,married,basic.4y,no,no,no,cellular,aug,fri,...,2,999,0,nonexistent,1.4,93.444,-36.1,4.966,5228.1,1


In [15]:
train_df.shape

(8604, 21)

In [16]:
test_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
5,54,blue-collar,divorced,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
8,38,admin.,single,professional.course,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
11,45,services,married,high.school,unknown,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
19,31,admin.,divorced,high.school,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
22,37,management,married,university.degree,no,no,no,telephone,may,mon,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [17]:
test_df.shape

(1518, 21)

In [18]:
x_train=train_df.drop(columns=['y'],axis=1)
x_test=test_df.drop(columns=['y'],axis=1)
y_train=train_df['y']
y_test=test_df['y']

## k-NN for One Feature
Now that we have our training and test sets, we can implement our algorithm!

Before we begin, we need to select a **distance metric** to calculate the distance between observations.One of the most common distance metrics is the Euclidean distance https://en.wikipedia.org/wiki/Euclidean_distance. The Euclidean distance between two observations (x1 to xn)  and (y1 to yn)

<img src='euclidean.png' width=350 height=350>


- xi is the value for a feature for one observation, and
- yi is the value for the same feature for another observation.

When we are working with only one feature, the above formula simplifies to:
<img src='onefeature.png' width=350 height=350>

We'll implement our algorithm with this distance metric. You're encouraged to learn about some of the other distance metrics that can be used as well:
- Manhattan distance https://en.wikipedia.org/wiki/Taxicab_geometry
- Minkowski distance https://en.wikipedia.org/wiki/Minkowski_distance
- Hamming distance  https://en.wikipedia.org/wiki/Hamming_distance

As per the algorithm, we'll calculate how far an unseen observation is from all the observations in the training data for a specific feature.

But wait a minute. We learned that a machine learning model is trained over some data and then the trained model can be used to classify unseen data. Why are we using an unseen observation to implement the k-NN algorithm? Aren't we training the model using unseen data?

K-nearest neighbors is a bit of a unique case. It works at the time of prediction and it doesn't technically have a "training phase." The model classifies every new input by comparing it to its neighbors. Those neighbors are from the training set.

As a result, the algorithm can be time-consuming depending on the number of observations and features in our dataset.

So why did we have to create a separate test set?

So why did we have to create a separate test set?

Even though there's no training phase, the algorithm does rely on an unseen observation to be able to make a prediction. In the next exercise, we'll use a random value from our test set to implement the algorithm.

In [19]:
def Knn(feature, test_input, k):
    """Calculate the Euclidean distance between the test_input and every observation 
    in X_train for the given feature. 
    Save the distances in a new column, distance, in X_train."""
    x_train['distance']=np.abs(x_train[feature]- test_input[feature])
    
    prediction=y_train[x_train['distance'].nsmallest(n=k).index].mode()[0]
    return prediction
    

In [20]:
model_prediction=Knn('age', x_test.iloc[417],3)

print('predicted label:' ,model_prediction)
print('actual label:' ,y_test.iloc[417])

predicted label: 0
actual label: 0


## Evaluating the Model
In the previous screen, we implemented the algorithm and used it to classify a sample input from the test set.

We'll now evaluate our model's performance by calculating how accurately it is able to correctly classify a given value. We can calculate the accuracy by comparing how many predictions the model gets correct compared to the total number of predictions.

<img src='accuracy.png' width=350 height=350>
In order to do that, we'll classify every data point in our test set and compare the predictions to the actual labels for those data points.

In [21]:
x_test['age_predicted_y']= Knn('age', x_test.iloc[417], 3)

In [22]:
model_accuracy = (x_test["age_predicted_y"] == y_test).value_counts(normalize=True)[0]*100
print(f"Accuracy of model trained on the column 'age': {model_accuracy:.2f}%")

x_test["campaign_predicted_y"] = x_test.apply(lambda x: Knn("campaign", x, 3), axis=1)

model_accuracy = (x_test["campaign_predicted_y"] == y_test).value_counts(normalize=True)[0]*100
print(f"Accuracy of model trained on the column 'campaign': {model_accuracy:.2f}%")

Accuracy of model trained on the column 'age': 44.99%
Accuracy of model trained on the column 'campaign': 44.86%


## Feature Engineering I
For a single feature, age, our model got an accuracy of ~46%! It's not performing too well.

Fortunately, we don't have to rely on using just one feature at a time. We can use multiple at the same time, which might improve our model's performance.

When building machine learning models, we'll often have to transform features so they can be effectively used to train models and yield better performance. The process of transforming those features is called feature engineering. **Feature engineering** will often be part of the data preparation step of the machine learning workflow.
We'll work through a couple of commonly used techniques in this lesson.

Both age and campaign are numerical features. There are several categorical variables in our data set as well. However, all the categorical variables in our data set contain string values.

We can't calculate the distance between two strings. That's another limitation of the k-nearest neighbor algorithm. We can work around it without much effort.
Just like we encoded the yes and no values as 0 and 1 for our target variable, we can encode our categorical features.

What values could we use for the categories in the marital column, for example? Maybe 1 to 4 for all 4 categories?
While that sounds like a good approach, it's not suitable in practice.

We're calculating the distance between the feature(s) of two observations. By assigning each category a unique number, we're inadvertently turning the feature into an ordinal variable. The distance between a 4 and a 1 is higher than that between a 1 and a 2, even though there is no inherent order or rank to those categories.

An alternative approach is to convert each category into its own column. For every observation in each column, we set a binary value, 0 or 1, for that observation, depending on how the observation was originally categorized.

This is what the outcome would look like:

<img src='marital.png' width=400 heigt=400>

Each row above represents an observation. The Marital column lists the category for each observation. The rest of the columns store a 0 or 1, depending on the category for that observation.

This process is known as **one-hot encoding**. The new columns that get created in this process are called **dummy variables**.

There are two ways we can use one-hot encoding. We discussed the first one above. An alternative to that is to set all values for one of the categories, like Unknown, in the dummy variables as 0 instead of its having its own separate column.

In [23]:
bank_df=subscription.copy()

In [24]:
#get dummies of the marital column
bank_dummies=pd.get_dummies(bank_df['marital'], drop_first=True)

In [25]:
bank_dummies

Unnamed: 0,married,single,unknown
0,1,0,0
1,1,0,0
2,1,0,0
3,0,0,0
4,0,1,0
...,...,...,...
10117,0,0,0
10118,1,0,0
10119,1,0,0
10120,1,0,0


##  k-NN for Multiple Features
We'll now implement the k-nearest neighbor algorithm for multiple features.
We'll modify our knn() function to account for this and select the following features for the model:

- age
- campaign
- marital_married
- marital_single

In [26]:
bank_df=pd.concat([bank_dummies, bank_df], axis=1)

In [27]:
bank_df.shape

(10122, 24)

In [28]:
#randomly select 85% of the data
train_df=bank_df.sample(frac=0.85,random_state=417 )
test_df=bank_df.drop(train_df.index)

In [29]:
x_train=train_df.drop(columns=['y'],axis=1)
x_test=test_df.drop(columns=['y'],axis=1)
y_train=train_df['y']
y_test=test_df['y']

In [30]:
def Knn(features, test_input,k):
    """trains a knn model using multiple features and uses the euclidean distance metric,returns
    predictions
    """
    squared_distance=0
    for feature in features:
        squared_distance +=(x_train[feature] - test_input[feature])**2
        x_train['distance']=squared_distance**0.5
        
        predictions= y_train[x_train['distance'].nsmallest(n=k).index].mode()[0]
    return predictions
    
    
    

In [31]:
model_prediction=Knn(["age", "campaign", "married", "single"], x_test.iloc[417], 3)

In [32]:
print('predicted label', model_prediction)
print('Actual label', y_test.iloc[417])


#apply it on every row in x_test
x_test["predicted_y"] =Knn(["age", "campaign", "married", "single"], x_test, 3)

predicted label 0
Actual label 0


In [33]:
x_test

Unnamed: 0,married,single,unknown,age,job,marital,education,default,housing,loan,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,predicted_y
5,0,0,0,54,blue-collar,divorced,basic.4y,no,no,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
8,0,1,0,38,admin.,single,professional.course,no,no,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
11,1,0,0,45,services,married,high.school,unknown,yes,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
19,0,0,0,31,admin.,divorced,high.school,no,no,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
22,1,0,0,37,management,married,university.degree,no,no,no,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10092,1,0,0,31,services,married,high.school,no,yes,no,...,3,999,0,nonexistent,-1.1,94.767,-50.8,1.040,4963.6,0
10099,0,1,0,35,admin.,single,professional.course,no,yes,no,...,3,999,0,nonexistent,-1.1,94.767,-50.8,1.040,4963.6,0
10105,0,0,0,35,technician,divorced,basic.4y,no,yes,no,...,1,9,4,success,-1.1,94.767,-50.8,1.035,4963.6,0
10106,1,0,0,33,admin.,married,university.degree,no,no,no,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.035,4963.6,0


In [34]:
#calculate the accuracy
model_accuracy = (x_test["predicted_y"] == y_test).value_counts(normalize=True)[0]*100
print(f"Accuracy of the model: {model_accuracy:.2f}%")

Accuracy of the model: 44.99%


In [35]:
x_test["predicted_y"].value_counts()

0    1518
Name: predicted_y, dtype: int64

In [36]:
y_test.value_counts()

0    835
1    683
Name: y, dtype: int64

## Feature Engineering II
In the previous screen, we calculated the distance between multiple features. Let's look at how the distance column we created is distributed in two different situations.

If we only used the features age and campaign, the distance column, when calculated using a single data point, would have the following distribution:

<img src='hist.png' width=400 height=400>

Now, what if we used age and nr.employed?
<img src='hist2.png' width=400 height=400>

The distance calculated using age and nr.employed has a lot of variation in its values. There are distance values in the range 125 to 160 and also in the range 200 to 280. In contrast, most of the distance values for age and campaign are between 0 and 20.
If we go back to our summary statistics, we notice that the maximum value in nr.employed is 5228, while the maximum of age is only 98. Any distance calculated using the two will result in a large value because of nr.employed.
That wouldn't yield a fair estimate of the similarity between two observations. Especially when we add more features into the mix. One feature will continue to have a larger contribution to the distance calculation, and that could negatively impact our model's performance.
In order to address this, we can **normalize** our features by rescaling their values to a specific range. One common approach is to normalize the features to the range [0, 1]; this is called **min-max scaling or min-max normalization.**

We can scale our features this way:

<img src='minmax.png' width=400 height=400>



Where x is the original value of the feature.

In this final exercise, we'll normalize our age and campaign features and implement the algorithm using the same features as before.

Note that the distance values calculated using age and campaign do not vary drastically. Even with normalization, our model might not show a significant improvement.

In [37]:
#normalize the data
def Knn(features, test_input, k):
    for feature in features:
        x_train[feature]=(x_train[feature]-x_train[feature].min())/(x_train[feature].max()-x_train[feature].min())
        
        x_test[feature]=(x_test[feature]-x_test[feature].min())/(x_test[feature].max() - x_test[feature].min())
        
x_test['predicted_model']=Knn(["age", "campaign", "married", "single"], x_test, 3)

model_accuracy = (x_test["predicted_y"] == y_test).value_counts(normalize=True)[0]*100
print(f"Accuracy of the model: {model_accuracy:.2f}%")
        

Accuracy of the model: 44.99%


## Review 
Our model's performance didn't improve significantly. In fact, our model's accuracy is less than 50%. If we were to randomly guess a particular data point's class, we would be correct 50% of the time since there are only two classes. Our model is performing worse than a random guess right now.

That's ok! In future lessons we'll learn how we can improve upon our model. For now, it's important to remember that building, training and fine-tuning machine learning models is an iterative process

In this lesson, we learned:

- What the K-Nearest Neighbors algorithm is.
- How to implement it from scratch, both using a single feature and using multiple features.
#### About feature engineering, specifically:
- Encoding our categorical variables into numerical values, and
- Normalizing our features using min-max scaling.
In the next lesson, we'll learn more about evaluating our models.

## Evaluating Model Performance
In the previous lesson, we implemented the K-Nearest Neighbors algorithm and trained a classifier to predict if a bank client would subscribe to the bank's product.

In this lesson, we'll build and train the classifier using scikit-learn and try to improve upon our previous model's performance.

### Validation Set
We learned in the previous lesson that our data set doesn't contain any missing values. We don't currently need to wrangle our data any further. We can move on to preparing it for training our model by splitting it into training and test sets.

Our goal with training our model is to see how well it can perform on the test set or on unseen data. However, if we repeatedly evaluate the model on the test set and re-train it, we are introducing bias.Our model will start to indirectly learn from our test set.

In this situation, we won't be able to effectively judge how well our model performs on data it hasn't seen before.

That's why we need a buffer between our training and test sets. We want to evaluate our model and improve upon it without having to use the test set. We'll create a validation set, sometimes referred to as a development set or dev set.

We can then train our model and evaluate it on the **validation set**. Depending on its performance on the validation set, we can re-train it with some tweaks and evaluate it again.

Once we're satisfied with the model's performance on the validation set, we can evaluate it one last time on the test set.


We'll split that data set into three parts:

- Training Set (60%)
- Validation Set (20%)
- Test Set (20%)

In [38]:
subscription.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [39]:
#split the data into train and validation set
from sklearn.model_selection import train_test_split

x=subscription.drop(columns=['y'])
y=subscription['y']

x_train,x_val,y_train,y_val= train_test_split(x,y,test_size=0.2, random_state=417)

In [40]:
#split the data into train and test set.
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2*x.shape[0]/x_train.shape[0] ,
                                                random_state=417)

In [41]:
x.shape

(10122, 20)

In [42]:
x_train.shape

(7591, 20)

##  Building and Training a k-NN
Now that we have our training set, we can build a classifier and fit the model to the data.

We learned in the first lesson that fitting a model is the same as training the model. The model learns from the data. However, when we implemented our model from scratch in the previous lesson, we learned that k-NNs don't really have a training phase.

So how would we use scikit-learn to "fit" our model?
When we implemented a k-NN from scratch, we calculated the distance between observations. Across a large number of features and observations, this can be a computationally expensive task. Instead of a brute force approach, we can use different data structures to help speed that up.

scikit-learn uses the training phase to set up such a data structure. For different algorithms, scikit-learn handles the training phase differently. This again brings up the point of experimenting without understanding an algorithm's inner workings. We wouldn't have learned about this distinction if we hadn't implemented a k-NN from scratch!

In [43]:
from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=6)

#fit the model
knn.fit(x_train, y_train)

ValueError: could not convert string to float: 'unknown'

## Feature Engineering
This is because we have categorical columns in our dataset that haven't yet been converted into dummy variables. Since a k-NN uses a distance metric, it can't work with string data.
Convert all our categorical columns into dummy variables.
Normalize the features by scaling their values to the range [0, 1]
In the previous lesson, we implemented the normalization from scratch. Now we'll use scikit-learn's MinMaxScaler method https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html.

From a functional perspective, it works similar to the way we'd instantiate a model in scikit-learn and then call fit() on the data.
However, fitting to the data is just the first step. sklearn.preprocessing.MinMaxScaler.fit() calculates the minimum and maximum values for each feature we input. We then need to transform those features, using sklearn.preprocessing.MinMaxScaler.transform() to normalize those features.

scikit-learn provides us with a single function, sklearn.preprocessing.MinMaxScaler.fit_transform(), that allows us to carry out both operations

In [44]:
subscription.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,57,housemaid,divorced,basic.4y,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,39,management,single,basic.9y,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [45]:
subscription.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [46]:
subscription['y']=subscription['y'].astype('category')

In [47]:
subscription['y'].dtype

CategoricalDtype(categories=[0, 1], ordered=False)

In [48]:
subscription['default'].unique()

array(['no', 'unknown'], dtype=object)

In [49]:
dummies= pd.get_dummies(subscription[['marital','default']], drop_first=True)


In [50]:
df=pd.concat([dummies,subscription], axis=1)

In [51]:
df.head()

Unnamed: 0,marital_married,marital_single,marital_unknown,default_unknown,age,job,marital,education,default,housing,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,1,0,0,0,40,admin.,married,basic.6y,no,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,1,0,0,0,56,services,married,high.school,no,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,1,0,0,1,41,blue-collar,married,unknown,unknown,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,0,0,0,0,57,housemaid,divorced,basic.4y,no,yes,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,0,1,0,1,39,management,single,basic.9y,unknown,no,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [52]:
x=df[['marital_married','marital_single','default_unknown']]
y=df[['y']]

from sklearn.model_selection import train_test_split

x_train,x_val,y_train,y_val= train_test_split(x,y, test_size=0.2, random_state=417)
x_train,x_test,y_train,y_test= train_test_split(x,y, test_size=0.2*x.shape[0]/x_train.shape[0],
                                               random_state=417)

In [53]:
from sklearn.preprocessing import MinMaxScaler

#create an instance of it
scaler=MinMaxScaler()

x_train_scaled= scaler.fit_transform(x_train)


In [54]:
x_val_scaled= scaler.fit_transform(x_val)

##  Evaluating the Model on Validation Set
We can now build and train our model again. On this screen, we'll also evaluate our model on our validation set.

Since we transformed some of the features in our training data in the previous screen, we need to make sure we transform those same features in our validation set.

We don't need to use sklearn.preprocessing.MinMaxScaler.fit() again. Our scaler has already "learned" how to scale the training data and we can directly transform our validation (or test) data set using the already-defined scaler

In [55]:
from sklearn.neighbors import KNeighborsClassifier

#create an instance
knn=KNeighborsClassifier(n_neighbors=1)

knn.fit(x_train_scaled,y_train)

#validation score
Val_accuracy=knn.score(x_val_scaled,y_val)
Val_accuracy


  return self._fit(X, y)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.5076543209876543

In [56]:
#train another model

knn=KNeighborsClassifier(n_neighbors=2000)

knn.fit(x_train_scaled, y_train)

#validation score
val_accuracy=knn.score(x_val_scaled, y_val)
val_accuracy

  return self._fit(X, y)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.5560493827160494

### Underfitting and Overfitting
In the previous screen, selecting two drastically different values for 
K
 results in different model accuracies. One performs much better than the other.

<img src='knn.svg' width=500 height=500>

If k = 1, the model will only look at the closest neighbor to classify the new data point. If we had 100s of new data points, each one would only look at its closest neighbor only. As a result, instead of the classifier having a decision boundary that might look like the following..

<img src='cla.svg' width=400 height=400>

it might now look like this.
<img src='knn2.svg' width=400 height=400>

We get an overly complex boundary that tries to classify each point because the model is struggling to generalize the data. It starts to memorize specific aspects or features of the data.

What would happen if we evaluated the model on new data? It would likely result in poor performance because the model might not fit the test set as well as it did the training set. The model, in this situation, is overfitting to the training data.

On the other hand, for large values of K , we might get a smoother decision boundary:

<img src='largek.svg' width=400 height=400>

In this situation, the model struggles to represent the data well enough, leading to relatively poor performance. The model is **underfitting.** It doesn't have enough complexity to reasonably capture relevant patterns or insights from the training set. This impacts the model's performance when evaluated on the test set.

Achieving balance and a building a model that generalizes the data well can be a difficult task. If we look at our validation accuracies, we can't tell if our model is overfitting or underfitting. We'll explore this in more depth and learn to tackle this problem when we work with more complicated models.

###  Evaluating the Model on Test Set
`Experiment with different features and values for K
.

1.1. Evaluate the model's performance on the validation set for the different combinations.

1.2. Identify features and values for K
 that result in a reasonably good (to you) accuracy value.

Using the features and values for K
 identified in the previous step, evaluate the model on the test set
 Normalize the test set before evaluating the model.

2.2. The model should be evaluated on the test set only once.`

In [57]:
#normalize the test set
x_test_scaled=scaler.fit_transform(x_test)

#evaluate on test set
test_accuracy=knn.score(x_test_scaled, y_test)
test_accuracy

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.5527459502173054

### Review
In this lesson, we learned:

How to build and train the K-Nearest Neighbors algorithm using scikit-learn.
What a validation set is used for.
What overfitting and underfitting are.
In the next lesson, we will learn more about improving our model's performance.

## Hyperparameter Optimization
We previously built and trained a K-Nearest Neighbors classifier using scikit-learn. We also learned about the importance of a validation set.

Now we'll learn how to improve a machine learning model's performance.

In [58]:
#convert all features to dummy
dummies=pd.get_dummies(data=df, drop_first=True)

In [59]:
dummies

Unnamed: 0,marital_married,marital_single,marital_unknown,default_unknown,age,duration,campaign,pdays,previous,emp.var.rate,...,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_nonexistent,poutcome_success,y_1
0,1,0,0,0,40,151,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
1,1,0,0,0,56,307,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
2,1,0,0,1,41,217,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
3,0,0,0,0,57,293,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
4,0,1,0,1,39,195,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10117,0,0,0,0,64,151,3,999,0,-1.1,...,1,0,0,0,0,0,0,1,0,0
10118,1,0,0,0,37,281,1,999,0,-1.1,...,1,0,0,0,0,0,0,1,0,1
10119,1,0,0,0,73,334,1,999,0,-1.1,...,1,0,0,0,0,0,0,1,0,1
10120,1,0,0,0,44,442,1,999,0,-1.1,...,1,0,0,0,0,0,0,1,0,1


In [60]:
dummies.columns

Index(['marital_married', 'marital_single', 'marital_unknown',
       'default_unknown', 'age', 'duration', 'campaign', 'pdays', 'previous',
       'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m',
       'nr.employed', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_married', 'marital_single', 'marital_unknown',
       'education_basic.6y', 'education_basic.9y', 'education_high.school',
       'education_illiterate', 'education_professional.course',
       'education_university.degree', 'education_unknown', 'default_unknown',
       'housing_unknown', 'housing_yes', 'loan_unknown', 'loan_yes',
       'contact_telephone', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_mon', 'day_of_week_thu', 'day_of_week_

### Feature Selection
We previously either used all the features in our dataset to build and train a model or randomly selected a handful of them.

While both can be appropriate approaches to train a model, they aren't necessarily the best ones. We've already observed how our models didn't always perform as well as we'd hoped in either of those scenarios.Not all features in a dataset might be relevant to a model's performance. Identifying and removing such features in the data preparation step, before training a model, can not only boost its performance, but also reduce the computational cost. The latter is especially important when we have to work with large datasets and complex machine learning models.

**Random selection.**
- We've utilized this in the previous lesson by selecting features at random.

**Domain Expertise.**
- For example, one of the features is euribor3m. A reasonable understanding of what Euribor is could inform of us whether it is likely to have any impact on the prediction.

**Identifying features that are strongly correlated to our target variable**
We'll learn about other approaches later. For now, we'll calculate the Pearson Correlation Coefficient on our columns to identify which features are strongly correlated to the target variable.

We could also plot the heat map for those values to make it easier to identify those features. Since the categorical columns of our dataset have been one-hot encoded, we have over fifty features in our dataset right now. Creating a heat map using all those pairs will make it difficult to identify correlations

In [61]:
#get the corellating coefficinets of each feature against the taregt variable
c_coef=dummies.corr()
c_coef

Unnamed: 0,marital_married,marital_single,marital_unknown,default_unknown,age,duration,campaign,pdays,previous,emp.var.rate,...,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_nonexistent,poutcome_success,y_1
marital_married,1.0,-0.789171,-0.057211,0.125024,0.298633,0.000678,0.019554,0.033757,-0.036281,0.085846,...,-0.004659,-0.00049,-0.003359,0.02839,-0.01116,-0.012167,0.001498,0.039735,-0.032894,-0.062697
marital_single,-0.789171,1.0,-0.032784,-0.127203,-0.447032,0.000674,-0.019207,-0.042416,0.041814,-0.105098,...,-0.02307,0.012975,0.011346,-0.028571,0.01599,0.004052,0.003785,-0.046029,0.040089,0.074975
marital_unknown,-0.057211,-0.032784,1.0,-0.010735,0.003165,0.014313,-0.012222,-0.003465,0.002569,-0.001204,...,-0.008712,0.001466,-0.008723,0.001131,0.009581,-0.014309,0.010925,6.2e-05,0.005018,0.00407
default_unknown,0.125024,-0.127203,-0.010735,1.0,0.126715,0.000489,0.034345,0.119375,-0.129074,0.262759,...,-0.075468,-0.068877,-0.064214,0.022893,-0.009961,-0.013803,-0.015985,0.138126,-0.113058,-0.172503
age,0.298633,-0.447032,0.003165,0.126715,1.0,-0.014483,0.002096,-0.060699,0.064021,-0.056307,...,0.027199,0.049819,0.038864,0.01536,-0.019865,0.023564,-0.015088,-0.054391,0.063597,0.046524
duration,0.000678,0.000674,0.014313,0.000489,-0.014483,1.0,-0.030583,0.01879,-0.031276,0.048823,...,-0.018013,-0.016366,-0.010895,-0.03015,0.026132,-0.0133,0.010703,0.035481,-0.020561,0.468197
campaign,0.019554,-0.019207,-0.012222,0.034345,0.002096,-0.030583,1.0,0.087687,-0.098093,0.193798,...,-0.074295,-0.067082,-0.050503,0.035515,-0.010702,-0.018065,-0.022615,0.109391,-0.083469,-0.118361
pdays,0.033757,-0.042416,-0.003465,0.119375,-0.060699,0.01879,0.087687,1.0,-0.705404,0.332098,...,-0.027969,-0.125969,-0.17908,0.0212,-0.023904,-0.014594,-0.007478,0.660379,-0.954401,-0.317997
previous,-0.036281,0.041814,0.002569,-0.129074,0.064021,-0.031276,-0.098093,-0.705404,1.0,-0.388545,...,0.065595,0.112461,0.173833,-0.013236,0.017924,0.006864,0.00346,-0.850795,0.639402,0.263903
emp.var.rate,0.085846,-0.105098,-0.001204,0.262759,-0.056307,0.048823,0.193798,0.332098,-0.388545,1.0,...,-0.077132,-0.209719,-0.174362,0.003497,-0.016777,-0.010187,0.034893,0.465448,-0.315389,-0.42968


In [62]:
#show how each correlates with the target
correlation=c_coef['y_1']
correlation

marital_married                 -0.062697
marital_single                   0.074975
marital_unknown                  0.004070
default_unknown                 -0.172503
age                              0.046524
duration                         0.468197
campaign                        -0.118361
pdays                           -0.317997
previous                         0.263903
emp.var.rate                    -0.429680
cons.price.idx                  -0.202009
cons.conf.idx                    0.080425
euribor3m                       -0.445328
nr.employed                     -0.468524
job_blue-collar                 -0.125771
job_entrepreneur                -0.025231
job_housemaid                   -0.008715
job_management                   0.009571
job_retired                      0.113650
job_self-employed               -0.005562
job_services                    -0.067254
job_student                      0.111979
job_technician                  -0.005098
job_unemployed                   0

In [63]:
high_corr=correlation[correlation > 0.15]
high_corr

duration            0.468197
previous            0.263903
month_oct           0.157058
poutcome_success    0.307181
y_1                 1.000000
Name: y_1, dtype: float64

In [64]:
x=dummies[['duration','previous','month_oct','poutcome_success']]
y=dummies['y_1']

In [65]:
from sklearn.model_selection import train_test_split


#split the data into train,test and validation set
x_train,x_val,y_train,y_val=train_test_split(x,y, test_size=0.2, random_state=417)
x_train,x_test,y_train,y_test=train_test_split(x,y, test_size=0.2*x.shape[0]/x_train.shape[0])

In [66]:
#normalize the data
from sklearn.preprocessing import      MinMaxScaler

#instantiate the class
scaler=MinMaxScaler()

x_train_scaled=scaler.fit_transform(x_train)
x_test_scaled=scaler.fit_transform(x_test)
x_val_scaled=scaler.fit_transform(x_val)

### Training and Evaluating the Model
We previously selected some features based on how strongly they correlated to the target variable. It's important to note that calculating the Pearson's r is not an ideal approach.

It's a measure of linear correlation between variables. Therefore, it will fail to capture any non-linear relationships. Additionally, the dummy variables aren't technically continuous. There are alternative approaches (such as Cramér's V https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) that are more suitable than Pearson's correlation coefficient when working with categorical columns. We won't discuss that here, however.

Let's train our model and then evaluate it on the validation set. We'll try multiple values for K
 when evaluating our model.

In [67]:
neighbors= np.arange(1,6)

accuracies={}

for k in neighbors:
    knn=KNeighborsClassifier(n_neighbors=k)
    
    #fit the model on the training set
    knn.fit(x_train_scaled, y_train)
    
    #evaluate the model on the scaled validation set
    val_score=knn.score(x_val_scaled, y_val)
    accuracies[k]=val_score
print(accuracies)    

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


{1: 0.7106172839506173, 2: 0.7298765432098765, 3: 0.7619753086419753, 4: 0.7654320987654321, 5: 0.7728395061728395}


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Our validation accuracy changes depending on the number of neighbors (K) that we set for our model.

There are certain parameters that we can set or input ourselves when training machine learning models. These parameters can influence the training process and can have an impact on the model's performance. They are called as **hyperparameters.** For k-nearest neighbors, K is one such hyperparameter.We've tried several values for 
K
 so far. In the previous lesson, k = 2000 resulted in a poorly performing model. On the previous screen, however, k = 5 yielded a relatively good accuracy score.

This process of tuning the hyperparameter values in order to maximize the model's performance is called **hyperparameter tuning or hyperparameter optimization** Finding optimal hyperparameter values often requires experimentation. Different models might have a wide range of hyperparameters to tune. The documentation for scikit-learn's KNeighborsClassifier https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html lists multiple parameters that we can set values for when training our model.

We'll experiment with a few of those and observe how they impact our model's performanc

In [68]:
#tune again 

neighbors= np.arange(1,6)

val_accuracies={}

for neighbor in neighbors:
    knn=KNeighborsClassifier(n_neighbors=neighbor, weights='distance', p=5)
    
    #fit the model
    knn.fit(x_train_scaled, y_train)
    
    #evaluate the model on validation set
    val_accuracy=knn.score(x_val_scaled,y_val)
    
    val_accuracies[neighbor]=val_accuracy
val_accuracies

{1: 0.7106172839506173,
 2: 0.7377777777777778,
 3: 0.7540740740740741,
 4: 0.7644444444444445,
 5: 0.7664197530864197}

In [69]:
#tune again 

neighbors= np.arange(1,6)

val_accuracies={}

for neighbor in neighbors:
    knn=KNeighborsClassifier(n_neighbors=neighbor, weights='distance', p=2)
    
    #fit the model
    knn.fit(x_train_scaled, y_train)
    
    #evaluate the model on validation set
    val_accuracy=knn.score(x_val_scaled,y_val)
    
    val_accuracies[neighbor]=val_accuracy
val_accuracies

{1: 0.7106172839506173,
 2: 0.7377777777777778,
 3: 0.7540740740740741,
 4: 0.7644444444444445,
 5: 0.7664197530864197}

### Experimentation vs Fundamentals
Before moving forward, let's look at one of the hyperparameters for our model. On the previous screen, we set the weights to distance.

But what does that do? The documentation presents an explanation, but how do we decide that a particular value makes sense for our use-case?

This will often come through trial and error, but having a better understanding of the algorithm behind our model can give us insights for reflection.

<img src='algorithm.svg' width=400 height=400>

We already know we can calculate the distance between an uncategorized data point and all the other data points, identify the K
 nearest neighbors, and--based on the classes of those neighbors--classify the uncategorized data point by selecting the most common class.

Each neighbor in the above algorithm is given the same weight--no neighbor is more important or relevant than another. When creating our model, this is the same as setting the weights parameter to **uniform.** All neighbors have a uniform impact when our model decides on a class for the new data point.
Is that what we want every time? What if our dataset had 1000 points corresponding to one label and only 200 corresponding to the other?

We might end up with a higher likelihood of selecting the former class instead of the latter if all neighbors were considered equally. Our model would be biased towards one class. A potential solution to that is to assign weights to those neighbors.
We calculate the distance of each neighbor from the unknown data point. We assign a weight equal to the inverse of that distance. The closer the neighbor to the new point, the more likely the new point belongs to the same class.
This comes with its own potential drawbacks. For example, the additional computations will add to our computational costs. The model might also overfit the data, since it would start to learn from the closest features instead of generalizing appropriately.

If we didn't fully understand how k-NNs work, we might not have understood this specific example of a weighted k-NN, or thought about where it could be potentially useful. Hyperparameter optimization allows us to quickly iterate through different values without thinking about how they impact the model. But it's still important to understand the underlying algorithm as often as possible.

### Grid Search
Previously, modifying two more hyperparameters improved our model's performance corresponding to some of the K's, but the rest worsened. Not every attempt will result in improvement.

We can't always try every possible permutation and combination. Depending on the size of the dataset, the number of hyperparameters, and the range of values they could take, it would be computationally expensive.We can try out a smaller subset of values. A commonly used approach that can help us find the optimal hyperparameter values is called **grid search.**

We've already applied this technique, in part. In the grid search technique, we define a grid of hyperparameter values. This grid contains the range of values for different hyperparameters we want to explore and train our model with.
We created a list containing the number of neighbors we wanted to use to train our model. That was an example of such a grid. We then expanded the grid by setting the values for two more hyperparameters. We could expand the grid further by having multiple nested loops to explore different combinations of hyperparameters and their values.
We don't need to keep creating multiple loops ourselves. We can use Scikit-learn's GridSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html to conduct this search.
GridSearchCV allows us to input a dictionary of hyperparameters and the values we want to search. Additionally, GridSearchCV automatically evaluates the different models on validation sets it creates from the training data. It simplifies our workflow in that regard

In [70]:
dummies.head()

Unnamed: 0,marital_married,marital_single,marital_unknown,default_unknown,age,duration,campaign,pdays,previous,emp.var.rate,...,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_nonexistent,poutcome_success,y_1
0,1,0,0,0,40,151,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
1,1,0,0,0,56,307,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
2,1,0,0,1,41,217,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
3,0,0,0,0,57,293,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0
4,0,1,0,1,39,195,1,999,0,1.1,...,0,0,0,1,0,0,0,1,0,0


In [71]:
x=dummies[['duration','previous','month_oct','poutcome_success','default_unknown']]
y=dummies['y_1']

#split the data
x_train,x_test,y_train,y_test=train_test_split(x,y, test_size=0.2, random_state=417)

from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()

#scale the data
x_train_scaled=scaler.fit_transform(x_train)
x_test_scaled=scaler.fit_transform(x_test)

params_grid= { 'n_neighbors': range(1,10),
             'metric': ['minkowski','manhattan']}

from sklearn.model_selection import GridSearchCV

#create an instance of the model
knn=KNeighborsClassifier()
#perform the grid search
knn_grid=GridSearchCV(knn,params_grid, scoring='accuracy' )

knn_grid.fit(x_train_scaled, y_train)

best_score= knn_grid.best_score_
best_parameter= knn_grid.best_params_

print('best models accuracy is {}'.format(best_score) )
print('best models parameter is {}'.format(best_parameter))






  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


best models accuracy is 0.7855998596908623
best models parameter is {'metric': 'minkowski', 'n_neighbors': 9}


### Evaluating the Model on Test Set
Because of the grid search technique and the features we selected earlier, we were able to obtain a model that has:

**An accuracy of ~87.7%.**
**The following hyperparameters and values:**
- metric = "manhattan"
- n_neighbors: 9

Note that your results might vary depending on your experimentation.

As a reminder: the above model was evaluated on a validation set and resulted in the highest accuracy compared to all the other models obtained from that hyperparameter grid.

We can now use this model and evaluate it on the test set. Scikit-learn again makes this simple for us to do:
- We can obtain our best model, known as an estimator, from GridSearchCV.
- We can evaluate the test set by calculating the accuracy score using the best estimator .
Before we evaluate the model, let's look at how the machine learning workflow has evolved over these lessons:

<img src='MLworkflow.svg' width=500 height=500>

In [72]:
#best model
best_model=knn_grid.best_estimator_
best_model_accuracy=best_model.score(x_test_scaled,y_test)

print('best model is :', best_model )
print('best model accuracy is :', best_model_accuracy)

best model is : KNeighborsClassifier(n_neighbors=9)
best model accuracy is : 0.7846913580246914


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


##  Review
With an accuracy of 87%, our final model is much better than the ones we trained in the previous lessons!

In this lesson, we learned that there are two ways we can improve our model's performance:

Identifying and selecting relevant features for training our model.
Tuning or optimizing hyperparameters.
Next, you'll apply everything you've learned in a Guided Project