<div style="text-align:center">
    <img src="../files/monolearn-logo.png" height="150px">
    <h1>ML course</h1>
    <h3>Session 06: Grid search, Cross Validation, K-Nearest Neighbor (KNN)</h3>
    <h4><a href="https://amzenterprise.ir/">Ali Momenzadeh</a></h5>
</div>

<img src = "../files/6/1_UgJe_U6SI9wo9uicjcnNqA.png" width=50%>

### Cross-Validation

<img src = "../files/6/1_3XvSvKfde8u89TMwjkz3kg.png" width=60%>

*Cross-Validation* is a validation technique designed to evaluate and assess how the results of statistical analysis (model) will generalize to an independent dataset. Cross-Validation is primarily used in scenarios where prediction is the main aim, and the user wants to estimate how well and accurately a predictive model will perform in real-world situations.

Cross-Validation seeks to define a dataset by testing the model in the training phase to help minimize problems like overfitting and underfitting. However, you must remember that both the validation and the training set must be extracted from the same distribution, or else it would lead to problems in the validation phase.

<img src = "../files/6/DlBMc.png" width=50% style="background-color:white">

#### Benefits of Cross-Validation

* It helps evaluate the quality of your model.
* It helps to reduce/avoid problems of overfitting and underfitting.
* It lets you select the model that will deliver the best performance on unseen data.

#### Hold-out (Train-Test-Split) vs. Cross-validation

Cross-validation is usually the preferred method because it gives your model the opportunity to train on multiple train-test splits. This gives you a better indication of how well your model will perform on unseen data. Hold-out, on the other hand, is dependent on just one train-test split. That makes the hold-out method score dependent on how the data is split into train and test sets.

The hold-out method is good to use when you have a very large dataset, you’re on a time crunch, or you are starting to build an initial model in your data science project. Keep in mind that because cross-validation uses multiple train-test splits, it takes more computational power and time to run than using the holdout method.

<img src = "../files/6/1_pJ5jQHPfHDyuJa4-7LR11Q.png" width=50%>

#### Cross-Validation: Different Validation Strategies

1. Validation set 
2. Train/Test split 
3. K-fold 
4. Leave one out

<img src = "../files/6/screenshot-miro.medium.com-2020.02.14-17_27_05.png" width=50%>

#### Import Libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

#### Load and prepare data

In [None]:
dataset = pd.read_csv("Social_Network_Ads.csv")

In [None]:
dataset

In [None]:
dataset.columns

In [None]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

#### Train and test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

#### Feature Scaling

In [None]:
dataset

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_train

In [None]:
from sklearn.svm import SVC

classifier = SVC(kernel = "rbf", random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

#### Evaluation

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
accuracy_score(y_test, y_pred)

#### Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)

In [None]:
accuracies

In [None]:
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

<img src = "../files/6/1_4G__SV580CxFj78o9yUXuQ.png" width=50%>

StratifiedKFold: https://stackoverflow.com/a/72139320/10304611

#### Applying Grid Search to find the best model and the best parameters

<img src = "../files/6/0_0SGzQwbkOwSmE13A.jpg" width=40%>

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = [{'C': [0.25, 0.5, 0.75, 1], 'kernel': ['linear']},
              {'C': [0.25, 0.5, 0.75, 1], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]

In [None]:
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = "accuracy",
                           cv = 10,
                           n_jobs = -1)

grid_search.fit(X_train, y_train)

In [None]:
best_accuracy = grid_search.best_score_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))

In [None]:
best_parameters = grid_search.best_params_
print("Best Parameters:", best_parameters)

##### We will now be presented with the optimal values of the hyperparameters.

<hr/>

### K-Nearest Neighbor

K-Nearest Neighbors, or **KNN** for short, is one of the simplest machine learning algorithms and is used in a wide array of institutions. KNN is a `non-parametric`, `lazy learning` algorithm.



* When we say a technique is non-parametric, it means that it does not make any assumptions about the underlying data. In other words, it makes its selection based off of the proximity to other data points regardless of what feature the numerical values represent. 

* Being a lazy learning algorithm implies that there is little to no training phase. Therefore, we can immediately classify new data points as they present themselves.

> K-NN is a lazy learner because it doesn’t learn a discriminative function from the training data but “memorizes” the training dataset instead.
For example, the logistic regression algorithm learns its model weights (parameters) during training time. In contrast, there is no training time in K-NN.

**To summarize: An eager learner has a model fitting or training step. A lazy learner does not have a training phase.**

<img src = "../files/6/1_YWKvGH4kKOtCvlX950LM9g.jpg" width=75%>

#### KNN in a nutshell

1.     Pick a value for K (i.e. 5).

<img src = "../files/6/1_mAgqYN_HLbYYXXkQdyBA6Q.png" width=50%>

2. Take the K nearest neighbors of the new data point according to their Euclidean distance.

<img src = "../files/6/1_4F-q86XFr2-EsaAcz0Zu5A.png" width=50%>


3. Among these neighbors, count the number of data points in each category and assign the new data point to the category where you counted the most neighbors.

<img src = "../files/6/1_OMHr6KZl7nHnKgLb8pq0Jg.png" width=50%>


#### Algorithm...

1.Initialize the K value.

2.Calculate the distance between test input and K trained nearest neighbors.

3.Check class categories of nearest neighbors and determine the type in which test input falls.

4.Classification will be done by taking the majority of votes.

5.Return the class category.

#### Distance Metrics

The distance metric is the effective hyper-parameter through which we measure the distance between data feature values and new test inputs.

<img src = "../files/6/1__i1PCxvSDw5TIfzyq90aag.png" width=50%>

#### Some pros and cons of KNN

> Pros:

* No assumptions about data
* Simple algorithm — easy to understand
* Can be used for classification and regression

> Cons:

* High memory requirement — All of the training data must be present in memory in order to calculate the closest K neighbors
* Sensitive to irrelevant features
* Sensitive to the scale of the data since we’re computing the distance to the closest K points

#### Apply KNN on Breast Cancer Dataset

#### Import libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

#### Load and prepare data

In [None]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

#### EDA

In [None]:
cancer.data.shape

In [None]:
print(cancer.feature_names)

In [None]:
print(cancer.target_names)

<img src = "../files/6/1_-7Gwli-yhmHA7XNRmJwSRg.jpg" width=60%>

In [None]:
cancer.target

#### Train and test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

In [None]:
print ("Shape of Train Data:", X_train.shape)
print ("Shape of Test Data:", X_test.shape)

In [None]:
y_train.shape

In [None]:
sns.countplot(cancer.target)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
# knn = KNeighborsClassifier(n_neighbors = 5)

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [None]:
knn.score(X_train, y_train)

In [None]:
knn.score(X_test, y_test)

#### Evaluation

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, square=True , annot=True)

#### How to choose a K value?

K value indicates the count of the nearest neighbors. We have to compute distances between test points and trained labels points. Updating distance metrics with every iteration is computationally expensive, and that’s why KNN is a lazy learning algorithm

<img src = "../files/6/0_FakkqTKdMPDb3gof.jpg" width=40%>

#### Optimal value of K

In [None]:
error_rate = []

# Might take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    # (pred_i != y_test) => [True, False, ...] => mean of a boolean vector is Number of True devided by false values
    # np.mean([True, True, False]) => 2 / 3 => 0.6666666666666666
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate, color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
knn = KNeighborsClassifier(n_neighbors=11)
#knn = KNeighborsClassifier(n_neighbors=30)

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [None]:
knn.score(X_train, y_train)

In [None]:
knn.score(X_test, y_test)