<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Phystech School of Applied Mathematics and Informatics (PSAMI) MIPT</b></h3>

<h2 style="text-align: center;"><b>k Nearest Neighbor(KNN)</b></h2>

The Neighbor Neighbor Method (kNN) is a very popular classification method, also sometimes used in regression tasks. This is one of the most understandable approaches to classification. The essence of the method is intuitive: you are like others, that is, your neighbors. Formally, the basis of the method is the compactness hypothesis: if the metric of the distance between the examples is introduced quite successfully, then similar examples are more often in the same class than in the other.

<img src='https://hsto.org/web/68d/a45/6f0/68da456f00f8434e87628dbe7e3f54a7.png'>


To classify each of the test sample objects, you must perform the following operations sequentially.:

* Calculate the distance to each of the objects of the training sample
* Select the objects of the training sample, to which the distance is minimal
* The class of the object being classified is the class most often found among $k$ nearest neighbors.

We will work with a subsample of [forest coverage type data from the UCI repository](http://archive.ics.uci.edu/ml/datasets/Covertype).(subsample is `forest_dataset.csv`) Available in 7 different classes. Each object is described by 54 signs, 40 of which are binary. Data description is available here.

### Data preprocessing

In [None]:
import pandas as pd

In [None]:
all_data = pd.read_csv('forest_dataset.csv',)
all_data.head()

In [None]:
all_data.shape

Extract the values of the class label to the variable `labels`, attribute descriptions to the variable `feature_matrix`. Change the format to `numpy`-format using the method`.values`.

In [None]:
labels = all_data[all_data.columns[-1]].values
feature_matrix = all_data[all_data.columns[:-1]].values

Now we will work with all 7 types of coverage. Divide the sample into training and test using the method `train_test_split`, use parameter values `test_size=0.2`, `random_state=42`.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
train_feature_matrix, test_feature_matrix, train_labels, test_labels = train_test_split('''your code''')

# normalize data on normalization parameters for train_feature_matrix
scaler = '''your code'''
train_feature_matrix = scaler.'''your code'''(train_feature_matrix)
test_feature_matrix = scaler.'''your code'''(test_feature_matrix)

### Model training

The quality of classification and regression by the method of nearest neighbors depends on several parameters:

* the number of the neighbors `n_neighbors`
* the object distance metric `metric`
* the weights of neighbors (the neighbors of the test example may enter with different weights, for example, the further the example is, the less its voice is taken into account) `weights`


Teach in the dataset `KNeighborsClassifier` from the `sklearn`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

clf = KNeighborsClassifier()
clf.fit('''your code''')
pred_labels = clf.predict('''your code''')
accuracy_score(test_labels, '''your code''')

### Question 1:
* What quality have you got?

Let's look through the parameters of our model

* Go through the grid from `1` to` 10` parameter of the number of the neighbors

* Also you try using different metrics: `['manhattan', 'euclidean']`

* Try different weighting strategies.: `[‘uniform’, ‘distance’]`

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'weights': ['''your code'''], 'n_neighbors': '''your code''', 'metric': ['''your code''']}

# Docstring:     
# Exhaustive search over specified parameter values for an estimator.
clf_grid = GridSearchCV(clf, params, cv=5, scoring='accuracy', n_jobs=-1)
clf_grid.fit(train_feature_matrix, train_labels)

We derive the best parameters

In [None]:
clf_grid.best_params_

### Question 2:
* What metric should be used?

### Question 3:
* How many neighbors should be used?

### Question 4:
* What type of strategy should be used?

Using the found optimal number of the neighbors, calculate the probabilities of belonging to the classes for the test sample(`.predict_proba`).

In [None]:
optimal_clf = KNeighborsClassifier('''your code''')
optimal_clf.fit('''your code''')
pred_prob = optimal_clf.'''your code'''(test_feature_matrix)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

unique, freq = np.unique(test_labels, return_counts=True)
freq = list(map(lambda x: x / len(test_labels),freq))

pred_freq = pred_prob.mean(axis=0)
plt.figure(figsize=(10, 8))
plt.bar(range(1, 8), pred_freq, width=0.4, align="edge", label='prediction')
plt.bar(range(1, 8), freq, width=-0.4, align="edge", label='real')
plt.legend()
plt.show()

### Question 5:
* What is the predicted probability pred_freq of the optimal model of the class 3(round up to 2 decimal places)?