# SUPERVISED LEARNING

# kNN Vs Random Forest

In this example we will work with _digits dataset_ from sklearn library. 
The goal will be to implement the simplest metric classifier - the k - Nearest Neighbors method, and to compare the quality of its work with Random Forest classifier.

We will need:
- sklearn
- scipy
- numpy

### Data Load

In [1]:
from sklearn import model_selection, metrics, ensemble

In [2]:
import numpy as np

In [3]:
from sklearn.datasets import load_digits
digits = load_digits()
type(digits)
digits.keys()

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

In [4]:
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.25, stratify=y)

## 1. kNN

### 1.1. kNN by hand

Let's implement our own method of one Nearest Neighbor with Euclidean metric for classification tasks. 

In [5]:
from scipy.spatial.distance import euclidean as E_dist

In [6]:
y_pred = np.zeros(y_test.shape[0])          # 450

for i in range(X_test.shape[0]):            # 450
    dist = np.zeros(X_train.shape[0])       # 1347
    for j in range(X_train.shape[0]):
        ed = E_dist(X_test[i], X_train[j])
        dist[j] = ed
    y_pred[i] = y_train[dist.argmin()]


In [7]:
print('Score: ', metrics.accuracy_score(y_test, y_pred))

Score:  0.9866666666666667


### 1.2. KNeighborsClassifier

In [8]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

In [9]:
y_predict = clf.predict(X_test)

In [10]:
print('Score: ', metrics.accuracy_score(y_test, y_predict))

Score:  0.9866666666666667


## 2. Random Forest

Now we will train samples on the _RandomForestClassifier_ with 1000 trees and make predictions on the test sample and estimate errors.

In [11]:
RF_clf = ensemble.RandomForestClassifier(n_estimators=1000)
RF_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [12]:
y_predict = RF_clf.predict(X_test)

In [13]:
print('Score: ', metrics.accuracy_score(y_test, y_predict))

Score:  0.9688888888888889


Pay attention to how the quality of work of Random Forest relates to the quality of work, perhaps, with one of the simplest methods - kNN. This difference is a particularity of this dataset, but we must always remember that this situation can also occur, and do not forget about simple methods.