# Simple Machine Learning Entry

## kNN

k-nearest-neighbors algorithm

example : ***Iris***

In [2]:
from sklearn.datasets import load_iris

# load data of Iris 
iris = load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

---
Usually we need to separate the data set into **Training Sets** and **Testing Sets**

In [31]:
# Aspects
X = iris.data
#print(X.shape)

# Labels ( Categories )
Y = iris.target
#print(Y.shape)

import numpy as np

data = np.hstack((X,Y.reshape(-1,1)))

#print(data.shape)

np.random.shuffle(data)
train_size = int (data.size*0.8)

train_data , test_data = data [:int(train_size/5) ] , data [int(train_size /5):]
# slice data into 2 parts


x_train , y_train = train_data[: , :-1 ] , train_data[ : , -1: ]
x_test  , y_test  = test_data [: , :-1 ] , test_data[ : , -1: ]


# A simpler way is to call the `train_test_split` function in the module

---
Use numpy first to better understand the algorithm:

In [32]:
def euclid_dis(u,v):
    return np.sqrt(np.sum(np.abs(u-v)**2))
    # these unary operation will work on every element


from scipy import stats
# used to get mode

def mklabel (x_train,y_train,x_one,k):
    distances = [ euclid_dis( x_one, x_i)  for x_i in x_train ]
    # get the distance or "similarity" of test and all train_data
    
    labels = y_train [ np.argpartition(distances , k-1)[:k] ]
    # argpartition ( array , num ) returns an array that the smallest n + 1 elements come first
    # then call a slice 
    # and use a *fancy indexing* 

    return stats.mode(labels).mode

---
Now all the preparations are ready . 
Define the authentic function


In [33]:
def predict_by_kNN (x_train,y_train,x_new,k=5):
    return np.asarray([mklabel(x_train,y_train,x_one,k) for x_one in x_new])
    # fancy indexing 

In [34]:
y_pred = predict_by_kNN(x_train, y_train, x_test)
y_pred == y_test

array([[ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True]])

---
It is easier to accomplish the algorithm using `scikit-learn`

In [50]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

model.fit(x_train,y_train)

y_pred = model.predict(x_test)
y_test = y_test.flatten()

print(y_pred)
print(y_test)


y_pred == y_test

[1. 2. 2. 2. 2. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 2. 0. 0. 1. 1. 2. 0. 0. 0.
 1. 1. 2. 2. 1. 2.]
[1. 2. 2. 2. 2. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 2. 0. 0. 1. 1. 2. 0. 0. 0.
 1. 1. 2. 2. 1. 2.]


  return self._fit(X, y)


array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [51]:
model.score(x_test,y_test)

1.0

## Conclusion

`kNN-model` is a simple but powerful algorithm
it may have low-efficiency in calaulation , and `k` value is of great importance .
moreover , the imbalance of training data may lead to *Class Imbalance Bias* too .