# 7.1: kNN ML Algorithm

## Intro to ML (Machine Learning)

* Supervised learning: labeled data (e.g. there is an attirbute (AKA feature) that you are interested in predicting for unseen instances)
    * The attribute is called the "class" or the "class label"
    * the attribute is categorical... classification
    * The attirbute is numeric... regression
    * Example algorithm: kNN (k nearest neighbors)

* Unsupervised learning: unlabeled data
    * Example algorithm: k-means clustering

## Supervised Learning

* Need a way to divide a dataset into a training set and a testing set
    * The training set is used to build/train an/a algorithm/model
    * The testing set is used to evaluate the algorithm/model
    * The training set and the testing set *are different*
* Example: 
    * We have a super-tiny t-shirt sizes dataset
        * 4 instances
        * 3 attributes (1 is the class (t-shirt size))
        * Goal is to use height and weight attributes to predict t-shirt size
        * We will do this for a test set with a single unseen instance
            * height=161 weight=63 t-shirt size=?
            * Let's say the "ground truth value" is M (medium)

## kNN Algorithm

* Identify the $k$ nearest neighbors in the training set to a set set instance
    * The most frequently occuring class label amonst the $k$ nearest neighbors will be the clas slabel prediction for the unseen instance
* We need a way to measure "nearness AKA "closeness"
    * 2D: $\sqrt{a^2 + b^2}$
    * ND: Euclidean distance - $\sqrt{\sum_{i=1}^n (a_i-b_i)^2}$
* We need to normalize (AKA scale) our attirbutes so we don't have unanticipated weighting of one attribute more than another (e.g. height has a larger scale than weight so it will dominate the formula)
    * We will use the min-max scaling approach
    * For each attribute, the min becomes 0 and the max becomes 1 (e.g. bounded to $[0,1]$ so the units have no weighting effect)
    * For each attribute, for each value, subtract the min then divide by the original range (max - min)



### Tracing the kNN Algorithm

| row # | height (m) | wight (kg) | t-shirt size |
| - | - | - | - |
| 0 | 158 | 58 | M |
| 1 | 163 | 61 | M | 
| 2 | 165 | 61 | L |
| 3 | 168 | 66 | L |

**STEP 1: Scale the height**

* For HEIGHT:
    * min = 158
    * max = 168
    * range = 10

* For WEIGHT:
    * min = 58
    * max = 66 
    * k = 8

| row # | height (m) | wight (kg) | t-shirt size |
| - | - | - | - |
| 0 | $\frac{158-158}{10} = 0$ | $\frac{58-58}{8} = 0$ | M |
| 1 | $\frac{163-158}{10} = 0.5$ | $\frac{61-58}{8} = 0.375$ | M | 
| 2 | $\frac{165-158}{10} = 0.7$ | $\frac{61-58}{8} = 0.375$ | L |
| 3 | $\frac{168-158}{10} = 1$ | $\frac{66-58}{8} = 1$ | L |


Our UNSEEN INSTANCE: $(161, 63) = (\frac{161-158}{10}, \frac{58-63}{8}) = (0.3, 0.625)$

**STEP 2: Calculate the distance**

| row # | height (m) | wight (kg) | t-shirt size | Distance from $(0.3, 0.625)$|
| - | - | - | - | - |
| 0 | 0 | 0 | M | $\sqrt{(0-0.3)^2 + (0-0.625)^2} = 0.6933$ |
| 1 | 0.5 | 0.375 | M | $\sqrt{(0.5-0.3)^2 + (0.375-0.625)^2} = 0.32$ | 
| 2 | 0.7 | 0.375 | L | $\sqrt{(0.7-0.3)^2 + (0.375-0.625)^2} = 0.47$ |
| 3 | 1 | 1 | L | $\sqrt{(1-0.3)^2 + (1-0.625)^2} = 0.79$ |

**STEP 3: Get the class labels for the 3 smallest distances**

* If we look at the table directly above, then the closest rows are 1, 2, and 0. The class labels are M, M, and L.


**STEP 4: Pick the majority class and use that as the prediction**

* Since the majority are M, our prediction will be M


## Coding the kNN Algorithm

We will start by getting all of our data appropriately:

In [5]:
import pandas as pd

# knn algorithm with the scikit-learn library
# notation
# X: a feature matrix (rows of feature vectors (instances)) with the class labels stripped off
# y: is a class label vector
# X and y are parallel
# use _train and _test to denote train and test sets despectively

df = pd.read_csv('shirt_sizes.csv')

X_train = df.drop("t-shirt size", axis=1) # 1 is for columns
print('training dataset:\n', X_train)

y_train = df["t-shirt size"]
print('\ntraining y-vector:\n', y_train)

X_test = [[161, 63]]
print('\ntest data:\n', X_test)

training dataset:
    height(cm)  weight(kg)
0         158          58
1         163          61
2         165          61
3         168          66

training y-vector:
 0    M
1    M
2    L
3    L
Name: t-shirt size, dtype: object

test data:
 [[161, 63]]


Next, we will normalize our data:

In [8]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_normalized = scaler.transform(X_train)
print('Normalized training data:\n', X_train_normalized)
X_test_normalized = scaler.transform(X_test)
print('\nNormalized test data:\n', X_test_normalized)

Normalized training data:
 [[0.    0.   ]
 [0.5   0.375]
 [0.7   0.375]
 [1.    1.   ]]

Normalized test data:
 [[0.3   0.625]]


We will then set up the kNN classifier:

In [10]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_clf.fit(X_train_normalized, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=3)

Finally, we need to make our prediction with the classifier for the unseen instance:

In [12]:
y_predicted = knn_clf.predict(X_test_normalized)
print('y predicted:', y_predicted)
print('nearest neighbors:', knn_clf.kneighbors(X_test_normalized))

y predicted: ['M']
nearest neighbors: (array([[0.32015621, 0.47169906, 0.69327123]]), array([[1, 2, 0]]))


## Closing Throughts on kNN

* What if our attributes are not numeric (meaning they are categorical)?
    * simple approach: map labels to integers
        * `from sklearn.preprocessing import LabelEncoder`
    * Another approach: write your own distance function(0 if labels are same, 1 otherwise)
* kNN is NOT the only ML algorithm
    * Naive Bayes
    * Descision trees (random forests)
    * SVMs (support vector machines)
    * Neural networks (deep learning)
    * etc.