[View in Colaboratory](https://colab.research.google.com/github/JacksonIsaac/colab_notebooks/blob/master/Chap2_machine_learning_in_action.ipynb)

# Machine Learning in Action

## Chapter 2

### k-Nearest Neighbours

We cluster the data based on the closeness of data points to each other using distance measurement. In simple words, grouping data points together which have minimal distance between them.

When a new data input is given, we calculate its distance from each cluster points, and select the most similar clusters (nearest neighbours). From this, we look at the top *k* clusters.

**Pros:** High accuracy, insensitive to outliers, no assumptions about data

**Cons:** Computationally expensive, requires a lot of memory

**Works with:** Numeric values, nominal values

## Import Dependencies

In [0]:
import numpy as np
import operator

### Let's create a sample dataset

In [0]:
def create_dataset():
    group = np.array([[1., 1.1], [1., 1.], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

In [0]:
group, labels = create_dataset()

In [12]:
group

array([[1. , 1.1],
       [1. , 1. ],
       [0. , 0. ],
       [0. , 0.1]])

In [13]:
labels

['A', 'A', 'B', 'B']

## Classification using kNN

For every datapoint we will:


*   Calculate the distance between input data point (inX) and each point in trained dataset matrix with corresponding classification labels
*   Sort distances in increasing order
*   Select *k* items with lowest distances to inX
*   Find majority of items among these *k* items
*   Return the majority class as the classification of inX.



### Distance calculation logic

Distance between data points is calculated by using Euclidian distance.

![Euclidian distance formula](https://i.imgur.com/I9sKgfn.png)

In [0]:
def knn_classify(inX, dataset, labels, k):
    
    # Step 1: Calculate distance between inX and
    # datapoints in dataset
    
    dataset_size = dataset.shape[0] # No. of rows
    diff_mat = np.tile(inX, (dataset_size, 1)) - dataset # Calculate difference between points
    
    sq_diff_mat = diff_mat**2 # Square the differences
    
    sq_distance = sq_diff_mat.sum(axis=1) # Sum the squared differences
    
    ## This is the Euclidian distance
    distance = sq_distance**0.5 # Take square root of sum of squared difference
    
    # Step 2: Sort distances in increasing order
    sorted_distance = distance.argsort()
    
    # Step 3: Select k items with lowest distance
    class_count = {}
    for i in range(k):
        vote_label = labels[sorted_distance[i]] # Get the labels/class of ith index
        class_count[vote_label] = class_count.get(vote_label, 0) + 1 # Add count of label
    
    # Step 4: Find label/class with majority of items (max count)
    sorted_class_count = sorted(class_count.iteritems(),
                               key=operator.itemgetter(1), reverse=True)
    
    # Step 5: Return the label with majority/max count
    return sorted_class_count[0][0]