# KNN Search from Scratch

## 1st Step is to calculate Euclidean Distance.
We calculate euclidean distance between the two rows in dataset.

Euclidean distance = sqrt((x2-x1)^2 + (y2-y1)^2 + (z2-z1)^2 + ....)

In the test dataset provided below the first two columns are the input variables and the last column is the output which gives label as it is categorised in 0 and 1.


In [7]:
import math 
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return math.sqrt(distance)
# Taking a testing data to get the output:
dataset = [[2.7810836,2.550537003,0],
    [1.465489372,2.362125076,0],
    [3.396561688,4.400293529,0],
    [1.38807019,1.850220317,0],
    [3.06407232,3.005305973,0],
    [7.627531214,2.759262235,1],
    [5.332441248,2.088626775,1],
    [6.922596716,1.77106367,1],
    [8.675418651,-0.242068655,1],
    [7.673756466,3.508563011,1]]
# Finding Euclidean Distance of every row from first row
for row in dataset:
    distance = euclidean_distance(dataset[0], row)
    print(distance)


0.0
1.3290173915275787
1.9494646655653247
1.5591439385540549
0.5356280721938492
4.850940186986411
2.592833759950511
4.214227042632867
6.522409988228337
4.985585382449795


## 2nd Step is to get Nearest Neighbors.
After calculating distance, you can find the 'K' nearest neighbors which will have the least distance from the target.


In [14]:
def get_neighbors(train, test_row, k):
    distances = list()
    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distances.append((train_row, dist))
    distances.sort(key=lambda tup: tup[1])
    neighbors = list()
    for i in range(k):
        neighbors.append(distances[i][0])
    return neighbors
neighbors = get_neighbors(dataset, dataset[0], 3)
for neighbor in neighbors:
    print(neighbor)

[2.7810836, 2.550537003, 0]
[3.06407232, 3.005305973, 0]
[1.465489372, 2.362125076, 0]


As we can see in above code, first it will calculate the distance of each row from the train_row. And then add the train_row and it's distance as tuple  in the distances list. Then we sort the list __distances__ in ascending order , "key = lambda tup:tup[1] " means the key = the element at position 1 of the tuple ,i.e. distance. So the list distances get sorted in ascending order by distance.

After this , we will add all the trainrow of first "K" elements in the neighbors list.

## Step 3 is Making Predictions
After getting the most similar neighbors , we will make predictions.
In the case of classification, we can return the most represented class among the neighbors.

In [17]:
def predict_classification(train, test_row, k):
    neighbors = get_neighbors(train, test_row, k)
    output_values = [row[-1] for row in neighbors] 
    prediction = max(set(output_values), key=output_values.count)
    return prediction
prediction = predict_classification(dataset, dataset[0], 3)
print('Expected %d, Got %d.' % (dataset[0][-1], prediction))

Expected 0, Got 0.


In the above code, output values store the last value of each row in the neighbors list.
Then predicition finds the max value used in the set __output_values__ by counting the frequency of each number.

Then in last we print the actual value(Expected) and the predicted value(Got).

# KNN Search on Images:

To do KNN Search in Visual Search, we'll first use a pretrained CNN model and do the feature extraction and then KNN Search.

A CNN model ( CONVOLUTIONAL NEURAL NETWORK ) is a type of deep learning which is applied to analyze visual imagery. They are applied in images recognition, images classifications , objects detections, recognition faces etc.

So here we'll use a pretrained model based on RestNet architecture from MXNet Model Zoo.

MXNet is a powerful open-source deep learning instrument built to ease the development of deep learning algorithms. It is used to define, train and deploy deep neural networks and allows fast model-training and supports a flexible programming model.

Model_zoo is a package which provides pre-defined and pre-trained models to help in ML algorithms.

RestNet architecture has achieved high accuracy in classifying the over 11 million images of the ImageNet dataset.
There are also many other architectures available like alexnet , squeezenet ,densenet etc.

This model’s output layer generates feature vectors instead of labels for classifying images.Feature vectors are one of the core components of visual search.

For KNN Search: 
First step is to construct an index. By the index we can look up for "neighbors" by the Euclidean distance or cosine similarity.


In [18]:
pip install hnswlib

Note: you may need to restart the kernel to use updated packages.


In [20]:
import hnswlib

__hnswlib__ is a library which provides many functions for approximate nearest neighbor search. It uses an efficient and totally graph-based approach.

So, first we'll create the index.

`num_elements = len(features)
labels_index = np.arange(num_elements)`

num_elements will store the length of features vector and then label_index will create a numpy array of range (0 to (num_elements - 1)).

`p = hnswlib.Index(space = 'cosine', dim = EMBEDDING_SIZE)`
__hnswlib.Index__ will help to create an non- initialized index 