## K-Nearest neighbors algorithm
This algorithm is a supervised learning method for classification and regression which gets label based on similarity.

KNN assumption:

- data points which exist in close proximity are similar but when are far from eachother then show dissimilarity 

Steps of algorithm
1. We have a point that we want to classify - query
2. We calculate the distance between this query and all of the data points
3. We sort all distances and take the k points referring to k smallest distances
4. For regression we calculate the mean, for classification - mode, of this k points

In [132]:
import numpy as np
import pandas as pd
import sklearn as skl
from scipy import stats as st

## Implementation of algorithm

In [128]:
def euclidean_dist(vec1,vec2):
    '''
     Function calculate the euclidean distance between two vectors.
    '''
    sum_of_squares=0
    for i in range(0,len(vec1)):
        sum_of_squares+=(vec1[i]-vec2[i])**2
    return np.sqrt(sum_of_squares)

def knn(data,query,k,dist=euclidean_dist,regression=True):
    '''
    Function based on k-nearest neighborhood algorithm solve regression and classification tasks.
    Inputs: data in data frame or matrix, query as a list, k  number of nearest neighbors, 
    dist distance function, regression if true function solve regression problem  otherwise classification problem. 
    '''
    index_neighbour_distance = []

    #Calculate distance between query point and every other point in the data
    for index, coordinate in enumerate(data.values):
        distance=dist(coordinate[:-1],query)
        index_neighbour_distance.append((distance,index)) # add calculated distances and indexes to the list

    sorted_neighbour_distance=sorted(index_neighbour_distance) # sort neighbor distances in ascending order
    result = [data.values[i][-1] for x,i in sorted_neighbour_distance[:k]] # take k nearest points and read their labels
    
    # for regression we choose the mean of the result and for classification we take mode
    if regression == True:
        return np.mean(result)
    else:
        return st.mode(result)[0][0]


## Dataset
This dataset is a part of bigger dataset however I took from the original data only seven principal components (which let me save 90% of variability) and labels.

In [137]:
data = pd.read_csv('data_pca_cancer',sep=';')
data =data.iloc[:,1:]
data.head()

Unnamed: 0,pc1,pc2,pc3,pc4,pc5,pc6,pc7,diagnosis
0,9.192837,1.948583,-1.123166,3.633731,-1.19511,1.411424,2.15937,1
1,2.387802,-3.768172,-0.529293,1.118264,0.621775,0.028656,0.013358,1
2,5.733896,-1.075174,-0.551748,0.912083,-0.177086,0.541452,-0.668166,1
3,7.122953,10.275589,-3.23279,0.152547,-2.960878,3.053422,1.429911,1
4,3.935302,-1.948072,1.389767,2.940639,0.546747,-1.226495,-0.936213,1


Splitting the into test and train set in order to measure the accuracy of given algorithm.

In [138]:
import sklearn.model_selection
X_train,X_test,y_train,y_test = sklearn.model_selection.\
    train_test_split(data.iloc[:,:7],data['diagnosis'],test_size=0.2)
X_train['diagnosis']=y_train


## Prediction

In [139]:
prediction=[]
for i in range(len(X_test)):
    prediction.append(knn(data=X_train, query=X_test.iloc[i,:], k=5, dist=euclidean_dist,regression=False))

In [142]:
print(skl.metrics.confusion_matrix(prediction,y_test))

[[72  1]
 [ 1 40]]


In [145]:
accuracy=112/114
accuracy

0.9824561403508771

In [141]:
print(skl.metrics.classification_report(y_test,prediction))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99        73
           1       0.98      0.98      0.98        41

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



##  Conclusions
All of metrics are pretty high, about 98-99%, therefore we can assume that algorithm performed quite well.

#  Remark
Since I still have been learning and I want to improve my coding and machine learning skills I would be grateful for any feedback and advice.

14.07.2022