# K-nearest neighbors classification algorithm

## What is the K-nearest neighbors classification algorithm?
The K-nearest neighbors (KNN) classification algorithm is a classification model to predict the affiliation (Zugehörigkeit) of new data to a class (category) and is based on the distance between a new data point and the data points that are already assigned to a class. The number of the 'K' nearest data points per class is counted. The new data point is labeled to the class with the majority of the counts.</br>
For example: there are 2 classes A and B, a new data point is assigned to class A because it is closer to the data points of class A </br>
or e.g. with K=3, 2 neighbors are closest to class A and 1 neighbor is closer to class B, so the new data point is assigned to class A.

For this classification, you need to have several requirements:
1. A dataset with already labeled data points
1. A distance measuring algorithm (e.g. Euclidean distance)
1. 'K' needs to be chosen appropriately (e.g. with an even number of classes, an even K could lead to problems)

In [122]:
import numpy as np
from sklearn import preprocessing, model_selection, neighbors
import pandas as pd

# download data about breast cancer patients from the UC Irvine Machine Learning Repository
# 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic'

# save the data to the same directory as your file

# write the column headers into the data file (e.g. in text editor)
# make sure to not make whitespace after the comma
# no need to write a header for every column, leave the rest blank

# import the datafile
df = pd.read_csv('8.breast-cancer-wisconsin.data')

df.head()

Unnamed: 0,ID,Diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave_points,...,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [123]:
# drop useless columns
df = df.loc[:, "Diagnosis":"frac_dimension"]

df.head()

Unnamed: 0,Diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave_points,symmetry,frac_dimension
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


In [149]:
# creating features (X) and the label (y)
X = np.array(df.drop(['Diagnosis'], axis=1))
y = np.array(df['Diagnosis'])

# preparation for training and testing data
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

# select k-Nearest Neighbors KNN as classifier
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

# difference between confidence and accuracy?
# Accuracy is a measure of correct predictions over total predictions, while
# confidence is not a common term in the context of classification models.
accuracy = clf.score(X_test, y_test)
print("Accuracy: %.2f %%" % (accuracy*100))

# lets make a prediction with unknown data from new patients
example_measures = np.array([[18.12,14.73,128.3,1088.0,0.1123,0.2425,0.2110,0.09513,0.2184,0.06895], 
                             [18.12,14.73,128.3,1098.0,0.1123,0.2425,0.2180,0.09513,0.2184,0.06895]])
example_measures = example_measures.reshape(len(example_measures),-1)

predictions = clf.predict(example_measures)

print("Predictions:", predictions)

Accuracy: 84.21 %
Predictions: ['M' 'M']
