# Classification using the K-Nearest neighbors algorithm

Given a new data point $P$, you calculate the $K$ nearest points to $P$ and assign the point $P$ to the group that has the majority number of votes among the $K$ nearest points.

In [1]:
import numpy as np 
from sklearn import preprocessing, model_selection, neighbors
import pandas as pd

In [8]:
df = pd.read_csv('./data/breast-cancer-wisconsin.data')


**Note:** It is important to drop the columns which don't give any information about what class the datapoint belongs to, because KNN considers all the columns in the dataset to compute proximity.

In [9]:
df.replace('?', -99999, inplace=True)
df.drop('id', axis=1, inplace=True)

In [10]:
X = np.array(df.drop(['class'], axis=1))
y = np.array(df['class'])

In [11]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

In [12]:
print(accuracy)

0.9857142857142858


If you don't drop the `id` column, the accuracy is gonna nosedive.

## Prediction

In [14]:
example_measures = np.array([[4, 2, 1, 1, 1, 2, 3, 2, 1]])

prediction = clf.predict(example_measures)
print(prediction)

[2]
