# K Nearest Neighbors

To learn how to apply the K Nearest Neighbors algorithm to a dataset, we will use the Breast Cancer Wisconsin (Original) dataset, provided by UC Irvine: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29

In [1]:
# Import support libraries
import os

# Import analytical libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import neighbors

In [2]:
# Import data
data_file_path = os.path.join('Data', 'breast-cancer-wisconsin.data')
data = pd.read_csv(data_file_path)

We read in breast-cancer-wisconsin.names that,

<em>There are 16 instances in Groups 1 to 6 that contain a single missing 
   (i.e., unavailable) attribute value, now denoted by "?". </em>

In [3]:
# Preview data
display(data.head())

# Summarize data
display(data.describe())

# Check for null data
display(data.isna().sum())

Unnamed: 0,id,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Unnamed: 0,id,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bland_chromatin,normal_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


id                             0
clump_thickness                0
uniformity_cell_size           0
uniformity_cell_shape          0
marginal_adhesion              0
single_epithelial_cell_size    0
bare_nuclei                    0
bland_chromatin                0
normal_nucleoli                0
mitoses                        0
class                          0
dtype: int64

We read in breast-cancer-wisconsin.names that,

<em>There are 16 instances in Groups 1 to 6 that contain a single missing 
   (i.e., unavailable) attribute value, now denoted by "?". </em>
   
We will replaced this with -99999, which scikit learn recognizes as outlier data.

In [4]:
data.replace('?', -99999, inplace = True)

In [5]:
# Define features and labels
X = np.array(data.drop(['class'], 1))
y = np.array(data['class'])

# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [6]:
# Define classifier
classifier = neighbors.KNeighborsClassifier()

# Fit data
classifier.fit(X_train, y_train)

# Test accuracy
classifier.score(X_test, y_test)

0.6357142857142857

Our algorithm has around 60% accuracy, not the greatest.  This is an illustration of why feature selection is so important.  In this case, our dataset has an "id" data point, which essentially only labels each row with a unique identifier.  In reality, the id of a dataset is not a meaninfgul factor in examining breast cancer, and in this analysis, it only causes noise.  We will therefore drop this column, and observe how the accuracy thereafter behaves.

In [7]:
# Drop id column
data.drop(['id'], 1, inplace = True)

# Define features and labels
X = np.array(data.drop(['class'], 1))
y = np.array(data['class'])

# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define classifier
classifier = neighbors.KNeighborsClassifier()

# Fit data
classifier.fit(X_train, y_train)

# Test accuracy
classifier.score(X_test, y_test)

0.9642857142857143

...

We will now use this classifier to predict a sample case.

In [8]:
# Define example measures
example_measures = np.array([4, 2, 1, 1, 1, 2, 3, 2, 1]).reshape(1,-1)

# Preview classifier accuracy
display(classifier.score(X_train, y_train))

# Predict
display(classifier.predict(example_measures))

0.9731663685152058

array([2], dtype=int64)

Our classifier predicted a value of "2" based on the sample measures, which per breast-cancer-wisconsin.names corresponds to benign.

We can also predict multiple samples at once.

In [15]:
# Define example measures
example_measures = np.array([[4, 2, 1, 1, 1, 2, 3, 2, 1], [10, 2, 2, 4, 6, 2, 5, 7, 2], [5, 4, 4, 2, 6, 4, 3, 1, 2]])

example_measures = example_measures.reshape(len(example_measures),-1)

# Preview classifier accuracy
display(classifier.score(X_train, y_train))

# Predict
display(classifier.predict(example_measures))

0.9731663685152058

array([2, 4, 2], dtype=int64)

# Sources

1. <a href='https://pythonprogramming.net/k-nearest-neighbors-application-machine-learning-tutorial/'>Applying K Nearest Neighbors to Data</a>