# KNN Demo

This <a href="https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/">dataset</a> uses tumor information to predict whether the tumor is benign (B) or malignant (M).

In [1]:
#import data
import pandas as pd
data = pd.read_csv("tumorData.csv")
data = data.drop(['id'], 1)

# remove NANs
print(data.isnull().sum())
data.head()

diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed: 32                569
dtype: i

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


Corr uses numerical columns only, so convert 'diagnosis' to numerical.

In [2]:
#pick columns to use
data['diagnosis'] = data['diagnosis'].map({'B': 1, 'M': 0})
corr = data.corr()
corr['diagnosis'][abs(corr['diagnosis']) > 0.75]

diagnosis               1.000000
concave points_mean    -0.776614
radius_worst           -0.776454
perimeter_worst        -0.782914
concave points_worst   -0.793566
Name: diagnosis, dtype: float64

Select the features from above that are strongly correlated. Create training and validation sets.

In [3]:
# Split data in train and test data
from sklearn.model_selection import train_test_split

X = data[['concave points_mean', 'radius_worst', 'perimeter_worst', 'concave points_worst']]
Y = data['diagnosis']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Create the model using k = 3.

In [4]:
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train) 
predict = knn.predict(X_test)
pd.DataFrame({"Actual": Y_test, "Predictions": predict}).head(10)

Unnamed: 0,Actual,Predictions
204,1,1
70,0,0
131,0,0
431,1,1
540,1,1
567,0,0
369,0,0
29,0,0
81,1,1
477,1,1


In [5]:
# Get accuracy
from sklearn.metrics import accuracy_score

accuracy_score(Y_test, predict)

0.92105263157894735

The training accuracy is using the training set to predict on and calculate accuracy. The validation accuracy uses a different validation set to predict on and calculate accuracy. With a very small k, the training accuracy is very high, but the validation accuracy is lower because the model is overfitted.

In [6]:
for x in range(1, 201, 2):
    knn = KNeighborsClassifier(n_neighbors = x)
    knn.fit(X_train, Y_train) 
    predictTrain = knn.predict(X_train)
    predictTest = knn.predict(X_test)
    print("n-neighbors: %d. Train accuracy: %f. Validation accuracy: %f" % (x, accuracy_score(Y_train, predictTrain), accuracy_score(Y_test, predictTest)))

n-neighbors: 1. Train accuracy: 1.000000. Validation accuracy: 0.885965
n-neighbors: 3. Train accuracy: 0.923077. Validation accuracy: 0.921053
n-neighbors: 5. Train accuracy: 0.914286. Validation accuracy: 0.929825
n-neighbors: 7. Train accuracy: 0.914286. Validation accuracy: 0.921053
n-neighbors: 9. Train accuracy: 0.907692. Validation accuracy: 0.938596
n-neighbors: 11. Train accuracy: 0.912088. Validation accuracy: 0.947368
n-neighbors: 13. Train accuracy: 0.912088. Validation accuracy: 0.947368
n-neighbors: 15. Train accuracy: 0.916484. Validation accuracy: 0.938596
n-neighbors: 17. Train accuracy: 0.914286. Validation accuracy: 0.938596
n-neighbors: 19. Train accuracy: 0.912088. Validation accuracy: 0.947368
n-neighbors: 21. Train accuracy: 0.914286. Validation accuracy: 0.947368
n-neighbors: 23. Train accuracy: 0.914286. Validation accuracy: 0.947368
n-neighbors: 25. Train accuracy: 0.905495. Validation accuracy: 0.938596
n-neighbors: 27. Train accuracy: 0.907692. Validation ac