# Breast Cancer - Machine Learning Example

To determine the accuracy of machine learning to predict breast cancer using Scikit Learns K Nearest Neighbors algorithm. 

Source:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29

Creator:
Dr. WIlliam H. Wolberg (physician)
University of Wisconsin Hospitals
Madison, Wisconsin, USA 

Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself: 

Attribute Information:
1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

Import dependies

In [26]:
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn import preprocessing, cross_validation, neighbors




Read in csv to a pandas dataframe

In [27]:
columns = ['id','Clump_thickness','Uniformity_Size','Uniformity_Shape','Marginal_Adhesion',
          'Epithelial_Size','Bare Nuclei','Bland_Chromatin','Normal Nucleoli','Mitoses','Class']
df = pd.read_csv('breast-cancer-wisconsin.csv',names = columns)

In [28]:
df.head()

Unnamed: 0,id,Clump_thickness,Uniformity_Size,Uniformity_Shape,Marginal_Adhesion,Epithelial_Size,Bare Nuclei,Bland_Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Clean Dataset

In [29]:
df.drop(['id'],1, inplace=True)

In [30]:
df.replace('?',-999999,inplace=True)

Create X and y datasets for training. Shuffle data and create an out of sample test set (20% of original data)

In [31]:
X = np.array(df.drop(['Class'],1))

In [32]:
y = np.array(df['Class'])

In [34]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

Train and fit data

In [54]:
clf = neighbors.KNeighborsClassifier(n_neighbors=5,n_jobs=3)
clf.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=3, n_neighbors=5, p=2,
           weights='uniform')

Test Accuracy on out of sample data

In [55]:
accuracy = clf.score(X_test, y_test)
accuracy

0.9714285714285714

Example prediction: (2 for benign, 4 for malignant)

In [61]:
patient_sample = np.array([4,2,1,1,1,2,3,2,1])
patient_sample = patient_sample.reshape(1,-1)

prediction = clf.predict(patient_sample)
print("Result (2 for benign, 4 for malignant): ",prediction)

Result (2 for benign, 4 for malignant):  [2]


Therefore the sample is benign with an accuracy of 97%.