# Cancer cell classification using Scikit-learn

## Algorithms used: K-Nearest Neighborhood, Gaussian Naive Bayes

For this Project we will be using Scikit-Learn and utilizing the Wisconsin Breast Cancer(Diagnostic) dataset. Since it is a database that is already available on scikit-learn we can import it directly from the Library.

In [129]:
import sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

Load the required data set and organize the data into difference appropriately named variables, and observe the Number of data and features available.

In [131]:
data = load_breast_cancer()

#Organize our data into Target, Target Names, Features, and Feature Names
target_names = data['target_names']
targets = data['target']
features = data['data']
feature_names = data['feature_names']

print(features.shape)
print(target_names)
print(feature_names)

(569, 30)
['malignant' 'benign']
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


We have a decent number of observations to work with, and can select an Algorithm accordingly.

Upon examination of the organized data, we are able to see the targets and the features we are given to work with.

Let us use KNN Algorithm and Naive Bayes Algorithm as classification algorithm. We can select one of the two based on which algorithm is able to provide higher accuracy.

In [135]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB


knn = KNeighborsClassifier(n_neighbors = 10)
kf = KFold(n_splits=5, shuffle=True, random_state=10)
accuracy_sum = 0
for train_index, test_index in kf.split(features):
    features_train, features_test = features[train_index], features[test_index]
    target_train, target_test = targets[train_index], targets[test_index]

    knn.fit(features_train, target_train)
    predictions = knn.predict(features_test)
    accuracy = accuracy_score(target_test, predictions)
    accuracy_sum = accuracy_sum + accuracy
print("Mean Accuracy for KNN:",accuracy_sum/5.0)

gnb = GaussianNB()
kf = KFold(n_splits=5, shuffle=True, random_state=10)
accuracy_sum = 0
for train_index, test_index in kf.split(features):
    features_train, features_test = features[train_index], features[test_index]
    target_train, target_test = targets[train_index], targets[test_index]

    gnb.fit(features_train, target_train)
    predictions = gnb.predict(features_test)
    accuracy = accuracy_score(target_test, predictions)
    accuracy_sum = accuracy_sum + accuracy
print("Mean Accuracy for GNB:",accuracy_sum/5.0)

Mean Accuracy for KNN: 0.9384722869119703
Mean Accuracy for GNB: 0.9384257102934328


So, we find out that this Machine Learning Classifier when based on K-Nearest Neighborhood algorithm gives an Accuracy of 93.847%, while the Classifier based on Gaussian Naive Bayes gives an Accuracy of 93.842%.