# ML | Cancer Cell Classification using Sci-Kit Learn

#### This is a question for the Udacity Nanodegree Accelerator Challenge

Here, we are going to use Scikit-learn's breast cancer dataset and try to create a classification model. The goal is to achieve a __95%__ accuracy in our predictions.



In [30]:
#importing required files and libraries

from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
import os

In [15]:
data = load_breast_cancer()

Let's take a look at the dataset by splitting it into features and labels.

In [29]:
target_names = data['target_names'] 
target = data['target'] 
feature_names = data['feature_names'] 
features = data['data'] 

#target_names
print("The dataset has the following in 'target_names': \n" + str(target_names))

#target
print("\nBelow are the target values:\n")
print(target)

#feature_names
print("\nThe dataset has the following feature_names: \n" + str(feature_names))

#features
print("\nThe features have the following values: \n")
print(features)

The dataset has the following in 'target_names': 
['malignant' 'benign']

Below are the target values:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 

## Converting Dataset to Pandas DataFrame

To work conveniently with this data, we'll convert it to a pandas dataframe in the following code cells.

In [33]:
cancer_data = pd.DataFrame(data.data,columns=data.feature_names)
cancer_data['target'] = pd.Series(data.target)
cancer_data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## Splitting The Dataset

So far so good. Let us now split this dataset into train and test data.

In [51]:
from sklearn.model_selection import train_test_split

X = cancer_data.drop(['target'], axis = 1)
y = cancer_data[['target']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

## Defining and Fitting The Classifier

In [77]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

classifier1 = GaussianNB()
model1 = classifier1.fit(X_train, y_train.values.ravel())

classifier2 = RandomForestClassifier(max_depth = 2, n_estimators = 10)
model2 = classifier2.fit(X_train, y_train.values.ravel())

## Making Predictions

In [78]:
predictions1 = model1.predict(X_test)
predictions2 = model2.predict(X_test)

## Evaluating Predictions

In [91]:
from sklearn.metrics import accuracy_score

accuracy1 = accuracy_score(y_test, predictions1) * 100
accuracy2 = accuracy_score(y_test, predictions2) * 100

print("The GaussianNB Classifier performed with %.2f%% accuracy!" % np.round(accuracy1, decimals = 2))
print("The Random Forest Classifier performed with %.2f%% accuracy!" % np.round(accuracy2, decimals = 2))

The GaussianNB Classifier performed with 95.80% accuracy!
The Random Forest Classifier performed with 95.10% accuracy!
