# Tutorial on using sklearn for breast cancer
- implement a simple machine learning algorithm in Python using Scikit-learn
- Using a database of breast cancer tumor information, use a Naive Bayes (NB) classifer that predicts whether or not a tumor is malignant or benign.

Source for tutorial: https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn

In [2]:
import sklearn

We will be working with the breast cancer dataset
- includes various information about breast cancer tumors, as well as classification labels of malignant or benign
-  dataset has 569 *instances*, or data, on 569 tumors and includes information on 30 *attributes*, or features, such as the radius of the tumor, texture, smoothness, and area

From the built-in dataset:

In [4]:
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer() #this is a python object that works like a dictionary

Important dictionary keys: 
- classification label names (```target_names```), 
- the actual labels (```target```), 
- the attribute/feature names (```feature_names```), 
- and the attributes (```data```).

In our case, possible useful attributes include the size, radius, and texture of the tumor  
Creating python3 lists for the above:

In [29]:
label_names = data['target_names'] # source on python dictionaries https://www.w3schools.com/python/python_dictionaries.asp 
labels = data['target'] #access the items of a dictionary by referring to its key name, inside square brackets.
feature_names = data['feature_names'] #kind of like C structs. feature_names is the name for the features below
features = data['data'] #attributes aka features, the actual values themselves
# Look at our data
print(label_names)        # therefore, 0 represents malignant and 1 benign
print(labels[0])          # then, from the prelabeled names, the first tumor in the list must be a malignant tumor
print(feature_names)      # names for all the features 
print (len(features))     # there are 569 tumors
print (len(features[0]))  # there are 30 different features in a tumor
print(features[0])        # looking at mean radius of this tumor, we see is 1.79e+01 and the mean texture is 1.038e+01


['malignant' 'benign']
0
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
569
30
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


Now to organize the data into sets. To evaluate how well the classifier/model performs, we need to do so on unseen data. Thus we have to split our data into training and testing data.
```sklearn``` has a ```train_test_split()``` to do so.

In [14]:
from sklearn.model_selection import train_test_split
# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33,
                                                          random_state=42)

This function randomly splits the data using the ```test_size``` parameter.  
In this example, we now have a test set (```test```) that represents 33% of the original dataset. The remaining data (```train```) then makes up the training data. We also have the respective labels for both the train/test variables, i.e. ```train_labels``` and ```test_labels```


Bulding and training model:  
the Naive Bayes (NB) algorithim usually performs well in binary classification tasks

In [16]:
from sklearn.naive_bayes import GaussianNB

# Initialize our classifier/model
gnb = GaussianNB()

# Train our classifier/model
model = gnb.fit(train, train_labels)

After training the model, we can make predictions.  
The ```predict()``` function returns an array of predictions for each data instance in the test set

In [17]:
# Make predictions
preds = gnb.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


Now we evalute the model's accuracy by comparing to the pre-labelled values, using ```accuracy()```

In [18]:
from sklearn.metrics import accuracy_score

# Evaluate accuracy
print(accuracy_score(test_labels, preds))

0.9414893617021277


Thus, the Naive Bayes classifier is 94% accurate here