Implementation of a simple machine learning algorithm in Python using Scikit-learn. Using a database of breast cancer tumor information, use a Naive Bayes (NB) classifier that predicts whether or not a tumor is malignant or
benign.

1. Install Scikit-learn

In [2]:
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


2. Import module

In [3]:
import sklearn

3. Import Dataset. The dataset is the Breast Cancer
Wisconsin Diagnostic Database. The dataset includes various information
about breast cancer tumors, as well as classification labels of malignant or
benign. The dataset has 5 6 9 instances, or data, on 569 tumors and
includes information on 30 attributes, or features, such as the radius of
the tumor, texture, smoothness, and area.

In [4]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

4. Create variables for each deemed important information. And look at the resulting data

In [5]:
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

In [6]:
print(label_names)
print(labels[0])
print(feature_names[0])
print(features[0])

['malignant' 'benign']
0
mean radius
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


5. Organizing the data into training and testing set, with the testing set representing 20% of the original dataset.

In [8]:
from sklearn.model_selection import train_test_split

train, test, train_labels, test_labels = train_test_split(features,
labels,

test_size=0.20,
random_state=42)

6. Building and training the model

In [9]:
from sklearn.naive_bayes import GaussianNB

# Initialize the classifier
gnb = GaussianNB()

# Train the classifier
model = gnb.fit(train, train_labels)

7. Make predictions on the test set

In [10]:
preds = gnb.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0]


8. Evaluate the model

In [11]:
from sklearn.metrics import accuracy_score

# Evaluate accuracy
print(accuracy_score(test_labels, preds))

0.9736842105263158
