# Classification

We will use a decision tree classifier to classify breast tumors as malignant or benign, using the Breast Cancer Wisconsin Dataset from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).

First, let's import some modules and functions.



In [1]:
import numpy as np
import pandas as pd
np.random.seed(42)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

The Breast Cancer Wisconsin Dataset is included with scikit-learn, so we don't need to download it.

In [2]:
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

Split the data into a training set and a test set, with 20% in the test set. The training data is used to build the model, and the test data is used to evaluate the model. It is important to remember that the data used to train the model should not be used to evaluate the model, because the training error will underestimate the true error rate on new data.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [4]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Now we evaluate the model using the test data. Our model is almost 93% accurate.

In [5]:
model.score(X_test, y_test)

0.9298245614035088

We use the `predict` method to make predictions on the test data.

In [6]:
y_pred = model.predict(X_test)

The confusion matrix shows how many benign tumors were classified as malignant (false positives) and how many malignant tumors were classified as benign (false negatives).

In [7]:
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

[[39  4]
 [ 4 67]]


This chart would be easier to interpret if it were labeled, don't you think?


In [8]:
labels = breast_cancer.target_names
pd.DataFrame(conf_matrix, columns=labels, index=labels)

Unnamed: 0,malignant,benign
malignant,39,4
benign,4,67


The 4 in the top right is the number of false negatives -- malignant tumors that were misclassified as benign.

The 4 in the bottom left is the number of false positives -- benign tumors that were misclassified as malignant.


## Exercises:
1. Use the confusion matrix to calculate precision and recall.
2. Use the `classification_report` function to find precision and recall.
3. Try a random forest classifier instead of a decision tree.

Precision answers this question: If a tumor is diagnosed as malignant, what is the probability that it is actually malignant?

    Precision = (True Positive) / (True Positive + False Positive)

Recall answers this question: If a tumor is actually malignant, what is the probability that it will be diagnosed as malignant?

    Recall = (True Positive) / (True Positive + False Negative)

In [9]:
TP = 39; FN = 4; FP = 4; TN = 67
precision = TP / (TP + FP)
recall = TP / (TP + FP)
print('Precision = %.2f' % precision)
print('Recall = %.2f' % recall)

Precision = 0.91
Recall = 0.91


In [10]:
print(classification_report(y_pred, y_test))

             precision    recall  f1-score   support

          0       0.91      0.91      0.91        43
          1       0.94      0.94      0.94        71

avg / total       0.93      0.93      0.93       114



The F1-score is the harmonic mean of precision and recall. In this case, since precision and recall are both 0.91, the F1-score is also 0.91.

In [11]:
def harmonic_mean(x, y):
    return 2/(1/x + 1/y)

print("F1-score = %.2f" % harmonic_mean(0.91, 0.91))

F1-score = 0.91


Let's try a random forest model instead of a decision tree.

In [12]:
random_forest_model = RandomForestClassifier(random_state=42)
random_forest_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

Our random forest model is almost 96% accurate on the test set.

In [13]:
random_forest_model.score(X_test, y_test)

0.956140350877193

In [14]:
rf_pred = random_forest_model.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, rf_pred),
             columns=labels, index=labels)

Unnamed: 0,malignant,benign
malignant,40,3
benign,2,69


We now have 3 false negatives and 2 false positives.

In [15]:
precision = 40 / (40 + 2)
recall = 40 / (40 + 3)
print('Precision = %.2f' % precision)
print('Recall = %.2f' % recall)

Precision = 0.95
Recall = 0.93


In [16]:
print(classification_report(y_test, rf_pred))

             precision    recall  f1-score   support

          0       0.95      0.93      0.94        43
          1       0.96      0.97      0.97        71

avg / total       0.96      0.96      0.96       114

