# Evaluating Classification Models
This notebook replicates the leukaemia classifier example from Data Science from Scratch by Joel Grus (2015), O'Reilly Media, Inc. The classifier achieves high accuracy despite being utterly useless.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Generate the dataset

In [2]:
# Define constants for number of rows and probabilities
N_ROWS = 1_000_000
PROB_LUKE = 5 / 1000         # 5/1000 chance of name being Luke
PROB_LEUKAEMIA = 14 / 1000   # 14/1000 chance of having leukaemia

# Ensure repeatable results
np.random.seed(42)

In [3]:
def generate_dataframe():
    """
    Generate a DataFrame with N_ROWS, with two Boolean columns:
      - 'Name is Luke'      : True with probability PROB_LUKE
      - 'has leukaemia'     : True with probability PROB_LEUKAEMIA
    """
    # Draw uniform random numbers and compare to thresholds
    luke_flags = np.random.rand(N_ROWS) < PROB_LUKE
    leukaemia_flags = np.random.rand(N_ROWS) < PROB_LEUKAEMIA

    return pd.DataFrame({
        'Name is Luke': luke_flags,
        'has leukaemia': leukaemia_flags
    })

In [4]:
df = generate_dataframe()

## Examine the data

In [5]:
df.sample(10)

Unnamed: 0,Name is Luke,has leukaemia
174677,False,False
350388,False,False
245525,False,False
765851,False,False
949508,False,False
650972,False,False
686008,False,False
282451,False,False
260213,False,False
770050,False,False


In [6]:
df['Name is Luke'].value_counts()

Name is Luke
False    994988
True       5012
Name: count, dtype: int64

In [7]:
df['has leukaemia'].value_counts()

has leukaemia
False    985880
True      14120
Name: count, dtype: int64

## Create training and test sets

In [8]:
X = df[['Name is Luke']].astype(int)
y = df['has leukaemia'].astype(int)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2
)

## Build a decision tree

In [10]:
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

Here we see that the model has learned to simply give the all clear for everyone. Given that there are very few people with leukaemia in the dataset, this means that it is almost always correct to give them the all clear. Clearly this is not useful.

In [11]:
text_tree = export_text(
    clf,
    feature_names=X.columns,
    class_names=['All clear', 'Treatment recommended']
)

print(text_tree)

|--- Name is Luke <= 0.50
|   |--- class: All clear
|--- Name is Luke >  0.50
|   |--- class: All clear



## Evaluate the model

To evaluate the model we first get the predictions in makes for the y values in the test set. 

In [12]:
y_pred = clf.predict(X_test)

### `accuracy_score`

Now measure the accuracy. This seems very high given that we predict leukaemia if the person's name is Luke and give them the all-clear otherwise. This is a terrible model, which tells us that using the accuracy score is not an appropriate way to evaluate it.

In [13]:
accuracy_score(y_test, y_pred)

0.986115

### `classification_report`

The classification report goes into more detail about what's going on. It was correct over 98% of the time when it predicted no leukaemia. But, of the 2,777 people with leukaemia, the model detected exactly zero of them. 

In [14]:
print(classification_report(
    y_test,
    y_pred,
    digits=4,
    zero_division=0.0))

              precision    recall  f1-score   support

           0     0.9861    1.0000    0.9930    197223
           1     0.0000    0.0000    0.0000      2777

    accuracy                         0.9861    200000
   macro avg     0.4931    0.5000    0.4965    200000
weighted avg     0.9724    0.9861    0.9792    200000



### `confusion_matrix`

The confusion matrix shows the counts of correct and incorrect classifications:

|             |No leukaemia predicted |Leukaemia predicted |
|-------------|-----------------------|--------------------|
|No leukaemia |True negatives         |False positives     |
|Leukaemia    |False negatives        |True positives      |

In this case, the consequences of being wrong vary depending on whether the person does or does not have leukaemia. A simple accuracy calculation, i.e. (true positives + true negatives) / total predictions does not take this into account.

In [15]:
print(confusion_matrix(y_test, y_pred))

[[197223      0]
 [  2777      0]]
