**Decision Boundaries**

 There is a relationship between k and the complexity or amount of “corner cases rules” of our k-NN classifier. Note that k is a hyperparameter since we set it before applying the algorithm. Hence, we tune it using grid search in practice.

In the next units, we will work on the heart disease diagnosis dataset. We will first see how to set a baseline for this classification task and then try with k-NN using Scikit-learn.

**Example**

The goal is to diagnose the presence (or absence) of a heart disease based on a set of observations for each patient

* age - the age of the patient
* trestbps - the resting blood pressure in mm Hg
* chol - the amount of cholesterol in mg/dl
* thalach - maximum heart rate during the tests
* oldpeak - another measure obtained using an electrocardiogram
* ca - the number of major vessels colored by fluoroscopy
* Our goal is to use these features to predict a target disease variable.

In [1]:
import pandas as pd

# Load data
data_df = pd.read_csv("c4_heart-numerical.csv")

# First five rows
data_df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,disease
0,63,145,233,150,2.3,0,absence
1,67,160,286,108,1.5,3,presence
2,67,120,229,129,2.6,2,presence
3,37,130,250,187,3.5,0,absence
4,41,130,204,172,1.4,0,absence


* We can see that all variables are numerical except the target one which has two possible values: presence and absence. We can verify that using Numpy unique() function.

In [2]:
import numpy as np

# Create X/y arrays
X = data_df.drop("disease", axis=1).values
y = data_df.disease.values
print("X:", X.shape, X.dtype)
print("y:", y.shape, y.dtype)

X: (303, 6) float64
y: (303,) object


In [3]:
# Print labels
labels = np.unique(y)
print("Labels:", labels)

Labels: ['absence' 'presence']


* Let’s now split the data into train/test sets. This time, we will use a 70-30 train/test split

In [4]:
from sklearn.model_selection import train_test_split

# Split data
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

print("Train set:", X_tr.shape, y_tr.shape)
print("Test set:", X_te.shape, y_te.shape)


Train set: (212, 6) (212,)
Test set: (91, 6) (91,)


The code to load and split the dataset is similar to what we saw for regression tasks. The only difference is that the output vector y contains categorical values instead of continuous ones.

Before fitting k-NN, let’s first define the baseline.

**BASELINE**

The “most frequent” baseline: We saw that it makes sense to predict the mean or median of the target variable for regression tasks. In the classification case, we can predict the most frequent category.In our example, there are only two categories and the most frequent one is 'absence'



In [5]:
# Count the number of entries labeled with 'absence'
n_absence = np.sum(y_tr == "absence")

print("Total absence:", n_absence)

Total absence: 117


We use Numpy sum() on an array of boolean values to count the number of True entries. We get that 117 of the 212 patients from the training set are labeled with 'absence'. In other words, the proportion of patients not diagnosed with the heart disease is around 55%.

In [6]:
# Probability of 'absence'
p_absence = n_absence / len(y_tr)

print("Probability of absence: {:.2f}".format(p_absence))

Probability of absence: 0.55


Note that we used the train_test_split() function which shuffles the data before splitting. Hence, the train and test sets are good samples of the data, and they should have approximately the same proportion of patients in each category.

In [7]:
# On the test set
p_absence_te = np.sum(y_te == "absence") / len(y_te)

print("Probability of absence: {:.2f}".format(p_absence_te))


Probability of absence: 0.52


The number 0.52 is the accuracy of the baseline model. It corresponds to the proportion of data points from the test set that the model classified correctly. In classification tasks, this is often what we want to maximize. Similarly, the error rate is the proportion of data points misclassified. In our case, the error rate of the “most frequent” baseline is 0.48.

In general, the accuracy of the “most frequent” baseline should be equal to the percentage of samples in the most frequent category. Hence, it’s only better than random guessing when the distribution of the target variable is not uniform. In our case, the dataset contains approximately the same number of patients labeled with absence and presence. Hence, it’s accuracy is very close to 50% which corresponds to random guessing for two classes.

**Multiple classes**
The code from above doesn’t scale well to target variables with many categories because we have to count the number of entries in each one manually. One solution is to use value_counts(list) from Pandas which groups the entries by values and counts the number of entries in each group.

In [8]:
# Compute distribution using Pandas
pd.Series(y_tr).value_counts() / len(y_tr)


absence     0.551887
presence    0.448113
dtype: float64

**OPTION** In this code, we normalize the counts by the total number of entries to get a probability, but we could also use normalize

In [9]:
# Compute distribution using Pandas
pd.Series(y_tr).value_counts(normalize=True)

absence     0.551887
presence    0.448113
dtype: float64

Scikit-learn provides a DummyClassifier() to compute baselines for classification tasks. We can set its strategy parameter to 'most_frequent' to get the baseline from above



In [10]:
from sklearn.dummy import DummyClassifier

# Create the dummy classifier
dummy = DummyClassifier(strategy="most_frequent")

In [11]:
#Let’s fit and evaluate our baseline estimator

# Fit it
dummy.fit(None, y_tr)

# Compute test accuracy
accuracy = dummy.score(None, y_te)
print("Accuracy: {:.2f}".format(accuracy))

Accuracy: 0.52


One important difference with regression: Scikit-learn classifiers return the accuracy instead of R2 coefficient.

**Confusion matrix**
So far in this unit, we reported the accuracy score of our predictions which simply corresponds to the number of times our predictions match the true correct label

In [12]:
# "Most-frequent" predictions
y_pred_absence = dummy.predict(X_te)
print("Predicted:", y_pred_absence[:5], "..")
print("True labels:", y_te[:5], "..")

Predicted: ['absence' 'absence' 'absence' 'absence' 'absence'] ..
True labels: ['absence' 'absence' 'presence' 'absence' 'presence'] ..


However, it’s usually helpful to investigate where errors occur. This can be done by plotting the confusion matrix



In [13]:
from sklearn.metrics import confusion_matrix

# Confusion matrix
matrix = confusion_matrix(y_true=y_te, y_pred=y_pred_absence)
print(matrix)

[[47  0]
 [44  0]]


This is simply a frequency table that shows how many times we predicted some class versus how many times it is actually that class or another one i.e. a frequency table of predictions versus true class

The table is easier to visualize as a DataFrame

In [14]:
# Confusion matrix as a DataFrame
matrix_df = pd.DataFrame(
    matrix,
    columns=["pred: absence", "pred: presence"],
    index=["true: absence", "true: presence"],
)

matrix_df

Unnamed: 0,pred: absence,pred: presence
true: absence,47,0
true: presence,44,0


In our case, we can see that our “most-frequent” baseline always predicts “absence” .. which is correct 47 times, and incorrect 44 times.

From this matrix, we can derive two important metrics: the recall and precision of our classifier. Those scores are defined in terms of the number of true or false positives/negatives

Intuitively, the precision answers “How many times are we correct when we predict positive?”. The formula is simply

**precision = tp/(tp+fp)**

We can compute it in Scikit-learn using precision_score()


In [16]:
from sklearn.metrics import precision_score

precision = precision_score(y_true=y_te, y_pred=y_pred_absence, pos_label="presence")
# Returns: "UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples."

  _warn_prf(average, modifier, msg_start, len(result))


You should get an error message saying “UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.”. This is due to the fact that precision is not defined since our “most-frequent” baseline never predicts ‘presence’.

In [17]:
# Let's check the value of precision
precision

0.0

In [18]:
#However, we can compute it for a “always predicts presence” baseline
# Precision of the "always predicts presence" baseline
y_pred_presence = np.full_like(y_te, fill_value="presence")
precision_score(y_true=y_te, y_pred=y_pred_presence, pos_label="presence")  # ~ 0.48


0.4835164835164835

ntuitively, the recall measures “How many times do we predict positive when it is?”. The formula is simply

**recall = tp/(tp+fn)**

We can compute it in Scikit-learn using recall_score()

In [19]:
from sklearn.metrics import recall_score

recall_score(y_true=y_te, y_pred=y_pred_absence, pos_label="presence")

0.0

Our “most-frequent” baseline never predicts ‘presence’ when the disease is present, so the recall score is 0.0 in our case.

However, we can get a score of 1.0 with the “always predicts presence” baseline!

In [20]:
# Recall of the "always predicts presence" baseline
recall_score(y_true=y_te, y_pred=y_pred_presence, pos_label="presence")  # 1.0

1.0

**F1 SCORE**
The F1 score is a way to combine the precision and recall metrics into a single score. The formula is

f1_score = 2 * (precision * recall) / (precision + recall)
Again, we can compute it in Scikit-learn using f1_score()

In [21]:
from sklearn.metrics import f1_score

f1_score(y_true=y_te, y_pred=y_pred_presence, pos_label="presence")  # ~0.65

0.6518518518518518

**Classification report**
It’s common to print those metrics when analyzing the performance of classifiers. For this reason, Scikit-learn provides a classification_report() function that gives the different scores



In [22]:
from sklearn.metrics import classification_report

report = classification_report(y_true=y_te, y_pred=y_pred_presence, zero_division=0.0)
print(report)


              precision    recall  f1-score   support

     absence       0.00      0.00      0.00        47
    presence       0.48      1.00      0.65        44

    accuracy                           0.48        91
   macro avg       0.24      0.50      0.33        91
weighted avg       0.23      0.48      0.32        91



The table shows the different scores depending on the positive class i.e. absence or presence

In this table, support corresponds to the number of points in each class. Macro and weighted averages refer to different ways to combine the results when there are multiple classes.