# Fine tuning your model

We've evaluated the performance of our k-NN classifier based on its accuracy(the fraction of correctly identified samples). However, accuracy is not always an informative metric, especially when you have a class imbalance, e.g. in a situation where you are detectecing spam emails(are 1% of total) and the vast majority of email is real(99%). Even if you classifier were 99% accurate, which would normally be considered good, it would fail in detecting spam, the purpose of the classifier.  

A better way of evaluating the performance of **binary classifiers** is by computing a **confusion matrix** and generating a **classification report**.

Given a binary classification task, e.g. detecting spam email, we can draw up a 2x2 matrix that sumarises performance called a **confusion matrix**.

Across the top we have the predicted labels, along the side we have the actual labels.

![](../imgs/confusion-matrix.png)

As we're trying tp detect spam, spam is the **positive** class.

- `true positive` spam emails correctly labelled
- `true negative` real emails correctly labelled
- `false positive` real emails incorrectly labelled (identified as spam)
- `false negative` spam emails incorrectly labelled (identified as real mail)

We can compute the **accuracy** of our model using the confusion matrix:

- **sum of the diagonal** / **total sum of the matrix**

![Confusion matrix](../imgs/confusion-matrix-2.png)

There are several other metrics that can be derived fron the confusion matrix:

- **precision**  

    - true positives / true positives + false positives  
    
- **recall / sensitivity / true positive rate**  

    - true positives / true positives + false negative  
    
- **F1 score**  

    - 2 * (precision * recall) / (precision + recall)  
    
**High precision** means that not many real emails are predicted as spam, i.e. low false positive.

**High recall** means that most spam emails were predicted correctly.

![Confusion matrix](../imgs/confusion-matrix-3.png)

### Implement confusion matrix using sklearn

The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a **binary classification** problem. A target value of `0` indicates that the patient does not have diabetes, while a value of `1` indicates that the patient does have diabetes.

We're to train a **k-NN classifier** to the data and evaluate its performance by generating a confusion matrix and classification report.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
# prepare the data
df = pd.read_csv('../data/diabetes.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnancies    768 non-null int64
glucose        768 non-null int64
diastolic      768 non-null int64
triceps        768 non-null int64
insulin        768 non-null int64
bmi            768 non-null float64
dpf            768 non-null float64
age            768 non-null int64
diabetes       768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [3]:
df.sample(5)

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
402,5,136,84,41,88,35.0,0.286,35,1
407,0,101,62,0,0,21.9,0.336,25,0
418,1,83,68,0,0,18.2,0.624,27,0
740,11,120,80,37,150,42.3,0.785,48,1
140,3,128,78,0,0,21.1,0.268,55,0


In [6]:
X = df.drop('diabetes', axis=1).values
y = df.diabetes.values

print(type(X), X.shape)
print(type(y), y.shape)

<class 'numpy.ndarray'> (768, 8)
<class 'numpy.ndarray'> (768,)


In [7]:
# fit the model and train the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)

# predict the labels
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[176  30]
 [ 56  46]]
              precision    recall  f1-score   support

           0       0.76      0.85      0.80       206
           1       0.61      0.45      0.52       102

   micro avg       0.72      0.72      0.72       308
   macro avg       0.68      0.65      0.66       308
weighted avg       0.71      0.72      0.71       308



The support gives the number of samples of the true response that lie in that class on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

By analyzing the confusion matrix and classification report, you can get a much better understanding of your classifier's performance.