## Understanding Metrics

In this exercise we will work with the Pima Indians Diabetes dataset.  This data set is 
[originally](https://archive.ics.uci.edu/ml/datasets/Diabetes) from the 
[UC-Irvine machine learning repository](https://archive.ics.uci.edu/ml/datasets.php).  
We will use a cleaned up version from 
[Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).  For convenience I've already downloaded the dataset to the exercise folder. The dataset has the following variables (columns):

- Pregnancies
- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Outcome

Spend sometime on the Kaggle site familiarizing yourself with the dataset.

In [1]:
import pandas as pd
file = 'diabetes.csv'
data = pd.read_csv(file)
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


For the purposes of this exercise, we are going to explore whether we can predict the diabetes status of a patient given the following 4 health measurements?

In [2]:
features = ['Pregnancies', 'Insulin', 'BMI', 'Age']
X = data[features]
y = data.Outcome

In [3]:
total_cases = len(data)  # == len(X) == len(y)
total_cases

768

There are 768 rows in the data set.  We split them into a _Training data set_ and a _Test data set_ with a scikit function.  If we all use the same value for `random_state` our splits will be the same

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2021)

Now, lets use Logistic Regression to classifiy by training it on the `(X_train, y_train)` combo

In [5]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train, y_train)

LogisticRegression()

We use the fitted model to make a prediction with `X_test`.  We get the predictions as a numpy array.

In [6]:
y_pred_class = clf.predict(X_test)
y_pred_class[:5]

array([0, 0, 0, 0, 0], dtype=int64)

For the rest of this exercise we will examine various metrics by means of which we can measure the performance of the classifier.  You will use the builtine scikit functions for these metrics and will also calculate them yourselfs.  Where ever you see a function of the form `my_<metric>` you need to define the function yourself to do the same calculation the builtin scikit function does.

##### Classification accuracy: percentage of correct predictions

In [7]:
len(X_test)

192

Of the 192 test cases how many did we get right?

In [8]:
from sklearn import metrics

print(metrics.accuracy_score(y_test, y_pred_class))

0.65625


In [9]:
type(y_test), type(y_pred_class)

(pandas.core.series.Series, numpy.ndarray)

In [10]:
def my_accuracy_score( actual, predicted):
    '''
    Given two dataframes of the actual and predicted values of any variable, this function calculates and returns the accuracy
    of prediction
    '''
    acc = len([ 'YAY' for a,b in list(zip(actual,predicted)) if a==b])/len(actual)
    return acc

my_accuracy_score(y_test, y_pred_class)

0.65625

#### Null accuracy:

This is defined as the accuracy that could be achieved by always predicting the most frequent class.

In [11]:
y_test.value_counts(normalize=True)

0    0.619792
1    0.380208
Name: Outcome, dtype: float64

The null accuracy is 61.98%

### Confusion matrix

While the confusion matrix itself is not a metric, all of the metrics can be calculated from it. Read the scikit-learn [documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)
on the confusion matrix

In [12]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[101,  18],
       [ 48,  25]], dtype=int64)

In [13]:
import pandas as pd
import numpy as np

def my_confusion_matrix( actual, predicted):
    '''
    For two given dataframes of actual and predicted values of a variable, creates a confusion matrix using the pandas functions
    instead of scikit-learn.
    Similar to sk-learn, the rows represent the Actual values and columns represent the Predicted values
    '''
    cmat = pd.crosstab(index=actual,columns=predicted)
    cmat.rename_axis(index='Actual',columns='Predicted',inplace=True)
    
    return cmat

my_confusion_matrix(y_test, y_pred_class).values

array([[101,  18],
       [ 48,  25]], dtype=int64)

In [14]:
my_confusion_matrix(y_test, y_pred_class)

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,101,18
1,48,25


**Basic terminology**

- **True Positives (TP):** we *correctly* predicted that they *do* have diabetes
- **True Negatives (TN):** we *correctly* predicted that they *don't* have diabetes
- **False Positives (FP):** we *incorrectly* predicted that they *do* have diabetes (a "Type I error")
- **False Negatives (FN):** we *incorrectly* predicted that they *don't* have diabetes (a "Type II error")

In [15]:


cm = metrics.confusion_matrix(y_test, y_pred_class)

TP = cm[1][1]
TN = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TP,TN,FP,FN

(25, 101, 18, 48)

In [16]:
def metric_maker(cm):
    '''
    Given a confusion matrix return a function that calculates the
    TP, FP, TN, FN metrics
    '''
    def TFPN_metric(klass, metric):
        '''
        Given a class and calculate a metric specified as a 2 character string 
        TFPN_metric(1, 'TP').
        klass values are integers as given in Piazza
        '''
        metric = metric.upper()
        if metric == 'TP':
            ans = cm.loc[klass,klass]
        elif metric == 'FP':
            ans = cm.loc[cm.index!=klass,klass].values.sum()
        elif metric == 'TN':
            ans = cm.loc[cm.index!=klass,cm.columns!=klass].values.sum()
        elif metric == 'FN':
            ans = cm.loc[klass,cm.columns!=klass].values.sum()
        else:
            print('This is not a metric')
        return ans
    return TFPN_metric

cm = my_confusion_matrix(y_test, y_pred_class)

TFPN_metric = metric_maker(cm)


In [17]:
TFPN_metric(1, 'TP'), TFPN_metric(1, 'FN'), TFPN_metric(1, 'TN'), TFPN_metric(1, 'FP')

(25, 48, 101, 18)

#### Metrics computed from a confusion matrix

Now we will calculate the following metrics:

- accuracy
- error, misclassification rate
- recall, sensitivity, True Positive Rate (TPR)
- specificity
- false positivte rate (FPR)
- precision


**Accuracy:** Overall, how often is the classifier correct?

In [18]:
metrics.accuracy_score(y_test, y_pred_class)

0.65625

In [19]:
TP,FN,TN,FP = TFPN_metric(1, 'TP'), TFPN_metric(1, 'FN'), TFPN_metric(1, 'TN'), TFPN_metric(1, 'FP')

In [20]:

accuracy = (TP+TN)/(TP+TN+FP+FN)
accuracy

0.65625

**Error:** Overall, how often is the classifier incorrect? Also known as "Misclassification Rate"

In [21]:
1 - metrics.accuracy_score(y_test, y_pred_class)

0.34375

**Recall:** When the actual value is positive, how often is the prediction correct?

- How "sensitive" is the classifier to detecting positive instances?
- Also known as _Sensitivity_ or _True Positive Rate_ (TPR)

In [22]:
metrics.recall_score(y_test, y_pred_class)

0.3424657534246575

In [23]:

recall = TP/(TP+FN)
recall

0.3424657534246575

**Specificity:** When the actual value is negative, how often is the prediction correct?

- How "specific" (or "selective") is the classifier in predicting positive instances?

In [24]:
## Interestingly there is not builtin function in scikit-learn to calculate specificity
specificity = TN/(FP+TN)
specificity

0.8487394957983193

**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?

In [25]:

FPR = FP/(FP+TN)
FPR

0.15126050420168066

**Precision:** When a positive value is predicted, how often is the prediction correct?

- How "precise" is the classifier when predicting positive instances?

In [26]:
metrics.precision_score(y_test, y_pred_class)

0.5813953488372093

In [27]:

precision = TP/(TP+FP)
precision

0.5813953488372093

### Piazza Test

In [28]:
cm2 = pd.DataFrame({0: [7, 8, 9],1: [1, 2, 3],2: [3, 2, 1],})
cm2

Unnamed: 0,0,1,2
0,7,1,3
1,8,2,2
2,9,3,1


In [29]:
TFPN_metric = metric_maker(cm2)
TFPN_metric(0, 'TP'), TFPN_metric(0, 'FN'), TFPN_metric(0, 'TN'), TFPN_metric(0, 'FP')

tbl = pd.DataFrame(index = [0,1,2], columns = 'precision recall f1_score'.split())

for klass in tbl.index:
    TP = TFPN_metric(klass, 'TP')
    FN = TFPN_metric(klass, 'FN')
    TN = TFPN_metric(klass, 'TN')
    FP = TFPN_metric(klass, 'FP')
    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    f1_score = 2*precision*recall/(precision+recall)
    tbl.loc[klass,'precision'] = precision
    tbl.loc[klass,'recall'] = recall
    tbl.loc[klass, 'f1_score'] = f1_score
    
tbl

Unnamed: 0,precision,recall,f1_score
0,0.291667,0.636364,0.4
1,0.333333,0.166667,0.222222
2,0.166667,0.076923,0.105263


## References:
1. https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html