
# Exercise 5: Metrics and Object Oriented Programming

Object oriented programming (OOP) is a popular programming paradigm.  
It is based on the idea of 'objects', that have attributes and methods to handle data.  

Before you start, please get familiar with the basic concept of OOP by looking at some of the available resources online. 
If you search for object oriented programming with python on the internet, you will find tons of material, from short beginner tutorials to whole courses.  
e.g. [this one](https://realpython.com/python3-object-oriented-programming/) or [this](https://www.codecademy.com/learn/learn-python/modules/learn-python-introduction-to-classes-u)



In [5]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

The next cell shows an example of how sensitivity can be computed.

In [6]:
def compute_sensitivity(x,y):
    """
    Args:
        x: one hot encoded vector of predictions
        y: one hot encoded vector of ground truth
    """
    if x.shape != y.shape:
        raise ValueError("x and y should have the same shape.")
    
    tp = ((x+y)==2).sum()
    p = y.sum()
    return tp/p
    

Lets define small vectors x and y and test the sensitivity function.  
The vectors should be one hot encoded.

Imagine we are dealing with a task of disease classification for four different patients. To represent whether a patient has a disease or not, we might use the values 0 and 1. Here, 0 means "disease present," and 1 means "no disease."

We can store this information in an array. For example, we might have an array [1, 1, 0, 1] for four patients, where the first, second, and fourth patients have a disease, but the third patient does not.

This is the concept of one-hot encoding. Each patient (or "class") is represented as an array where only one value is "hot" (or '1'). The rest of the array is "cold" (or '0').

You can play around with the values in x and y vectors and their length. 

In [7]:
x = np.array([1, 1, 0, 1])
y = np.array([1, 1, 1, 0])
compute_sensitivity(x,y)

0.6666666666666666

## Understanding Sensitivity and Boolean Operations in Python

In our function `compute_sensitivity(x, y)`, we encounter the line `tp = ((x+y)==2).sum()`. This line might be a little tricky to understand without prior programming experience, so let's break it down.

### Boolean Operations

In Python, certain operations return a **boolean value** - `True` or `False`. In the context of numerical computations, Python treats `True` as 1 and `False` as 0.

### Understanding the Line `tp = ((x+y)==2).sum()`

This line is computing the **True Positives (tp)**, which are the cases where the patient has a disease (`y=1`) and our prediction (`x`) is also positive (`x=1`).

1. `x+y` is an element-wise addition of the two vectors, resulting in a new vector where each element is the sum of the corresponding elements in `x` and `y`. 

2. `(x+y)==2` is a boolean operation that checks which elements in the resulting vector are equal to 2. It creates a new boolean vector, where elements are `True` if the corresponding element in `x+y` is 2, and `False` otherwise.

3. `.sum()` adds up all the elements in this boolean vector. Since `True` is treated as 1 and `False` as 0, the sum gives us the count of `True` elements. This is the number of true positives, i.e., cases where both the prediction and the ground truth indicate disease.


### Implementing with a For Loop

If the above operation seems a bit complex, we can achieve the same result using a `for` loop. Here's how:


In [8]:
def compute_sensitivity_with_loop(x, y):
    """
    Args:
        x: one hot encoded vector of predictions
        y: one hot encoded vector of ground truth
    """
    if x.shape != y.shape:
        raise ValueError("x and y should have the same shape.")
    
    tp = 0  # Initialize the count of true positives
    for i in range(len(x)):
        if x[i] == 1 and y[i] == 1:
            tp += 1  # Increase the count if both x and y are 1 at this position
    
    p = y.sum()
    return tp / p

In [9]:
x = np.array([1, 1, 0, 1])
y = np.array([1, 1, 1, 0])
compute_sensitivity(x,y)

0.6666666666666666

now let's define a function to compute precision in a similar fashion

In [4]:
def compute_precision(x,y):
    """
    Args:
        x: one hot encoded vector of predictions
        y: one hot encoded vector of ground truth
    """
    if x.shape != y.shape:
        raise ValueError("x and y should have the same shape.")
    
    tp = ((x+y)==2).sum()
    fp = ((x-y)==1).sum()
    return tp/(tp+fp)

and let's test it with our example vectors:

In [5]:
compute_precision(x,y)

0.6666666666666666

You can see that both functions had to compute the false positives (tp), which is fine in this small example, but with very large multi-dimensional tensors would be inconvenient.
Also, if you want to make changes to the calculation of tp you would have to do it in both functions.
One solution would be to extract the true positive calculation into a new function, but in this exercise we want to focus on OOP.  

Let's define a class ConfusionMatrix, that is initialized with our 2 vectors x and y and stores the values of tp, tn... as class attributes.

In [6]:
class ConfusionMatrix():
    def __init__(self, x, y):
        """
        Args:
            x: one hot encoded vector of predictions
            y: one hot encoded vector of ground truth
        """
        self.tp = ((x+y)==2).sum()
        self.fp = ((x-y)==1).sum()
        self.p = y.sum()


now let's create an instance (object) of the class ConfusionMatrix:

In [7]:
cm = ConfusionMatrix(x,y)

now we can have a look at the objects attributes:

In [8]:
cm.tp
cm.fp

1

We can also define some class methods, e.g. to compute sensitivity

In [9]:
class ConfusionMatrix():
    def __init__(self, x, y):
        """
        Args:
            x: one hot encoded vector of predictions
            y: one hot encoded vector of ground truth
        """
        self.tp = ((x+y)==2).sum()
        self.fp = ((x-y)==1).sum()
        self.p = y.sum()
    
    def get_sensitivity(self):
        return self.tp/self.p

In [10]:
cm = ConfusionMatrix(x,y)
cm.get_sensitivity()

0.6666666666666666


### Homework:


Now you can complete the class ConfusionMatrix by adding methods for specificity, precision and F1 score. Feel free to add more class attributes if needed.  
The formulas to compute confusion matrix based metrics can be found in the lecture slides or: [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

In [10]:
### Homework complete the class ConfusionMatrix here
class ConfusionMatrix():
    def __init__(self, x, y):
        self.tp = ((x+y)==2).sum()
        self.fp = ((x-y)==1).sum()
        self.p = y.sum()
    
    def get_sensitivity(self):
        return self.tp/self.p
    
    def get_specificity(self):
        # fill here
        pass
    
    def get_precision(self):
        # fill here
        pass
    
    def get_f1score(self):
        # fill here
        pass

Finally we want to use our ConfusionMatrix class to plot the ROC:  
first we will load a .csv file with values for prediction and actual labels:

In [11]:
!wget https://github.com/CS4MS/CS4MS_S23/raw/main/data/exercise5_prediction.csv
pred_df = pd.read_csv('exercise5_prediction.csv')

pred_df

--2023-05-23 14:39:19--  https://github.com/CS4MS/CS4MS_S23/raw/main/data/exercise5_prediction.csv
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-05-23 14:39:19 ERROR 404: Not Found.



FileNotFoundError: [Errno 2] No such file or directory: 'exercise5_prediction.csv'

the dataframe has 100 rows and 2 columns for actual values and prediction.  
as you can see, the prediction values are floating point and not 0 or 1.  
We need to define a threshold that decides if a prediction gets assigned to class 0 or 1.


In [None]:
threshold = 0.5
# first let's turn the dataframe into separate numpy arrays
actual = np.array(pred_df['actual'])
pred = np.array(pred_df['prediction'])

# now we can define a new array new_pred and set the numbers higher than the threshold to 1 and the ones lower to 0.
new_pred = np.zeros_like(pred)
new_pred[pred> threshold] = 1


In [None]:
# and we can compute our metrics with the confusion matrix:
cm = ConfusionMatrix(new_pred,actual)
#cm.get_sensitivity()
#cm.get_specificity()
cm.get_f1score()

The threshold of 0.5 seems to have been a good choice. But what if we had used a different value?  


In [None]:
# for plotting the ROC we need to use multiple thresholds:
# using numpy linspace we can create 100 evenly spaced numbers between 0 and 1:
thresholds = np.linspace(0,1,100)
#print(thresholds)


### Homework

use the thresholds to threshold the predictions, and then compute the metrics needed to plot the ROC.  
you can use the lists below to store x and y values.

In [None]:
y_values = [] 
x_values = []
# put your code here:

now you can use the cell below to plot the ROC.  
the dotted red line shows the ROC for a random classifier

In [None]:
# plot ROC curve
plt.title('ROC')
plt.plot(x_values,y_values, 'b')
#plt.plot(roc_values)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('Sensitivity')
plt.xlabel('1-Specificity')
plt.show()