In [60]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')

### Read the set of labels into a dataframe:

In [45]:
labels = pd.read_csv('labels.csv')

In [46]:
labels.tail()

Unnamed: 0,perfect_labeler,radiologist,algorithm
144,benign,cancer,0.06
145,benign,cancer,0.98
146,benign,cancer,0.77
147,benign,cancer,0.03
148,benign,cancer,0.02


### Start with assessing the radiologist's performance:
* Assess the _accuracy_ of the radiologist by just looking at the percent of cases that they correctly labeled
* Next, look at the true positive and true negative rates of the radiologist by generating a _confusion matrix_ 

In [47]:
radiologist_accuracy = sum(labels.perfect_labeler == labels.radiologist)/len(labels)

In [48]:
radiologist_accuracy

0.8993288590604027

In [49]:
confusion_matrix(labels.perfect_labeler.values,labels.radiologist.values,labels=["cancer","benign"])

array([[ 25,   4],
       [ 11, 109]])

### Now look at the algorithm's performance compared to the perfect labeler:
* Since the algorithm doesn't create a binary label, it instead returns a _probability_ of cancer, choose a probability cut-off to use for the algorithm's labeling of cancer vs. bening. _(Hint: 0.5 is a reasonable starting place)_
* Start with assessing _accuracy_ again here
* Generate a confusion matrix

In [58]:
labels["algorithm_prediction"] = np.arange(0, len(labels))

def performance_of_algorithm(X, y):
    for i in range(0, len(X)):
        if X.algorithm[i] >= 0.5:
            labels["algorithm_prediction"][i] = "cancer" # creating labels
        else:
            labels["algorithm_prediction"][i] = "benign"
    
    algorithm_accuracy = sum(y == X.algorithm_prediction)/len(labels) # assessing accuracy
    print(algorithm_accuracy) 
    
    print(confusion_matrix(y.values, X.algorithm_prediction.values, labels = ["cancer", "benign"])) # confusion matrix

In [61]:
performance_of_algorithm(labels, labels.perfect_labeler)

0.8859060402684564
[[ 21   8]
 [  9 111]]


What happens now if you change the threshold cut-off for your algorithm's classification to 0.4? What if you raise it to 0.6? How do accuracy, fp, fn, tp, and tn change?

In [63]:
# With cutoff as 0.4, The accuracy dips.
def performance_of_algorithm(X, y):
    for i in range(0, len(X)):
        if X.algorithm[i] >= 0.4:
            labels["algorithm_prediction"][i] = "cancer" # creating labels
        else:
            labels["algorithm_prediction"][i] = "benign"
    
    algorithm_accuracy = sum(y == X.algorithm_prediction)/len(labels) # assessing accuracy
    print(algorithm_accuracy) 
    
    print(confusion_matrix(y.values, X.algorithm_prediction.values, labels = ["cancer", "benign"])) # confusion matrix


performance_of_algorithm(labels, labels.perfect_labeler)

0.8590604026845637
[[ 25   4]
 [ 17 103]]


In [64]:
# With cutoff as 0.6, The accuracy increases.
def performance_of_algorithm(X, y):
    for i in range(0, len(X)):
        if X.algorithm[i] >= 0.6:
            labels["algorithm_prediction"][i] = "cancer" # creating labels
        else:
            labels["algorithm_prediction"][i] = "benign"
    
    algorithm_accuracy = sum(y == X.algorithm_prediction)/len(labels) # assessing accuracy
    print(algorithm_accuracy) 
    
    print(confusion_matrix(y.values, X.algorithm_prediction.values, labels = ["cancer", "benign"])) # confusion matrix


performance_of_algorithm(labels, labels.perfect_labeler)

0.9060402684563759
[[ 20   9]
 [  5 115]]


### Finally, let's compare our algorithm to the radiologist
* A "perfect labeler" might not exist in the real world, and in fact, if often does not
* In AI for medical imaging, using a radiologist's labels as our "true" label is often the standard of practice, and algorithm performance is judged in both an academic setting as well as in the regulated industry landscape based on performance against an expert human

* Repeat the steps above using a set threshold for your algorithm (again, 0.5 is perfectly reasonable) but now computing accuracy, tp, tn, fp, fn against the radiologist. 

In [89]:
# With cutoff as 0.55, The accuracy increases.
def performance_of_algorithm(X, y):
    for i in range(0, len(X)):
        if X.algorithm[i] >= 0.55:
            labels["algorithm_prediction"][i] = "cancer" # creating labels
        else:
            labels["algorithm_prediction"][i] = "benign"
    
    algorithm_accuracy = sum(y == X.algorithm_prediction)/len(labels) # assessing accuracy
    print("Accuracy of the algorithm against the radiologist's labels = ", algorithm_accuracy) 
    
    c_matrix = confusion_matrix(y.values, X.algorithm_prediction.values, labels = ["cancer", "benign"])
    print("\nThe confusion matrix : \n", c_matrix) # confusion matrix
    
    tp = c_matrix[0][0]
    print("\nTrue Positive = ", tp)
    
    tn = c_matrix[1][1]
    print("\nTrue Negative = ", tn)
    
    fp = c_matrix[1][0]
    print("\nFalse Positive = ", tp)
    
    fn = c_matrix[0][1]
    print("\nFalse Negative = ", tp)
    
performance_of_algorithm(labels, labels.radiologist)

Accuracy of the algorithm against the radiologist's labels =  0.8791946308724832

The confusion matrix : 
 [[ 23  13]
 [  5 108]]

True Positive =  23

True Negative =  108

False Positive =  23

False Negative =  23


## Reflection: 
* In the above exercise you assess performances of a human as well as of an algorithm against a 'perfect labeler' and also against each other. 
* Does accuracy seem like the appropriate statistic to use when evaluating these labels? Why or why not? 
* In what clinical settings does it seem more or less acceptable to have a high level of FNs? FPs? 
* How did changing the threshold on the algorithm performance change the different performance statistics? 
* How did your opinion of the algorithm's performance change when you started comparing it to a radiologist instead of the perfect labeler? What does this mean for a real-world scenario when a perfect labeler doesn't exist, and we only have a radiologist's read to base our performance on? 