# Model Performance Evaluation: Extra Credit (10 pts) #

In predictive modeling, area under the receiver operating characteristic curve (AUROC) can be used to measure performance of a model with a binary response variable (containing values 0 or 1 only), which attempts to predict the probability of an event (1) given a set of X characteristics. The goal of a model with such dependent variable is to assign high probabilities to examples with an actual '1' label and low probabilities to examples with an actual '0' label. You can read more about AUC here: 
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

While in general to compute area under a curve requires numerical approximation techniques such as trapezoid rule, another way to estimate this area is by observing the number of rows in the data set for which the predicted probabilities are close to the actual outcome. In particular, given a set of actual labels and the corresponding predicted probabilities, this estimate can be obtained by looking at all possible pairs of actual 1's and 0's and counting the number of pairs for which the predicted probability for the 1 actual label is higher than the predicted probability of the 0 actual label. The complete rule for estimating the area is to give a +1 when the predicted probability of an actual 1 exceeds the predicted probability of an actual 0; to give a 0 when the predicted probability of an actual 0 exceeds the predicted probability of an actual 1; and to give a 0.5 when the predicted probabilities of the actual outcome pair are equal. Then, these are added and divided by the total number of pairs to obtain the estimated area. The higher the number the more accurate our model is in correctly predicting the positive examples in the data. For example: let us consider 5 pairs of predicted probabilities where the first number in the pair corresponds to the actual 1 (Pr1) and the second number in the pair corresponds to the actual 0 (Pr0). Then,

| Pr1 | Pr0 | Add to correct | add to N |
| --- | --- | --- | --- |
| 0.8 | 0.2 | +1 | +1 |
| 0.6 | 0.7 | +0 | +1 |
| 0.5 | 0.5 | +0.5 | +1 |
| 0.3 | 0.1 | +1 | +1 |
| 0.9 | 0.7 | +1 | +1 |


<font color='red'> estimated area = sum of correct / N = 3.5/5 = 0.7. </font>
<br>
<br>
Your job in this exercise is to write a function that takes in a 2-d numpy array of predicted and actual probabilities (each individual 1-d array from top to bottom will have 2 values: first is the predicted probability and second is the actual label), and use numpy methods and techniques to compute the estimated area without any loops. Remember you have to consider ALL possible pairs like above of a positive and negative actual example. 

In [5]:
import numpy as np


def compute_estimated_AUC(myarray): 
    """
        This function takes an a 2-d array where the first column has predicted probabilities and the second column has actual
        labels (either 0 or 1) of a binary-classification problem, and returns the estimated area under the curve using above
        rules.
    """

    df_array=myarray

    #Step 1: extract predicted probabilities of the 0 actual dep value only: 
    #your code goes here
    # second value in the array is 0
    pred_prob_0 = df_array[df_array[:, 1] == 0][:, 0]
    # print(pred_prob_0)


    #step 2: extract predicted probabilities of the 1 actual dep value only and transform into proper format: 
    #your code goes here
    # second value in the array is 1
    pred_prob_1 = df_array[df_array[:, 1] == 1][:, 0]
    #print(pred_prob_1)
    
    #now create an array of 1's of size equals to the number of zeros: 
    #your code goes here
    # create an array of 1's of size equals to the number of zeros
    ones_array = np.ones(len(pred_prob_0))
    # print(ones_array)

    #perform proper vectorized matrix multiplication
    #your code goes here
    # 8.2 Combining and Merging Datasets McKinney 2020
    # Use all key combinations observed in both tables together
    score_matrix = np.outer(ones_array, pred_prob_1)
    # print(ones_array)
    # print(pred_prob_1)
    # print(score_matrix)


    #Step 3: perform the proper broadcasting step: 
    #your code goes here
    # make the array symmetric by subtracting its transpose
    score_matrix = score_matrix - np.transpose(score_matrix)
    # print(score_matrix)   
    
    #Step 4: now, replace 0 with 0.5: 
    #your code goes here
    score_matrix[score_matrix == 0] = 0.5
    # print(score_matrix)

    #Step 5: finally, replace -1 with 0: 
    #your code goes here
    score_matrix[score_matrix == -1] = 0

    #Step 6: estimate the area under the curve using the given formula above: 
    #your code goes here
    n0 = len(pred_prob_0)
    n1 = len(pred_prob_1)
    # here is where the error was
    #cstat = (np.sum(score_matrix) - 0.5 * n0 * (n0 + 1)) / (n0 * n1)
    cstat = np.sum(score_matrix) / (n0 * n1)
    
    return cstat


In [6]:
#test your code: 
myarray=np.array([[0.3,0],[0.3,1],[0.7,1],[0.9,0]])
auc=compute_estimated_AUC(myarray)
auc

0.25

# Looking for the error I found another way to calculate AUC. It didn't come up with the same answer as my code. ??? 
import numpy as np
from sklearn.metrics import roc_auc_score

myarray = np.array([[0.3, 0], [0.3, 1], [0.7, 1], [0.9, 0]])
y_true = myarray[:, 1]
y_score = myarray[:, 0]

#Calculate AUC using scikit-learn
auc = roc_auc_score(y_true, y_score)

print(auc)   
0.375