In [4]:
# INTRODUCTION TO EVALUATING CLASSIFICATION MODELS
# Imagine the following scenario:
# A credit card company wants to detect fraudulent transactions in real-time.
# Historically, 10 of every 100,000 transactions have been fraudulent.
# An analyst writes a program to detect fraudulten transactions.
# But because of a bug, it flags every transaction as not fraudulent.
# So fore every 100,000 transactions, it correctly classifies 99,990 transactions that aren't fraudulent, and it erroneously classifies the 10 fraudulent transactions.
# The program's accuracy score appears impressive at 99.99%.
# However, it spectacularly fails at its job - detecting 0 of every 10 fraudulent transactions.
# The success rate is therefore 0%.
# To catch problems like this, we need additional evaluation metrics.
# In this lesson, you'll learn how to calculate and apply advanced evaluation metrics to evaluate the performance of your classification models.
# Previous lessons covered the model-fit-predict pattern.
# This lesson adds another stage to our machine learning pattern, giving us model-fit-predict-evaluate.
# After identifying the right model, we fit our data ot it. 
# We can then use the model to make predictions about new data.
# Finally, we evaluate how well those predictions performed.
# If the last stage shows that the model didn't make predictions well, we might select a different machine learning model.
# By evaluating additional metrics for model performance in this lesson, you'll sharpen your skills regarding the last stage of this machine learning pattern.

In [5]:
# ASSESSING MODEL PERFORMANCE
# It's often not enough to train and then use a machine learning model for making predictions.
# We also  need to know how well the model performs at its prediction task.
# In the previous lesson, we learned one way of assessing performance - by using the accuracy score.
# But, we also want to know the percentage of predictions that the model gets right and how well it predicts each outcome.
# To help answer these questions, we can use the following four metrics to give us additional insight into the model's performance:
    # 1. Accuracy
    # 2. Precision
    # 3. Recall
    # 4. F1 Score

In [6]:
# ACCURACY
# Earlier, we learned that we can measure a model's performance based on the differences between its predicted and its actual values.
# We examined the accuracy score to summarize this performance.
# In this section, we'll define and explain accuracy in more detail.
# A classification algorithm results in tow or more outcomes.
# Classifying fraudulent transactions, for example, results in two outcomes: A transaction is fraudulent or not.
# We can categorize these two predictions according to a confusion matrix.
# A confusion matrix groups our model's predictions according to whether the model accurately predicted categories.
# Or did it confuse any predictions with the wrong categories?
# By comparing the accurate predictons to the inaccurate (confused) predictions, a confusion matrix supplies a way for us to evaluate the model's performance.
# It contains the number of times that a model correctly predicted a positive or negative value and the number of times that it didn't.

# CONFUSION MATRIX
# A confusion matrix consist of two rows and two columns:
# The two columns are the 'Predicted True' and 'Predicted False' columns, respectively.
# The two rows are the 'Actually True' and 'Actually False' columns, respectively.
# Within the first column of the matrix are the positive predictions: So the 'True Positives' and 'False Positives'.
# This first column separates all the predictions that the model made for the positive class into whether they were accurate or not. (In our example, the positive class is the 'fraudulent' class).
# The second column contains all the negative predictions - the 'False Negatives' and 'True Negatives'.
# It separates all the predictions that the model made for the negative class. (In our example, the negative class is the 'not fraudulent' class).
# To further explain, any prediction falls into one of the two categories: true or false.
# In the context of fraud detection, a true prediction means the model categorized the transaction as fraudulent.
# A false prediction means that the model categorized the transaction as not fraudulent.
# If the model predicted a transaction as fraudulent, and the transaction really was fraudulent, we call that prediction a TRUE POSITIVE (TP).
# If the model predicted a transaction as fraudulent, but the transaction wasn't fraud, we call that prediction a FALSE POSITIVE (FP).
# If the model predicted a transaction as not fraud, but it was fraud, we call the prediction a FALSE NEGATIVE (FN).
# Lastly, if the model predicted a transaction as not fraud, and it wasn't fraud, we call this prediction a TRUE NEGATIVE (TN).
# We can use all this information to calculate the accuracy of the model.
# The ACCURACY measures how often the model was correct. 
# It does so by calculating the ratio of the number of correct predictions to the total number of outcomes.
# The formula for accuracy is as follows:
    # accuracy = (TP + TN) / (TP + TN + FP + FN)
# Now, we can manually calculate the accuracy each time.
# However, FinTech professionals always seek a shortcut to a problem.
# In this case, we can continue to use the `accuracy_score` function to estimate the accuracy of the model.
# The next metric that we want to explore is the precision.

In [7]:
# PRECISION
# The precision metric relates to the accuracy metric but slightly differs.
# As with the accuracy, one way to illustrate the concept of precision is through a confusion matrix.
# To build a confusion matrix, you compare a list of the known values with a list of the values the model predicted.
# For example, consider a confusion matrix that demonstrates the performance of a model on the fraudulent transaction dataset.
# This confusion matrix shows the comparison between the actual and the predicted values.
# But, it splits these values according to whether the predictions were positive (true, predicting fraud).
# Or if the predictions were negative (false, predicting no fraud).
# Let's say, in our case, the intersection of 'Actually True' and 'Predicted True' shows that the model predicted 30 TPs.
# That is, of all the predictions that the model predicted as fraudulent, 30 of them actually were.
# This gets us to the precision.
# PRECISION, also known as the POSITIVE PREDICTIVE VALUE (PPV), measures how confident we are that the model correctly made the positive predictions.
# We get the precision by dividing the number of TPs by the number of all the positives. (The latter is the sum of the TPs and the FPs).
# The formula for precision is as follows:
    # precision = TPs / (TPs + FPs)
# Let's say that aside from the 30 TPs, 20 FPs were predicted, all of which turned out to be legitimate transactions.
# The precision would therefore be the following calculation:
    # precision = 30 / (30 + 20) or 30 / 50 = 0.6
# Note that we use the cells in the 'Predicted true' column to calculate the precision.
# To summarize, in machine learning, the precision measures the reliability of positive classification.
# Asking yourself the following question might help you remember how to think about the precision:
    # I know the model just predicted a fraudulent transaction, but, how likely is it that the transaction is actually fraudulent?    

In [8]:
# RECALL
# Another way that we can assess a model's performance is by using the recall, which is also called the sensitivity.
# People in machine learning more commonly use the term recall.
# In our example, the RECALL measures that number of actually fraudulent transaction that the model correctly classified as fraudulent.
# To understand the recall, ask yourself the following question:
    # I know that this transaction is fraudulent, but how likely is it that the model will predict it as fraudulent?
# The formula for the recall is as follows:
    # recall = TPs / (TPs + FNs)
# To get the recall, we start with the number of TPs - that is, the number of times that the model correctly predicted a fraudulent transaction.
# We then compare this number to the total number of actually fraudulent transactions - including the ones that the model missed (that is, the FNs).
# Let's explain this by using the same confusion matrix as before.
# Let's say in our hypothetical confusion matrix, we have the 30 TPs and we have 10 FNs, which highlights the Actually true row.
# The recall is thus as follows:
    # recall = 30 / (30 + 10) | 30 / 40 = 0.75

In [9]:
# IMPORTANT
# A fundamental tension exists between the precision and the racall.
# Highly sensitive tests and algorithms tend to be aggressive.
# This is because they effectively detect the intended targets. 
# But, this means that they also risk resulting in numerous false positives.
# High precision, by contrast, usually results from a conservative process.
# In this case, the predicted positives are likely actual positives. 
# But, the model might not predict numerous other true positives. 
# In practice, we need to make a tradeoff between the recall and the precision that requires a balancing act between the two.

In [13]:
# F1 SCORE
# We can characterize the F1 SCORE which is also called the HARMONIC MEAN, as a single summary statistic for the precison and the recall.
# The formula for the F1 score is as follows:
    # F1 = 2 * (precision * recall) / (precision + recall)
# The F1 score for our fraudulent transaction classifier is thus as follows:
    # F1 = 2 * (0.6 * 0.75) / (0.6 + 0.75) == 0.66
# To illustrate the F1 score, say that 10 transactions out of 10,000 are fraudulent.
# What does the F1 score tell us in this scenario?
# Note that a large class imbalance exists between the transactions that are fraudulent and those that aren't.
# That is, we have much fewer transactions that are fraudulent than those that aren't.
# Our model might have a high F1 score, but that can prove deceptive, because the model might do a good job of predicting only our larger class.
# The larger class consist of 9,990 transactions that aren't fraudulent.
# However, we're interested in the smaller class - that is the fraudulent transactions that the model correctly identified.
# Consider a model that we trained for such a dataset.
# The precision of this model is 0.99039 (9900 / 9906).
# The recall is 0.99099 (9900 / 9990).
# Using the F1 score, which is, 2 * (0.99039 * 0.99099) / (0.99039 + 0.99099), we arrive at an F1 score of 0.995.
# An F1 score of 1 is considered perfect, so in this case, the model performed well.
# In fact, this model excels at predicting the larger class - but we're not interested in that.
# The model correctly predicted the fraudulent transaction only 40% of the time (4 TNs / (6 FPs + 4 TNs)).

In [None]:
# ON THE JOB
# If you want to discuss the predictive power of your model, you need to make sure that it will do as you claim.
# You need empirical data to show that your model does whay you say it does. 
# Evaluating the model's performance and explaining its limitations is fundamental to instilling confidence in a team of coworkers that your model works.