# Assignment 1 - Binary Classification Evaluation Metrics

**Objective:**
The objective of this assignment is to assess your understanding of fundamental concepts in model evaluation for machine learning tasks. This assignment covers topics discussed in the first half of the course, including key evaluation metrics, confusion matrices, ROC curves, and Precision-Recall curves.
Instructions:

1. Theory Questions:
Answer the following theoretical questions:

    1. Explain the limitations of accuracy as an evaluation metric in imbalanced datasets. How does accuracy behave when classes are heavily skewed, and why might it provide misleading results?
    2. Describe the purpose and interpretation of a confusion matrix. How does it help in assessing a classification model's performance?
    3. Explain the concept of ROC curves. What does each point on an ROC curve represent? How is the area under the ROC curve (AUC-ROC) calculated?
    4. Compare and contrast the advantages and disadvantages of ROC curves and Precision-Recall curves. In what scenarios would you prefer to use one over the other, and why?

2. Practical Exercises:
* Implement Python code to calculate the following evaluation metrics for a given binary classification problem: Log Loss
* Select the best metric for an applied scenario

**Submission Guidelines:**
* Submit your responses to the theory questions in a neatly organized markdown.
* Include your Python code for the practical exercise.
* Submit your assignment as a single `.ipynb` file named `MY NAME Assignment 1 - Log Loss` via the course submission platform (slack).

## Part 1: Theory Questions (20 points)
Provide your answers here:

### 1. Limitations of Accuracy as an Evaluation Metric in Imbalanced Datasets

Accuracy is limited in imbalanced datasets because it simply measures the proportion of correct predictions among all predictions, without considering class distribution. When classes are heavily skewed, accuracy becomes misleading for several reasons:

- The majority class dominates the accuracy calculation, causing the metric to be insensitive to performance on minority classes
- A naive classifier that always predicts the majority class will achieve high accuracy without providing any useful information about minority classes
- Accuracy fails to reflect the model's ability to correctly identify positive cases in scenarios where the positive class is rare but important (like fraud detection or disease diagnosis)

For example, in a dataset with 98% negative cases and 2% positive cases, a model that always predicts "negative" would achieve 98% accuracy while completely failing to detect any positive cases.

### 2. Purpose and Interpretation of a Confusion Matrix

A confusion matrix is a table that describes the performance of a classification model by showing the counts of:
- True Positives (TP): Correctly predicted positive cases
- True Negatives (TN): Correctly predicted negative cases
- False Positives (FP): Negative cases incorrectly predicted as positive (Type I error)
- False Negatives (FN): Positive cases incorrectly predicted as negative (Type II error)

The matrix helps assess model performance by enabling the calculation of various metrics:
- Accuracy: (TP + TN) / (TP + TN + FP + FN)
- Precision: TP / (TP + FP)
- Recall/Sensitivity: TP / (TP + FN)
- Specificity: TN / (TN + FP)
- F1-Score: Harmonic mean of precision and recall

Confusion matrices provide a comprehensive view of model performance across all classes, revealing where the model excels and where it struggles, particularly in identifying specific error types.

### 3. Concept of ROC Curves

ROC (Receiver Operating Characteristic) curves plot the True Positive Rate (TPR, also called recall or sensitivity) against the False Positive Rate (FPR, or 1-specificity) at various classification thresholds.

Each point on an ROC curve represents a specific threshold value for classifying predictions as positive or negative. Moving along the curve corresponds to changing this threshold:
- Points toward the upper-left corner indicate better classification performance
- The diagonal line represents random guessing

The Area Under the ROC Curve (AUC-ROC) is calculated by integrating the ROC curve from x=0 to x=1. It quantifies the overall ability of the model to discriminate between positive and negative classes:
- AUC-ROC of 1.0 indicates perfect classification
- AUC-ROC of 0.5 indicates performance equivalent to random guessing
- AUC-ROC values below 0.5 suggest worse-than-random performance

AUC-ROC can be interpreted as the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

### 4. ROC Curves vs. Precision-Recall Curves

**ROC Curves:**
- Advantages:
  - Insensitive to class imbalance when comparing different models
  - Provides a visual representation of the trade-off between TPR and FPR
  - AUC-ROC is invariant to the decision threshold
  
- Disadvantages:
  - Can be overly optimistic in highly imbalanced datasets
  - FPR can be misleadingly low when the negative class is large
  - Less informative when the focus is on positive class performance

**Precision-Recall Curves:**
- Advantages:
  - Better suited for imbalanced datasets
  - Focuses on the performance of the positive (typically minority) class
  - More sensitive to improvements in positive class detection
  - Directly shows the trade-off between precision and recall
  
- Disadvantages:
  - More sensitive to changes in class distribution
  - Not suitable for comparing models across datasets with different class distributions
  - Does not account for true negative performance

**When to prefer each:**

Use ROC curves when:
- Class distribution is relatively balanced
- Both classes are equally important
- You need to compare models across different datasets
- False positive and false negative costs are similar

Use Precision-Recall curves when:
- Datasets are highly imbalanced
- The positive/minority class is of primary interest
- False positives are especially costly or concerning
- Working in domains like information retrieval, anomaly detection, or medical diagnosis where finding rare positive cases is crucial

## Practicing Log Loss (25 Points)

**Objective:**
The objective of this assignment is to deepen your understanding of log loss, also known as logarithmic loss or cross-entropy loss, and its application in evaluating the performance of classification models.

**Instructions:**
In this assignment, you will be given a set of binary classification predictions along with their corresponding actual class labels. Your task is to calculate the log loss for each prediction and then analyze the overall log loss performance of the model.

**Dataset:**
You are provided with a dataset containing the following information:

Predicted probabilities for the positive class (ranging from 0 to 1) for a set of instances.
Actual binary class labels (0 or 1) indicating whether the instance belongs to the positive class or not.

**Assignment Tasks:**
1. Calculate the log loss for each instance in the dataset using the predicted probabilities and actual class labels.
2. Summarize the individual log losses and compute the overall log loss performance for the model.
3. Interpret the overall log loss value and analyze the model's performance. Discuss any insights or observations derived from the log loss analysis.


**Dataset:**

| Instance | Predicted Probability | Actual Label |
|----------|------------------------|--------------|
|    1     |          0.9           |       1      |
|    2     |          0.3           |       0      |
|    3     |          0.6           |       1      |
|    4     |          0.8           |       0      |
|    5     |          0.1           |       1      |


**Grading Criteria:**

* Correctness of log loss calculations.
* Clarity and completeness of the analysis.
* Insights derived from the log loss interpretation.
* Overall presentation and adherence to submission guidelines.

In [1]:
import pandas as pd

# Create a DataFrame with the dataset
data = {
    'Instance': [1, 2, 3, 4, 5],
    'Predicted Probability': [0.9, 0.3, 0.6, 0.8, 0.1],
    'Actual Label': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

   Instance  Predicted Probability  Actual Label
0         1                    0.9             1
1         2                    0.3             0
2         3                    0.6             1
3         4                    0.8             0
4         5                    0.1             1


In [7]:
# Interpret the log loss
import numpy as np

def log_loss(y_true, y_pred):
       """Calculates the log loss.

       Args:
           y_true: The true binary labels (0 or 1).
           y_pred: The predicted probabilities for the positive class (between 0 and 1).

       Returns:
           The log loss value.
       """
       epsilon = 1e-15  # Small value to avoid log(0)
       y_pred = np.clip(y_pred, epsilon, 1 - epsilon)  # Clip predictions to avoid extreme values
       loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
       return loss
predicted_probabilities = df['Predicted Probability']
actual_labels = df['Actual Label']
log_loss_values = [log_loss(y_true, y_pred) for y_true, y_pred in zip(actual_labels, predicted_probabilities)]
overall_log_loss = np.mean(log_loss_values)
print(f"Log Loss Values: {log_loss_values}")
print(f"Overall Log Loss: {overall_log_loss}")

Log Loss Values: [np.float64(0.10536051565782628), np.float64(0.35667494393873245), np.float64(0.5108256237659907), np.float64(1.6094379124341005), np.float64(2.3025850929940455)]
Overall Log Loss: 0.976976817758139


*Question: Interpret the log loss above. How would it change if the predicted probability for instance 0 changed from 0.9 to 0.6? Why?*

*Your answer:*  Overall Log Loss (0.976976817758139): This value represents the average log loss across all instances in your dataset. A lower log loss indicates better model performance. In this case, the overall log loss is relatively high, suggesting that the model's predictions have room for improvement.

Individual Log Loss Values: The list Log Loss Values provides the log loss for each instance separately. Let's examine them:

Instance 1 (0.10536051565782628): This low log loss indicates a confident and accurate prediction for this instance. The model predicted a high probability (0.9) for the positive class, which aligned with the actual label (1).
Instance 2 (0.35667494393873245): This moderate log loss suggests a reasonably good prediction. The model predicted a lower probability (0.3) for the positive class, which was correct as the actual label was (0).
Instance 3 (0.5108256237659907): This log loss is slightly higher, indicating a less confident prediction. The model predicted a moderate probability (0.6) for the positive class, which was correct but not as certain as in Instance 1.
Instance 4 (1.6094379124341005): This high log loss indicates a poor prediction. The model predicted a high probability (0.8) for the positive class, but the actual label was (0), resulting in a significant penalty.
Instance 5 (2.3025850929940455): This very high log loss signifies a severely incorrect prediction. The model predicted a low probability (0.1) for the positive class, while the actual label was (1), leading to a substantial penalty.

If the predicted probability for Instance 1 changed from 0.9 to 0.6, here's how the log loss would be affected:

Increased Log Loss for Instance 1: The log loss for Instance 1 would increase because the model's prediction becomes less confident. Even though the predicted probability (0.6) is still greater than 0.5 and would likely result in a correct classification, it is further from the true label (1) compared to the original prediction (0.9). This increased deviation from the true label leads to a higher penalty in the log loss calculation.

Increased Overall Log Loss: The overall log loss would also increase because the log loss for Instance 1 has increased. Since the overall log loss is the average of individual log losses, any increase in an individual log loss will contribute to a higher overall log loss.

*Question: Why might you select log loss over precision, recall, or accuracy (in the context of any problem, not this one specifically)?*

*Your answer:*
When prediction confidence is crucial: If you need to understand how confident your model is in its predictions, log loss is a good choice. For example, in medical diagnosis or financial fraud detection, it's important to know not only whether a prediction is correct but also how certain the model is about it.

When dealing with probabilistic models: If your model outputs probabilities rather than just class labels (e.g., logistic regression, neural networks), log loss is a natural choice for evaluation. It directly works with the predicted probabilities, providing a more comprehensive assessment of the model's performance.

When optimizing model training: Log loss is often used as the objective function during model training, as it's a continuous and differentiable function that can be effectively minimized by optimization algorithms.

When dealing with imbalanced datasets: In cases of class imbalance, log loss can be a more robust metric than accuracy. It penalizes models that simply predict the majority class, encouraging them to learn patterns that can accurately identify the minority class as well.

When evaluating ranking models: Log loss is commonly used in information retrieval and ranking tasks where the goal is to order items based on their relevance. It can assess the model's ability to rank positive instances higher than negative ones.

## Application Scenario: Select a Metric (55 points)

**Application Scenario: Fraud Detection System**

You are working as a data scientist for a financial institution that wants to develop a fraud detection system to identify potentially fraudulent transactions. The dataset contains information about various transactions, including transaction amount, merchant ID, and transaction type. Your task is to build a machine learning model to classify transactions as either fraudulent or non-fraudulent.

**Problem Description:**

* Dataset: The dataset consists of historical transaction data, with labels indicating whether each transaction was fraudulent or not.
* Class Distribution: The dataset is mostly non-fraudulant cases, with a small percentage of transactions being fraudulent compared to legitimate transactions.
* Objective: The objective is to develop a fraud detection model that minimizes false negatives (fraudulent transactions incorrectly classified as non-fraudulent) while maintaining a reasonable level of precision.

**Stakeholder Requirements:**
Given the nature of the problem, it is crucial to prioritize recall (sensitivity) to ensure that as many fraudulent transactions as possible are detected. However, precision is also important to minimize false positives and avoid unnecessary investigations of legitimate transactions. Minimizing false negatives (missing fraudulent transactions) is of utmost importance.

**Task:**
Your task is to develop Python code to evaluate the performance of different machine learning models using various evaluation metrics, including accuracy, precision, recall, and F2 score. *Select the evaluation metric that best suits the problem and explain your choice*.

**Additional Guidelines:**
* You should preprocess the dataset as needed and split it into training and testing sets.
* Implement machine learning models of your choice (e.g., logistic regression, random forest) and evaluate their performance.
* Use appropriate evaluation metrics for binary classification tasks.
* Discuss the rationale behind your choice of evaluation metric and how it aligns with the problem requirements.
* Present your findings and recommendations for selecting the best model based on the chosen evaluation metric.

**Dataset Sample:**

| Transaction ID | Transaction Amount | Merchant ID | Transaction Type | Fraudulent |
|----------------|--------------------|-------------|------------------|------------|
| 1              | 1000               | M123        | Online Purchase  | 0          |
| 2              | 500                | M456        | ATM Withdrawal   | 0          |
| 3              | 2000               | M789        | Online Purchase  | 1          |
| 4              | 1500               | M123        | POS Transaction  | 0          |
| 5              | 800                | M456        | Online Purchase  | 0          |
| 6              | 3000               | M789        | ATM Withdrawal   | 1          |

* Transaction ID: Unique identifier for each transaction.
* Transaction Amount: The amount of money involved in the transaction.
* Merchant ID: Identifier for the merchant involved in the transaction.
* Transaction Type: The type of transaction (e.g., online purchase, ATM withdrawal, POS transaction).
* Fraudulent: Binary indicator (0 or 1) specifying whether the transaction is fraudulent (1) or not (0).

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, fbeta_score

# Creating the dataset
data = {
    'Transaction ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                       11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                       21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
                       31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
    'Transaction Amount': [1000, 500, 2000, 1500, 800, 3000, 1200, 700, 1800, 1300,
                           900, 400, 2200, 1600, 850, 2800, 1100, 600, 1900, 1400,
                           950, 300, 2100, 1700, 820, 3200, 1250, 720, 1850, 1350,
                           880, 420, 2400, 1750, 830, 3100, 1150, 620, 1950, 1450],
    'Merchant ID': ['M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123',
                    'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456',
                    'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123',
                    'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456'],
    'Transaction Type': ['Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction', 'Online Purchase',
                         'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase',
                         'POS Transaction', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction', 'Online Purchase',
                         'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase',
                         'POS Transaction', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal'],
    'Fraudulent': [0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
                   0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
                   1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
                   0, 0, 1, 0, 0, 1, 0, 0, 1, 0]
}

# Creating DataFrame
df = pd.DataFrame(data)

# One-hot encoding for categorical features
df = pd.get_dummies(df, columns=['Merchant ID', 'Transaction Type'], drop_first=True)

# Splitting the dataset
X = df.drop(['Fraudulent', 'Transaction ID'], axis=1)
y = df['Fraudulent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f2_score = fbeta_score(y_test, y_pred, beta=2)

# Printing the results
print("Precision:", precision)
print("Recall:", recall)
print("F2 Score:", f2_score)


Precision: 0.0
Recall: 0.0
F2 Score: 0.0


In [None]:
# YOUR CODE HERE