# Assignment 1 - Binary Classification Evaluation Metrics

**Objective:**
The objective of this assignment is to assess your understanding of fundamental concepts in model evaluation for machine learning tasks. This assignment covers topics discussed in the first half of the course, including key evaluation metrics, confusion matrices, ROC curves, and Precision-Recall curves.
Instructions:

1. Theory Questions:
Answer the following theoretical questions:

    1. Explain the limitations of accuracy as an evaluation metric in imbalanced datasets. How does accuracy behave when classes are heavily skewed, and why might it provide misleading results?
    2. Describe the purpose and interpretation of a confusion matrix. How does it help in assessing a classification model's performance?
    3. Explain the concept of ROC curves. What does each point on an ROC curve represent? How is the area under the ROC curve (AUC-ROC) calculated?
    4. Compare and contrast the advantages and disadvantages of ROC curves and Precision-Recall curves. In what scenarios would you prefer to use one over the other, and why?

2. Practical Exercises:
* Implement Python code to calculate the following evaluation metrics for a given binary classification problem: Log Loss
* Select the best metric for an applied scenario

**Submission Guidelines:**
* Submit your responses to the theory questions in a neatly organized markdown.
* Include your Python code for the practical exercise.
* Submit your assignment as a single `.ipynb` file named `MY NAME Assignment 1 - Log Loss` via the course submission platform (slack).

## Part 1: Theory Questions (20 points)
Provide your answers here:

    1.Accuracy measures the percentage of correct predictions made by the model. However, in imbalanced datasets (where one class is much more common than the other), accuracy may be misleading. If, say, 95% of samples are type A and only 5% are type B, a model that always predicts type A qwill be 95% accurate, but is never actually detecting the minority class. Also accuracy doesn't tell us about false positives or false negatives.
    2. A confusion matrix is a 2x2 table for binary classification problems. It shows how many predictions fall into each category; Predictive Positive or Negative, vs Actual Positive or Negative. It helps to visualize performance, and allows calculation of key metrics liek precision, recall, F!-score, and specificity.
    3. Receiver Operating Characteristic plots the true positive rate (recall) vs False Positive Rate. Each point on the ROC curve shows the model's performance at a specific decision threshold. The AUC (area under the curve) measures teh area under the ROC curve, ranges from 0 to 1, and higher is better. AUC = .5 is the equivalent of random guessing, and AUC = 1 is perfect classification.
    4. ROC curves work best for balanced datasets, but can be misleading when the dataset is imbalanced as even a high false positive rate may seem low due to many true negatives. Precidion-Recall Curves work best for imbalanced dataset, as they show how well the model detects positives without being misled by many negatives.

## Practicing Log Loss (25 Points)

**Objective:**
The objective of this assignment is to deepen your understanding of log loss, also known as logarithmic loss or cross-entropy loss, and its application in evaluating the performance of classification models.

**Instructions:**
In this assignment, you will be given a set of binary classification predictions along with their corresponding actual class labels. Your task is to calculate the log loss for each prediction and then analyze the overall log loss performance of the model.

**Dataset:**
You are provided with a dataset containing the following information:

Predicted probabilities for the positive class (ranging from 0 to 1) for a set of instances.
Actual binary class labels (0 or 1) indicating whether the instance belongs to the positive class or not.

**Assignment Tasks:**
1. Calculate the log loss for each instance in the dataset using the predicted probabilities and actual class labels.
2. Summarize the individual log losses and compute the overall log loss performance for the model.
3. Interpret the overall log loss value and analyze the model's performance. Discuss any insights or observations derived from the log loss analysis.


**Dataset:**

| Instance | Predicted Probability | Actual Label |
|----------|------------------------|--------------|
|    1     |          0.9           |       1      |
|    2     |          0.3           |       0      |
|    3     |          0.6           |       1      |
|    4     |          0.8           |       0      |
|    5     |          0.1           |       1      |


**Grading Criteria:**

* Correctness of log loss calculations.
* Clarity and completeness of the analysis.
* Insights derived from the log loss interpretation.
* Overall presentation and adherence to submission guidelines.

In [4]:
import pandas as pd
import numpy as np

# Create a DataFrame with the dataset
data = {
    'Instance': [1, 2, 3, 4, 5],
    'Predicted Probability': [0.9, 0.3, 0.6, 0.8, 0.1],
    'Actual Label': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original Data:")
print(df)

# --- Add this part below ---
# Define a function to compute log loss for one prediction
def log_loss(p, y):
    p = np.clip(p, 1e-15, 1 - 1e-15)
    return -(y * np.log(p) + (1 - y) * np.log(1 - p))

# Apply the log loss function to each row
df['Log Loss'] = df.apply(lambda row: log_loss(row['Predicted Probability'], row['Actual Label']), axis=1)

# Calculate overall log loss
overall_log_loss = df['Log Loss'].mean()

# Display the updated DataFrame with Log Loss
print("\nData with Log Loss:")
print(df)

# Print the overall Log Loss
print(f"\nOverall Log Loss: {overall_log_loss:.4f}")


Original Data:
   Instance  Predicted Probability  Actual Label
0         1                    0.9             1
1         2                    0.3             0
2         3                    0.6             1
3         4                    0.8             0
4         5                    0.1             1

Data with Log Loss:
   Instance  Predicted Probability  Actual Label  Log Loss
0         1                    0.9             1  0.105361
1         2                    0.3             0  0.356675
2         3                    0.6             1  0.510826
3         4                    0.8             0  1.609438
4         5                    0.1             1  2.302585

Overall Log Loss: 0.9770


In [7]:
import pandas as pd
import numpy as np

# Create a DataFrame with the dataset
data = {
    'Instance': [1, 2, 3, 4, 5],
    'Predicted Probability': [0.6, 0.3, 0.6, 0.8, 0.1],
    'Actual Label': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original Data:")
print(df)

# --- Add this part below ---
# Define a function to compute log loss for one prediction
def log_loss(p, y):
    p = np.clip(p, 1e-15, 1 - 1e-15)
    return -(y * np.log(p) + (1 - y) * np.log(1 - p))

# Apply the log loss function to each row
df['Log Loss'] = df.apply(lambda row: log_loss(row['Predicted Probability'], row['Actual Label']), axis=1)

# Calculate overall log loss
overall_log_loss = df['Log Loss'].mean()

# Display the updated DataFrame with Log Loss
print("\nData with Log Loss:")
print(df)

# Print the overall Log Loss
print(f"\nOverall Log Loss: {overall_log_loss:.4f}")

Original Data:
   Instance  Predicted Probability  Actual Label
0         1                    0.6             1
1         2                    0.3             0
2         3                    0.6             1
3         4                    0.8             0
4         5                    0.1             1

Data with Log Loss:
   Instance  Predicted Probability  Actual Label  Log Loss
0         1                    0.6             1  0.510826
1         2                    0.3             0  0.356675
2         3                    0.6             1  0.510826
3         4                    0.8             0  1.609438
4         5                    0.1             1  2.302585

Overall Log Loss: 1.0581


*Question: Interpret the log loss above. How would it change if the predicted probability for instance 0 changed from 0.9 to 0.6? Why?*

 Log loss penalizes incorrect and overconfident predictions.The Original Predicted Probability of 0.9 is close to the truth, so low log loss. The New Predicted Probability: 0.6 is less confident about a correct label. Old log loss: .105, New Log loss for .6 prediction: .511, so the log loss increases with the lower predicted probability. This means the model is now less confident about a corredc prediction, so the log loss increases. It also increases the overall average log loss, signaling worse model performance.
 Log loss rewards correct, confident predictions and penalizes wrong or unsure ones, especially when they're confidently wrong.

*Question: Why might you select log loss over precision, recall, or accuracy (in the context of any problem, not this one specifically)?*

You'd choose log loss when you care about how confident your predictions are. A model that predicts .99 when it's correct gets rewarded, but if it predicts .99 and is wrong, it's penalized. This is especailly useful in hight stakes applicaitons like medical diagnosis or fraud detection, where it's vital to know how confident the modle is in it's prediction. It's also useful for comoparing probabilistic models like logistic regression, XGBoost, neural networks, etc. and helps to distinguish between two models that both have the right answer but with different confidence. re: the other metrics, Accuracy tells you the % of correct predictions, but is misleading on imbalanced data. Precision  = % of predicted positives that are correct, but ignores false negatives. Recal = % of actual positives that are correclty predicted, but ignores false positives. Log loss measures the quality of probability estimates. So, to summarize, we use log loss when: we are ranking models that output probabilities, when the cost of a confident wrong prediction is high, and when we are optimizing for overall predictive uncerqainty rather than individual yes/no outputs.

## Application Scenario: Select a Metric (55 points)

**Application Scenario: Fraud Detection System**

You are working as a data scientist for a financial institution that wants to develop a fraud detection system to identify potentially fraudulent transactions. The dataset contains information about various transactions, including transaction amount, merchant ID, and transaction type. Your task is to build a machine learning model to classify transactions as either fraudulent or non-fraudulent.

**Problem Description:**

* Dataset: The dataset consists of historical transaction data, with labels indicating whether each transaction was fraudulent or not.
* Class Distribution: The dataset is mostly non-fraudulant cases, with a small percentage of transactions being fraudulent compared to legitimate transactions.
* Objective: The objective is to develop a fraud detection model that minimizes false negatives (fraudulent transactions incorrectly classified as non-fraudulent) while maintaining a reasonable level of precision.

**Stakeholder Requirements:**
Given the nature of the problem, it is crucial to prioritize recall (sensitivity) to ensure that as many fraudulent transactions as possible are detected. However, precision is also important to minimize false positives and avoid unnecessary investigations of legitimate transactions. Minimizing false negatives (missing fraudulent transactions) is of utmost importance.

**Task:**
Your task is to develop Python code to evaluate the performance of different machine learning models using various evaluation metrics, including accuracy, precision, recall, and F2 score. *Select the evaluation metric that best suits the problem and explain your choice*.

**Additional Guidelines:**
* You should preprocess the dataset as needed and split it into training and testing sets.
* Implement machine learning models of your choice (e.g., logistic regression, random forest) and evaluate their performance.
* Use appropriate evaluation metrics for binary classification tasks.
* Discuss the rationale behind your choice of evaluation metric and how it aligns with the problem requirements.
* Present your findings and recommendations for selecting the best model based on the chosen evaluation metric.

**Dataset Sample:**

| Transaction ID | Transaction Amount | Merchant ID | Transaction Type | Fraudulent |
|----------------|--------------------|-------------|------------------|------------|
| 1              | 1000               | M123        | Online Purchase  | 0          |
| 2              | 500                | M456        | ATM Withdrawal   | 0          |
| 3              | 2000               | M789        | Online Purchase  | 1          |
| 4              | 1500               | M123        | POS Transaction  | 0          |
| 5              | 800                | M456        | Online Purchase  | 0          |
| 6              | 3000               | M789        | ATM Withdrawal   | 1          |

* Transaction ID: Unique identifier for each transaction.
* Transaction Amount: The amount of money involved in the transaction.
* Merchant ID: Identifier for the merchant involved in the transaction.
* Transaction Type: The type of transaction (e.g., online purchase, ATM withdrawal, POS transaction).
* Fraudulent: Binary indicator (0 or 1) specifying whether the transaction is fraudulent (1) or not (0).

In [8]:
import pandas as pd

# Creating the dataset
data = {
    'Transaction ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                       11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                       21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
                       31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
    'Transaction Amount': [1000, 500, 2000, 1500, 800, 3000, 1200, 700, 1800, 1300,
                           900, 400, 2200, 1600, 850, 2800, 1100, 600, 1900, 1400,
                           950, 300, 2100, 1700, 820, 3200, 1250, 720, 1850, 1350,
                           880, 420, 2400, 1750, 830, 3100, 1150, 620, 1950, 1450],
    'Merchant ID': ['M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123',
                    'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456',
                    'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123',
                    'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456', 'M789', 'M123', 'M456'],
    'Transaction Type': ['Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction', 'Online Purchase',
                         'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase',
                         'POS Transaction', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction', 'Online Purchase',
                         'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'POS Transaction',
                         'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase',
                         'POS Transaction', 'Online Purchase', 'ATM Withdrawal', 'Online Purchase', 'ATM Withdrawal'],
    'Fraudulent': [0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
                   0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
                   1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
                   0, 0, 1, 0, 0, 1, 0, 0, 1, 0]
}

# Creating DataFrame
df = pd.DataFrame(data)

# Displaying the DataFrame
print(df)


    Transaction ID  Transaction Amount Merchant ID Transaction Type  \
0                1                1000        M123  Online Purchase   
1                2                 500        M456   ATM Withdrawal   
2                3                2000        M789  Online Purchase   
3                4                1500        M123  POS Transaction   
4                5                 800        M456  Online Purchase   
5                6                3000        M789   ATM Withdrawal   
6                7                1200        M123  Online Purchase   
7                8                 700        M456   ATM Withdrawal   
8                9                1800        M789  Online Purchase   
9               10                1300        M123  POS Transaction   
10              11                 900        M456  Online Purchase   
11              12                 400        M789   ATM Withdrawal   
12              13                2200        M123  Online Purchase   
13    

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score, classification_report

# Step 1: Define features and target
X = df.drop(columns=['Transaction ID', 'Fraudulent'])
y = df['Fraudulent']

# Step 2: Preprocessing (OneHotEncode categorical columns)
categorical_features = ['Merchant ID', 'Transaction Type']
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Keeps numerical features like Transaction Amount
)

# Step 3: Split the data into training and testing sets (stratify to preserve fraud balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Step 4: Create a pipeline with preprocessing + Random Forest classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Step 5: Train the model
pipeline.fit(X_train, y_train)

# Step 6: Predict and evaluate
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)

# Step 7: Output results
print("\n📊 Evaluation Metrics:")
print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F2 Score:  {f2:.2f}")
# from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score, classification_report
from sklearn.model_selection import train_test_split

# Step 1: Define features and target
X = df.drop(columns=['Transaction ID', 'Fraudulent'])
y = df['Fraudulent']

# Step 2: Preprocessing (OneHotEncode categorical columns)
categorical_features = ['Merchant ID', 'Transaction Type']
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Keeps numerical features like Transaction Amount
)

# Step 3: Split the data into training and testing sets (stratify to preserve fraud balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Step 4: Create a pipeline with preprocessing + Random Forest classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Step 5: Train the model
pipeline.fit(X_train, y_train)

# Step 6: Predict and evaluate
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)

# Step 7: Output results
print("\n📊 Evaluation Metrics:")
print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F2 Score:  {f2:.2f}")

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, fbeta_score, classification_report

# Step 1: Define features and target
X = df.drop(columns=['Transaction ID', 'Fraudulent'])
y = df['Fraudulent']

# Step 2: Preprocessing (OneHotEncode categorical columns)
categorical_features = ['Merchant ID', 'Transaction Type']
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Keeps numerical features like Transaction Amount
)

# Step 3: Split the data into training and testing sets (stratify to preserve fraud balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Step 4: Create a pipeline with preprocessing + Random Forest classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Step 5: Train the model
pipeline.fit(X_train, y_train)

# Step 6: Predict and evaluate
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)

# Step 7: Output results
print("\n📊 Evaluation Metrics:")
print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F2 Score:  {f2:.2f}")

print("\n🧾 Classification Report:")
print(classification_report(y_test, y_pred))
print("In evaluating metrics for this solution, we need consider that a false negative (missing an actual fraudulent transaction) means that fraud goes undetected, which could result in financial losses and damage to the institutional reputation. A false positive (flagging a legit transaction as fraud) could inconvenience users and lead to unnecessary investigations, but is less costly than missing actual fraud. Therefore we should proceed with the goal of maximizing recall (sensitivity) so that we catch as many fraudulent transactions as possible, while maintaining a reasonable level of precision to avoid too many false alarms.  To balance this, I’d select the F2 score as a main. Evaluation metric, as it gives more weight to recall than precision, and aligns with the requirement to minimize FN while still considering FP.  Based on using a random forest classifier, chosen because it works well for imbalanced classification problems like fraud detection, handles both numerical and categorical data, is less sensitive to noise and overfitting compared to a single decision tree, and ranks feature importance, helping with interpretability.   Findings: based on results, the model achieves a moderate recall (.330, meaning it correctly identifies a fair number of fraud cases. Precision is lower, due to the class imbalance and nature of fraud detection. The F2 of .31 provides a holistic view of performance that prioritizes fraud detection coverage. ")


📊 Evaluation Metrics:
Accuracy:  0.50
Precision: 0.25
Recall:    0.33
F2 Score:  0.31

📊 Evaluation Metrics:
Accuracy:  0.50
Precision: 0.25
Recall:    0.33
F2 Score:  0.31

📊 Evaluation Metrics:
Accuracy:  0.50
Precision: 0.25
Recall:    0.33
F2 Score:  0.31

🧾 Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.57      0.62         7
           1       0.25      0.33      0.29         3

    accuracy                           0.50        10
   macro avg       0.46      0.45      0.45        10
weighted avg       0.54      0.50      0.52        10

In evaluating metrics for this solution, we need consider that a false negative (missing an actual fraudulent transaction) means that fraud goes undetected, which could result in financial losses and damage to the institutional reputation. A false positive (flagging a legit transaction as fraud) could inconvenience users and lead to unnecessary investigations, but is less costly than