# Task 2: Threshold Optimization

In this task, you'll practice finding optimal thresholds for different use cases.

## Instructions

Complete the code cells below to find optimal thresholds based on different criteria.

**Key Concepts:**
- Default threshold is 0.5
- Changing threshold affects precision/recall trade-off
- Optimal threshold depends on business requirements

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    precision_recall_curve, roc_curve
)

## Load Data

In [None]:
# Load the classification data
df = pd.read_csv('../../fixtures/input/classification_data.csv')

y_true = df['true_label'].values
y_prob = df['predicted_probability'].values  # Probability of class 1

print(f"Dataset size: {len(y_true)}")
print(f"Class distribution: {np.bincount(y_true)}")
print(f"Probability range: [{y_prob.min():.3f}, {y_prob.max():.3f}]")

## Task 2.1: Find Threshold that Maximizes F1-Score

Search through different thresholds and find the one that gives the highest F1-score.

In [None]:
# YOUR CODE HERE
# 1. Create a range of thresholds from 0.1 to 0.9
# 2. For each threshold, convert probabilities to predictions
# 3. Calculate F1-score
# 4. Find threshold with maximum F1-score

best_threshold_f1 = None
best_f1_score = None

# TEST - Do not modify
assert best_threshold_f1 is not None, "Find the best threshold"
assert best_f1_score is not None, "Calculate the best F1 score"
assert 0 < best_threshold_f1 < 1, "Threshold should be between 0 and 1"
assert 0 < best_f1_score <= 1, "F1 score should be between 0 and 1"

# Verify by calculating F1 at this threshold
y_pred_best = (y_prob >= best_threshold_f1).astype(int)
verify_f1 = f1_score(y_true, y_pred_best)
assert abs(verify_f1 - best_f1_score) < 0.001, "F1 score doesn't match threshold"

print(f"✓ Best threshold for F1: {best_threshold_f1:.4f}")
print(f"✓ Best F1-score: {best_f1_score:.4f}")

## Task 2.2: Find Threshold for High Precision

Find a threshold that achieves at least 95% precision while maximizing recall.

In [None]:
# YOUR CODE HERE
# 1. Try different thresholds
# 2. Filter those with precision >= 0.95
# 3. Among those, find the one with maximum recall

high_precision_threshold = None
high_precision_recall = None

# TEST - Do not modify
assert high_precision_threshold is not None, "Find threshold for high precision"
assert high_precision_recall is not None, "Calculate recall at this threshold"

# Verify precision is >= 0.95
y_pred_hp = (y_prob >= high_precision_threshold).astype(int)
verify_precision = precision_score(y_true, y_pred_hp, zero_division=0)
assert verify_precision >= 0.94, f"Precision should be >= 0.95, got {verify_precision:.4f}"

verify_recall = recall_score(y_true, y_pred_hp)
assert abs(verify_recall - high_precision_recall) < 0.001, "Recall doesn't match"

print(f"✓ Threshold for precision >= 0.95: {high_precision_threshold:.4f}")
print(f"  Precision: {verify_precision:.4f}")
print(f"  Recall: {high_precision_recall:.4f}")

## Task 2.3: Find Threshold for High Recall

Find a threshold that achieves at least 90% recall while maximizing precision.

In [None]:
# YOUR CODE HERE
# 1. Try different thresholds
# 2. Filter those with recall >= 0.90
# 3. Among those, find the one with maximum precision

high_recall_threshold = None
high_recall_precision = None

# TEST - Do not modify
assert high_recall_threshold is not None, "Find threshold for high recall"
assert high_recall_precision is not None, "Calculate precision at this threshold"

# Verify recall is >= 0.90
y_pred_hr = (y_prob >= high_recall_threshold).astype(int)
verify_recall = recall_score(y_true, y_pred_hr)
assert verify_recall >= 0.89, f"Recall should be >= 0.90, got {verify_recall:.4f}"

verify_precision = precision_score(y_true, y_pred_hr)
assert abs(verify_precision - high_recall_precision) < 0.001, "Precision doesn't match"

print(f"✓ Threshold for recall >= 0.90: {high_recall_threshold:.4f}")
print(f"  Precision: {high_recall_precision:.4f}")
print(f"  Recall: {verify_recall:.4f}")

## Task 2.4: Optimal Threshold Using Youden's J Statistic

Youden's J statistic = Sensitivity + Specificity - 1 = TPR - FPR

Find the threshold that maximizes this statistic.

In [None]:
# YOUR CODE HERE
# 1. Calculate ROC curve to get TPR and FPR at different thresholds
# 2. Calculate J = TPR - FPR for each threshold
# 3. Find threshold with maximum J

youden_threshold = None
youden_j_score = None

# TEST - Do not modify
assert youden_threshold is not None, "Find Youden's optimal threshold"
assert youden_j_score is not None, "Calculate J statistic"
assert 0 < youden_threshold < 1, "Threshold should be between 0 and 1"
assert 0 < youden_j_score <= 1, "J statistic should be between 0 and 1"

print(f"✓ Youden's optimal threshold: {youden_threshold:.4f}")
print(f"✓ J statistic: {youden_j_score:.4f}")

## Task 2.5: Visualize Threshold Effects

Create a plot showing how precision, recall, and F1-score change with threshold.

In [None]:
# YOUR CODE HERE
# 1. Calculate precision, recall, F1 for thresholds from 0.1 to 0.9
# 2. Create a plot with threshold on x-axis and metrics on y-axis
# 3. Mark the optimal points found above

# Your plotting code here

# TEST - Do not modify
# Check that a plot was created
import matplotlib.pyplot as plt
assert len(plt.gcf().axes) > 0, "Create a plot showing threshold effects"
print("✓ Plot created successfully")

## Task 2.6: Compare Thresholds

Create a summary table comparing all the thresholds you found.

In [None]:
# YOUR CODE HERE
# Create a DataFrame comparing:
# - Best F1 threshold
# - High precision threshold
# - High recall threshold
# - Youden's threshold
# Include columns: Threshold, Precision, Recall, F1-Score

comparison_df = None

# TEST - Do not modify
assert comparison_df is not None, "Create comparison DataFrame"
assert isinstance(comparison_df, pd.DataFrame), "Should be a DataFrame"
assert len(comparison_df) >= 4, "Should have at least 4 rows"
assert 'Threshold' in comparison_df.columns, "Should have Threshold column"
assert 'Precision' in comparison_df.columns, "Should have Precision column"
assert 'Recall' in comparison_df.columns, "Should have Recall column"
assert 'F1-Score' in comparison_df.columns, "Should have F1-Score column"

print("✓ Threshold Comparison:")
print(comparison_df.to_string(index=False))

## Task 2.7: Use Case Analysis

For each scenario below, choose the most appropriate threshold from those you found:

1. **Medical Screening**: Don't want to miss sick patients
2. **Spam Filter**: Don't want to block legitimate emails
3. **Fraud Detection**: Balance between catching fraud and false alarms

Assign the appropriate threshold to each variable.

In [None]:
# YOUR CODE HERE
# Choose from: best_threshold_f1, high_precision_threshold, 
#              high_recall_threshold, youden_threshold

medical_screening_threshold = None  # Don't miss sick patients (high recall)
spam_filter_threshold = None        # Don't block legitimate emails (high precision)
fraud_detection_threshold = None    # Balance (best F1)

# TEST - Do not modify
assert medical_screening_threshold is not None, "Choose threshold for medical screening"
assert spam_filter_threshold is not None, "Choose threshold for spam filter"
assert fraud_detection_threshold is not None, "Choose threshold for fraud detection"

# Verify choices make sense
# Medical should prioritize recall (lower threshold)
assert medical_screening_threshold <= best_threshold_f1, \
    "Medical screening should use lower threshold for high recall"

# Spam should prioritize precision (higher threshold)
assert spam_filter_threshold >= best_threshold_f1, \
    "Spam filter should use higher threshold for high precision"

print("✓ Use Case Recommendations:")
print(f"  Medical Screening: {medical_screening_threshold:.4f}")
print(f"  Spam Filter: {spam_filter_threshold:.4f}")
print(f"  Fraud Detection: {fraud_detection_threshold:.4f}")

## Task 2.8: Calculate Cost-Sensitive Threshold

Imagine:
- Cost of False Positive: $10
- Cost of False Negative: $100

Find the threshold that minimizes total cost.

In [None]:
# YOUR CODE HERE
# 1. For each threshold, calculate total cost
# 2. Cost = (FP * 10) + (FN * 100)
# 3. Find threshold with minimum cost

cost_fp = 10
cost_fn = 100

optimal_cost_threshold = None
minimum_cost = None

# TEST - Do not modify
assert optimal_cost_threshold is not None, "Find cost-optimal threshold"
assert minimum_cost is not None, "Calculate minimum cost"
assert 0 < optimal_cost_threshold < 1, "Threshold should be between 0 and 1"
assert minimum_cost >= 0, "Cost should be non-negative"

# Verify cost calculation
from sklearn.metrics import confusion_matrix
y_pred_cost = (y_prob >= optimal_cost_threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_cost).ravel()
verify_cost = (fp * cost_fp) + (fn * cost_fn)
assert abs(verify_cost - minimum_cost) < 1, "Cost calculation doesn't match"

print(f"✓ Cost-optimal threshold: {optimal_cost_threshold:.4f}")
print(f"  Minimum cost: ${minimum_cost:.2f}")
print(f"  FP: {fp}, FN: {fn}")
print(f"\nNote: Since FN costs 10x more than FP, we should use a lower threshold")
print(f"to reduce false negatives (prioritize recall).")

## Summary

Congratulations! You've completed Task 2. You've learned to:
- Find thresholds that optimize different metrics (F1, precision, recall)
- Use Youden's J statistic for balanced optimization
- Choose appropriate thresholds for different use cases
- Calculate cost-sensitive thresholds

Key insights:
- The default threshold (0.5) is often not optimal
- Different use cases require different thresholds
- Cost-sensitive optimization accounts for real-world consequences
- Always validate your threshold choice with domain experts