# Exercise 1: Analyzing Confusion Matrix

## 1. Definitions of Confusion Matrix Terms in Email Spam Detection

- **True Positives (TP)**:
  - Emails correctly classified as "Spam."
  - Example: An actual spam email is detected as spam by the classifier.

- **True Negatives (TN)**:
  - Emails correctly classified as "Not Spam."
  - Example: A legitimate email is correctly identified as not spam.

- **False Positives (FP)**:
  - Emails incorrectly classified as "Spam" when they are actually "Not Spam."
  - Example: A legitimate email from your boss is mistakenly flagged as spam.

- **False Negatives (FN)**:
  - Emails incorrectly classified as "Not Spam" when they are actually "Spam."
  - Example: A phishing email is mistakenly marked as not spam and reaches your inbox.



## 2. Calculating Metrics

Given the confusion matrix values:
- **TP** = Number of True Positives
- **TN** = Number of True Negatives
- **FP** = Number of False Positives
- **FN** = Number of False Negatives

### Formulas:
1. **Accuracy**:
   $$
   \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
   $$
   - Measures the overall correctness of the classifier.

2. **Precision**:
   $$
   \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
   $$
   - Focuses on how many predicted "Spam" emails are actually spam.
   - High precision means fewer legitimate emails are flagged as spam.

3. **Recall (Sensitivity)**:
   $$
   \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
   $$
   - Measures how many actual spam emails were correctly identified.
   - High recall ensures most spam emails are detected.

4. **F1-Score**:
   $$
   F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
   $$
   - Harmonic mean of precision and recall, balancing both metrics.

### Example Calculation:
Assume the confusion matrix values are:
- TP = 50, TN = 40, FP = 10, FN = 20

1. Accuracy:
   $$
   \text{Accuracy} = \frac{50 + 40}{50 + 40 + 10 + 20} = \frac{90}{120} = 0.75
   $$

2. Precision:
   $$
   \text{Precision} = \frac{50}{50 + 10} = \frac{50}{60} = 0.833
   $$

3. Recall:
   $$
   \text{Recall} = \frac{50}{50 + 20} = \frac{50}{70} = 0.714
   $$

4. F1-Score:
   $$
   F1 = 2 \times \frac{0.833 \times 0.714}{0.833 + 0.714} = 2 \times \frac{0.595}{1.547} = 0.769
   $$



## 3. Impact of False Positives vs False Negatives

### Higher Number of False Positives (FP):
- **Impact**:
  - Legitimate emails are incorrectly flagged as spam.
  - This can frustrate users, especially if important emails (e.g., work-related) are sent to the spam folder.
- **Metrics Affected**:
  - Precision decreases because more non-spam emails are misclassified as spam.
    - Lower precision means a higher proportion of flagged emails are not actually spam.
  
### Higher Number of False Negatives (FN):
- **Impact**:
  - Spam emails are incorrectly classified as not spam and reach the inbox.
  - This can pose security risks, such as phishing attacks or malware delivery.
- **Metrics Affected**:
  - Recall decreases because fewer actual spam emails are detected.
    - Lower recall means the classifier fails to catch a significant portion of spam.

### Trade-Off Between FP and FN:
- The balance between FP and FN depends on the use case:
  - In email spam detection, minimizing FN might be prioritized to ensure harmful emails do not reach users' inboxes, even at the cost of more FP (i.e., stricter filters).
  - However, excessive FP can reduce user trust in the system, so a balance is necessary.


---

# Exercise 2: Evaluating Trade-offs in Metrics

## 1. **Why High Recall is More Important in Medical Diagnosis**
In a medical diagnosis context, recall (sensitivity) measures the ability of the model to correctly identify all patients with the disease. High recall is critical because:
- Missing a true positive (false negative) means failing to detect a patient who actually has the disease, which could lead to severe health consequences or even death.
- False positives (incorrectly diagnosing a healthy person as diseased) are less harmful because they typically result in additional testing or follow-up, which is less risky than missing a diagnosis.

For example, in cancer detection, it is far more critical to identify all potential cancer cases (even if some healthy individuals are flagged) than to risk leaving undiagnosed patients untreated.



## 2. **Scenario Where Precision Becomes More Important**
Precision becomes more important in scenarios where false positives have significant consequences. For instance:
- **Fraud Detection**: Flagging legitimate transactions as fraudulent (false positives) can frustrate customers and damage trust in the system. Here, high precision ensures that only actual fraudulent transactions are flagged.
- **Spam Email Filtering**: Marking important legitimate emails as spam (false positives) can lead to missed opportunities or critical communication failures. High precision minimizes these errors.

In these cases, it is acceptable to miss some true positives (low recall) if it ensures that flagged instances are highly reliable.



## 3. **Consequences of Solely Focusing on Accuracy in Imbalanced Datasets**
Accuracy measures the overall correctness of predictions but can be misleading in imbalanced datasets, where one class significantly outweighs the other. For example:
- In a dataset where 95% of patients are healthy and only 5% have a disease, a model that predicts "healthy" for every case achieves 95% accuracy but fails to detect any diseased patients (0 recall for the diseased class).

### Potential Consequences:
1. **False Sense of Model Performance**:
   - High accuracy may mask poor performance on the minority class (e.g., diseased patients), leading to dangerous real-world outcomes.
   
2. **Neglecting Minority Class**:
   - The model might prioritize majority class predictions, ignoring the minority class entirely.
   
3. **Unethical Outcomes**:
   - In medical or safety-critical applications, focusing solely on accuracy could result in harm to individuals in the minority class.

### Mitigation Strategies:
- Use metrics like **F1-score**, which balances precision and recall.
- Evaluate performance with class-specific metrics like recall for minority classes.
- Consider using weighted accuracy or cost-sensitive learning to account for imbalances.

---

# Exercise 3: Understanding Cross-Validation and Learning Curves

## 1. **Difference Between K-Fold Cross-Validation and Stratified K-Fold Cross-Validation**

### K-Fold Cross-Validation:
- Divides the dataset into $ k $ equally sized folds.
- The model is trained on $ k-1 $ folds and tested on the remaining fold, iterating $ k $ times so that each fold is used as a test set once.
- Suitable for datasets where class distribution is not a concern.

### Stratified K-Fold Cross-Validation:
- Ensures that each fold maintains the same class distribution as the original dataset.
- Particularly useful for imbalanced datasets (e.g., when one class is significantly underrepresented).
  
### Choice for Housing Price Prediction:
- **K-Fold Cross-Validation** would be preferred because predicting housing prices is a regression task, where maintaining class distribution is not relevant. Stratification is more applicable to classification problems.



## 2. **Learning Curves and Their Use**

### What Are Learning Curves?
- **Definition**: Graphical representations of model performance (e.g., loss or accuracy) on training and validation sets as a function of training iterations or dataset size.
- **Purpose**: Diagnose model behavior, such as underfitting, overfitting, or good fit.

### How They Help:
1. **Diagnose Underfitting**:
   - Training and validation errors are both high and do not improve with more training data or iterations.
   - Indicates the model is too simple or lacks capacity to capture the patterns in the data.

2. **Diagnose Overfitting**:
   - Training error decreases significantly, but validation error increases after a point.
   - Indicates the model is memorizing training data instead of generalizing.

3. **Assess Data Sufficiency**:
   - If validation error decreases with more data, adding more training samples may improve performance.



## 3. **Implications of Underfitting and Overfitting**

### Underfitting:
- **Symptoms on Learning Curve**:
  - Training loss remains high and flat.
  - Validation loss mirrors training loss with no significant improvement.
- **Causes**:
  - Model complexity is too low (e.g., insufficient features or too simple algorithm).
  - Insufficient training time or poor hyperparameter tuning.
- **Solutions**:
  - Use a more complex model (e.g., increase number of features or layers).
  - Train for more epochs or adjust learning rate.
  - Add relevant features to better capture patterns in the data.

### Overfitting:
- **Symptoms on Learning Curve**:
  - Training loss decreases significantly but validation loss increases after a certain point.
  - Large gap between training and validation losses.
- **Causes**:
  - Model complexity is too high (e.g., too many parameters or layers).
  - Insufficient regularization or noisy training data.
- **Solutions**:
  - Apply regularization techniques (e.g., L1/L2 penalties, dropout).
  - Use early stopping to halt training before overfitting occurs.
  - Reduce model complexity by simplifying architecture or pruning features.

---

## Summary
By leveraging cross-validation to evaluate generalization and learning curves to diagnose model behavior, you can iteratively refine your housing price prediction model to achieve an optimal balance between bias (underfitting) and variance (overfitting). This ensures robust performance on unseen data while avoiding pitfalls like overtraining or insufficient learning.


---

# Exercise 4: Impact of Class Imbalance on Model Evaluation

## 1. Why Using Accuracy Might Be Misleading
- **Definition**: Accuracy measures the proportion of correct predictions (both true positives and true negatives) out of all predictions.
- **Problem in Imbalanced Datasets**:
  - In a dataset where only 2% of instances are positive (diseased), a model that predicts "Not Diseased" for every instance achieves 98% accuracy, despite failing to identify any actual diseased cases (0% recall for the positive class).
  - This high accuracy is misleading because it reflects the dominance of the majority class (negative cases) rather than the model's ability to detect rare positive cases.



## 2. Importance of Precision and Recall

### **Precision**:
- Measures the proportion of correctly identified positive cases out of all predicted positives.
- **Importance**:
  - High precision ensures that most predicted positive cases are truly diseased, reducing false positives.
  - In medical diagnosis, false positives might lead to unnecessary tests or treatments, which can be costly or stressful for patients.

### **Recall (Sensitivity)**:
- Measures the proportion of actual positive cases that are correctly identified.
- **Importance**:
  - High recall ensures that most diseased patients are correctly detected, minimizing false negatives.
  - Missing a true positive (false negative) could result in undiagnosed and untreated diseases, potentially leading to severe health consequences.

### Trade-Off Between Precision and Recall:
- In this context, **recall is more critical** because failing to identify diseased patients is far more harmful than subjecting healthy individuals to additional tests (false positives).



## 3. Strategies to Evaluate and Improve Performance

### **Evaluation Strategies**
1. **Precision-Recall Curve**:
   - Plot precision vs. recall across different classification thresholds.
   - The area under the precision-recall curve (PR-AUC) is more informative than ROC-AUC in imbalanced datasets as it focuses on the minority class performance[1][6].

2. **F1-Score**:
   - Combines precision and recall into a single metric using their harmonic mean:
     $$
     F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
     $$
   - Useful for balancing false positives and false negatives when both metrics are important.

3. **Confusion Matrix Analysis**:
   - Examine TP, TN, FP, and FN values directly to understand how well the model performs on each class.

4. **Class-Specific Metrics**:
   - Evaluate metrics like recall specifically for the minority class to ensure it is being detected effectively.



### **Improvement Strategies**
1. **Resampling Techniques**:
   - **Oversampling**: Duplicate or synthetically generate samples from the minority class (e.g., SMOTE).
   - **Undersampling**: Reduce samples from the majority class to balance the dataset.

2. **Cost-Sensitive Learning**:
   - Assign higher misclassification costs to false negatives compared to false positives.
   - Train models with class weights to penalize errors on the minority class more heavily.

3. **Algorithmic Adjustments**:
   - Use specialized models designed for imbalanced data, such as XGBoost or Random Forest with balanced class weights.
   - Adjust classification thresholds to prioritize recall over precision.

4. **Anomaly Detection Approach**:
   - Treat rare disease detection as an anomaly detection problem where minority cases are flagged as anomalies.

5. **Data Augmentation**:
   - Generate synthetic data points for the minority class using techniques like GANs (Generative Adversarial Networks).

---

## Summary
In highly imbalanced datasets like rare disease detection, accuracy is not a reliable metric due to its bias toward the majority class. Instead, metrics like precision, recall, F1-score, and PR-AUC should be prioritized. Strategies such as resampling, cost-sensitive learning, and threshold adjustments can help improve model performance while ensuring that critical cases in the minority class are not overlooked.

---

# Exercise 5: Role of Threshold Tuning in Classification Models

## 1. **Effect of Changing the Threshold on Precision and Recall**

- **Threshold**: The probability score above which a case is classified as "positive" (loan default).
- **Impact of Increasing Threshold from 0.5 to 0.7**:
  - **Precision**:
    - Precision is likely to increase because fewer cases will be classified as "default" (positive), and those that are classified will have a higher probability of actually being defaults.
    - This reduces false positives, improving the proportion of true positives among predicted positives.
  - **Recall**:
    - Recall is likely to decrease because some true positive cases (actual defaults) with probabilities between 0.5 and 0.7 will now be missed.
    - This increases false negatives, reducing the proportion of actual defaults that are correctly identified.

### Trade-Off:
- Increasing the threshold improves precision at the cost of recall, while lowering the threshold does the opposite.



## 2. **Consequences of Setting the Threshold Too High or Too Low**

### **Threshold Too High (e.g., 0.9)**:
- **Consequences**:
  - Very few cases will be classified as "default."
  - High precision but very low recall.
  - Many actual defaulters (false negatives) will go undetected, leading to significant financial losses for the bank as these customers are incorrectly approved for loans.

### **Threshold Too Low (e.g., 0.3)**:
- **Consequences**:
  - Many cases will be classified as "default."
  - High recall but low precision.
  - A large number of non-defaulters (false positives) will be flagged as defaulters, potentially resulting in unnecessary loan rejections or higher interest rates for reliable clients, damaging customer satisfaction and trust.



## 3. **Using ROC Curves and AUC to Find the Optimal Threshold**

### **ROC Curve**:
- A plot of the True Positive Rate (TPR or Recall) vs. False Positive Rate (FPR) at various thresholds.
- Helps visualize the trade-off between recall and false positives as you adjust the threshold.
- Each point on the curve corresponds to a specific threshold.

### **AUC (Area Under the Curve)**:
- A single scalar value summarizing the ROC curve.
- Ranges from 0 to 1, where:
  - AUC = 1 indicates perfect classification.
  - AUC = 0.5 indicates random guessing.
- Higher AUC values indicate better model performance across all thresholds.

### **Finding the Optimal Threshold**:
1. Analyze the ROC curve to identify a threshold that provides a good balance between TPR and FPR.
2. Consider business-specific priorities:
   - If minimizing false negatives (recall) is critical, choose a threshold closer to the left side of the ROC curve.
   - If minimizing false positives (precision) is critical, choose a threshold closer to the right side of the curve.
3. Use metrics like F1-score or a cost-benefit analysis to determine an optimal trade-off point based on business goals.

---

## Summary
Threshold tuning is crucial for balancing precision and recall in binary classification tasks like loan default prediction. While increasing the threshold improves precision at the expense of recall, lowering it does the opposite. ROC curves and AUC provide valuable tools for understanding model performance across thresholds and selecting an optimal threshold based on business priorities, such as minimizing financial losses or maintaining customer satisfaction.