# Exercise 1: Defining the Problem and Data Collection for Loan Default Prediction

1. Problem Statement
- Develop a machine learning model to predict the likelihood of loan default by borrowers at the time of loan application, enabling the financial institution to make informed lending decisions and minimize credit risk.

2. Specific Goals
- Predict whether a loan applicant will default (binary classification: default vs. non-default)
- Estimate the probability of default for risk-based pricing and decision-making
- Identify key risk factors that contribute to loan defaults
- Reduce financial losses from bad loans while maintaining responsible lending practices

3. Identify and list the types of data you would need for this project
- Applicant Personal Information (Age, Gender, Marital status, Number of dependents, Education level, Employment status and stability, Residential status (owner, renter, etc.), Geographic location (zip code, region))
- Financial Information (Annual income, Employment history and job tenure, Other sources of income, Existing financial obligations, Savings and assets, Bank account details (checking/savings balances), Debt-to-income ratio)
- Credit History (Credit score (FICO, VantageScore, etc.), Credit report history length, Number of credit accounts, Types o of credit (revolving, installment), Payment history on previous loans, Number of late payments and their severity, Bankruptcies, foreclosures, or collections, Credit utilization ratio, Recent credit inquiries)
- Loan-Specific Information (Requested loan amount, Loan purpose (debt consolidation, home improvement, business, etc.), Loan term/duration, Interest rate, Collateral (if applicable), Loan-to-value ratio (for secured loans))

4. Discuss the sources where you can collect this data
- Institution's Loan Management System
- Customer Relationship Management (CRM) System
- Core Banking System
- Credit Bureaus
- Financial Data Aggregators
- Public and Government Records

# Exercise 2: Feature Selection and Model Choice for Loan Default Prediction

Instructions: identify which features might be most relevant for predicting loan defaults (Ordered from Most to Least Important)
- Applicant Income
- Loan Amount / Load Amount Term
- Credit History
- Education
- Property Area
- Dependents / Marital Status

# Exercise 3: Training, Evaluating, and Optimizing the Model

Which model(s) would you pick for a Loan Prediction ?

Regression Model:
- Highly interpretable (regulatory requirement in many jurisdictions)
- Provides probability estimates naturally
- Fast to train and deploy
- Well-understood by business stakeholders and regulators
- Coefficients can be directly explained in adverse action notices
- Good for understanding linear relationships

# Exercise 4: Designing Machine Learning Solutions for Specific Problems
For each of these scenario, decide which type of machine learning would be most suitable. Explain.

1. Predicting Stock Prices : predict future prices
- Supervised Learning; We know the 'labels' or features of each stock (historical prices) and we use that to determine the future price.
2. Organizing a Library of Books : group books into genres or categories based on similarities.
- Unsupervised Learning; We want to cluster the books depending on their similarities, this is a clear cut case for unsupervised learning.

3. Program a robot to navigate and find the shortest path in a maze.
- Reinforcement Learning; no labels, environmental feedback, goal oriented with delayed results, learns optimal policy through interaction

# Exercise 5 : Designing an Evaluation Strategy for Different ML Models

## Part 1: Supervised Learning Model - Random Forest Classifier (Fraud Detection)

### Context: Binary classification to detect fraudulent transactions

---

### Evaluation Strategy:

#### A. Performance Metrics

**Primary Metrics:**

**1. Precision**
- Formula: TP / (TP + FP)
- **Why important:** Minimizes false alarms (legitimate transactions flagged as fraud)
- High false positives annoy customers and waste investigation resources
- Target: >80%

**2. Recall (Sensitivity)**
- Formula: TP / (TP + FN)
- **Why important:** Catch as many actual frauds as possible
- Missing fraud (false negatives) leads to financial losses
- Target: >70%

**3. F1-Score**
- Harmonic mean of precision and recall
- **Why important:** Balances both metrics, especially useful for imbalanced data
- Single metric to compare models

**4. AUC-ROC (Area Under ROC Curve)**
- **Why important:** Threshold-independent evaluation
- Shows model's ability to distinguish between classes across all decision thresholds
- Target: >0.85

**5. AUC-PR (Precision-Recall Curve)**
- **Why important:** More informative than ROC for highly imbalanced datasets
- Fraud is typically <1% of transactions
- Better reflects performance on minority class

**Secondary Metrics:**

**6. Confusion Matrix**
- Visual breakdown of TP, TN, FP, FN
- Helps understand error types

**7. Specificity**
- TN / (TN + FP)
- Rate of correctly identifying legitimate transactions

**8. Matthews Correlation Coefficient (MCC)**
- Balanced metric even for imbalanced classes
- Range: -1 to +1 (higher is better)

---

#### B. Validation Methods

**1. Stratified K-Fold Cross-Validation (k=5 or 10)**
- **Why:** Maintains class distribution in each fold (crucial for imbalanced data)
- Provides robust performance estimates
- Reduces variance in metrics
- Detects overfitting

**2. Hold-Out Validation Set**
- 70% train, 15% validation, 15% test
- Validation set for hyperparameter tuning
- Test set touched only once for final evaluation

**3. Time-Based Split (if temporal data)**
- Train on older data, validate on recent data
- Simulates real deployment scenario
- Prevents data leakage

**4. Class Imbalance Handling**
- **Stratification** to maintain fraud ratio
- Consider SMOTE or undersampling for training
- Always evaluate on original class distribution

---

#### C. Additional Evaluation Techniques

**1. ROC Curve Analysis**
- Plot TPR vs FPR at various thresholds
- Identify optimal operating point
- Compare multiple models visually

**2. Precision-Recall Curve**
- More informative for imbalanced data
- Shows precision-recall tradeoff
- Find threshold that balances business needs

**3. Threshold Optimization**
- Default: 0.5
- Optimize based on business cost:
  - Cost of false positive (customer friction)
  - Cost of false negative (fraud loss)
- May use lower threshold (0.3) to catch more fraud

**4. Feature Importance Analysis**
- Identify most predictive features
- Ensure model isn't using spurious correlations
- Regulatory compliance and interpretability

**5. Calibration Curves**
- Check if predicted probabilities match actual frequencies
- If model says 30% fraud probability, do 30% of those actually turn out fraudulent?

---

### Challenges and Limitations:

**1. Class Imbalance**
- Fraud might be <1% of data
- **Challenge:** High accuracy (99%) can be achieved by always predicting "not fraud"
- **Solution:** Focus on precision, recall, and AUC-PR instead of accuracy

**2. Concept Drift**
- Fraud patterns evolve over time
- **Challenge:** Model trained on old data may not catch new fraud types
- **Solution:** Regular retraining, monitoring performance degradation

**3. Threshold Selection Subjectivity**
- Different thresholds optimize different metrics
- **Challenge:** Business must decide cost tradeoff between FP and FN
- **Solution:** Involve stakeholders, calculate business cost explicitly

**4. Data Leakage Risk**
- Features that include future information
- **Challenge:** Inflated performance estimates
- **Solution:** Careful feature engineering, temporal validation

**5. Limited Interpretability**
- Random Forest is less interpretable than simpler models
- **Challenge:** Difficult to explain decisions to customers or regulators
- **Solution:** Use SHAP values, feature importance, have simpler fallback model

**6. Evaluation vs. Production Mismatch**
- Static test set vs. dynamic real-world environment
- **Challenge:** Test performance may not reflect production performance
- **Solution:** A/B testing, champion-challenger framework

---

## Part 2: Unsupervised Learning Model - K-Means Clustering (Customer Segmentation)

### Context: Segment customers into groups for targeted marketing

---

### Evaluation Strategy:

#### A. Internal Validation Metrics (No Ground Truth)

**1. Silhouette Score**
- **Range:** -1 to +1
- **Interpretation:**
  - +1: Perfect clustering (points close to own cluster, far from others)
  - 0: Overlapping clusters
  - -1: Wrong cluster assignment
- **Why important:** Measures both cohesion and separation
- Calculate per-sample and average across dataset
- **Target:** >0.5 (good), >0.7 (excellent)

**2. Elbow Method (Inertia/Within-Cluster Sum of Squares)**
- Plot inertia vs. number of clusters
- Look for "elbow" where improvement rate decreases
- **Why important:** Helps choose optimal k
- **Limitation:** Elbow not always clear, subjective interpretation

**3. Davies-Bouldin Index**
- **Range:** 0 to ∞ (lower is better)
- Average ratio of within-cluster to between-cluster distances
- **Why important:** Considers both compactness and separation
- Less computationally expensive than silhouette

**4. Calinski-Harabasz Index (Variance Ratio Criterion)**
- **Range:** 0 to ∞ (higher is better)
- Ratio of between-cluster to within-cluster variance
- **Why important:** Higher values indicate better-defined clusters

**5. Dunn Index**
- Ratio of minimum inter-cluster distance to maximum intra-cluster distance
- **Why important:** Rewards compact, well-separated clusters
- **Limitation:** Sensitive to outliers

---

#### B. Stability and Robustness Checks

**1. Clustering Stability**
- Run algorithm multiple times with different initializations
- Check if same clusters emerge
- **Why important:** Good clustering should be reproducible

**2. Subsample Stability**
- Cluster subsets of data
- Measure consistency with full dataset clustering
- **Why important:** Clusters should be stable, not artifacts of specific sample

**3. Perturbation Analysis**
- Add small noise to data
- Check if cluster assignments change drastically
- **Why important:** Robust clusters shouldn't be overly sensitive to small changes

---

#### C. Domain-Specific Validation

**1. Cluster Interpretation**
- Examine cluster characteristics (mean values, distributions)
- **Do clusters make business sense?**
- Can we name/describe each segment meaningfully?

**2. Cluster Size Distribution**
- Check if clusters are reasonably sized
- **Warning signs:**
  - One cluster with 95% of data (not useful)
  - Many tiny clusters (overfitting)

**3. Feature Analysis per Cluster**
- Compare feature distributions across clusters
- Are clusters distinguishable on important dimensions?

**4. Business Value Assessment**
- Can marketing campaigns be differentiated per segment?
- Do segments show different behaviors/preferences?
- **Ultimate test:** Does clustering lead to actionable insights?

---

#### D. Comparative Evaluation

**1. Compare Different k Values**
```
k=2: Silhouette=0.65, DBI=0.8
k=3: Silhouette=0.71, DBI=0.6  ← Best
k=4: Silhouette=0.68, DBI=0.7
k=5: Silhouette=0.62, DBI=0.9
```

**2. Compare Different Algorithms**
- K-Means vs. Hierarchical vs. DBSCAN
- Different algorithms may reveal different structures

**3. Gap Statistic**
- Compare clustering quality to random data
- **Why important:** Determines if structure exists at all
- Tests if data actually has natural clusters

---

### Challenges and Limitations:

**1. No Ground Truth**
- **Challenge:** No "correct answer" to validate against
- Unlike supervised learning, can't calculate accuracy
- **Solution:** Use multiple metrics, domain expertise, business validation

**2. Metric Disagreement**
- Different metrics may suggest different optimal k
- **Challenge:** Silhouette says k=3, elbow suggests k=5
- **Solution:** Consider business context, not just mathematical optimum

**3. Curse of Dimensionality**
- Many features make distance metrics less meaningful
- **Challenge:** Silhouette and distance-based metrics become unreliable in high dimensions
- **Solution:** Dimensionality reduction (PCA) before clustering, feature selection

**4. Subjectivity in Interpretation**
- **Challenge:** Two analysts might interpret same clusters differently
- What makes clusters "good" depends on use case
- **Solution:** Clear business objectives, stakeholder involvement

**5. Algorithm Assumptions**
- K-Means assumes spherical clusters of similar size
- **Challenge:** Real data may have irregular shapes
- **Solution:** Try multiple algorithms, visual inspection

**6. Local Optima**
- K-Means sensitive to initialization
- **Challenge:** Different runs give different results
- **Solution:** Multiple random initializations, k-means++ initialization

**7. Scalability of Validation Metrics**
- Silhouette score O(n²) complexity
- **Challenge:** Computationally expensive for large datasets
- **Solution:** Sample-based approximation, use faster metrics like DBI

**8. Determining "True" Number of Clusters**
- **Challenge:** Real data may not have clear natural clustering
- Continuous spectrum rather than discrete groups
- **Solution:** Accept that k is somewhat arbitrary, choose based on business utility

---

## Part 3: Reinforcement Learning Model - Q-Learning Agent (Game Playing / Navigation)

### Context: Agent learning to navigate maze or play game

---

### Evaluation Strategy:

#### A. Performance Metrics

**1. Cumulative Reward**
- **Definition:** Total reward accumulated over episode
- **Why important:** Primary objective of RL - maximize long-term reward
- Track across training episodes to monitor improvement
- **Analysis:**
  - Mean cumulative reward
  - Variance in rewards (consistency)
  - Maximum achieved reward

**2. Average Episode Reward**
- Mean reward per episode over evaluation period
- **Why important:** Normalized measure of performance
- Compare across different evaluation windows

**3. Success Rate**
- Percentage of episodes where goal is achieved
- **Why important:** Binary measure of task completion
- Example: % of times agent reaches maze exit

**4. Episode Length (Steps to Goal)**
- Number of actions taken per episode
- **Why important:** 
  - Shorter = more efficient policy
  - Combined with success rate shows quality of solution
- Track minimum, average, and maximum

**5. Reward Per Step**
- Average reward earned per action
- **Why important:** Efficiency metric - achieving goals with less waste

---

#### B. Learning Progress Metrics

**1. Learning Curve**
- Plot cumulative/average reward vs. training episodes
- **What to look for:**
  - Upward trend (learning)
  - Plateau (convergence)
  - Fluctuations (exploration noise)

**2. Convergence Analysis**
- Has policy stabilized?
- **Methods:**
  - Moving average of rewards flattens
  - Policy changes become minimal
  - Q-value updates become small
- **Why important:** Know when to stop training

**3. Exploration Rate Decay**
- Track epsilon (ε) in ε-greedy policy
- **Why important:** Should decrease over time as agent gains experience
- Verify exploration schedule is appropriate

**4. Q-Value Evolution**
- Monitor Q-value estimates for key state-action pairs
- **Why important:** Should converge to stable values
- Diverging Q-values indicate instability

**5. Loss/TD Error**
- Track temporal difference error over time
- **Why important:** Should decrease as Q-values converge
- Large persistent errors suggest learning problems

---

#### C. Policy Quality Assessment

**1. Deterministic Evaluation**
- Turn off exploration (ε=0), use greedy policy
- **Why important:** See what agent actually learned, without random actions
- Run multiple episodes to average out environment stochasticity

**2. Policy Stability**
- How much does policy change between evaluations?
- **Why important:** Stable policy indicates convergence
- Measure: percentage of state-action pairs that change

**3. Optimal Trajectory Analysis**
- Compare agent's path to known optimal solution (if available)
- **Why important:** Benchmark against best possible performance
- Measure deviation from optimality

**4. State Coverage**
- What percentage of state space has agent visited?
- **Why important:** 
  - Poor coverage suggests insufficient exploration
  - Some areas never learned
- Visualize visited states (if low-dimensional)

---

#### D. Exploration vs. Exploitation Balance

**1. Exploration Efficiency**
- How quickly does agent discover high-reward states?
- Time to first goal achievement
- **Why important:** Measures exploration strategy effectiveness

**2. Exploitation Verification**
- Does agent consistently choose best known actions?
- Compare greedy vs. ε-greedy performance
- **Why important:** Ensure agent isn't over-exploring late in training

**3. Regret Analysis**
- Cumulative difference between optimal and achieved rewards
- **Why important:** Quantifies cost of learning
- Lower regret = better exploration-exploitation balance

**4. Visit Count Distribution**
- Frequency of visiting each state
- **Why important:** 
  - Uniform = good exploration
  - Concentrated = exploitation or poor exploration

---

#### E. Robustness and Generalization

**1. Environment Variation Testing**
- Test in slightly different environments
- Add noise or perturbations
- **Why important:** Agent shouldn't be overfitted to training environment

**2. Different Initial States**
- Start from various positions
- **Why important:** Policy should work from any valid starting point

**3. Transfer Learning Assessment**
- Test on related but different tasks
- **Why important:** Indicates if agent learned general principles

**4. Adversarial Testing**
- Introduce worst-case scenarios
- **Why important:** Identify failure modes and edge cases

---

#### F. Comparison Baselines

**1. Random Policy**
- Agent taking random actions
- **Why important:** Minimum baseline - learned policy must beat this

**2. Heuristic/Rule-Based Policy**
- Hand-crafted solution
- **Why important:** Shows if RL provides value over simpler approaches

**3. Optimal Policy (if computable)**
- Known best solution
- **Why important:** Upper bound on performance

**4. Different RL Algorithms**
- Compare Q-Learning vs. SARSA vs. DQN
- **Why important:** Validate algorithm choice

---

### Challenges and Limitations:

**1. Credit Assignment Problem**
- **Challenge:** Long delay between action and reward
- Hard to determine which actions were responsible for success/failure
- **Impact on evaluation:** 
  - Slow learning manifests as flat learning curves
  - Difficult to diagnose specific problems

**2. Exploration-Exploitation Tradeoff**
- **Challenge:** Too much exploration = suboptimal performance during training
- Too little exploration = missing better strategies
- **Impact on evaluation:**
  - Training performance (with exploration) ≠ learned policy quality
  - Must evaluate with exploration OFF for true assessment
  - Choosing exploration schedule affects results

**3. High Variance in Performance**
- **Challenge:** RL training is inherently noisy
- Stochastic environment and policy lead to variable results
- **Impact on evaluation:**
  - Single episode performance unreliable
  - Need many episodes to get stable estimates
  - Require multiple training runs with different seeds

**4. Sample Efficiency**
- **Challenge:** RL often requires millions of interactions
- Expensive in real-world applications
- **Impact on evaluation:**
  - Long training times make iteration slow
  - Difficult to do extensive hyperparameter search
  - May need to evaluate before full convergence

**5. Non-Stationary Learning**
- **Challenge:** Target (optimal policy) changes as agent learns
- Unlike supervised learning with fixed dataset
- **Impact on evaluation:**
  - Can't use traditional train/test split
  - Early performance doesn't predict final performance
  - Learning curves may show temporary degradation

**6. Reward Engineering**
- **Challenge:** Reward function dramatically affects learning
- Poor rewards lead to unintended behaviors
- **Impact on evaluation:**
  - High reward doesn't always mean desired behavior
  - Agent may exploit reward function
  - Need to monitor actual behavior, not just rewards

**7. Sparse Rewards**
- **Challenge:** Reward only at goal, nothing during exploration
- Agent may never discover reward signal
- **Impact on evaluation:**
  - Initial performance may be zero for long periods
  - Difficult to distinguish "not learned" from "can't learn"
  - May need reward shaping to evaluate learning progress

**8. Partial Observability**
- **Challenge:** Agent may not see full state
- Memory/history needed for optimal decisions
- **Impact on evaluation:**
  - Performance ceiling limited by observability
  - Hard to distinguish poor learning from fundamental limitations

**9. Catastrophic Forgetting**
- **Challenge:** Learning new skills may degrade previously learned ones
- Especially in continual learning scenarios
- **Impact on evaluation:**
  - Performance on old tasks may decrease
  - Need to monitor multiple metrics simultaneously

**10. Evaluation Environment Mismatch**
- **Challenge:** Evaluate in same environment used for training
- May not reflect real-world performance
- **Impact on evaluation:**
  - Overfitting to training environment
  - Need separate test environments

**11. Reproducibility Issues**
- **Challenge:** Sensitive to random seeds, initialization, hyperparameters
- **Impact on evaluation:**
  - Results may not replicate
  - Need multiple runs with statistical testing
  - Publication requires careful documentation

**12. No Ground Truth Optimal Policy**
- **Challenge:** Usually don't know theoretical optimal performance
- **Impact on evaluation:**
  - Can't calculate optimality gap
  - Don't know how much improvement is possible
  - Hard to know when to stop training

---

## Summary Comparison Table

| Aspect | Supervised | Unsupervised | Reinforcement |
|--------|------------|--------------|---------------|
| **Ground Truth** | ✅ Yes (labels) | ❌ No | ⚠️ Indirect (rewards) |
| **Primary Metric** | Accuracy/F1 | Silhouette/DBI | Cumulative Reward |
| **Validation Method** | Train/Test Split | Internal metrics | Episode evaluation |
| **Main Challenge** | Class imbalance | No correct answer | High variance |
| **Evaluation Clarity** | High | Medium | Low |
| **Interpretability** | Direct | Subjective | Behavior-based |

---

## Conclusion

Each model type requires fundamentally different evaluation approaches tailored to their unique characteristics and challenges. The key is to:

1. **Supervised Learning:** Use multiple metrics beyond accuracy, especially for imbalanced data
2. **Unsupervised Learning:** Combine mathematical metrics with domain knowledge and business value
3. **Reinforcement Learning:** Monitor learning progress over time and evaluate with exploration disabled