# **Theoretical Questions :**

# **1. What is Ensemble Learning in Machine Learning? Explain the key idea behind it.**

- Ensemble Learning is a machine learning technique that combines multiple individual models, often called **base learners** or **weak learners**, to create a more powerful and accurate predictive model.  
The main idea is that a group of weak models, when combined properly, can outperform any single strong model.

In ensemble learning, each model contributes to the final prediction, and the errors made by one model are compensated by others.  
This leads to better generalization, higher accuracy, and reduced overfitting.

There are two main types of ensemble methods:

1. **Bagging (Bootstrap Aggregating):**
   - Multiple models are trained independently on random subsets of the data (with replacement).
   - Final prediction is made by averaging (for regression) or voting (for classification).
   - Example: **Random Forest**

2. **Boosting:**
   - Models are trained sequentially, where each new model focuses more on correcting the errors of the previous ones.
   - Example: **AdaBoost, Gradient Boosting, XGBoost**

3. **Stacking:**
   - Multiple models (base learners) make predictions, and a meta-model learns to combine these predictions optimally.

**Key Idea:**  
> “The wisdom of the crowd” — combining multiple diverse models leads to a more accurate and robust overall prediction than relying on a single model.


# **2. What is the difference between Bagging and Boosting?**

- **Bagging (Bootstrap Aggregating)** and **Boosting** are both ensemble learning techniques used to improve model accuracy by combining multiple models,  
but they differ in how these models are trained and combined.

| Feature | Bagging | Boosting |
|----------|----------|-----------|
| **Objective** | Reduce variance and prevent overfitting | Reduce bias and improve weak learners |
| **Training Method** | Models are trained **independently** on random subsets of data (with replacement) | Models are trained **sequentially**, each new model focuses on the errors of the previous one |
| **Data Sampling** | Uses **bootstrap sampling** (random sampling with replacement) | Uses the **entire dataset**, but assigns higher weights to misclassified samples |
| **Model Weighting** | All models have **equal weight** in the final prediction | Models are **weighted based on their performance** |
| **Error Handling** | Each model works independently, so errors are averaged out | Each subsequent model tries to **correct** the previous model’s errors |
| **Overfitting Tendency** | Less prone to overfitting | More prone to overfitting if not regularized properly |
| **Examples** | Random Forest | AdaBoost, Gradient Boosting, XGBoost |

**In summary:**  
- **Bagging** reduces **variance** by averaging predictions from independent models.  
- **Boosting** reduces **bias** by sequentially improving weak models to create a strong learner.


# **3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

- **Bootstrap sampling** is a statistical technique in which multiple random samples are drawn **with replacement** from the original dataset.  
Each sample (called a **bootstrap sample**) has the same size as the original dataset, but because of replacement, some data points may appear multiple times, while others may not appear at all.

**Role in Bagging (e.g., Random Forest):**

1. **Diversity Creation:**  
   - Each model (like a decision tree in Random Forest) is trained on a different bootstrap sample.
   - This introduces variability among the models, making them less correlated.

2. **Variance Reduction:**  
   - Since each model is trained on a slightly different dataset, their errors tend to cancel each other out when combined.
   - The final prediction (by averaging or voting) becomes more stable and less sensitive to noise.

3. **Improved Generalization:**  
   - By combining diverse models trained on different samples, the ensemble generalizes better to unseen data.

**In summary:**  
> Bootstrap sampling allows Bagging methods like Random Forest to create multiple diverse training datasets from the same data,  
> leading to a robust ensemble that reduces variance and improves predictive performance.


# **4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

- In **Bagging methods** such as **Random Forest**, each base model (e.g., decision tree) is trained on a **bootstrap sample** of the dataset —  
that is, a random sample **with replacement**. As a result, some data points are not included in this sample.

These data points that are **not selected** in the bootstrap sample for a particular model are called **Out-of-Bag (OOB) samples**.

### **Role of OOB Samples:**
- OOB samples act as a **built-in validation set** for each model.
- Since a model hasn’t seen its OOB samples during training, they can be used to test that model’s performance.

### **OOB Score:**
- The **OOB score** is the average accuracy (or error) computed by predicting the OOB samples across all models in the ensemble.
- It provides an **unbiased estimate of the model’s performance** without needing a separate validation or test set.

### **Advantages of Using OOB Score:**
1. Eliminates the need for a separate validation dataset.
2. Provides an efficient and quick way to estimate model accuracy.
3. Helps detect overfitting in ensemble models.

**In summary:**  
> **OOB samples** are the unused data points in bootstrap sampling, and the **OOB score** evaluates model accuracy using those samples,  
> serving as an internal cross-validation method in ensemble techniques like Random Forest.


# **5.  Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

- **Feature importance** measures how much each feature contributes to predicting the target variable.  
Both **Decision Tree** and **Random Forest** provide feature importance scores, but they differ in how these scores are calculated and interpreted.

| Aspect | Decision Tree | Random Forest |
|---------|----------------|----------------|
| **Model Type** | Single model | Ensemble of many decision trees |
| **Computation Method** | Calculated based on the **reduction in impurity** (e.g., Gini or entropy) caused by each feature during splits | Computed as the **average of feature importance scores** across all trees in the forest |
| **Stability** | Can be **unstable** — small changes in data can lead to large differences in feature importance | More **stable and reliable**, as it aggregates results from many trees |
| **Bias** | May be **biased** toward features with more categories or continuous values | Bias is **reduced** due to averaging over multiple trees |
| **Interpretation** | Shows how important a feature is **for one specific tree** | Shows the **overall importance** of each feature across the entire ensemble |
| **Overfitting Tendency** | High — single tree may overfit, giving misleading importance | Lower — averaging across trees gives more **generalized importance** |

**In summary:**  
> A **Decision Tree** provides feature importance for a single model and may be unstable,  
> whereas a **Random Forest** gives a more **robust and generalizable** estimate of feature importance by averaging across many trees.


# **Practical Questions :**

In [None]:
# 6) Write a Python program to:
# ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()
# ● Train a Random Forest Classifier
# ● Print the top 5 most important features based on feature importance scores.
# (Include your Python code and output in the code box below.)


from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importances
importances = pd.Series(model.feature_importances_, index=data.feature_names)

# Sort and get top 5 features
top_features = importances.sort_values(ascending=False).head(5)

# Print top 5 most important features
print("Top 5 Most Important Features:")
print(top_features)




Top 5 Most Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [None]:
# 7) Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# (Optional) Add slight noise to make the problem less trivial
X = X + np.random.normal(0, 0.2, X.shape)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)

# Train a Bagging Classifier using Decision Trees (with limited depth)
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=50,
    random_state=42
)
bagging_model.fit(X_train, y_train)
bagging_predictions = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)

# Print the accuracies
print("Accuracy of Single Decision Tree: {:.2f}%".format(dt_accuracy * 100))
print("Accuracy of Bagging Classifier: {:.2f}%".format(bagging_accuracy * 100))




Accuracy of Single Decision Tree: 95.56%
Accuracy of Bagging Classifier: 95.56%


In [None]:
# 8) Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy



from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Add slight random noise to make classification less trivial
X = X + np.random.normal(0, 0.2, X.shape)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define the model
rf = RandomForestClassifier(random_state=42)

# Define the parameter grid for tuning
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [2, 4, 6, 8, None]
}

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best model
best_rf = grid_search.best_estimator_

# Make predictions
y_pred = best_rf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the results
print("Best Parameters Found:", grid_search.best_params_)
print("Final Accuracy: {:.2f}%".format(accuracy * 100))



Best Parameters Found: {'max_depth': 4, 'n_estimators': 50}
Final Accuracy: 95.56%


In [None]:
# 9) Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
# ● Compare their Mean Squared Errors (MSE)


from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a Bagging Regressor with shallow trees
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(max_depth=6),
    n_estimators=50,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

# Train a Random Forest Regressor with the same depth
rf_reg = RandomForestRegressor(
    n_estimators=50,
    max_depth=6,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print and compare the Mean Squared Errors
print("Mean Squared Error (Bagging Regressor): {:.4f}".format(bagging_mse))
print("Mean Squared Error (Random Forest Regressor): {:.4f}".format(rf_mse))



Mean Squared Error (Bagging Regressor): 0.4126
Mean Squared Error (Random Forest Regressor): 0.4126


# **10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to:**
         ● Choose between Bagging or Boosting
         ● Handle overfitting
         ● Select base models
         ● Evaluate performance using cross-validation
         ● Justify how ensemble learning improves decision-making in this real-world context.  


- Ensemble approach for predicting loan default — step-by-step

Context:
We have customer demographic + transaction-history data and must predict loan default (binary classification). Business constraints: class imbalance, regulatory need for explainability, cost of false negatives (missed defaults) vs false positives (unnecessary denial).

1) Choosing between Bagging and Boosting

Decision rule:
- Use **Bagging (e.g., Random Forest)** when:
  - The main problem is **high variance** (models overfit to training data).
  - Data may contain noisy labels, and we want robust estimates.
  - We want simpler hyperparameter tuning and faster parallel training.
- Use **Boosting (e.g., Gradient Boosting, XGBoost, LightGBM, CatBoost)** when:
  - The main problem is **high bias** (weak learners underfit).
  - We want to squeeze maximum predictive accuracy from features.
  - We can control overfitting (early stopping, regularization) and accept sequential training.
Practical choice for loan default:
- Start with **Random Forest** as a baseline (robust, interpretable feature importances).
- Use **Boosted Trees** (LightGBM/XGBoost/CatBoost) next to improve performance and capture subtle patterns.
- Final choice: ensemble (stack) of both families if regulatory & compute budgets allow.

2) Handling overfitting

Techniques to apply (training & model-level):
- **Data level**
  - More data if possible (historical transactions).
  - Feature engineering & careful cross-validation — avoid leakage (time-based CV for temporal features).
- **Model regularization**
  - For Decision Trees / RF: limit `max_depth`, `min_samples_leaf`, `min_samples_split`.
  - For Boosting: use `learning_rate` (small), `n_estimators` with early stopping on validation, `max_depth`, `subsample`, `colsample_bytree`.
- **Sampling**
  - Use bootstrap + subsample features (RF) or row/feature subsampling (boosting) to reduce correlation.
- **Early stopping**
  - Monitor validation metric (AUC/PR-AUC/cost metric) and stop when no improvement.
- **Feature selection / dimensionality reduction**
  - Remove leakage features, use domain-driven features, or apply regularized models to select features.
- **Ensemble averaging / stacking**
  - Blend multiple model families to reduce variance.
- **Model stability checks**
  - Retrain with different seeds and confirm consistent performance.
- **Calibration**
  - After training, calibrate probabilities (Platt scaling, isotonic) to avoid overconfident predictions.

3) Selecting base models

Practical candidates:
- **Decision Tree** (as base learner for bagging; shallow trees for robustness)
- **Random Forest** (bagging ensemble) — good baseline
- **Gradient Boosted Trees** (XGBoost / LightGBM / CatBoost) — high accuracy for tabular data
- **Logistic Regression with regularization** — strong, interpretable baseline; good for scorecards
- **Simple neural networks** or **MLP** only if many high-dim engineered features exist
How to choose:
- Start simple (Logistic) → baseline.
- Add Random Forest for variance reduction.
- Add a tuned GBT for top accuracy.
- Optionally use stacking: meta-learner (regularized LR) that combines RF + GBT + LR predictions.

4) Evaluate performance using cross-validation

Design CV carefully to reflect production use:
- **Stratified K-fold** for balanced class representation (e.g., stratified 5 or 10 folds).
- If data is time-ordered (transactions), use **time-based CV** (walk-forward / rolling window) to prevent leakage.
- Use **nested CV** for robust hyperparameter tuning:
  - Outer loop: estimate generalization (k folds).
  - Inner loop: hyperparameter tuning (grid/random/optuna).
- **Metrics**:
  - Primary: **AUC-ROC** and **Precision-Recall AUC** (PR-AUC is critical if positives are rare).
  - Business metrics: **Expected Loss**, **Cost-weighted error**, or **Profit/Loss** based on credit decision cost matrix.
  - Secondary: Accuracy, F1-score, but avoid relying solely on accuracy for imbalanced data.
- **Probability calibration & evaluation**:
  - Use **Brier score**, **calibration plots**, and **reliability diagrams** to check predicted probabilities.
- **Stability & fairness checks**:
  - Evaluate across cohorts (age groups, geography, income bands) for performance drift or bias.
- **Confidence intervals & statistical tests**:
  - Use bootstrapping or repeated CV to get confidence intervals for metrics; compare models statistically before deployment.

5) Practical pipeline (high-level)

- Data ingestion & cleaning (handle missing values, outliers, feature engineering).
- Create transaction-level aggregates (recency, frequency, monetary, behavioral trends).
- Split: time-ordered holdout or stratified CV.
- Baseline: train Logistic Regression (regularized) → evaluate.
- Train Random Forest (baseline ensemble) → tune `n_estimators`, `max_depth`, `min_samples_leaf`.
- Train Gradient Boosted model → tune `n_estimators`, `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`.
- If helpful, train Bagging Regressor or BaggingClassifier variants.
- Evaluate (nested CV), calibrate probabilities, and compute business KPIs (expected loss).
- Perform explainability: SHAP / LIME / tree feature importances and produce per-decision explanations for regulatory records.
- Model validation: backtesting on recent unseen months.
- Deployment: monitor performance & drift; schedule model retraining.

6) Justify how ensemble learning improves decision-making (business-ready)

- **Higher predictive performance**: Ensembles (bagging and boosting) reduce variance/bias and typically produce more accurate risk scores than single models — leading to fewer missed defaults and fewer unnecessary denials.
- **Robustness**: Bagging reduces sensitivity to noisy training examples; boosting captures subtle patterns. Combined, they yield stable predictions across customers.
- **Better probability estimates (after calibration)**: Ensembles often produce more reliable ranking of risk; calibrated probabilities support better thresholding for lending decisions and consistent provisioning.
- **Risk-adjusted decisions**: Improved discrimination allows fine-grained decision rules (e.g., accept/accept-with-conditions/reject), optimizing expected portfolio return and loss.
- **Model explainability (operationalized)**: Tree-based ensembles allow feature importance and SHAP explanations which can be recorded for regulatory audits and for explaining adverse actions to customers.
- **Operational benefits**: Reduced false negatives (missed defaulters) lowers expected credit losses. Reduced false positives improves customer experience and increases revenue.

7) Additional operational & regulatory considerations

- **Audit trail**: log features and model version for every decision.
- **Bias & fairness checks**: test for disparate impact; implement mitigation strategies if required.
- **Model governance**: document the model lifecycle, approvals, and performance monitoring plan.
- **Monitoring**: track input feature distributions, performance metrics, calibration, and business KPIs. Trigger alerts and retraining if drift detected.
- **Explainability for decisions**: produce short textual reasons from SHAP/top features for each declined applicant.

8) Summary checklist (short)

- [ ] Baseline: regularized Logistic Regression.
- [ ] Bagging: Random Forest (baseline ensemble).
- [ ] Boosting: LightGBM/XGBoost/CatBoost (tuned).
- [ ] CV: nested, stratified or time-based as appropriate.
- [ ] Metrics: AUC-ROC, PR-AUC, calibration, busi


In [None]:
# Ensemble approach for loan-default prediction (ready for Colab)
# - simulates demographic + transaction features
# - trains RandomForest (bagging-style) and HistGradientBoosting (boosting-style)
# - uses stratified CV, GridSearchCV, class imbalance handling, calibration check
# - prints best params and final evaluation metrics

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, average_precision_score, accuracy_score, confusion_matrix,
    classification_report
)
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# ===========================
# 1) Simulate dataset
# ===========================
# We'll create a dataset that imitates: demographic (age, income) + transaction-derived features
# Imbalanced target (default rare) to mimic real loan-default distribution.
X, y = make_classification(
    n_samples=20000,
    n_features=20,
    n_informative=8,
    n_redundant=4,
    n_repeated=0,
    n_classes=2,
    weights=[0.93, 0.07],   # ~7% defaults (class imbalance)
    class_sep=1.0,
    random_state=42
)

# Create human-readable feature names: demographics and transaction aggregates
feature_names = [
    "age", "annual_income", "loan_amount", "loan_term", "num_prev_loans",
    "delinquency_count", "avg_monthly_balance", "std_monthly_balance",
    "txn_freq_6mo", "txn_amt_mean", "txn_amt_std", "credit_utilization",
    "num_credit_inquiries", "employment_years", "home_ownership_flag",
    "months_since_last_default", "ratio_debit_credit", "savings_to_income",
    "recent_large_txn_count", "avg_days_between_txn"
]
X = pd.DataFrame(X, columns=feature_names)
y = pd.Series(y, name="default")

print("Dataset shape:", X.shape)
print("Default rate: {:.2f}%".format(y.mean() * 100))

# ===========================
# 2) Train / Test split (stratified)
# ===========================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

# ===========================
# 3) Baseline: Logistic Regression (simple, interpretable) - optional quick baseline
# ===========================
from sklearn.linear_model import LogisticRegression
baseline_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42))
])
baseline_pipe.fit(X_train, y_train)
y_pred_baseline = baseline_pipe.predict(X_test)
y_prob_baseline = baseline_pipe.predict_proba(X_test)[:,1]
print("\nBaseline Logistic Regression (quick check):")
print(" - Accuracy:", accuracy_score(y_test, y_pred_baseline))
print(" - ROC-AUC:", roc_auc_score(y_test, y_prob_baseline))
print(" - PR-AUC :", average_precision_score(y_test, y_prob_baseline))

# ===========================
# 4) Model candidates (Bagging vs Boosting)
#    - RandomForestClassifier (bagging family)
#    - HistGradientBoostingClassifier (boosting family; fast and handles large data)
# Overfitting controls included via hyperparams
# ===========================
rf = RandomForestClassifier(
    n_jobs=-1,
    class_weight='balanced_subsample',  # handle imbalance
    random_state=42
)

hgb = HistGradientBoostingClassifier(
    random_state=42,
    early_stopping=True,
    scoring='roc_auc'
)

# ===========================
# 5) Cross-validation strategy (Stratified K-Fold)
# ===========================
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ===========================
# 6) Hyperparameter grids (small grid for demonstration)
# ===========================
rf_param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [6, 12, None],
    'min_samples_leaf': [2, 5],
    'max_features': ['sqrt']  # typical choice for RF
}

hgb_param_grid = {
    'max_iter': [100, 300],
    'max_leaf_nodes': [15, 31],
    'learning_rate': [0.05, 0.1],
    'min_samples_leaf': [20, 50]  # large to reduce overfitting
}

# ===========================
# 7) GridSearchCV for RF
# ===========================
print("\nTuning RandomForest (GridSearchCV) ... (this may take a minute)")
rf_grid = GridSearchCV(
    rf,
    rf_param_grid,
    scoring='roc_auc',
    cv=cv,
    n_jobs=-1,
    verbose=0
)
rf_grid.fit(X_train, y_train)
best_rf = rf_grid.best_estimator_
print("Best RF params:", rf_grid.best_params_)
print("Best RF CV ROC-AUC: {:.4f}".format(rf_grid.best_score_))

# ===========================
# 8) GridSearchCV for HGB
# ===========================
print("\nTuning HistGradientBoosting (GridSearchCV) ...")
hgb_grid = GridSearchCV(
    hgb,
    hgb_param_grid,
    scoring='roc_auc',
    cv=cv,
    n_jobs=-1,
    verbose=0
)
hgb_grid.fit(X_train, y_train)
best_hgb = hgb_grid.best_estimator_
print("Best HGB params:", hgb_grid.best_params_)
print("Best HGB CV ROC-AUC: {:.4f}".format(hgb_grid.best_score_))

# ===========================
# 9) Final evaluation on hold-out test set
# ===========================
def evaluate_model(model, X_test, y_test, name="Model"):
    y_prob = model.predict_proba(X_test)[:,1]
    y_pred = model.predict(X_test)
    roc = roc_auc_score(y_test, y_prob)
    pr = average_precision_score(y_test, y_prob)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n{name} final test results:")
    print(" - Accuracy : {:.4f}".format(acc))
    print(" - ROC-AUC  : {:.4f}".format(roc))
    print(" - PR-AUC   : {:.4f}".format(pr))
    print(" - Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print(" - Classification Report:\n", classification_report(y_test, y_pred, digits=4))
    return {"roc_auc": roc, "pr_auc": pr, "accuracy": acc}

rf_metrics = evaluate_model(best_rf, X_test, y_test, name="RandomForest (Bagging)")
hgb_metrics = evaluate_model(best_hgb, X_test, y_test, name="HistGradientBoosting (Boosting)")

# ===========================
# 10) Compare via cross-validated ROC-AUC (repeated CV)
# ===========================
print("\nCross-validated ROC-AUC (5-fold) on training set:")
rf_cv_scores = cross_val_score(best_rf, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
hgb_cv_scores = cross_val_score(best_hgb, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
print(" - RF CV ROC-AUC mean ± std:", np.mean(rf_cv_scores), "±", np.std(rf_cv_scores))
print(" - HGB CV ROC-AUC mean ± std:", np.mean(hgb_cv_scores), "±", np.std(hgb_cv_scores))

# ===========================
# 11) Feature importance for interpretation (permutation importance on test)
# ===========================
print("\nComputing permutation importances (test set) for best model (RF)...")
perm = permutation_importance(best_rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
imp_df = pd.DataFrame({
    "feature": X_test.columns,
    "importance_mean": perm.importances_mean,
    "importance_std": perm.importances_std
}).sort_values(by="importance_mean", ascending=False).reset_index(drop=True)
print(imp_df.head(8))

# ===========================
# 12) Notes printed for assignment
# ===========================
print("\nNotes:")
print("- We used class_weight/balancing and PR-AUC alongside ROC-AUC because default is a rare event.")
print("- We controlled overfitting via max_depth/min_samples_leaf (RF) and learning_rate/early stopping + min_samples_leaf (HGB).")
print("- Use time-based CV in production if data has time ordering (here we used StratifiedKFold for demo).")


Dataset shape: (20000, 20)
Default rate: 7.46%
Train shape: (15000, 20) Test shape: (5000, 20)

Baseline Logistic Regression (quick check):
 - Accuracy: 0.7804
 - ROC-AUC: 0.8131123357423584
 - PR-AUC : 0.5082986233210026

Tuning RandomForest (GridSearchCV) ... (this may take a minute)
Best RF params: {'max_depth': 12, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'n_estimators': 100}
Best RF CV ROC-AUC: 0.9524

Tuning HistGradientBoosting (GridSearchCV) ...
Best HGB params: {'learning_rate': 0.05, 'max_iter': 100, 'max_leaf_nodes': 31, 'min_samples_leaf': 50}
Best HGB CV ROC-AUC: 0.9489

RandomForest (Bagging) final test results:
 - Accuracy : 0.9748
 - ROC-AUC  : 0.9499
 - PR-AUC   : 0.8624
 - Confusion Matrix:
 [[4618    9]
 [ 117  256]]
 - Classification Report:
               precision    recall  f1-score   support

           0     0.9753    0.9981    0.9865      4627
           1     0.9660    0.6863    0.8025       373

    accuracy                         0.9748      5000
   