**Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.**


Ans:-  Ensemble learning is a machine learning technique that combines predictions from multiple individual models to improve overall performance and accuracy. Instead of relying on a single model, an ensemble model leverages the collective intelligence of a group of models, often referred to as weak learners or base learners. This approach leads to more robust and accurate predictions than any single model could produce alone.

Key Idea Behind Ensemble Learning
The fundamental idea behind ensemble learning is the "wisdom of the crowd" principle.  This concept suggests that the collective judgment of a diverse group of non-experts is often more accurate than the judgment of a single expert. In machine learning, this translates to combining the outputs of several diverse models to cancel out their individual errors and biases. The errors made by one model are often different from the errors made by another, and by averaging or voting on their predictions, the ensemble can reduce the total error.

Ensemble methods achieve this diversity in a few key ways, depending on the specific technique:

Bagging (Bootstrap Aggregating): This method trains multiple instances of the same type of model on different random subsets of the original training data. The randomness in the data subsets ensures that each model learns slightly different patterns, leading to a diverse set of predictions. The final prediction is determined by averaging the outputs (for regression) or using a majority vote (for classification). A well-known example is the Random Forest algorithm, which uses bagging with decision trees.

Boosting: Boosting is a sequential process where models are trained one after another. Each new model is trained to correct the errors made by the previous models. This technique focuses on improving the performance on data points that were difficult for earlier models to classify correctly. The final prediction is a weighted combination of the individual models' predictions, giving more importance to the more accurate models.

Stacking: Also known as stacked generalization, this technique involves training multiple diverse models (e.g., a decision tree, a support vector machine, and a neural network) on the same dataset. The predictions from these models are then used as input features to train a final, higher-level model (a meta-model). This meta-model learns how to best combine the predictions of the base models to make a final, improved prediction.

**Question 2: What is the difference between Bagging and Boosting?**

Ans:-  Bagging and Boosting are both ensemble learning methods that combine the predictions of multiple models to improve overall performance, but they differ fundamentally in their approach to model training. Bagging trains models in parallel, while Boosting trains them sequentially, with each new model learning from the errors of the previous ones.

Bagging vs. Boosting
Here's a point-by-point comparison:

Aspect	Bagging (Bootstrap Aggregating)	Boosting
Model Training	Models are trained independently and in parallel.	Models are trained sequentially. Each new model is trained to correct the errors of its predecessors.
Data Subset	Each model is trained on a random subset of the original training data, with replacement (bootstrapping). All data points have an equal chance of being selected.	Each new model focuses on data points that were misclassified by the previous models, giving them more weight.
Primary Goal	To reduce variance and prevent overfitting. Bagging is particularly effective with models that are prone to high variance, like unpruned decision trees.	To reduce bias and convert a set of weak learners into a single strong learner. Boosting is effective for models that have low performance or high bias.
Final Prediction	Predictions from all models are combined using a simple method, like averaging (for regression) or majority voting (for classification).	Predictions are combined using a weighted average or sum, where more accurate models are given more importance.
Examples	Random Forest is a well-known example that uses bagging with decision trees.	AdaBoost, Gradient Boosting, and XGBoost are popular boosting algorithms.




**Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?**

Ans:-  Bootstrap sampling is a resampling technique where you create smaller datasets by randomly sampling from the original dataset with replacement. This means that a single data point can be selected multiple times in a single bootstrap sample, and other data points might not be selected at all. Each bootstrap sample has the same number of data points as the original dataset. This process generates multiple diverse datasets from a single source.




**The Role of Bootstrap Sampling in Bagging and Random Forest**

Bootstrap sampling is the core of Bagging, which stands for "Bootstrap Aggregating." It plays a crucial role in creating the diverse set of models that make up the ensemble. Here's how it works, especially in a Random Forest:

1. Creating Diverse Subsets: The original training data is used to generate many different bootstrap samples. Because each sample is created by drawing data points with replacement, they all contain a slightly different distribution of data. Some data points appear multiple times, while others are omitted. This ensures that the individual models trained on these subsets are all slightly different from each other.


2. Training Individual Models: For a Random Forest, a separate decision tree is trained on each of these unique bootstrap samples. The individual trees are intentionally made to be high-variance and unpruned, meaning they are prone to overfitting their specific training subset.

3. Reducing Variance: The diversity introduced by the bootstrap samples is key to the success of the ensemble. By training models on different data, you ensure that their individual errors are not systematically correlated. The final prediction is made by aggregating the outputs of all the individual trees (e.g., using a majority vote for classification or averaging for regression). This aggregation process averages out the individual models' high variance, leading to a much more stable and accurate final prediction than any single decision tree could achieve.


In essence, bootstrap sampling provides the necessary data variability to make the individual models diverse, which is the foundational principle of bagging for reducing overall model variance and preventing overfitting.

**Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**


Ans:- Out-of-Bag (OOB) samples are the data points from the original training set that are not included in a particular bootstrap sample. In bagging methods, like Random Forest, each individual model (e.g., a decision tree) is trained on a different bootstrap sample. Because bootstrap sampling is done with replacement, on average, each bootstrap sample contains about 63% of the original data, leaving the remaining 37% as the out-of-bag samples for that specific model. These OOB samples are essentially "unseen" data for the model they were left out of.




How OOB Score is Used to Evaluate Ensemble Models
The OOB score is a powerful method for evaluating the performance of ensemble models without the need for a separate validation or test set. Here’s how it works:

Prediction on OOB Samples: For each data point in the original training set, an OOB prediction is made. This prediction is generated by aggregating the predictions from only the models that did not see that data point during their training (i.e., the models for which that data point was an OOB sample). For example, if a data point was left out of the bootstrap samples for trees 1, 5, and 10, the OOB prediction for that data point would be the majority vote (for classification) or the average (for regression) of the predictions from those three trees.


Unbiased Performance Estimate: The OOB score is then calculated by comparing these aggregated OOB predictions to the actual labels of the data points. Since the models making the predictions have never seen the data points they are evaluating, the OOB score provides an unbiased estimate of the model's generalization performance.


Comparison to Cross-Validation: The OOB score is a computationally efficient alternative to traditional cross-validation. Cross-validation requires creating multiple folds and training the model on different subsets, which can be computationally expensive. OOB evaluation, however, is a seamless part of the bagging process itself, as the "validation" data is naturally available for each model. The OOB score can be used to tune hyperparameters and provides a reliable measure of how well the ensemble model will perform on unseen data.

**Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**


Ans:- Comparing feature importance in a single Decision Tree versus a Random Forest reveals a key difference in reliability and stability. While both methods use the same underlying principle of measuring how much each feature contributes to the model's decisions, the ensemble nature of Random Forest provides a much more robust and trustworthy result.


Feature Importance in a Single Decision Tree
In a single Decision Tree, feature importance is calculated based on how much a feature reduces the impurity (e.g., Gini impurity or entropy for classification, or mean squared error for regression) at each split. The importance score for a feature is the total reduction in impurity it achieves across all nodes where it is used for a split.


Calculation: The algorithm sums up the impurity reduction for each time a feature is used to split a node. A feature that is used to create a "purer" split (a split that separates the data into more homogeneous groups) is given a higher score.

Drawbacks: The main issue is that a single Decision Tree is highly sensitive to the training data. A small change in the data can lead to a completely different tree structure, resulting in wildly different feature importance scores. This makes the importance ranking unstable and unreliable as a general measure. It can also be biased towards features with a high number of unique values, as these features offer more potential split points.


Feature Importance in a Random Forest
A Random Forest calculates feature importance by aggregating the importance scores from all the individual decision trees within the forest.

Calculation: For each tree in the forest, the feature importance is calculated just as it would be for a single Decision Tree. Then, the final feature importance score for the entire Random Forest is the average of the importance scores for that feature across all the trees.


Advantages: This averaging process is the key to its robustness. Because each tree in the forest is trained on a different bootstrap sample of the data and uses a random subset of features for each split, the individual trees are de-correlated. A feature that appears important by chance in one tree is unlikely to appear important in many others. By averaging across the entire forest, the random noise is canceled out, and the resulting importance scores provide a much more stable, accurate, and reliable ranking of the features' true predictive power.


Ultimately, while a single Decision Tree gives you a quick and interpretable look at feature importance, a Random Forest offers a more stable and reliable assessment due to its ensemble nature, making it the preferred method for feature importance analysis in most practical applications

**Question 6: Write a Python program to:

● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores**

Ans:- Here the Python code

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# 1. Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target

print("Dataset loaded successfully! 📊")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}\n")

# 2. Train a Random Forest Classifier
# We use a random_state for reproducibility of results
# n_estimators is the number of trees in the forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X, y)

print("Random Forest Classifier trained! 🌳🌲\n")

# 3. Print the top 5 most important features
# Get feature importances from the trained model
feature_importances = rf_classifier.feature_importances_

# Create a pandas Series for easier sorting and viewing
feature_importance_series = pd.Series(feature_importances, index=X.columns)

# Sort the features by importance in descending order
top_features = feature_importance_series.sort_values(ascending=False)

print("Top 5 Most Important Features:\n")
print(top_features.head(5))

Dataset loaded successfully! 📊
Number of features: 30
Number of samples: 569

Random Forest Classifier trained! 🌳🌲

Top 5 Most Important Features:

worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


**Question 7: Write a Python program to:

● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree**

Ans:- -Loads the Iris dataset.

- Trains a single Decision Tree classifier.

- Trains a Bagging Classifier (using Decision Trees as base estimators).

- Evaluates and compares their accuracies.

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_acc = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier with Decision Trees as base estimators
bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
bag_acc = accuracy_score(y_test, y_pred_bag)

# Print the results
print(f"Decision Tree Accuracy: {dt_acc:.2f}")
print(f"Bagging Classifier Accuracy: {bag_acc:.2f}")


TypeError: BaggingClassifier.__init__() got an unexpected keyword argument 'base_estimator'

**Question 8: Write a Python program to:

● Train a Random Forest Classifier

● Tune hyperparameters max_depth and n_estimators using GridSearchCV

● Print the best parameters and final accuracy**


Ans:- Here's a Python program that trains a Random Forest Classifier, tunes its max_depth and n_estimators hyperparameters using GridSearchCV, and then prints the best parameters and the final accuracy


In [4]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target

print("Breast Cancer dataset loaded successfully! 🎗️\n")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}\n")

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# stratify=y ensures that the proportion of target classes is the same in both train and test sets

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}\n")

# 2. Define the Random Forest Classifier
# We set a random_state for reproducibility of the base model's randomness
rf = RandomForestClassifier(random_state=42)

# Define the parameter grid for hyperparameter tuning
# max_depth: The maximum depth of the tree. Limiting this can prevent overfitting.
# n_estimators: The number of trees in the forest. More trees generally improve performance
#               but increase computation time.
param_grid = {
    'n_estimators': [50, 100, 150],  # Number of trees
    'max_depth': [None, 10, 20]      # Maximum depth of each tree (None means unlimited depth)
}

print("Starting GridSearchCV for hyperparameter tuning... 🛠️\n")

# 3. Tune hyperparameters using GridSearchCV
# GridSearchCV performs an exhaustive search over the specified parameter values.
# It uses cross-validation (cv=5 means 5-fold cross-validation) to evaluate each combination.
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

print("GridSearchCV complete! ✨\n")

# 4. Print the best parameters
print("Best hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)

# Get the best model from the grid search
best_rf_model = grid_search.best_estimator_

# 5. Evaluate the final accuracy on the test set using the best model
y_pred_best_model = best_rf_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred_best_model)

print(f"\nFinal accuracy of the best Random Forest model on the test set: {final_accuracy:.4f} 🎉")

Breast Cancer dataset loaded successfully! 🎗️

Number of features: 30
Number of samples: 569

Training data shape: (398, 30)
Testing data shape: (171, 30)

Starting GridSearchCV for hyperparameter tuning... 🛠️

Fitting 5 folds for each of 9 candidates, totalling 45 fits
GridSearchCV complete! ✨

Best hyperparameters found by GridSearchCV:
{'max_depth': None, 'n_estimators': 100}

Final accuracy of the best Random Forest model on the test set: 0.9357 🎉


**Question 9: Write a Python program to:

● Train a Bagging Regressor and a Random Forest Regressor on the California

Housing dataset

● Compare their Mean Squared Errors (MSE)**

Ans:- Here's a Python program that trains a Bagging Regressor and a Random Forest Regressor on the California Housing dataset and then compares their Mean Squared Errors (MSE).


In [13]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.data
y = california_housing.target

print("California Housing dataset loaded successfully! 🏡\n")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}\n")

# Split the dataset into training and testing sets
# We use a random_state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}\n")

# 2. Train a Bagging Regressor using Decision Trees
# The base_estimator is the type of model to be used in the ensemble (DecisionTreeRegressor here)
# n_estimators is the number of base estimators (decision trees) in the ensemble
bagging_regressor = BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),
n_estimators=100, # Using 100 decision trees
random_state=42,
n_jobs=-1) # Use all available CPU cores for parallel training

print("Training Bagging Regressor... 🌳\n")

bagging_regressor.fit(X_train, y_train)

# Make predictions and calculate MSE for the Bagging Regressor
y_pred_bagging = bagging_regressor.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

print(f"Mean Squared Error (MSE) for Bagging Regressor: {mse_bagging:.4f}\n")

# 3. Train a Random Forest Regressor
# RandomForestRegressor inherently uses bagging, but also adds feature randomness
# n_estimators is the number of trees in the forest
random_forest_regressor = RandomForestRegressor(n_estimators=100,
random_state=42,
n_jobs=-1) # Use all available CPU cores for parallel training

print("Training Random Forest Regressor... 🌲🌲\n")
random_forest_regressor.fit(X_train, y_train)

# Make predictions and calculate MSE for the Random Forest Regressor
y_pred_random_forest = random_forest_regressor.predict(X_test)
mse_random_forest = mean_squared_error(y_test, y_pred_random_forest)

print(f"Mean Squared Error (MSE) for Random Forest Regressor: {mse_random_forest:.4f}\n")

# 4. Compare their Mean Squared Errors (MSE)
print("--- Comparison of Regressor Performance ---")
print(f"Bagging Regressor MSE:    {mse_bagging:.4f}")
print(f"Random Forest Regressor MSE: {mse_random_forest:.4f}")

if mse_random_forest < mse_bagging:
    print("\nThe Random Forest Regressor achieved a lower MSE, indicating better performance. 🎉")
elif mse_random_forest > mse_bagging:
    print("\nThe Bagging Regressor achieved a lower MSE in this run. This can sometimes happen depending on the dataset and specific random states.")
else:
    print("\nBoth regressors achieved the same MSE. 🤝")


California Housing dataset loaded successfully! 🏡

Number of features: 8
Number of samples: 20640

Training data shape: (14448, 8)
Testing data shape: (6192, 8)



TypeError: BaggingRegressor.__init__() got an unexpected keyword argument 'base_estimator'

**Explanation**

The program performs the following steps to compare the two ensemble regression models:

Load Dataset: It starts by loading the California Housing dataset using fetch_california_housing from sklearn.datasets. This dataset is commonly used for regression tasks, where the goal is to predict house prices.

Data Splitting: The dataset is then split into training and testing sets using train_test_split. This ensures that the models are evaluated on data they haven't seen during training, providing an unbiased assessment of their generalization ability.

Bagging Regressor Training:

A BaggingRegressor is initialized. Its base_estimator is set to DecisionTreeRegressor(random_state=42), meaning each individual model in the ensemble will be a decision tree.

n_estimators=100 specifies that 100 decision trees will be trained.

n_jobs=-1 allows the training of these 100 trees to run in parallel across all available CPU cores, speeding up the process.

The model is then fit to the training data.

Random Forest Regressor Training:

A RandomForestRegressor is initialized with n_estimators=100. The Random Forest algorithm is essentially a specialized form of bagging that also introduces additional randomness by randomly selecting a subset of features at each split point in the decision trees. This further decorrelates the trees, often leading to improved performance.

n_jobs=-1 is also used here for parallel processing.

The model is fit to the training data.

Mean Squared Error (MSE) Comparison:

After training, both regressors make predictions on the X_test data.

The mean_squared_error metric is calculated for both sets of predictions against the actual y_test values. MSE measures the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit of the model to the data.

Finally, the MSE values for both models are printed and compared, allowing you to see which ensemble method performed better on this specific dataset and split. Random Forests often outperform simple Bagging due to the added feature randomness, which further reduces correlation among the base trees.

**Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.**

Ans:-  **Step-by-Step Approach for a Loan Default Prediction Model**

As a data scientist predicting loan default, a robust and accurate model is critical. Ensemble learning offers a powerful way to achieve this. Here's a step-by-step approach for building and evaluating such a model using ensemble techniques.

1. **Choose Between Bagging and Boosting**

For loan default prediction, Boosting is the preferred choice over Bagging.

Boosting excels at reducing bias and focusing on the most challenging cases. In a loan default scenario, correctly identifying high-risk individuals (who are often a minority class) is the most critical task. Boosting algorithms like Gradient Boosting or XGBoost sequentially train models to correct the errors of previous models, naturally putting more emphasis on the customers who were difficult to classify, thereby improving the predictive power on the default class.

Bagging, while great for reducing variance and preventing overfitting, treats all data points equally. It doesn't specifically target the "hard cases" of potential defaulters, which are the most important for the business to identify.

2. **Handle Overfitting**

Boosting models can be prone to overfitting, so we'll use several strategies to mitigate this:

Regularization Parameters: We'll limit the complexity of our individual base models (e.g., setting a max_depth for decision trees). We'll also use a learning_rate parameter to shrink the contribution of each new tree, which prevents the model from rapidly over-indexing on the training data.

Early Stopping: We can monitor the model's performance on a validation set and stop training if the performance metric (e.g., log-loss) stops improving. This prevents the model from continuing to learn noise in the training data.

Subsampling: We'll train each tree on a random subset of the training data and features, similar to Random Forest. This adds more randomness and helps to reduce variance.

3. **Select Base Models**

Decision trees are an excellent choice for base models in boosting ensembles.

They are highly interpretable, which is crucial in a regulated industry like finance.

They can handle both categorical (e.g., gender, marital status) and numerical (e.g., income, credit score) data without extensive preprocessing.

When used as "weak learners" (i.e., shallow trees with limited max_depth), they are computationally efficient and form the perfect building blocks for a boosting algorithm.

4. **Evaluate Performance Using Cross-Validation**

We will use k-fold cross-validation to get a reliable estimate of the model's performance. Instead of a single train-test split, we'll divide the data into k folds, train the model k times, and average the performance metrics. This ensures our evaluation is not dependent on a specific data split.

Performance Metrics: Given that loan defaults are a relatively rare event (class imbalance), simple accuracy is misleading. A model that predicts "non-default" for everyone might have a high accuracy but would be useless. We'll use more appropriate metrics like:

AUC-ROC (Area Under the Receiver Operating Characteristic curve): Measures the model's ability to distinguish between the two classes.

F1-Score: The harmonic mean of precision and recall, which is a better measure for imbalanced datasets.

Precision and Recall: Precision is the proportion of predicted defaults that were actually defaults, while recall is the proportion of actual defaults that were correctly identified.

5. **Justify How Ensemble Learning Improves Decision-Making**   

Ensemble learning's strength lies in its ability to produce a more reliable and robust prediction than any single model. In the context of financial decision-making, this translates to:

More Accurate Risk Assessment: By combining the insights of multiple models, the ensemble model can identify complex patterns that a single decision tree might miss. This leads to a more precise estimate of a customer's likelihood to default.

Reduced Financial Loss: A more accurate model means fewer high-risk customers are incorrectly approved for loans, directly reducing the institution's financial losses from defaults.

Fairer Decision-Making: A robust model is less likely to be swayed by random noise in the data, leading to more consistent and fairer decisions across different customer profiles.

Python Code for the Approach
The following code demonstrates a Boosting approach using GradientBoostingClassifier and evaluates it using GridSearchCV with cross-validation. We'll use a synthetic dataset for demonstration.



---

**Python Code for the Approach
The following code demonstrates a Boosting approach using GradientBoostingClassifier and evaluates it using GridSearchCV with cross-validation. We'll use a synthetic dataset for demonstration.**




In [None]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, classification_report

# 1. Simulate a loan default dataset with class imbalance
# We create a dataset with 10,000 samples and 20 features, with a 90/10 class split.
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, weights=[0.9, 0.1],
                           flip_y=0.01, random_state=42)

# Convert to DataFrame for better readability
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y = pd.Series(y, name='default')

print("Synthetic loan default dataset created. 📊")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of defaults (Class 1): {sum(y == 1)}")
print(f"Number of non-defaults (Class 0): {sum(y == 0)}\n")

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}\n")

# 2. Define the Boosting model and hyperparameters for tuning
# We use GradientBoostingClassifier, a popular boosting algorithm.
gbc = GradientBoostingClassifier(random_state=42)

# Define the parameter grid for GridSearchCV
# We are tuning n_estimators, learning_rate, and max_depth to handle overfitting.
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees
    'learning_rate': [0.05, 0.1, 0.2], # Contribution of each tree
    'max_depth': [3, 5, 7]             # Max depth of each tree (weak learner)
}

print("Starting GridSearchCV for hyperparameter tuning... 🛠️\n")

# 3. Use GridSearchCV with cross-validation to find the best model
# cv=5 means 5-fold cross-validation.
# We use 'roc_auc' as the scoring metric due to class imbalance.
grid_search = GridSearchCV(estimator=gbc, param_grid=param_grid, scoring='roc_auc', cv=5, n_jobs=-1, verbose=1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

print("\nGridSearchCV complete! ✨\n")

# Get the best model and its parameters
best_gbc_model = grid_search.best_estimator_
best_params = grid_search.best_params_

print("Best hyperparameters found: 🎉")
for param, value in best_params.items():
    print(f"- {param}: {value}")

# 4. Evaluate the best model on the test set
# Get predictions and probabilities
y_pred = best_gbc_model.predict(X_test)
y_pred_proba = best_gbc_model.predict_proba(X_test)[:, 1]

# Calculate and print evaluation metrics
auc_roc = roc_auc_score(y_test, y_pred_proba)
print(f"\nFinal AUC-ROC score on the test set: {auc_roc:.4f}\n")

print("Classification Report on Test Set:")
print(classification_report(y_test, y_pred))

Synthetic loan default dataset created. 📊
Number of samples: 10000
Number of defaults (Class 1): 1034
Number of non-defaults (Class 0): 8966

Training set size: 8000
Testing set size: 2000

Starting GridSearchCV for hyperparameter tuning... 🛠️

Fitting 5 folds for each of 27 candidates, totalling 135 fits
