### **01.What is Ensemble Learning in machine learning? Explain the key idea behind it.**

Ensemble learning in machine learning combines predictions from multiple individual models (also known as base learners or weak learners) to create a more accurate and robust prediction than any single model alone. The core idea is that by aggregating the outputs of diverse models, the ensemble can overcome individual model limitations and achieve better performance.

Ensemble learning is a method where we use many small models instead of just one. Each of these models may not be very strong on its own, but when we put their results together, we get a better and more accurate answer. It's like asking a group of people for advice instead of just one person—each one might be a little wrong, but together, they usually give a better answer.

Here's a breakdown:

Key Idea:

Ensemble Learning - GeeksforGeeks
Instead of relying on a single model, ensemble learning leverages the collective intelligence of multiple models. These models can be trained on different subsets of the data, use different algorithms, or have different hyperparameter settings. By combining their predictions, the ensemble can reduce variance, bias, and overfitting, leading to improved accuracy and generalization.

How it Works:

1. Base Learners:

Individual machine learning models are trained on the data. These models can be decision trees, neural networks, support vector machines, or any other suitable algorithm.

2. Ensemble Creation:

The base learners are combined to create an ensemble. Common methods for combining predictions include:
Bagging: Training multiple models independently on random subsets of the data (with replacement) and averaging their predictions.

Boosting: Training models sequentially, with each model focusing on correcting the errors of its predecessors.

Stacking: Training multiple models and then training a meta-model on their predictions to make the final prediction.

3. Prediction:

When a new data point needs to be classified or predicted, each base learner in the ensemble makes a prediction. These predictions are then combined, often by averaging or voting, to produce the final ensemble prediction.

Benefits of Ensemble Learning:

Increased Accuracy: Ensemble methods often achieve higher accuracy than individual models.
Reduced Variance: By combining multiple models, the overall variance of the predictions is reduced, leading to more stable results.

Improved Generalization:

Ensembles can generalize better to unseen data, reducing the risk of overfitting.
Robustness: Ensembles are more robust to noise and outliers in the data.

Benefits of Ensemble Learning in Machine Learning
Ensemble learning is a versatile approach that can be applied to machine learning model for: -

Reduction in Overfitting:

By aggregating predictions of multiple model's ensembles can reduce overfitting that individual complex models might exhibit.

Improved Generalization:

 It generalizes better to unseen data by minimizing variance and bias.

Increased Accuracy:

Combining multiple models gives higher predictive accuracy.

Robustness to Noise:

 It mitigates the effect of noisy or incorrect data points by averaging out predictions from diverse models.

Flexibility:

It can work with diverse models including decision trees, neural networks and support vector machines making them highly adaptable.

Bias-Variance Tradeoff:

Techniques like bagging reduce variance, while boosting reduces bias leading to better overall performance.


### **02.What is the difference between Bagging and Boosting?**



Bagging (Bootstrap Aggregating):

Bagging is a popular ensemble learning technique that focuses on reducing variance and improving the stability of machine learning models. The term “bagging” is derived from the idea of creating multiple subsets or bags of the training data through a process known as bootstrapping. Bootstrapping involves randomly sampling the dataset with replacement to generate multiple subsets of the same size as the original data. Each of these subsets is then used to train a base learner independently.

Boosting:

Boosting, like bagging, is an ensemble learning technique, but it aims to improve the performance of weak learners by combining them in a sequential manner. The core idea behind boosting is to give more weight to misclassified instances during the training process, enabling subsequent learners to focus on the mistakes made by their predecessors.

Differences Between Bagging and Boosting:

Sequential vs. Parallel:

Bagging:

 The base learners are trained independently in parallel, as each learner works on a different subset of the data. The final prediction is typically an average or vote of all base learners.

Boosting:

The base learners are trained sequentially, and each learner focuses on correcting the mistakes of its predecessors. The final prediction is a weighted sum of the individual learner predictions.

Data Sampling:

Bagging:

Utilizes bootstrapping to create multiple subsets of the training data, allowing for variations in the training sets for each base learner.

Boosting:

Assigns weights to instances in the training set, with higher weights given to misclassified instances to guide subsequent learners.

Weighting of Base Learners:

Bagging:

All base learners typically have equal weight when making the final prediction.

Boosting:

Assigns different weights to each base learner based on its performance, giving more influence to learners that perform well on challenging instances.

Handling Noisy Data and Outliers:

Bagging:

 Robust to noisy data and outliers due to the averaging or voting mechanism, which reduces the impact of individual errors.

Boosting:

More sensitive to noisy data and outliers, as the focus on misclassified instances might lead to overfitting on these instances.

Model Diversity:

Bagging:

Aims to create diverse base learners through random subsets of the data and, in the case of Random Forests, random feature selection for each tree.

Boosting:

 Focuses on improving the performance of weak learners sequentially, with each learner addressing the weaknesses of its predecessors.

Bias and Variance:

Bagging:

 Primarily reduces variance by averaging predictions from multiple models, making it effective for models with high variance.

Boosting:

 Addresses both bias and variance, with a focus on reducing bias by sequentially correcting mistakes made by weak learners.








### **03.What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**


Bootstrap sampling is a resampling technique where multiple samples of the same size as the original dataset are created by randomly selecting data points with replacement. This means that a single data point can be selected multiple times within a single bootstrap sample, and some data points from the original dataset may not be included in a particular bootstrap sample at all.

Bagging Classifier :-

Bagging or Bootstrap aggregating is a type of ensemble learning in which multiple base models are trained independently and parallelly on different subsets of training data. Each subset is generated using bootstrap sampling in which data points are picked at randomly with replacement. In bagging classifier the final prediction is made by aggregating the predictions of all base model using majority voting. In the models of regression the final prediction is made by averaging the predictions of the all base model and that is known as bagging regression.

In Bagging (Bootstrap Aggregating) methods like Random Forest, bootstrap sampling plays a crucial role:

Creating Diverse Training Sets:

Each individual model (e.g., decision tree in a Random Forest) within the ensemble is trained on a different bootstrap sample. This introduces variability among the training sets, which in turn leads to the creation of diverse base models.

Reducing Variance and Overfitting:

By training multiple models on these varied bootstrap samples and then aggregating their predictions (e.g., through majority voting for classification or averaging for regression), Bagging methods effectively reduce the variance of the overall model. This helps in mitigating overfitting, as the ensemble is less sensitive to the specific characteristics of any single training sample.

Enhancing Model Robustness:

The diversity introduced by bootstrap sampling makes the ensemble more robust to noise and outliers in the data. Since each base model is trained on a slightly different view of the data, the combined prediction is less likely to be swayed by anomalies present in only a subset of the data.

In essence, bootstrap sampling provides the foundation for creating a collection of slightly different base models, which are then combined to form a more stable, accurate, and robust ensemble model in techniques such as Random Forest.

How does Bagging Classifier Work :-

Bootstrap Sampling:

In Bootstrap Sampling data are sampled with 'n' subsets are made randomly from original training dataset with replacement. This step ensures that the base models are trained on diverse subsets of the data as some samples may appear multiple times in the new subset while others may be left out. It reduces the risks of overfitting and improves the accuracy of the model.

Base Model Training:

In bagging multiple base models are used. After the Bootstrap Sampling each base model is independently trained using learning algorithm such as decision trees, support vector machines or neural networks on a different bootstrapped subset data. These models are typically called "Weak learners" because they are not highly accurate. Since the base model is trained independently and parallelly it makes it computationally efficient and time saving.

Aggregation:

Once all the base models are trained and makes predictions on new unseen data then bagging classifier predicts class label for given instance by majority voting from all base learners. The class which has the majority voting is the prediction of the model.

Out-of-Bag (OOB) Evaluation:

Some samples are excluded from the training subset of particular base models during the bootstrapping method. These "out-of-bag" samples can be used to estimate the model's performance without the need for cross-validation.

Bagging Classifier process begins with the original training dataset which is used to create bootstrap samples (random subsets with replacement) for training multiple weak learners ensuring diversity. Each weak learner independently predicts outcomes as shown in the Base Model Training graph capturing different patterns.  

These predictions are aggregated using majority voting where the final classification is determined by the maximum voted output. The Out-of-Bag (OOB) evaluates models performance on data excluded from each bootstrap sample for validation. This approach enhances accuracy and reduces overfitting.

### **04.What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?**

Out-of-bag (OOB) samples are data points that are not included in a specific bootstrap sample during the training of a random forest model. In other words, when creating multiple decision trees in a random forest, each tree is trained on a random subset of the data (with replacement), and the remaining data points that are not selected for that particular tree are considered OOB samples. These OOB samples can be used to estimate the model's performance without needing a separate validation set.

Here's a more detailed explanation:

Bootstrapping:

Random forests use a technique called bootstrapping, where random samples with replacement are drawn from the original dataset to create multiple training sets for individual decision trees.

OOB Samples as a Validation Set:

Because of bootstrapping, some data points will not be included in the training data for a specific tree. These unselected data points are the OOB samples for that particular tree.

OOB Error:

The OOB samples can be used to estimate the model's generalization error (how well it performs on unseen data). By predicting the OOB samples for each tree and aggregating those predictions, you can get an OOB error, which serves as an estimate of the model's performance.

Advantages of OOB Samples:

Using OOB samples eliminates the need for a separate validation set, simplifying the model evaluation process. It also provides an unbiased estimate of the model's performance, as the OOB samples are not used in training the specific tree they are being evaluated on.

The Out-of-Bag (OOB) score is a method used to evaluate the performance of ensemble models, particularly those that utilize bagging, such as Random Forests. It provides an internal, unbiased estimate of the model's generalization error without the need for a separate validation set.

How OOB Score is Used:-

Bootstrapped Sampling and OOB Samples:

During the training of an ensemble model with bagging, each base learner (e.g., a decision tree in a Random Forest) is trained on a bootstrapped sample of the original dataset. A bootstrapped sample is created by randomly sampling with replacement from the original data. The data points that are not included in a particular bootstrapped sample for a specific base learner are called "out-of-bag" (OOB) samples for that base learner.

Individual Base Learner Evaluation:

Each base learner is then evaluated on its respective OOB samples. This means that for each data point in the original dataset, there will be a subset of base learners for which that data point was OOB.

Aggregated OOB Predictions:

For each original data point, predictions are made by all the base learners for which that data point was an OOB sample. These individual predictions are then aggregated (e.g., by majority vote for classification or averaging for regression) to form a final OOB prediction for that data point.

OOB Score Calculation:

The OOB score is calculated by comparing these aggregated OOB predictions to the actual values of the corresponding data points. For classification, this typically involves calculating the accuracy (proportion of correctly classified OOB samples). For regression, metrics like Mean Squared Error (MSE) or R-squared are commonly used on the OOB samples.

Benefits of OOB Score:

Unbiased Estimate:

The OOB score provides an unbiased estimate of the model's performance on unseen data because the OOB samples were not used in the training of the specific base learners that are predicting on them.

Efficient Cross-Validation:

It acts as a form of internal cross-validation, eliminating the need to explicitly split the data into separate training and validation sets, thus maximizing the use of available data for training.

Reduced Data Leakage:

It helps in avoiding data leakage, which can occur when using traditional cross-validation methods if not implemented carefully.


### **05.Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

Feature importance analysis in a single Decision Tree and a Random Forest differs primarily due to the ensemble nature of Random Forests.

Single Decision Tree:-

Decision Tree is very popular supervised machine learning algorithm used for regression as well as classification problems. In decision tree, a flow-chart like structure is build where each internal nodes denotes the features, rules are denoted using the branches and the leaves denotes the final result of the algorithm.

Calculation:

Feature importance in a single decision tree is typically calculated based on how much each feature reduces impurity (e.g., Gini impurity or entropy) when used for splitting nodes. Features that lead to larger reductions in impurity are considered more important.

Interpretation:

Feature importance in a single tree is straightforward to interpret and visualize, as it directly reflects the tree's decision-making process.

Limitations:

A single decision tree can be prone to overfitting, and its feature importance may be highly sensitive to small changes in the training data. If two features are highly correlated, the tree might only pick one for splitting, artificially diminishing the importance of the other.

Random Forest:-

Random Forest is very powerful supervised machine learning algorithm, used for classification and regression task. Random Forest uses ensemble learning (combining multiple models/classifiers to solve a complex problem and to improve the overall accuracy score of the model). In Random Forest multiple decision tree are built by considering the different subset of the given data and the average of all those to increase the overall accuracy of the model. As the number of decision tree in random forest increases the accuracy increases and overfitting also reduces.

Calculation:

Random Forests calculate feature importance by averaging the impurity reduction across all individual decision trees within the forest (Mean Decrease in Impurity, MDI). Another common method is Permutation Importance, which measures the decrease in model performance when a feature's values are randomly shuffled.

Interpretation:

While less directly interpretable than a single tree's importance due to the ensemble, Random Forest feature importance provides a more robust and stable measure by aggregating insights from multiple trees.

Advantages:

Random Forests inherently mitigate the overfitting issues of single trees, leading to more generalized and reliable feature importance scores. They are better at handling correlated features, as different trees in the forest might utilize different correlated features, leading to a more balanced importance attribution.

Considerations:

Random Forest feature importance can still exhibit biases, such as favoring high-cardinality features or potentially understating the importance of correlated features if their contributions are split across multiple trees. Permutation importance often offers a more robust alternative in such cases.

When to Use Random Forest vs. Decision Tree?

*Use a decision tree when interpretability is important, and you need a simple and easy-to-understand model.

*Use a random forest when you want better generalization performance, robustness to overfitting, and improved accuracy, especially on complex datasets with high-dimensional feature spaces.

*If computational efficiency is a concern and you have a small dataset, a decision tree might be more appropriate.

*If you have a large dataset with complex relationships between features and labels, a random forest is likely to provide better results.



In [1]:
#06.: Write a Python program to:

#● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()


from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
import pandas as pd
data_df = pd.DataFrame(data = data.data,
                       columns = data.feature_names)
data_df.head().T

Unnamed: 0,0,1,2,3,4
mean radius,17.99,20.57,19.69,11.42,20.29
mean texture,10.38,17.77,21.25,20.38,14.34
mean perimeter,122.8,132.9,130.0,77.58,135.1
mean area,1001.0,1326.0,1203.0,386.1,1297.0
mean smoothness,0.1184,0.08474,0.1096,0.1425,0.1003
mean compactness,0.2776,0.07864,0.1599,0.2839,0.1328
mean concavity,0.3001,0.0869,0.1974,0.2414,0.198
mean concave points,0.1471,0.07017,0.1279,0.1052,0.1043
mean symmetry,0.2419,0.1812,0.2069,0.2597,0.1809
mean fractal dimension,0.07871,0.05667,0.05999,0.09744,0.05883


In [None]:
#06.Write a Python program to:

 #using sklearn.datasets.load_breast_cancer()
 #●Train a Random Forest Classifier


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Breast Cancer dataset
breast_cancer_data = load_breast_cancer()

# Separate features (X) and target variable (y)
X = breast_cancer_data.data
y = breast_cancer_data.target

# Split the data into training and testing sets
# test_size=0.2 means 20% of the data will be used for testing
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
# n_estimators is the number of trees in the forest
# random_state ensures reproducibility of the model training
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Random Forest Classifier on the training data
random_forest_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = random_forest_model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, target_names=breast_cancer_data.target_names)

print(f"Accuracy of the Random Forest Classifier: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_rep)

Accuracy of the Random Forest Classifier: 0.9649

Classification Report:
              precision    recall  f1-score   support

   malignant       0.98      0.93      0.95        43
      benign       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [None]:
#06.Write a Python program using sklearn.datasets.load_breast_cancer()

#●Print the top 5 most important features based on feature importance scores.


import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target

# Train a RandomForestClassifier model
# n_estimators: The number of trees in the forest.
# random_state: Controls the randomness of the estimator.
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importance scores
feature_importances = model.feature_importances_

# Create a DataFrame to store feature names and their importance scores
importance_df = pd.DataFrame({
    'Feature': breast_cancer.feature_names,
    'Importance': feature_importances
})

# Sort features by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the top 5 most important features
print("Top 5 most important features:")
print(importance_df.head(5))

Top 5 most important features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [None]:
#07.Write a Python program to:

#● Train a Bagging Classifier using Decision Trees on the Iris dataset



from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable (species)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a Decision Tree Classifier as the base estimator
base_estimator = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging Classifier
# n_estimators: Number of base estimators (Decision Trees) in the ensemble
# base_estimator: The individual estimator to be bagged
# random_state: For reproducibility
bagging_classifier = BaggingClassifier(
    estimator=base_estimator,
    n_estimators=10,  # You can adjust the number of estimators
    random_state=42
)

# Train the Bagging Classifier
bagging_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_classifier.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Bagging Classifier: {accuracy:.4f}")

Accuracy of the Bagging Classifier: 1.0000


In [None]:
#07.Write a Python program to:

#● Evaluate its accuracy and compare with a single Decision Tree

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load a sample dataset
#    Using the Iris dataset for demonstration purposes
iris = load_iris()
X = iris.data  # Features
y = iris.target # Target variable

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train a single Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred_dt = dt_classifier.predict(X_test)

# 5. Evaluate the accuracy of the Decision Tree
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Accuracy of the single Decision Tree: {accuracy_dt:.4f}")

# 6. (Optional) Compare with another model (e.g., another Decision Tree with different parameters)
#    For comparison, let's train another Decision Tree with a different criterion
dt_classifier_gini = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_classifier_gini.fit(X_train, y_train)
y_pred_gini = dt_classifier_gini.predict(X_test)
accuracy_gini = accuracy_score(y_test, y_pred_gini)
print(f"Accuracy of the Gini-based Decision Tree: {accuracy_gini:.4f}")

# 7. Comparison statement
if accuracy_dt > accuracy_gini:
    print("The default Decision Tree performed slightly better or equally well in this case.")
elif accuracy_gini > accuracy_dt:
    print("The Gini-based Decision Tree performed slightly better in this case.")
else:
    print("Both Decision Trees achieved the same accuracy.")



Accuracy of the single Decision Tree: 1.0000
Accuracy of the Gini-based Decision Tree: 1.0000
Both Decision Trees achieved the same accuracy.


In [None]:
#08.Write a Python program to:

#● Train a Random Forest Classifier


import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Prepare a sample dataset (replace with your actual data)
# For demonstration, we'll create a synthetic dataset
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(0, 2, 100) # 100 binary labels (0 or 1)

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Instantiate the Random Forest Classifier
# n_estimators: number of trees in the forest
# random_state: for reproducibility of results
rfc = RandomForestClassifier(n_estimators=100, random_state=42)

# 4. Train the model
rfc.fit(X_train, y_train)

# 5. Make predictions on the test set
y_pred = rfc.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Random Forest Classifier: {accuracy:.2f}")

# You can also explore other metrics like:
# from sklearn.metrics import classification_report, confusion_matrix
# print(classification_report(y_test, y_pred))
# print(confusion_matrix(y_test, y_pred))

Accuracy of the Random Forest Classifier: 0.50


In [None]:
#08.Write a Python program to:

#Tune hyperparameters max_depth and n_estimators using GridSearchCV



from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import make_classification

# 1. Prepare data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Define the parameter grid
param_grid = {
    'max_depth': [5, 10, 15, 20],
    'n_estimators': [50, 100, 150, 200]
}

# 3. Instantiate the estimator
estimator = RandomForestClassifier(random_state=42)

# 4. Instantiate GridSearchCV
grid_search = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# 5. Fit GridSearchCV
print("Performing Grid Search...")
grid_search.fit(X_train, y_train)
print("Grid Search complete.")

# 6. Retrieve best parameters and score
print("\nBest Parameters:", grid_search.best_params_)
print("Best Cross-validation Score:", grid_search.best_score_)

# You can also access the best estimator directly
best_model = grid_search.best_estimator_
print("\nBest Estimator:", best_model)

# Evaluate the best model on the test set
test_accuracy = best_model.score(X_test, y_test)
print("Test Set Accuracy of Best Model:", test_accuracy)

Performing Grid Search...
Grid Search complete.

Best Parameters: {'max_depth': 15, 'n_estimators': 100}
Best Cross-validation Score: 0.9275

Best Estimator: RandomForestClassifier(max_depth=15, random_state=42)
Test Set Accuracy of Best Model: 0.95


In [None]:
#08.Write a Python program to:

#Print the best parameters and final accuracy


from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier # Example model

# Assume X_train, y_train, X_test, y_test are already defined
# (e.g., from train_test_split)

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

# Initialize the model
model = RandomForestClassifier(random_state=42)

# Perform Grid Search Cross-Validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print the best parameters found by Grid Search
print(f"Best parameters: {grid_search.best_params_}")

# Get the best estimator (model with best parameters)
best_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred = best_model.predict(X_test)

# Calculate and print the final accuracy on the test set
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {test_accuracy}")

Best parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
Test set accuracy: 0.945


In [None]:
#09.Write a Python program to:

#Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset

#Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Bagging Regressor ---
# A Bagging Regressor with a Decision Tree as the base estimator
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42),
    n_estimators=100,  # Number of base estimators (trees)
    max_samples=0.8,   # Use 80% of the samples for each base estimator
    bootstrap=True,    # Sample with replacement
    random_state=42,
    n_jobs=-1          # Use all available CPU cores
)
bagging_reg.fit(X_train, y_train)
y_pred_bagging = bagging_reg.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
print(f"Bagging Regressor Mean Squared Error: {mse_bagging:.4f}")

# --- Random Forest Regressor ---
random_forest_reg = RandomForestRegressor(
    n_estimators=100,  # Number of trees in the forest
    max_features=0.8,  # Use 80% of features for each split
    random_state=42,
    n_jobs=-1          # Use all available CPU cores
)
random_forest_reg.fit(X_train, y_train)
y_pred_rf = random_forest_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f"Random Forest Regressor Mean Squared Error: {mse_rf:.4f}")


Bagging Regressor Mean Squared Error: 0.2599
Random Forest Regressor Mean Squared Error: 0.2536


### **10.You are working as a data scientist at a financial institution to predict loan default.You have access to customer demographic and transaction history data.**

You decide to use ensemble techniques to increase model performance.

Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world context.



**ANSWER-**

Here's a step-by-step approach to using ensemble techniques for loan default prediction:

1.Choose between Bagging or Boosting:

Consider Boosting (e.g., Gradient Boosting, XGBoost, LightGBM) for this problem.
Loan default prediction often involves complex relationships and potentially imbalanced datasets. Boosting algorithms sequentially build models to correct errors of previous models, making them very effective at capturing intricate patterns and achieving high predictive accuracy, which is crucial in financial risk assessment. Bagging (e.g., Random Forest) is good for reducing variance and overfitting, but boosting often provides superior performance in complex classification tasks like this.

2.Handle Overfitting:

Regularization:

Apply L1 or L2 regularization techniques during model training to penalize large coefficients and prevent complex models that fit noise in the data.

Cross-validation:

Use k-fold cross-validation during model training to assess performance on unseen data and identify the optimal hyperparameters that prevent overfitting.

Feature Engineering/Selection:

Carefully select and engineer features to reduce dimensionality and focus on the most relevant information, minimizing the chances of the model learning spurious correlations.

Early Stopping:

For iterative algorithms like boosting, monitor performance on a validation set and stop training when performance on the validation set starts to degrade, preventing the model from memorizing the training data.

3.Select Base Models:

Decision Trees:

These are commonly used as base learners in both bagging (e.g., Random Forest) and boosting (e.g., Gradient Boosting Machines). They are interpretable and can capture non-linear relationships.

Consider variations:

For boosting, explore different types of base learners like shallow decision trees (stumps) or even linear models if appropriate for certain features.

4.Evaluate Performance using Cross-Validation:

K-Fold Cross-Validation:

Divide the dataset into 'k' folds. Train the model on 'k-1' folds and evaluate on the remaining fold. Repeat this 'k' times, using each fold once as the validation set. This provides a robust estimate of the model's generalization performance.

Metrics:

Use appropriate evaluation metrics for imbalanced datasets, such as AUC-ROC, Precision, Recall, F1-score, or Gini coefficient, rather than just accuracy, as accuracy can be misleading when one class is much larger than the other.

5.Justify how ensemble learning improves decision-making in this real-world context:

Increased Accuracy and Robustness:

Ensemble methods, by combining multiple models, generally achieve higher predictive accuracy and are more robust to noise and outliers compared to single models. This leads to more reliable loan default predictions.

Reduced Risk:

More accurate predictions of loan default allow the financial institution to make better-informed decisions regarding loan approvals, interest rates, and risk management strategies, ultimately reducing financial losses due to defaults.

Improved Decision Support:

The ensemble model provides a more comprehensive and reliable assessment of loan applicant risk, enabling faster and more consistent decision-making for loan officers and automated systems.

Handling Complexity:

Ensemble methods can effectively capture complex, non-linear relationships within the customer demographic and transaction history data that might be missed by simpler models, providing a more nuanced understanding of default risk.
