1. Can we use Bagging for regression problems ?


Yes, you can absolutely use bagging for regression problems, and it's a common and effective approach to improve the stability and accuracy of regression models. Bagging, or bootstrap aggregating, works by training multiple regression models on different subsets of the training data, and then averaging their predictions to produce a final, more robust prediction.

2. What is the difference between multiple model training and single model training ?



Single Model Training:
Focus:
Develops a single model to learn patterns and make predictions from data.
Process:
The model is trained using a specific algorithm and parameters, and its performance is evaluated based on its ability to generalize to unseen data.
Advantages:
Simpler to implement and manage.
Can be faster to train than multiple models.
May be more interpretable, depending on the model type.
Disadvantages:
Performance can be limited by the model's architecture and training data.
Can be prone to overfitting or underfitting.


Multiple Model Training (Ensemble Methods):
Focus:
Uses multiple models, often trained independently, and combines their predictions.
Process:
Training: Each model is trained using different algorithms, parameters, or data subsets.
Combination: The predictions from the individual models are combined using a voting scheme, averaging, or other methods to produce a final prediction.
Advantages:
Can achieve higher accuracy and robustness than single models.
Can be more resilient to errors or biases in individual models.
Disadvantages:
More complex to implement and manage.
Can be slower to train and predict than single models.
May be less interpretable than single models.
Examples of Ensemble Methods:
Bagging:
Training multiple models on different subsets of the training data and averaging their predictions (e.g., Random Forest).
Boosting:
Training models sequentially, where each model focuses on correcting the errors of the previous models (e.g., XGBoost, LightGBM).
Stacking:
Training multiple models and then using another model (the meta-model) to combine their predictions.
In summary: Choose single model training for simplicity and speed, and ensemble methods when higher accuracy and robustness are crucial, even at the cost of complexity and training time.

3. Explain the concept of feature randomness in Random Forest.



In Random Forest, feature randomness, also known as feature bagging or random subspace method, means that at each node of a decision tree, the algorithm randomly selects a subset of features to consider for splitting, rather than using all available features. This randomness helps to decorrelate the trees, leading to a more robust and accurate model with better generalization capabilities.

4. What is OOB (Out-of-Bag) Score ?



Out-of-bag (OOB) score, also known as OOB error, is a method to assess the prediction error of machine learning models, particularly those using bagging techniques like Random Forests, without needing a separate validation or test set. It leverages the "out-of-bag" samples (data points not used in training a particular model) to estimate the model's performance.

5.  How can you measure the importance of features in a Random Forest model ?



In Random Forest models, feature importance is typically measured using the mean decrease in impurity (or Gini importance), which quantifies how much a feature reduces the impurity of a node when used for splitting, averaged across all trees in the forest.

6.  Explain the working principle of a Bagging Classifier.


A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.

7. How do you evaluate a Bagging Classifier’s performance ?



To evaluate a Bagging Classifier's performance, you should train the model, make predictions on a test set, and then use metrics like accuracy, precision, recall, and F1-score to assess its performance.

8.  How does a Bagging Regressor work ?


A Bagging Regressor works by training multiple base regression models on different random subsets of the training data (bootstrap samples), and then aggregating their predictions (typically by averaging) to produce a final prediction.

9.  What is the main advantage of ensemble techniques ?

The main advantage of ensemble techniques in machine learning is improved predictive performance and generalization by combining multiple models, often resulting in higher accuracy and robustness than individual models.


10.  What is the main challenge of ensemble methods ?


The main challenge of ensemble methods lies in their increased computational complexity and the potential for overfitting, requiring careful selection and tuning of both base models and the meta-model.

11.  Explain the key idea behind ensemble techniques.


Ensemble techniques in machine learning combine multiple models to achieve better performance than any single model, leveraging the collective strengths of diverse models to reduce errors and improve accuracy.



The Core Idea:
Ensemble learning is based on the principle that combining the predictions of multiple models can overcome the limitations of individual models, leading to a more robust and accurate final prediction.

How it Works:
Multiple Models: Ensemble methods involve training multiple models (often "weak learners") on the same dataset or different subsets of it.
Combining Predictions: The predictions from these individual models are then combined using a specific strategy, such as averaging (for regression), majority voting (for classification), or a more sophisticated method like weighted averaging.

Improved Performance:
The resulting ensemble model often exhibits better generalization performance, meaning it performs well on unseen data, compared to any of the individual models alone.

Types of Ensemble Techniques:
Bagging: (Bootstrap Aggregating) trains multiple models on different subsets of the training data, and then combines their predictions.

Boosting: trains models sequentially, with each model focusing on the errors made by previous models.

Stacking: trains multiple models and then uses their predictions as input to a meta-model (a model that learns how to combine the predictions of the base models).

Benefits of Ensemble Learning:
Improved Accuracy: By combining multiple perspectives, ensemble models often achieve higher accuracy than single models.

Reduced Overfitting: Ensemble methods can help to mitigate overfitting by averaging out the predictions of multiple models.

Robustness: Ensemble models are often more robust to noisy data and outliers because they rely on the collective wisdom of multiple models.

Examples of Ensemble Algorithms:
Random Forest (a bagging method)
Gradient Boosting (a boosting method)
XGBoost (a boosting method)
Stacking (a stacking method)

12.What is a Random Forest Classifier?


 A Random Forest classifier is a machine learning algorithm that uses an ensemble of decision trees to make predictions, combining the results of multiple trees to improve accuracy and reduce overfitting.

13. What are the main types of ensemble techniques ?



Ensemble learning techniques. Perhaps three of the most popular ensemble learning techniques are bagging, boosting, and stacking. In fact, these together exemplify distinctions between sequential, parallel, homogenous, and heterogenous types of ensemble methods.

14. What is ensemble learning in machine learning ?


Ensemble learning in machine learning combines multiple models to improve predictive performance, leveraging the strengths of diverse models to achieve better accuracy and robustness than a single model.

15. When should we avoid using ensemble methods ?


Avoid ensemble methods when computational resources are severely limited, data is insufficient or highly correlated, or interpretability is paramount, as they can be computationally intensive, complex, and potentially difficult to explain.

16. How does Bagging help in reducing overfitting ?



Bagging attempts to reduce the chance overfitting complex models. It trains a large number of “strong” learners in parallel. A strong learner is a model that's relatively unconstrained. Bagging then combines all the strong learners together in order to “smooth out” their predictions.

17. Why is Random Forest better than a single Decision Tree ?


Random Forest is generally considered better than a single decision tree because it mitigates overfitting, improves accuracy and generalization by combining multiple trees, and provides insights into feature importance.




18. What is the role of bootstrap sampling in Bagging ?

In bagging (Bootstrap Aggregating), bootstrap sampling creates diverse subsets of the training data by randomly sampling with replacement, allowing the same data point to appear multiple times in a subset, which is then used to train multiple models, which are then aggregated to improve prediction accuracy and stability.

19. What are some real-world applications of ensemble techniques ?



Ensemble techniques, which combine multiple models to improve prediction accuracy, find applications in diverse fields like fraud detection, medical diagnostics, stock market prediction, and customer behavior analysis.

20. What is the difference between Bagging and Boosting?


Bagging and boosting are both ensemble learning methods that combine multiple models to improve accuracy and stability. The main difference between the two is how the models are trained.

**#practical**

In [None]:
#21.  Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy




import numpy as np
from scipy import stats
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Parameters
n_models = 100
random_states = [i for i in range(n_models)]


# Helper function for bootstrapping
def bootstrapping(X, y):
    n_samples = X.shape[0]
    idxs = np.random.choice(n_samples, n_samples, replace=True)
    return X[idxs], y[idxs]


    # Helper function for bagging prediction
def predict(X, models):
    predictions = np.array([model.predict(X) for model in models])
    predictions = stats.mode(predictions)[0]
    return predictions



    # Create a list to store all the tree models
tree_models = []

# Iteratively train decision trees on bootstrapped samples
for i in range(n_models):
    X_, y_ = bootstrapping(X_train, y_train)
    tree = DecisionTreeClassifier(max_depth=2, random_state=random_states[i])
    tree.fit(X_, y_)
    tree_models.append(tree)

# Predict on the test set
y_pred = predict(X_test, tree_models)

# Print the accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))

Accuracy:  1.0


In [None]:
#22.Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)



# Import required libraries
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y

df.head().T



df["target"].value_counts()




# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree model
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
y_pred_single = single_tree.predict(X_test)

# Evaluate the single Decision Tree
accuracy_single = accuracy_score(y_test, y_pred_single)
print(f"Accuracy of Single Decision Tree: {accuracy_single:.2f}")

# Train a Bagging Classifier with Decision Trees as base learners
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # Base model is a decision tree
    n_estimators=10,                          # Train 10 decision trees
    random_state=42,
    bootstrap=True                            # Use bootstrap sampling
)

bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)

# Evaluate the Bagging Classifier
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy of Bagging Classifier: {accuracy_bagging:.2f}")


Accuracy of Single Decision Tree: 0.96
Accuracy of Bagging Classifier: 0.93


In [None]:
#23. Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores

# importing the libraries
from functools import reduce
# linear algebra
import numpy as np
# data processing, CSV file I/O
import pandas as pd
# data visualization library
import seaborn as sns
import matplotlib.pyplot as plt
from pandas import DataFrame
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
dataset = pd.read_csv("../data/data.csv")



def get_xy(data:pd.DataFrame,list_drp:list):
        """
        set the x and y column

        args:
            data(pd.DataFrame): the dataFrame which we are extracting the x and y

        returns:
            y and X in form of pandas series

        """
        y = data.diagnosis # M or B
        X = data.drop(list_drp,axis = 1 )
        return y,X
d_list = ['Unnamed: 32','id','diagnosis']
y,x = get_xy(dataset,d_list)
x.head()

In [None]:
 #24.Train a Random Forest Regressor and compare its performance with a single Decision Tree





import numpy as np
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)

plt.scatter(X,y, color='blue') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='green') #plotting for predict points

plt.title("Random Forest Regression Results")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()


In [None]:
#25.Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier



# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

from collections import OrderedDict

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

RANDOM_STATE = 123

# Generate a binary classification dataset.
X, y = make_classification(
    n_samples=500,
    n_features=25,
    n_clusters_per_class=1,
    n_informative=15,
    random_state=RANDOM_STATE,
)

# NOTE: Setting the `warm_start` construction parameter to `True` disables
# support for parallelized ensembles but is necessary for tracking the OOB
# error trajectory during training.
ensemble_clfs = [
    (
        "RandomForestClassifier, max_features='sqrt'",
        RandomForestClassifier(
            warm_start=True,
            oob_score=True,
            max_features="sqrt",
            random_state=RANDOM_STATE,
        ),
    ),
    (
        "RandomForestClassifier, max_features='log2'",
        RandomForestClassifier(
            warm_start=True,
            max_features="log2",
            oob_score=True,
            random_state=RANDOM_STATE,
        ),
    ),
    (
        "RandomForestClassifier, max_features=None",
        RandomForestClassifier(
            warm_start=True,
            max_features=None,
            oob_score=True,
            random_state=RANDOM_STATE,
        ),
    ),
]

# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 150

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1, 5):
        clf.set_params(n_estimators=i)
        clf.fit(X, y)

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()

In [None]:
#26. Train a Bagging Classifier using SVM as a base estimator and print accuracy



# Import required libraries
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y

df.head().T

df["target"].value_counts()


# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree model
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
y_pred_single = single_tree.predict(X_test)

# Evaluate the single Decision Tree
accuracy_single = accuracy_score(y_test, y_pred_single)
print(f"Accuracy of Single Decision Tree: {accuracy_single:.2f}")

# Train a Bagging Classifier with Decision Trees as base learners
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # Base model is a decision tree
    n_estimators=10,                          # Train 10 decision trees
    random_state=42,
    bootstrap=True                            # Use bootstrap sampling
)

bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)

# Evaluate the Bagging Classifier
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"Accuracy of Bagging Classifier: {accuracy_bagging:.2f}")

Accuracy of Single Decision Tree: 0.96
Accuracy of Bagging Classifier: 0.93


In [None]:
#27 Random Forest Classifier with different numbers of trees and compare accuracy

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Create and prepare dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
                'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
                'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
                'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
                   72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
                   88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
                 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
                 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True,
             True, False, True, True, False, False, True, False, True, True, False,
             True, False, False, True, False, False],
    'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
             'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
             'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}

# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']

df = df[column_order]

# Prepare features and target
X,y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)







In [None]:
#29.Train a Random Forest Regressor and analyze feature importance scores





from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X, y = iris.data, iris.target


clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

importance = clf.feature_importances_


import matplotlib.pyplot as plt

plt.bar(range(X.shape[1]), importance)
plt.xticks(range(X.shape[1]), iris.feature_names, rotation=90)
plt.title("Feature Importance in Random Forest")
plt.show()

In [None]:
#31.Train a Random Forest Classifier and tune hyperparameters using GridSearchCV

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))


In [None]:
#32.Train a Bagging Regressor with different numbers of base estimators and compare performance

from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=4,
                       n_informative=2, n_targets=1,
                       random_state=0, shuffle=False)
regr = BaggingRegressor(estimator=SVR(),
                        n_estimators=10, random_state=0).fit(X, y)
regr.predict([[0, 0, 0, 0]])

In [None]:
#34.Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier




from sklearn.tree import DecisionTreeClassifier  # For creating decision tree classifiers
from sklearn.svm import SVC  # For creating Support Vector Machine classifiers

# Import model evaluation and ensemble method
from sklearn.metrics import accuracy_score  # For evaluating model accuracy
from sklearn.ensemble import BaggingClassifier  # For ensemble learning using bagging

# Import utility functions for data handling and preparation
from sklearn.model_selection import train_test_split  # For splitting the dataset into train and test sets
from sklearn.datasets import make_classification  # For generating a synthetic classification dataset
import pandas as pd  # For handling data frames
import numpy as np  # For numerical operations

X, y = make_classification(n_samples=10000, n_features=10, n_informative=3)

# Splitting the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bag.fit(X_train, y_train)

# Predicting the labels of the test data
y_pred = bag.predict(X_test)

# Printing the accuracy of the Bagging Classifier
print("Bagging Classifier accuracy:", accuracy_score(y_test, y_pred))

# Printing the shape of the samples and features used by the first estimator
# to illustrate how the Bagging Classifier diversifies training across different learners
print("Shape of the samples used by the first estimator:", bag.estimators_samples_[0].shape)
print("Shape of the features used by the first estimator:", bag.estimators_features_[0].shape)

In [None]:
#35.Train a Random Forest Classifier and visualize the confusion matrix


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
#Fit the model
logreg = LogisticRegression(C=1e5)
logreg.fig(X,y)
#Generate predictions with the model using our X values
y_pred = logreg.predict(X)
#Get the confusion matrix
cf_matrix = confusion_matrix(y, y_pred)
print(cf_matrix)

import seaborn as sns
sns.heatmap(cf_matrix, annot=True)

sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
            fmt='.2%', cmap='Blues')

labels = [‘True Neg’,’False Pos’,’False Neg’,’True Pos’]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt=‘’, cmap='Blues')

group_names = [‘True Neg’,’False Pos’,’False Neg’,’True Pos’]
group_counts = [“{0:0.0f}”.format(value) for value in
                cf_matrix.flatten()]
group_percentages = [“{0:.2%}”.format(value) for value in
                     cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f”{v1}\n{v2}\n{v3}” for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt=‘’, cmap='Blues')

In [None]:
#37. Train a Random Forest Classifier and print the top 5 most important features

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=0,
    shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)


from sklearn.ensemble import RandomForestClassifier

feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train, y_train)
import time

import numpy as np

start_time = time.time()
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time

print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")

import pandas as pd

forest_importances = pd.Series(importances, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()37.

In [None]:
#41.Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Example data
y_true = np.array([0, 1, 1, 0, 1, 1, 0, 0, 1, 0])
y_scores = np.array([0.1, 0.7, 0.8, 0.3, 0.9, 0.6, 0.2, 0.4, 0.7, 0.5])

# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

In [1]:
#42 Train a Bagging Classifier and evaluate its performance using cross-validatio



import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)

X_train.shape, y_train.shape
X_test.shape, y_test.shape

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9666666666666667

In [None]:
#43.Train a Random Forest Classifier and plot the Precision-Recall curv
from sklearn.metrics import auc

# Calculate AUC
auc_score = auc(recall, precision)

print(f"Area Under the Precision-Recall Curve: {auc_score:.4f}")

# Plot Precision-Recall curve with AUC
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'AUC = {auc_score:.4f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve with AUC')
plt.legend(loc='lower left')
plt.fill_between(recall, precision, alpha=0.2)
plt.show()

In [None]:
#45.Train a Bagging Regressor with different levels of bootstrap samples and compare performance.

# Importing necessary libraries

from sklearn.ensemble import BaggingClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the base classifier (in this case, a decision tree)

base_classifier = DecisionTreeClassifier()

# Initialize the BaggingClassifier

# You can specify the number of base estimators (n_estimators) and other parameters

bagging_classifier = BaggingClassifier(base_estimator=base_classifier, n_estimators=10, random_state=42)

# Train the BaggingClassifier

bagging_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = bagging_classifier.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)