Theory Questions

1. Can we use Bagging for regression problems
Ans 1. Yes, bagging (Bootstrap Aggregating) can definitely be used for regression problems. In regression, the goal is to predict a continuous output variable, and bagging helps improve the accuracy and robustness of these predictions by combining the outputs of multiple base regressors. It works by creating multiple subsets of the original training dataset through bootstrapping (random sampling with replacement), training a separate regression model on each subset, and then aggregating the individual predictions — usually by averaging them — to obtain the final output. This process reduces variance, helps prevent overfitting, and can significantly enhance the performance of unstable models like decision trees. Therefore, bagging is a powerful ensemble method suitable not only for classification but also for regression tasks.


2.  What is the difference between multiple model training and single model training
Ans 2. The primary difference between multiple model training and single model training lies in the number of models being developed and the objectives they serve. In **single model training**, only one machine learning model is trained on the entire dataset to perform a specific task, such as classification or regression. This approach is straightforward, easier to manage, and suitable when the problem is well-defined and the dataset is homogeneous. On the other hand, **multiple model training** involves training several models either independently or as part of an ensemble approach. This can be done to compare performance across different algorithms, improve overall prediction accuracy through methods like bagging, boosting, or stacking, or handle complex tasks such as multi-output problems or domain adaptation. Multiple model training often yields better generalization and robustness, but it is more computationally intensive and complex to implement and maintain compared to single model training.


3. Explain the concept of feature randomness in Random Forest
Ans 3. Feature randomness in Random Forest refers to the technique of introducing randomness in the selection of features at each split during the construction of decision trees. Unlike a traditional decision tree, where the best feature is chosen from the entire set of input features to split the data at each node, Random Forest selects a random subset of features at each node and then chooses the best feature among them for splitting. This random selection helps to reduce the correlation between individual trees in the ensemble, making the overall model more robust and less prone to overfitting. By ensuring that not all trees use the same dominant features, feature randomness promotes diversity among the trees, which improves the generalization performance of the Random Forest algorithm. This is a key reason why Random Forest often performs better than individual decision trees in predictive accuracy and model stability.


4.  What is OOB (Out-of-Bag) Score
Ans 4. The Out-of-Bag (OOB) score is a performance metric used in ensemble learning methods, particularly in Random Forest algorithms, to estimate the model’s accuracy without the need for a separate validation dataset. In Random Forest, each decision tree is trained on a random subset of the data selected with replacement (bootstrapping), meaning some data points are left out during the training of each tree. These excluded data points are known as "out-of-bag" samples. The OOB score is calculated by testing each tree on its respective OOB samples and averaging the prediction accuracy across all trees. This provides an internal cross-validation method that helps assess the model's generalization performance, making it a useful and efficient tool, especially when data is limited.


5.  How can you measure the importance of features in a Random Forest model
Ans 5. The importance of features in a Random Forest model can be measured using two primary methods: **Mean Decrease in Impurity (MDI)** and **Mean Decrease in Accuracy (MDA)**. MDI, also known as Gini importance, is calculated during the training process and is based on how much each feature decreases the impurity (such as Gini impurity or entropy) across all trees in the forest. Each time a feature is used to split a node, the resulting decrease in impurity is recorded, and the total decrease accumulated over all trees indicates the feature’s importance. On the other hand, MDA, or permutation importance, evaluates the impact of each feature on the model’s predictive performance by randomly shuffling the values of a feature and measuring the change in model accuracy. A significant drop in accuracy indicates a high importance of that feature. Both methods provide insights into how the model uses the features, although permutation importance is generally considered more reliable, especially when dealing with correlated features. Tools like scikit-learn offer built-in functions to compute and visualize these feature importances.


6. Explain the working principle of a Bagging Classifier
Ans 6. The Bagging Classifier, short for **Bootstrap Aggregating**, is an ensemble machine learning technique that improves the accuracy and stability of models by combining the predictions of multiple base learners, typically decision trees. The core idea behind bagging is to create several subsets of the original training dataset using **random sampling with replacement** (bootstrap sampling), so each model is trained on a different subset. These individual models are trained independently and may perform differently due to the variations in data. When it comes to making predictions, the Bagging Classifier aggregates the outputs of all base learners — using **majority voting for classification** or **averaging for regression** — to produce a final prediction. This approach helps to reduce **variance**, minimize the risk of overfitting, and increase the robustness of the model, especially when using high-variance, low-bias algorithms like decision trees. A popular example of a Bagging Classifier is the **Random Forest**, which adds an additional layer of randomness by also selecting random subsets of features at each split in the decision trees.


7.  How do you evaluate a Bagging Classifier’s performance
Ans 7. To evaluate the performance of a **Bagging Classifier**, several key metrics and validation strategies are used. Firstly, you assess its accuracy by using a **confusion matrix** that gives insight into true positives, true negatives, false positives, and false negatives, which helps in calculating accuracy, precision, recall, and F1-score. These metrics are especially important when dealing with imbalanced datasets. Next, **cross-validation**—such as k-fold cross-validation—is commonly used to ensure the model's performance is not dependent on a particular train-test split and to check its generalizability. Additionally, **ROC-AUC score** is useful for evaluating the classifier’s ability to distinguish between classes, particularly in binary classification tasks. You can also compare the performance of the Bagging Classifier to its base estimator (e.g., decision tree) to verify whether bagging improved the performance by reducing variance. Lastly, evaluating metrics like **training time**, **prediction time**, and **overfitting tendencies** can provide a more comprehensive picture of the model’s efficiency and robustness.


8. How does a Bagging Regressor work
ANs 8. A Bagging Regressor works by combining the predictions of multiple individual regression models to improve overall accuracy and reduce overfitting. The term "Bagging" stands for *Bootstrap Aggregating*, which involves training each base estimator (often decision trees) on different random subsets of the training data created using bootstrapping (sampling with replacement). Each model independently learns from its subset, and during prediction, their outputs are averaged to produce a final result. This ensemble approach reduces variance, making the model more robust and stable compared to a single estimator, particularly in the presence of noise or small data fluctuations.


9. What is the main advantage of ensemble techniques
Ans 9. The main advantage of ensemble techniques lies in their ability to significantly improve predictive performance by combining multiple models to make a final decision. Unlike individual models that may suffer from high variance, bias, or overfitting, ensemble methods like bagging, boosting, and stacking leverage the strengths of diverse learners to reduce errors and enhance generalization. By aggregating the outcomes of several weak or strong learners, ensembles produce more robust, accurate, and stable predictions, making them highly effective for complex and high-dimensional datasets.


10. What is the main challenge of ensemble methods
Ans 10. The main challenge of ensemble methods lies in balancing **model complexity and computational cost** while ensuring that the combined models offer meaningful improvements over individual models. Ensemble techniques, such as bagging, boosting, and stacking, rely on generating multiple base learners and aggregating their predictions to enhance performance. However, this often results in increased **training time, memory usage, and reduced interpretability**, making them less suitable for real-time or resource-constrained applications. Additionally, if the base models are not sufficiently diverse or if the ensemble is poorly constructed, the method may suffer from **overfitting** or provide only marginal gains. Therefore, designing effective ensembles requires careful selection of algorithms, hyperparameters, and validation strategies to truly benefit from their collective strength.


11. Explain the key idea behind ensemble techniques
ANs 11. The key idea behind ensemble techniques is to combine the predictions of multiple individual models to create a more robust, accurate, and generalizable predictive model. Instead of relying on a single model, ensemble methods leverage the strengths and diversity of several models to reduce errors caused by bias, variance, or noise. By aggregating their outputs—through techniques like bagging, boosting, or stacking—ensemble approaches often outperform individual models, especially in complex or high-dimensional datasets. This collective decision-making strategy enhances overall performance and stability, making ensemble techniques a powerful tool in both classification and regression tasks.


12. What is a Random Forest Classifier
Ans 12. A **Random Forest Classifier** is an ensemble machine learning algorithm used for classification tasks, which operates by constructing multiple decision trees during training and outputting the class that is the mode (majority vote) of the classes predicted by individual trees. It combines the concept of "bagging" (bootstrap aggregating) with decision trees, where each tree is trained on a random subset of the data and features, ensuring diversity among the trees. This reduces the risk of overfitting and improves generalization compared to a single decision tree. Random Forest is known for its robustness, high accuracy, and ability to handle both numerical and categorical data, as well as its effectiveness in dealing with missing values and outliers.


13.  What are the main types of ensemble techniques
Ans13. Ensemble techniques in machine learning are methods that combine multiple models to improve overall performance, robustness, and generalization. The main types of ensemble techniques include **Bagging (Bootstrap Aggregating)**, **Boosting**, and **Stacking**. **Bagging** involves training multiple instances of the same model on different random subsets of the data (e.g., Random Forest), which helps reduce variance and prevent overfitting. **Boosting**, on the other hand, builds models sequentially, where each new model tries to correct the errors made by the previous ones (e.g., AdaBoost, Gradient Boosting, XGBoost), making it effective at reducing bias. **Stacking** combines predictions from several different types of models using a meta-learner, which learns to optimally combine the base models' outputs for improved accuracy. These ensemble techniques leverage the strengths of individual models and compensate for their weaknesses, leading to superior predictive performance compared to any single model.


14. What is ensemble learning in machine learning
Ans 14. Ensemble learning in machine learning is a technique that combines the predictions of multiple models to produce a more accurate and robust output than any single model alone. By leveraging the strengths and compensating for the weaknesses of individual models, ensemble methods aim to reduce variance, bias, or improve predictions. There are various ensemble strategies such as bagging (e.g., Random Forest), boosting (e.g., AdaBoost, XGBoost), and stacking, each with a different approach to model combination. These methods are particularly useful in complex tasks where a single model might struggle to generalize well, making ensemble learning a powerful tool in both classification and regression problems.


15. When should we avoid using ensemble methods
Ans 15. Ensemble methods should be avoided in situations where model interpretability is crucial, such as in regulatory environments or clinical decision-making, where understanding the reasoning behind a prediction is essential. They are also less suitable when working with small datasets, as complex ensemble models can easily overfit the data, reducing generalizability. Additionally, if computational resources or time are limited, ensemble methods—especially those like boosting or bagging with many base learners—can be inefficient due to their high processing demands. Lastly, if a single, simpler model performs sufficiently well, adding ensemble complexity may yield only marginal improvements that do not justify the added effort and resource cost.


16. How does Bagging help in reducing overfitting
Ans 16. Bagging, or Bootstrap Aggregating, helps reduce overfitting by combining the predictions of multiple base models trained on different random subsets of the training data. Each model is trained on a bootstrap sample—created by sampling with replacement—which introduces diversity among the models. While a single model, especially a high-variance one like a decision tree, may overfit the training data, averaging the predictions (in regression) or taking a majority vote (in classification) across many such models reduces variance without significantly increasing bias. This ensemble approach smooths out the noise and idiosyncrasies of individual learners, resulting in a more generalized model that performs better on unseen data.


17. Why is Random Forest better than a single Decision Tree
Ans 17. Random Forest is generally considered better than a single Decision Tree because it combines the predictions of multiple decision trees to improve overall performance, reduce overfitting, and increase accuracy. While a single decision tree can be highly sensitive to noise and small changes in the training data—leading to high variance—Random Forest mitigates this by training multiple trees on different subsets of the data and features, then averaging their results (for regression) or using majority voting (for classification). This ensemble approach reduces the risk of overfitting and enhances generalization to unseen data, making Random Forest more robust, stable, and accurate than a single decision tree in most practical scenarios.


18.  What is the role of bootstrap sampling in Bagging
Ans 18. Bootstrap sampling plays a crucial role in Bagging (Bootstrap Aggregating) by introducing randomness and diversity into the training process, which helps to reduce overfitting and improve model stability. In this approach, multiple subsets of the original training data are created by randomly sampling with replacement. This means that each new dataset may contain duplicate instances and omit others from the original data. These bootstrapped datasets are then used to train multiple base models (typically decision trees). The ensemble of these models combines their predictions—through averaging for regression or majority voting for classification—leading to a more robust and accurate final prediction. By leveraging bootstrap sampling, Bagging ensures that each base model learns slightly different patterns, reducing variance and enhancing generalization.


19.  What are some real-world applications of ensemble techniques
Ans 19. Ensemble techniques are widely applied in real-world scenarios where high accuracy and robust performance are critical. In finance, they are used for credit scoring, stock market prediction, and fraud detection by combining multiple models to reduce risk and improve decision-making. In healthcare, ensemble methods aid in disease diagnosis and prognosis prediction by integrating various clinical models to enhance diagnostic accuracy. In e-commerce and marketing, ensemble techniques improve customer segmentation, recommendation systems, and churn prediction by aggregating diverse predictive models. Additionally, in cybersecurity, they help in detecting anomalies and preventing attacks by combining classifiers to filter threats more reliably. Overall, ensemble methods offer a powerful approach to tackling complex, data-driven problems across numerous industries.


20.  What is the difference between Bagging and Boosting?
Ans 20. Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques used to improve the accuracy and robustness of machine learning models, but they differ fundamentally in how they build and combine multiple models. Bagging works by creating multiple independent models in parallel using different subsets of the training data (obtained through random sampling with replacement), and then aggregates their predictions, typically through majority voting (for classification) or averaging (for regression), to reduce variance and prevent overfitting. In contrast, Boosting builds models sequentially, where each new model tries to correct the errors made by the previous ones, focusing more on the misclassified or poorly predicted instances. This sequential approach reduces bias and can lead to higher accuracy, but it also makes Boosting more sensitive to noise and prone to overfitting if not carefully tuned.


Practical Questions

In [None]:
#21.  Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create base model (Decision Tree)
base_model = DecisionTreeClassifier(random_state=42)

# Create Bagging Classifier
bagging_model = BaggingClassifier(base_estimator=base_model, n_estimators=50, random_state=42)

# Train the model
bagging_model.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Bagging Classifier Accuracy:", accuracy)


In [None]:
#22. Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [None]:
#23.  Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
#24.  Train a Random Forest Regressor and compare its performance with a single Decision Tree
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [None]:
#25  Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split the dataset (not necessary for OOB, but useful for comparison)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest with OOB scoring enabled
rf_model = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_model.fit(X_train, y_train)

# Print the Out-of-Bag score
print("Random Forest OOB Score:", rf_model.oob_score_)


In [None]:
#26. Train a Bagging Classifier using SVM as a base estimator and print accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create base model (SVM)
base_model = SVC(kernel='rbf', probability=True, random_state=42)

# Create Bagging Classifier using SVM as base estimator
bagging_model = BaggingClassifier(base_estimator=base_model, n_estimators=30, random_state=42)

# Train the model
bagging_model.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Bagging Classifier (SVM) Accuracy:", accuracy)


In [None]:
#27. Train a Random Forest Classifier with different numbers of trees and compare accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Try different numbers of trees
tree_counts = [10, 50, 100, 150, 200]
print("Random Forest Accuracy with Different Numbers of Trees:\n")

for n in tree_counts:
    # Initialize Random Forest with n trees
    model = RandomForestClassifier(n_estimators=n, random_state=42)

    # Train the model
    model.fit(X_train, y_train)

    # Predict on test set
    y_pred = model.predict(X_test)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{n} Trees: Accuracy = {accuracy:.4f}")


In [None]:
#28.  Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create base model (Logistic Regression)
base_model = LogisticRegression(max_iter=1000, solver='liblinear')

# Create Bagging Classifier using Logistic Regression
bagging_model = BaggingClassifier(base_estimator=base_model, n_estimators=50, random_state=42)

# Train the model
bagging_model.fit(X_train, y_train)

# Predict probabilities for AUC calculation
y_proba = bagging_model.predict_proba(X_test)[:, 1]

# Calculate and print AUC score
auc = roc_auc_score(y_test, y_proba)
print("Bagging Classifier (Logistic Regression) AUC Score:", auc)


In [None]:
#29.  Train a Random Forest Regressor and analyze feature importance scores
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import matplotlib.pyplot as plt

# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Get feature importance scores
importances = rf_regressor.feature_importances_
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance Score': importances
}).sort_values(by='Importance Score', ascending=False)

# Display the feature importance scores
print("\nFeature Importance Scores:\n")
print(importance_df)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance Score'], color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel("Importance Score")
plt.title("Feature Importances in Random Forest Regressor")
plt.tight_layout()
plt.show()


In [None]:
#30. Train an ensemble model using both Bagging and Random Forest and compare accuracy.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Bagging Classifier using Decision Trees
bagging_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)

# 2. Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_preds)

# Print comparison
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")
print(f"Random Forest Accuracy:     {rf_accuracy:.4f}")


In [None]:
#31. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predict on test set
y_pred = best_rf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Random Forest Accuracy after Hyperparameter Tuning:", round(accuracy, 4))


In [None]:
#32. Train a Bagging Regressor with different numbers of base estimators and compare performance
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Estimators to try
estimator_counts = [10, 50, 100, 150, 200]
mse_scores = []

print("Bagging Regressor MSE with Different Numbers of Base Estimators:\n")

for n in estimator_counts:
    model = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=n, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"{n} Estimators: MSE = {mse:.4f}")

# Plot the comparison
plt.figure(figsize=(8, 5))
plt.plot(estimator_counts, mse_scores, marker='o', color='teal')
plt.title("Bagging Regressor Performance vs Number of Estimators")
plt.xlabel("Number of Estimators")
plt.ylabel("Mean Squared Error (MSE)")
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
#33 Train a Random Forest Classifier and analyze misclassified samples
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred = rf_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", round(accuracy, 4))

# Analyze misclassified samples
misclassified_indices = [i for i, (true, pred) in enumerate(zip(y_test, y_pred)) if true != pred]

# Create DataFrame for misclassified samples
misclassified_df = pd.DataFrame(X_test[misclassified_indices], columns=feature_names)
misclassified_df['True Label'] = [target_names[i] for i in y_test[misclassified_indices]]
misclassified_df['Predicted Label'] = [target_names[i] for i in y_pred[misclassified_indices]]

print("\nMisclassified Samples:\n")
print(misclassified_df)


In [None]:
#34. Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Train a single Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_preds = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_preds)

# 2. Train a Bagging Classifier using Decision Trees
bagging_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)

# Print accuracy comparison
print("Single Decision Tree Accuracy:", round(dt_accuracy, 4))
print("Bagging Classifier Accuracy:  ", round(bagging_accuracy, 4))


In [None]:
#35. Train a Random Forest Classifier and visualize the confusion matrix
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
import matplotlib.pyplot as plt

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
class_names = data.target_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on test set
y_pred = rf_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Random Forest Accuracy:", round(accuracy, 4))

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp.plot(cmap='Blues', values_format='d')
plt.title("Confusion Matrix - Random Forest Classifier")
plt.tight_layout()
plt.show()


In [None]:
#36.  Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_iris()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimators
estimators = [
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('svm', SVC(probability=True, kernel='rbf', random_state=42))
]

# Define stacking classifier with Logistic Regression as final estimator
stacking_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5
)

# Train stacking classifier
stacking_model.fit(X_train, y_train)
y_pred_stack = stacking_model.predict(X_test)
stacking_accuracy = accuracy_score(y_test, y_pred_stack)

# Train and evaluate base models individually
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_accuracy = accuracy_score(y_test, dt_model.predict(X_test))

svm_model = SVC(kernel='rbf', probability=True, random_state=42)
svm_model.fit(X_train, y_train)
svm_accuracy = accuracy_score(y_test, svm_model.predict(X_test))

# Train and evaluate final estimator alone (Logistic Regression)
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
lr_accuracy = accuracy_score(y_test, lr_model.predict(X_test))

# Print accuracy comparison
print("Accuracy Comparison:\n")
print(f"Decision Tree Accuracy:       {dt_accuracy:.4f}")
print(f"SVM Accuracy:                 {svm_accuracy:.4f}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")
print(f"Stacking Classifier Accuracy: {stacking_accuracy:.4f}")


In [None]:
#37.  Train a Random Forest Classifier and print the top 5 most important features
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

# Create DataFrame of features and their importance scores
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print top 5 most important features
print("Top 5 Most Important Features:\n")
print(feature_importance_df.head(5))


In [None]:
#38. Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Bagging Classifier using Decision Trees
bagging_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)

# Predict on test set
y_pred = bagging_model.predict(X_test)

# Evaluate performance
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics
print("Bagging Classifier Performance Metrics:\n")
print(f"Precision:  {precision:.4f}")
print(f"Recall:     {recall:.4f}")
print(f"F1-Score:   {f1:.4f}")

# Optional: detailed class-wise report
print("\nDetailed Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


In [None]:
#39. Train a Random Forest Classifier and analyze the effect of max_depth on accuracy
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Try different max_depth values
depth_values = [1, 2, 3, 5, 7, 10, None]
accuracies = []

print("Random Forest Accuracy with Different max_depth Values:\n")

for depth in depth_values:
    rf_model = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
    rf_model.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"max_depth = {depth}: Accuracy = {acc:.4f}")

# Plotting the results
plt.figure(figsize=(8, 5))
depth_labels = ['None' if d is None else str(d) for d in depth_values]
plt.plot(depth_labels, accuracies, marker='o', linestyle='-', color='green')
plt.title("Effect of max_depth on Random Forest Accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
#40.  Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare performance
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimators
dt_base = DecisionTreeRegressor()
knn_base = KNeighborsRegressor()

# Create Bagging Regressors with different base estimators
bagging_dt = BaggingRegressor(base_estimator=dt_base, n_estimators=50, random_state=42)
bagging_knn = BaggingRegressor(base_estimator=knn_base, n_estimators=50, random_state=42)

# Train models
bagging_dt.fit(X_train, y_train)
bagging_knn.fit(X_train, y_train)

# Predict
y_pred_dt = bagging_dt.predict(X_test)
y_pred_knn = bagging_knn.predict(X_test)

# Evaluate using MSE
mse_dt = mean_squared_error(y_test, y_pred_dt)
mse_knn = mean_squared_error(y_test, y_pred_knn)

# Print comparison
print("Performance Comparison of Bagging Regressors:\n")
print(f"Bagging with Decision Tree Regressor:   MSE = {mse_dt:.4f}")
print(f"Bagging with KNeighbors Regressor:      MSE = {mse_knn:.4f}")


In [None]:
#41. Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load the binary classification dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict probabilities for positive class
y_proba = rf_model.predict_proba(X_test)[:, 1]  # probabilities for class 1

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)

# Print the result
print(f"Random Forest Classifier ROC-AUC Score: {roc_auc:.4f}")


In [None]:
#42. Train a Bagging Classifier and evaluate its performance using cross-validation
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Define the Bagging Classifier with Decision Tree as base estimator
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)

# Perform 5-fold cross-validation using accuracy
cv_scores = cross_val_score(bagging_model, X, y, cv=5, scoring='accuracy')

# Print results
print("Bagging Classifier Cross-Validation Accuracy Scores:", np.round(cv_scores, 4))
print("Mean Accuracy:", round(np.mean(cv_scores), 4))
print("Standard Deviation:", round(np.std(cv_scores), 4))


In [None]:
#43.  Train a Random Forest Classifier and plot the Precision-Recall curve
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_scores = rf_model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
avg_precision = average_precision_score(y_test, y_scores)

# Plot the precision-recall curve
plt.figure(figsize=(8, 5))
plt.plot(recall, precision, color='blue', label=f'AP = {avg_precision:.4f}')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve - Random Forest Classifier")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
#44. Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_iris()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimators
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('lr', LogisticRegression(max_iter=1000))
]

# Define stacking classifier with Logistic Regression as final estimator
stacking_model = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5
)

# Train and evaluate stacking classifier
stacking_model.fit(X_train, y_train)
y_pred_stack = stacking_model.predict(X_test)
acc_stack = accuracy_score(y_test, y_pred_stack)

# Train and evaluate Random Forest alone
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
acc_rf = accuracy_score(y_test, rf_model.predict(X_test))

# Train and evaluate Logistic Regression alone
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
acc_lr = accuracy_score(y_test, lr_model.predict(X_test))

# Print accuracy comparison
print("Accuracy Comparison:\n")
print(f"Random Forest Accuracy:       {acc_rf:.4f}")
print(f"Logistic Regression Accuracy: {acc_lr:.4f}")
print(f"Stacking Classifier Accuracy: {acc_stack:.4f}")


In [None]:
#45. Train a Bagging Regressor with different levels of bootstrap samples and compare performance.
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Load the dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define different max_samples percentages (proportion of data sampled for each estimator)
sample_sizes = [0.4, 0.6, 0.8, 1.0]
mse_scores = []

print("Bagging Regressor Performance with Varying Bootstrap Sample Sizes:\n")

for sample in sample_sizes:
    model = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(),
        n_estimators=50,
        max_samples=sample,
        bootstrap=True,
        random_state=42
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_samples = {sample:.1f}: MSE = {mse:.4f}")

# Plot the results
plt.figure(figsize=(8, 5))
plt.plot([int(s*100) for s in sample_sizes], mse_scores, marker='o', color='navy')
plt.title("Effect of Bootstrap Sample Size on Bagging Regressor Performance")
plt.xlabel("Bootstrap Sample Size (% of training data)")
plt.ylabel("Mean Squared Error")
plt.grid(True)
plt.tight_layout()
plt.show()
