# Theoretical

1. Can we use Bagging for regression problems?
- Yes, Bagging can be used for regression problems using models like Bagging Regressor. It combines predictions from multiple regressors by averaging their outputs to reduce variance and improve performance.


2. What is the difference between multiple model training and single model training?
- Single model training: Trains one model on the entire dataset.

- Multiple model training: Trains several models on different subsets (or variations) of the dataset and combines their outputs (ensemble), leading to better generalization and robustness.


3. Explain the concept of feature randomness in Random Forest.
- In Random Forest, at each split in a decision tree, only a random subset of features is considered. This promotes diversity among trees and reduces correlation, leading to a more generalized ensemble.


4. What is OOB (Out-of-Bag) Score?
- OOB score is an internal validation method used in Bagging/Random Forest. For each tree, data not included in its bootstrap sample (OOB data) is used to evaluate performance. It's a form of cross-validation.


5. How can you measure the importance of features in a Random Forest model?
- Feature importance in Random Forest can be measured by:

  - Gini Importance: Total decrease in impurity contributed by a feature.

  - Permutation Importance: Decrease in model performance when a feature’s values are randomly shuffled.
6. Explain the working principle of a Bagging Classifier.
- Bootstrap sampling: Random subsets with replacement.
- Train multiple base classifiers (e.g., Decision Trees) on each subset.
- Aggregate predictions by majority vote (for classification).

7. How do you evaluate a Bagging Classifier’s performance?
- Use metrics like:
  - Accuracy, Precision, Recall, F1-Score (for classification)
  - OOB Score as a quick internal validation
  - Cross-validation for robustness
8. How does a Bagging Regressor work?
- Trains multiple regressors on different bootstrap samples.
- Combines predictions by averaging.
- Reduces variance without increasing bias

9. What is the main advantage of ensemble techniques?
- They combine multiple models to improve performance, reduce overfitting, and enhance generalization compared to individual models.

10. What is the main challenge of ensemble methods?
- Complexity: Harder to interpret and maintain.
- Computational cost: Requires more memory and time to train multiple models.


11. Explain the key idea behind ensemble techniques.
- The core idea is that a group of weak learners can come together to form a strong learner by combining their predictions, thereby reducing bias and/or variance.


12. What is a Random Forest Classifier?
- An ensemble model that builds multiple decision trees using bootstrap samples and random feature subsets, then predicts by majority voting.

13. What are the main types of ensemble techniques?
- Bagging (Bootstrap Aggregating)
- Boosting (e.g., AdaBoost, Gradient Boosting)
- Stacking (model of models)


14. What is ensemble learning in machine learning?
- A technique where multiple models (classifiers or regressors) are trained and their predictions combined to improve overall performance.

15. When should we avoid using ensemble methods?
- When interpretability is critical.
- When the dataset is small and simple models perform well.
- When training time/resources are limited.


16. How does Bagging help in reducing overfitting?
- Bagging reduces overfitting by:

   - Training on different subsets of data (bootstrapping),

   - Averaging predictions, which reduces variance.
17. Why is Random Forest better than a single Decision Tree?
- Random Forest reduces overfitting and improves generalization by:
  - Using multiple trees trained on bootstrapped data,
  - Introducing randomness in feature selection.

18. What is the role of bootstrap sampling in Bagging?
- Bootstrap sampling provides different training subsets for each base model, encouraging model diversity and reducing variance when aggregating.

19. What are some real-world applications of ensemble techniques?
- Fraud detection
- Medical diagnosis
- Credit scoring
- Spam filtering
- Recommendation systems

20. What is the difference between Bagging and Boosting?
- Bagging:-
 Reduce variance,
 Parallel,
 Uniform.
- Boosting:-
  Reduce bias
  Sequential (each improves prior)
  Adjusted after each round


# Practical

21. Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy.

In [None]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Bagging Classifier with Decision Tree as base estimator
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)

# Train the model
bagging_clf.fit(X_train, y_train)

# Make predictions
y_pred = bagging_clf.predict(X_test)

# Evaluate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier Accuracy: {accuracy:.2f}")


22. Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE).

In [None]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Bagging Regressor with Decision Tree as base estimator
bagging_reg = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)

# Train the model
bagging_reg.fit(X_train, y_train)

# Make predictions
y_pred = bagging_reg.predict(X_test)

# Evaluate and print Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Bagging Regressor Mean Squared Error: {mse:.2f}")


23. Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Get feature importance scores
importances = rf_clf.feature_importances_

# Create a DataFrame for better readability
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance Score': importances
}).sort_values(by='Importance Score', ascending=False)

# Print feature importance scores
print("Feature Importance Scores (Descending Order):\n")
print(feature_importance_df.to_string(index=False))



24. Train a Random Forest Regressor and compare its performance with a single Decision Tree.

In [None]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a single Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)
dt_preds = dt_regressor.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_preds)

# Train a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)
rf_preds = rf_regressor.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_preds)

# Print the comparison results
print(f"Decision Tree Regressor MSE: {dt_mse:.2f}")
print(f"Random Forest Regressor MSE: {rf_mse:.2f}")


25. Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier

In [None]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and test sets (although OOB score uses only training data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest with OOB enabled
rf_clf = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,         # Enable Out-of-Bag scoring
    bootstrap=True,         # Required for OOB to work
    random_state=42
)
rf_clf.fit(X_train, y_train)

# Print the OOB Score
print(f"Out-of-Bag (OOB) Score: {rf_clf.oob_score_:.4f}")


26. Train a Bagging Classifier using SVM as a base estimator and print accuracy.

In [None]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Bagging Classifier using SVM as base estimator
bagging_svm = BaggingClassifier(
    base_estimator=SVC(),
    n_estimators=10,
    random_state=42
)

# Train the model
bagging_svm.fit(X_train, y_train)

# Predict on the test set
y_pred = bagging_svm.predict(X_test)

# Evaluate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier with SVM Accuracy: {accuracy:.2f}")


27. Train a Random Forest Classifier with different numbers of trees and compare accuracy.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Try different numbers of trees
tree_counts = [1, 5, 10, 50, 100, 200]
print("Random Forest Accuracy with Different Numbers of Trees:\n")
print(f"{'Trees':<10} {'Accuracy'}")
print("-" * 25)

# Loop through different tree counts and evaluate accuracy
for n in tree_counts:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{n:<10} {acc:.4f}")


28. Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target  # Binary labels: 0 or 1

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Bagging Classifier with Logistic Regression as base estimator
bagging_lr = BaggingClassifier(
    base_estimator=LogisticRegression(max_iter=1000, solver='liblinear'),  # Ensure convergence
    n_estimators=10,
    random_state=42
)

# Train the model
bagging_lr.fit(X_train, y_train)

# Predict probabilities for the positive class
y_proba = bagging_lr.predict_proba(X_test)[:, 1]

# Compute AUC score
auc = roc_auc_score(y_test, y_proba)
print(f"Bagging Classifier with Logistic Regression AUC Score: {auc:.4f}")


29. Train a Random Forest Regressor and analyze feature importance scores.

In [None]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd

# Load California Housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Get feature importances
importances = rf_regressor.feature_importances_

# Create a DataFrame for better readability
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Display feature importance scores
print("Random Forest Regressor - Feature Importance Scores:\n")
print(feature_importance_df.to_string(index=False))


30. Train an ensemble model using both Bagging and Random Forest and compare accuracy.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Bagging Classifier with Decision Tree as base estimator
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_acc = accuracy_score(y_test, bagging_preds)

# 2. Random Forest Classifier
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
rf_acc = accuracy_score(y_test, rf_preds)

# Print accuracy comparison
print("Model Accuracy Comparison:\n")
print(f"Bagging Classifier Accuracy      : {bagging_acc:.4f}")
print(f"Random Forest Classifier Accuracy: {rf_acc:.4f}")


31. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False],
    'criterion': ['gini', 'entropy']
}

# Initialize the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,   # Use all available cores for parallel processing
    verbose=1
)

# Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and corresponding cross-validation accuracy
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy: {:.4f}".format(grid_search.best_score_))

# Evaluate the best model on the test set
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print("Test Set Accuracy: {:.4f}".format(test_accuracy))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


32. Train a Bagging Regressor with different numbers of base estimators and compare performance.

In [None]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# List of different n_estimators to evaluate
n_estimators_list = [1, 5, 10, 50, 100, 200]

print("Bagging Regressor Performance with Varying Base Estimators:\n")
print(f"{'Estimators':<12}{'MSE':>10}")
print("-" * 25)

# Loop through different values of n_estimators
for n in n_estimators_list:
    # Train Bagging Regressor
    bagging = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(),
        n_estimators=n,
        random_state=42
    )
    bagging.fit(X_train, y_train)

    # Predict and evaluate MSE
    y_pred = bagging.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    # Print result
    print(f"{n:<12}{mse:>10.4f}")


33. Train a Random Forest Classifier and analyze misclassified samples.

In [None]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
target_names = data.target_names

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")

# Analyze misclassified samples
misclassified_indices = (y_test != y_pred)

# Create a DataFrame with test data and predictions
result_df = X_test.copy()
result_df['Actual'] = y_test
result_df['Predicted'] = y_pred
result_df['Correct'] = y_test == y_pred

# Filter misclassified samples
misclassified_samples = result_df[~result_df['Correct']]

# Display misclassified samples
print("\nMisclassified Samples:")
print(misclassified_samples[['Actual', 'Predicted'] + list(X.columns)].head())


34. Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Train a single Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_preds = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_preds)

# 2. Train a Bagging Classifier with Decision Trees
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)

# Print accuracy comparison
print("Performance Comparison:")
print(f"Single Decision Tree Accuracy : {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy   : {bagging_accuracy:.4f}")


35.  Train a Random Forest Classifier and visualize the confusion matrix.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
target_names = data.target_names

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict on test set
y_pred = rf.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=target_names, yticklabels=target_names)
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.tight_layout()
plt.show()


36. Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators
base_estimators = [
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42))  # SVC must have probability=True for stacking
]

# Define meta-learner (final estimator)
meta_learner = LogisticRegression(max_iter=1000)

# Create the Stacking Classifier
stacking_model = StackingClassifier(
    estimators=base_estimators,
    final_estimator=meta_learner,
    cv=5
)

# Train stacking model
stacking_model.fit(X_train, y_train)

# Predict and evaluate
stacking_preds = stacking_model.predict(X_test)
stacking_acc = accuracy_score(y_test, stacking_preds)

# Train and evaluate individual models
dt = DecisionTreeClassifier(random_state=42)
svm = SVC(probability=True, random_state=42)
lr = LogisticRegression(max_iter=1000)

dt.fit(X_train, y_train)
svm.fit(X_train, y_train)
lr.fit(X_train, y_train)

# Accuracy for individual models
acc_dt = accuracy_score(y_test, dt.predict(X_test))
acc_svm = accuracy_score(y_test, svm.predict(X_test))
acc_lr = accuracy_score(y_test, lr.predict(X_test))

# Print results
print("Model Accuracy Comparison:")
print(f"Decision Tree Accuracy       : {acc_dt:.4f}")
print(f"SVM Accuracy                 : {acc_svm:.4f}")
print(f"Logistic Regression Accuracy: {acc_lr:.4f}")
print(f"Stacking Classifier Accuracy: {stacking_acc:.4f}")


37. Train a Random Forest Classifier and print the top 5 most important features.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_
features = X.columns

# Create DataFrame for feature importances
feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances
})

# Sort by importance and display top 5
top_5_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)
print("Top 5 Most Important Features:")
print(top_5_features)


38. Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score.

In [None]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Bagging Classifier with Decision Tree base estimator
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)

# Train the model
bagging_model.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_model.predict(X_test)

# Evaluate using precision, recall, and F1-score
print("Classification Report (Precision, Recall, F1-score):")
print(classification_report(y_test, y_pred, target_names=data.target_names))


39. Train a Random Forest Classifier and analyze the effect of max_depth on accuracy.

In [None]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Evaluate effect of max_depth
depths = list(range(1, 21))
accuracies = []

for depth in depths:
    rf = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(depths, accuracies, marker='o', linestyle='-', color='green')
plt.title("Effect of max_depth on Random Forest Accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.grid(True)
plt.xticks(depths)
plt.show()


40. Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compare
performance.

In [None]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging with Decision Tree
bagging_dt = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_dt.fit(X_train, y_train)
dt_preds = bagging_dt.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_preds)

# Bagging with K-Nearest Neighbors
bagging_knn = BaggingRegressor(
    base_estimator=KNeighborsRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_knn.fit(X_train, y_train)
knn_preds = bagging_knn.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_preds)

# Print comparison
print("Bagging Regressor Performance Comparison:")
print(f"Decision Tree Base Estimator MSE: {dt_mse:.4f}")
print(f"K-Neighbors Base Estimator MSE   : {knn_mse:.4f}")


41. Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict probabilities for the positive class
y_probs = rf.predict_proba(X_test)[:, 1]

# Calculate ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_probs)
print(f"Random Forest ROC-AUC Score: {roc_auc:.4f}")


42. Train a Bagging Classifier and evaluate its performance using cross-validatio.


In [None]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Define Bagging Classifier with Decision Tree base estimator
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)

# Evaluate with 5-fold cross-validation
cv_scores = cross_val_score(bagging_model, X, y, cv=5, scoring='accuracy')

# Print results
print("Bagging Classifier Cross-Validation Results:")
print(f"Fold Accuracies : {cv_scores}")
print(f"Mean Accuracy   : {np.mean(cv_scores):.4f}")
print(f"Standard Deviation : {np.std(cv_scores):.4f}")


43. Train a Random Forest Classifier and plot the Precision-Recall curve


44. Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(max_iter=1000, random_state=42)

# Define stacking classifier
stack_model = StackingClassifier(
    estimators=[('rf', rf), ('lr', lr)],
    final_estimator=LogisticRegression(max_iter=1000),
    passthrough=False
)

# Fit individual models
rf.fit(X_train, y_train)
lr.fit(X_train, y_train)
stack_model.fit(X_train, y_train)

# Predict and evaluate
rf_acc = accuracy_score(y_test, rf.predict(X_test))
lr_acc = accuracy_score(y_test, lr.predict(X_test))
stack_acc = accuracy_score(y_test, stack_model.predict(X_test))

# Print results
print("Accuracy Comparison:")
print(f"Random Forest Accuracy      : {rf_acc:.4f}")
print(f"Logistic Regression Accuracy: {lr_acc:.4f}")
print(f"Stacking Classifier Accuracy: {stack_acc:.4f}")


45. Train a Bagging Regressor with different levels of bootstrap samples and compare performance.

In [None]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Different max_samples values (proportions of training data)
sample_sizes = [0.3, 0.5, 0.7, 1.0]
mse_scores = []

# Train and evaluate for each bootstrap sample size
for size in sample_sizes:
    model = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(),
        n_estimators=100,
        max_samples=size,
        bootstrap=True,
        random_state=42
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_samples = {size} --> MSE: {mse:.4f}")

# Plotting
plt.figure(figsize=(8, 5))
plt.plot(sample_sizes, mse_scores, marker='o', linestyle='--', color='blue')
plt.title('Effect of max_samples on Bagging Regressor Performance')
plt.xlabel('max_samples (proportion of training data)')
plt.ylabel('Mean Squared Error')
plt.grid(True)
plt.show()
