## **Exercise 1**
<!-- @q -->

In the following exercises, we'll explore the behavior of different ensemble methods from the notes. First, we'll set up some synthetic data to play with.  I've included a decision tree classifier to provide you with a baseline.

In [1]:
# Your code here
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(
    n_samples=400, n_features=20, n_informative=2, n_redundant=15, n_classes=2, random_state=42, flip_y=0.07
)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
single_tree_scores = cross_val_score(clf,X,y,cv=5)

# Print the test accuracy
print(f"Test Accuracy: {single_tree_scores.mean():.2f}")

Test Accuracy: 0.83


### **Exercise 1.1**

Implement a bagging classifier with 10 decision tree estimators, and compare performance to the single decision tree.

In [2]:
# Your code here
from sklearn.ensemble import BaggingClassifier  # Add this import
from sklearn.model_selection import train_test_split  # Add this import

X, y = make_classification(
    n_samples=400, n_features=20, n_informative=2, n_redundant=15, n_classes=2, random_state=42, flip_y=0.07
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(random_state=42)
single_tree_scores = cross_val_score(clf, X, y, cv=5)

print(f"Single Decision Tree CV Accuracy: {single_tree_scores.mean():.2f}")

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=10,
    random_state=42
)
bagging_scores = cross_val_score(bag_clf, X, y, cv=5)

print(f"Bagging Classifier CV Accuracy: {bagging_scores.mean():.2f}")
print(f"Improvement: {bagging_scores.mean() - single_tree_scores.mean():.2f}")

Single Decision Tree CV Accuracy: 0.83
Bagging Classifier CV Accuracy: 0.88
Improvement: 0.05


### **Exercise 1.2: Effect of Subsampling in Bagging**


__Task:__

- Train a BaggingClassifier with different max_samples values (30%, 50%, 70% and 100%) and compare how subsampling affects the model’s performance on the test set.
- Use 10 base estimators as in Exercise 1.2.

In [3]:
# Your code here
import pandas as pd
from sklearn.linear_model import LogisticRegression

max_samples_values = [0.3, 0.5, 0.7, 1.0]
results = []

for max_sample in max_samples_values:
    bag_tree = BaggingClassifier(
        DecisionTreeClassifier(random_state=42), 
        n_estimators=10, 
        random_state=42,
        max_samples=max_sample
    )
    bag_tree_scores = cross_val_score(bag_tree, X, y, cv=5)
    bag_log = BaggingClassifier(
        LogisticRegression(random_state=42, max_iter=1000), 
        n_estimators=10, 
        random_state=42,
        max_samples=max_sample
    )
    bag_log_scores = cross_val_score(bag_log, X, y, cv=5)
    
    results.append({
        'max_samples': f"{int(max_sample*100)}%",
        'Tree_CV_Mean': bag_tree_scores.mean(),
        'Tree_CV_Std': bag_tree_scores.std(),
        'LogReg_CV_Mean': bag_log_scores.mean(),
        'LogReg_CV_Std': bag_log_scores.std()
    })
    
    print(f"\n=== max_samples = {int(max_sample*100)}% ===")
    print(f"Bagging Decision Tree CV Accuracy: {bag_tree_scores.mean():.4f} (+/- {bag_tree_scores.std():.4f})")
    print(f"Bagging Logistic Regression CV Accuracy: {bag_log_scores.mean():.4f} (+/- {bag_log_scores.std():.4f})")
results_df = pd.DataFrame(results)
print("\n=== Summary Table ===")
print(results_df)


=== max_samples = 30% ===
Bagging Decision Tree CV Accuracy: 0.8975 (+/- 0.0310)
Bagging Logistic Regression CV Accuracy: 0.8875 (+/- 0.0335)

=== max_samples = 50% ===
Bagging Decision Tree CV Accuracy: 0.8950 (+/- 0.0269)
Bagging Logistic Regression CV Accuracy: 0.8925 (+/- 0.0312)

=== max_samples = 70% ===
Bagging Decision Tree CV Accuracy: 0.8775 (+/- 0.0414)
Bagging Logistic Regression CV Accuracy: 0.8950 (+/- 0.0322)

=== max_samples = 100% ===
Bagging Decision Tree CV Accuracy: 0.8825 (+/- 0.0203)
Bagging Logistic Regression CV Accuracy: 0.8850 (+/- 0.0366)

=== Summary Table ===
  max_samples  Tree_CV_Mean  Tree_CV_Std  LogReg_CV_Mean  LogReg_CV_Std
0         30%        0.8975     0.031024          0.8875       0.033541
1         50%        0.8950     0.026926          0.8925       0.031225
2         70%        0.8775     0.041382          0.8950       0.032210
3        100%        0.8825     0.020310          0.8850       0.036572


### **Exercise 1.3: Out-of-Bag (OOB) Evaluation**

<!-- @sub-->

**Task:**

- Enable Out-of-Bag (OOB) evaluation in the BaggingClassifier and compare the OOB score to the test set accuracy.
- Train the model using 10 base estimators and the same synthetic dataset as above.

In [4]:
# Your code here
bag_tree_oob = BaggingClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=10,
    random_state=42,
    oob_score=True
)

bag_tree_oob.fit(X_train, y_train)

oob_score = bag_tree_oob.oob_score_

y_pred_test = bag_tree_oob.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_test)

print("=== Out-of-Bag Evaluation ===")
print(f"OOB Score: {oob_score:.4f}")
print(f"Test Set Accuracy: {test_accuracy:.4f}")
print(f"Difference: {abs(oob_score - test_accuracy):.4f}")

=== Out-of-Bag Evaluation ===
OOB Score: 0.8571
Test Set Accuracy: 0.8667
Difference: 0.0095


### **Exercise 1.4: Implementing AdaBoost**


**Task:**  
- Train an `AdaBoostClassifier` using 10 estimators and a learning rate of 1.
- Use the synthetic dataset from earlier exercises.

In [5]:
# Your code here
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1, random_state=42),  # Weak learner (stump)
    n_estimators=10,
    learning_rate=1.0,
    random_state=42
)

ada_scores = cross_val_score(ada_clf, X, y, cv=5)
print(f"CV Accuracy: {ada_scores.mean():.4f} (+/- {ada_scores.std():.4f})")
single_tree_scores = cross_val_score(DecisionTreeClassifier(random_state=42), X, y, cv=5)
print(f"Single Decision Tree CV Accuracy: {single_tree_scores.mean():.4f}")
bag_scores = cross_val_score(
    BaggingClassifier(DecisionTreeClassifier(random_state=42), n_estimators=10, random_state=42), 
    X, y, cv=5
)
print(f"Bagging Classifier CV Accuracy: {bag_scores.mean():.4f}")
print(f"AdaBoost CV Accuracy: {ada_scores.mean():.4f}")

CV Accuracy: 0.9075 (+/- 0.0232)
Single Decision Tree CV Accuracy: 0.8300
Bagging Classifier CV Accuracy: 0.8825
AdaBoost CV Accuracy: 0.9075


### **Exercise 1.5: Effect of Hyperparameters in AdaBoost**


**Task:**  
- Sweep the learning rate parameters from .5 to 2 in increments of .25, and the number of estimators from 5 to 55 in increments of 10. Use cross validation with accuracy to evaluate performance. 
- Use GridSearchCV for your solution

In [6]:
# Your code here
from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {
    'learning_rate': np.arange(0.5, 2.25, 0.25),  # 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0
    'n_estimators': np.arange(5, 65, 10)  # 5, 15, 25, 35, 45, 55
}

ada_clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1, random_state=42),
    random_state=42
)

grid_search = GridSearchCV(
    ada_clf,
    param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',
    verbose=1,
    n_jobs=-1  # Use all available cores
)

grid_search.fit(X_train, y_train)

print("=== Grid Search Results ===")
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")

y_pred_best = grid_search.best_estimator_.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_best)
print(f"Test Set Accuracy: {test_accuracy:.4f}")

results_df = pd.DataFrame(grid_search.cv_results_)
top_5 = results_df[['param_learning_rate', 'param_n_estimators', 'mean_test_score']].sort_values(
    'mean_test_score', ascending=False
).head(5)
print("\n=== Top 5 Parameter Combinations ===")
print(top_5)

Fitting 5 folds for each of 42 candidates, totalling 210 fits


=== Grid Search Results ===
Best Parameters: {'learning_rate': np.float64(1.0), 'n_estimators': np.int64(15)}
Best Cross-Validation Score: 0.9071
Test Set Accuracy: 0.8750

=== Top 5 Parameter Combinations ===
    param_learning_rate  param_n_estimators  mean_test_score
13                 1.00                  15         0.907143
16                 1.00                  45         0.900000
15                 1.00                  35         0.896429
12                 1.00                   5         0.896429
6                  0.75                   5         0.896429


<!-- @ sub -->
Which parameters perform best? Which are the worst?  Do your results surprise you?  Why do you think you are seeing what you do?

*Enter your answer in this cell*
The best parameters are learning_rate=1.0 with n_estimators=15, achieving 90.7% CV accuracy. Surprisingly, higher learning rates consistently outperform lower ones, and fewer trees work better than more - increasing from 15 to 45 estimators actually decreases performance. This suggests overfitting with too many iterations at high learning rates. The results are counterintuitive because conventional wisdom favors lower learning rates with more estimators, but this simpler dataset doesn't require extensive boosting. Lower learning rates likely underperform because the grid doesn't test them with enough estimators (e.g., 0.1 learning rate needs 100+ trees, not just 5-45). This highlights that parameters must be tuned together - low learning rates paired with few estimators take tiny steps and stop before learning effectively.

# **Exercise 2: Gradient Boosting for Regression**


- Use `sklearn.ensemble.GradientBoostingRegressor` to implement a regression model on the following dataset.
- Use a single train / test split
- Train the model with 50 estimators and compare its performance to a decision tree regressor.

In [1]:
# @SHOW

from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X_reg, y_reg = make_regression(n_samples=1000, n_features=2, noise=0.1, random_state=42)



In [2]:
# Your code here
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train Gradient Boosting Regressor with 50 estimators
gb_regressor = GradientBoostingRegressor(n_estimators=50, random_state=42)
gb_regressor.fit(X_train, y_train)

# Train Decision Tree Regressor for comparison
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)

# Make predictions
gb_predictions = gb_regressor.predict(X_test)
dt_predictions = dt_regressor.predict(X_test)

# Calculate performance metrics
gb_mse = mean_squared_error(y_test, gb_predictions)
dt_mse = mean_squared_error(y_test, dt_predictions)

gb_r2 = r2_score(y_test, gb_predictions)
dt_r2 = r2_score(y_test, dt_predictions)

# Display results
print("="*60)
print("MODEL PERFORMANCE COMPARISON")
print("="*60)

print("\nGradient Boosting Regressor (50 estimators):")
print(f"  Mean Squared Error: {gb_mse:.4f}")
print(f"  R² Score: {gb_r2:.4f}")

print("\nDecision Tree Regressor:")
print(f"  Mean Squared Error: {dt_mse:.4f}")
print(f"  R² Score: {dt_r2:.4f}")

print("\n" + "="*60)
print("IMPROVEMENT")
print("="*60)
mse_improvement = ((dt_mse - gb_mse) / dt_mse) * 100
print(f"MSE Improvement: {mse_improvement:.2f}%")
print(f"Gradient Boosting performs {'better' if gb_mse < dt_mse else 'worse'} than Decision Tree")

MODEL PERFORMANCE COMPARISON

Gradient Boosting Regressor (50 estimators):
  Mean Squared Error: 5.6205
  R² Score: 0.9964

Decision Tree Regressor:
  Mean Squared Error: 6.1143
  R² Score: 0.9961

IMPROVEMENT
MSE Improvement: 8.08%
Gradient Boosting performs better than Decision Tree


# **Exercise 3**

<!-- @q -->

Previously, we used hyperparameter optimization to optimize clustering.  Here, we will use it to optimize a random forest classifier.   I've started the process by organizing the data and establishing a baseline model.

In [3]:
# @SHOW

# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load Dataset and Preprocess
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
                'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
                'hours-per-week', 'native-country', 'income']
data = pd.read_csv(url, names=column_names, na_values=' ?')

# Handle missing values by dropping rows with missing data
data.dropna(inplace=True)

# Convert categorical columns to dummy variables
data = pd.get_dummies(data, drop_first=True)

# Separate features and target variable
X = data.drop('income_ >50K', axis=1)
y = data['income_ >50K']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Train Baseline Random Forest Model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
baseline_accuracy = accuracy_score(y_test, y_pred)
print(f'Baseline Accuracy: {baseline_accuracy:.4f}')


Baseline Accuracy: 0.8543


#### Step 1: GridSearchCV


Implement run a grid search (using GridSearchCV) over a range of parameters for the random forest on the previously established data.  Test at least 18 different parameter combinations.

In [None]:
# Your code here

# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Step 3: GridSearchCV for Hyperparameter Optimization
print("="*70)
print("GRID SEARCH OPTIMIZATION")
print("="*70)

# Define parameter grid - This creates 2 x 3 x 3 x 2 = 36 combinations
param_grid = {
    'n_estimators': [100, 200],           # Number of trees
    'max_depth': [10, 20, None],          # Maximum depth of trees
    'min_samples_split': [2, 5, 10],      # Minimum samples to split a node
    'min_samples_leaf': [1, 2]            # Minimum samples at leaf node
}

# Calculate total combinations
total_combinations = 1
for param_values in param_grid.values():
    total_combinations *= len(param_values)
print(f"Testing {total_combinations} parameter combinations...\n")

# Initialize GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,              # Use all available cores
    verbose=1
)

# Fit the grid search
print("Running Grid Search (this may take a few minutes)...")
grid_search.fit(X_train, y_train)

# Step 4: Display Results
print("\n" + "="*70)
print("BEST PARAMETERS FOUND")
print("="*70)
for param, value in grid_search.best_params_.items():
    print(f"{param}: {value}")

print("\n" + "="*70)
print("MODEL PERFORMANCE COMPARISON")
print("="*70)
print(f"Baseline Accuracy:        {baseline_accuracy:.4f}")
print(f"Best Cross-Val Accuracy:  {grid_search.best_score_:.4f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred_optimized = best_model.predict(X_test)
optimized_accuracy = accuracy_score(y_test, y_pred_optimized)
print(f"Optimized Test Accuracy:  {optimized_accuracy:.4f}")

# Calculate improvement
improvement = (optimized_accuracy - baseline_accuracy) / baseline_accuracy * 100
print(f"\nImprovement: {improvement:.2f}%")

# Display top 5 parameter combinations
print("\n" + "="*70)
print("TOP 5 PARAMETER COMBINATIONS")
print("="*70)
results_df = pd.DataFrame(grid_search.cv_results_)
top_5 = results_df.nsmallest(5, 'rank_test_score')[['params', 'mean_test_score', 'rank_test_score']]
for idx, row in top_5.iterrows():
    print(f"\nRank {int(row['rank_test_score'])}: Score = {row['mean_test_score']:.4f}")
    print(f"  Parameters: {row['params']}")

BASELINE MODEL
Baseline Accuracy: 0.8543

GRID SEARCH OPTIMIZATION
Testing 36 parameter combinations...

Running Grid Search (this may take a few minutes)...
Fitting 5 folds for each of 36 candidates, totalling 180 fits

BEST PARAMETERS FOUND
max_depth: None
min_samples_leaf: 2
min_samples_split: 10
n_estimators: 100

MODEL PERFORMANCE COMPARISON
Baseline Accuracy:        0.8543
Best Cross-Val Accuracy:  0.8607
Optimized Test Accuracy:  0.8618

Improvement: 0.87%

TOP 5 PARAMETER COMBINATIONS

Rank 1: Score = 0.8607
  Parameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 100}

Rank 2: Score = 0.8603
  Parameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 200}

Rank 3: Score = 0.8592
  Parameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}

Rank 4: Score = 0.8591
  Parameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 2

#### Step 2: RandomizedSearchCV


Implement run a random search (using RandomSearchCV) over a range of parameters for the random forest.  Test at least 20 different parameter combinations.

In [None]:
# Your code here
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from scipy.stats import randint, uniform
import warnings
warnings.filterwarnings('ignore')

# Step 3: RandomizedSearchCV for Hyperparameter Optimization
print("="*70)
print("RANDOMIZED SEARCH OPTIMIZATION")
print("="*70)

# Define parameter distributions for random sampling
# This allows exploring a much wider range than GridSearch
param_distributions = {
    'n_estimators': randint(50, 300),              # Random integers between 50-300
    'max_depth': [5, 10, 15, 20, 25, None],        # Discrete options including no limit
    'min_samples_split': randint(2, 20),           # Random integers between 2-20
    'min_samples_leaf': randint(1, 10),            # Random integers between 1-10
    'max_features': ['sqrt', 'log2', None],        # Feature selection strategies
    'bootstrap': [True, False],                    # Whether to use bootstrap samples
    'criterion': ['gini', 'entropy']               # Splitting criteria
}

n_iter = 25  # Number of random combinations to test
print(f"Testing {n_iter} random parameter combinations...\n")

# Initialize RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_distributions,
    n_iter=n_iter,          # Number of parameter settings sampled
    cv=5,                   # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,              # Use all available cores
    verbose=1,
    random_state=42
)

# Fit the random search
print("Running Randomized Search (this may take a few minutes)...")
random_search.fit(X_train, y_train)

# Step 4: Display Results
print("\n" + "="*70)
print("BEST PARAMETERS FOUND")
print("="*70)
for param, value in random_search.best_params_.items():
    print(f"{param}: {value}")

print("\n" + "="*70)
print("MODEL PERFORMANCE COMPARISON")
print("="*70)
print(f"Baseline Accuracy:        {baseline_accuracy:.4f}")
print(f"Best Cross-Val Accuracy:  {random_search.best_score_:.4f}")

# Evaluate on test set
best_model = random_search.best_estimator_
y_pred_optimized = best_model.predict(X_test)
optimized_accuracy = accuracy_score(y_test, y_pred_optimized)
print(f"Optimized Test Accuracy:  {optimized_accuracy:.4f}")

# Calculate improvement
improvement = (optimized_accuracy - baseline_accuracy) / baseline_accuracy * 100
print(f"\nImprovement: {improvement:.2f}%")

# Display top 5 parameter combinations
print("\n" + "="*70)
print("TOP 5 PARAMETER COMBINATIONS")
print("="*70)
results_df = pd.DataFrame(random_search.cv_results_)
top_5 = results_df.nsmallest(5, 'rank_test_score')[['params', 'mean_test_score', 'rank_test_score']]
for idx, row in top_5.iterrows():
    print(f"\nRank {int(row['rank_test_score'])}: Score = {row['mean_test_score']:.4f}")
    print(f"  Parameters: {row['params']}")

# Additional statistics
print("\n" + "="*70)
print("SEARCH STATISTICS")
print("="*70)
print(f"Total combinations tested: {len(results_df)}")
print(f"Best score achieved: {random_search.best_score_:.4f}")
print(f"Mean score across all trials: {results_df['mean_test_score'].mean():.4f}")
print(f"Standard deviation: {results_df['mean_test_score'].std():.4f}")

BASELINE MODEL
Baseline Accuracy: 0.8543

RANDOMIZED SEARCH OPTIMIZATION
Testing 25 random parameter combinations...

Running Randomized Search (this may take a few minutes)...
Fitting 5 folds for each of 25 candidates, totalling 125 fits

BEST PARAMETERS FOUND
bootstrap: False
criterion: gini
max_depth: None
max_features: sqrt
min_samples_leaf: 2
min_samples_split: 13
n_estimators: 207

MODEL PERFORMANCE COMPARISON
Baseline Accuracy:        0.8543
Best Cross-Val Accuracy:  0.8603
Optimized Test Accuracy:  0.8619

Improvement: 0.89%

TOP 5 PARAMETER COMBINATIONS

Rank 1: Score = 0.8603
  Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 13, 'n_estimators': 207}

Rank 2: Score = 0.8592
  Parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': 15, 'max_features': None, 'min_samples_leaf': 8, 'min_samples_split': 5, 'n_estimators': 153}

Rank 3: Score = 0.8591
  Parameters: {'bootstrap': F

#### Step 3: BayesianSearchCV



Previously, we used HyperOpt for Bayesian optimization. The [bayesian-optimization](https://pypi.org/project/bayesian-optimization/) does much the same thing, but is a little more user friendly. Install the package and use it to run a Bayesian search over a range of parameters for the random forest.  Test at least 15 different parameter combinations.

In [8]:
pip install bayesian-optimization

Collecting bayesian-optimization
  Downloading bayesian_optimization-3.1.0-py3-none-any.whl.metadata (11 kB)
Downloading bayesian_optimization-3.1.0-py3-none-any.whl (36 kB)
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-3.1.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Your code here
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from bayes_opt import BayesianOptimization
import warnings
warnings.filterwarnings('ignore')

# Step 3: Bayesian Optimization for Hyperparameter Tuning
print("="*70)
print("BAYESIAN OPTIMIZATION")
print("="*70)

# Define the objective function to maximize
def rf_cv_score(n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features):
    """
    Function to optimize. Returns the cross-validation score.
    Bayesian Optimization will try to maximize this function.
    """
    # Convert continuous parameters to appropriate types
    n_estimators = int(n_estimators)
    max_depth = int(max_depth) if max_depth > 0 else None
    min_samples_split = int(min_samples_split)
    min_samples_leaf = int(min_samples_leaf)
    max_features = min(max(0.1, max_features), 1.0)  # Keep between 0.1 and 1.0
    
    # Create and evaluate the model
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        random_state=42,
        n_jobs=-1
    )
    
    # Use cross-validation to get a robust score
    cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
    return cv_scores.mean()

# Define the parameter bounds for Bayesian Optimization
param_bounds = {
    'n_estimators': (50, 300),          # Number of trees
    'max_depth': (5, 30),                # Maximum depth (will convert to None if needed)
    'min_samples_split': (2, 20),        # Minimum samples to split
    'min_samples_leaf': (1, 10),         # Minimum samples at leaf
    'max_features': (0.1, 1.0)           # Fraction of features to consider
}

# Initialize Bayesian Optimization
optimizer = BayesianOptimization(
    f=rf_cv_score,
    pbounds=param_bounds,
    random_state=42,
    verbose=2
)

# Run the optimization
n_iterations = 20  # Number of iterations (exploration + exploitation)
print(f"\nRunning Bayesian Optimization for {n_iterations} iterations...")
print("(This uses smart sampling to find optimal parameters efficiently)\n")

optimizer.maximize(
    init_points=5,      # Number of random exploration steps
    n_iter=n_iterations # Number of Bayesian optimization steps
)

# Step 4: Display Results
print("\n" + "="*70)
print("BEST PARAMETERS FOUND")
print("="*70)
best_params = optimizer.max['params']
print(f"n_estimators: {int(best_params['n_estimators'])}")
print(f"max_depth: {int(best_params['max_depth']) if best_params['max_depth'] > 0 else None}")
print(f"min_samples_split: {int(best_params['min_samples_split'])}")
print(f"min_samples_leaf: {int(best_params['min_samples_leaf'])}")
print(f"max_features: {best_params['max_features']:.3f}")

print("\n" + "="*70)
print("MODEL PERFORMANCE COMPARISON")
print("="*70)
print(f"Baseline Accuracy:        {baseline_accuracy:.4f}")
print(f"Best Cross-Val Accuracy:  {optimizer.max['target']:.4f}")

# Train final model with best parameters
best_rf = RandomForestClassifier(
    n_estimators=int(best_params['n_estimators']),
    max_depth=int(best_params['max_depth']) if best_params['max_depth'] > 0 else None,
    min_samples_split=int(best_params['min_samples_split']),
    min_samples_leaf=int(best_params['min_samples_leaf']),
    max_features=best_params['max_features'],
    random_state=42,
    n_jobs=-1
)
best_rf.fit(X_train, y_train)

# Evaluate on test set
y_pred_optimized = best_rf.predict(X_test)
optimized_accuracy = accuracy_score(y_test, y_pred_optimized)
print(f"Optimized Test Accuracy:  {optimized_accuracy:.4f}")

# Calculate improvement
improvement = (optimized_accuracy - baseline_accuracy) / baseline_accuracy * 100
print(f"\nImprovement: {improvement:.2f}%")

# Display top 5 trials
print("\n" + "="*70)
print("TOP 5 TRIALS")
print("="*70)
# Sort all results by target score
all_results = []
for i, res in enumerate(optimizer.res):
    all_results.append({
        'trial': i + 1,
        'score': res['target'],
        'params': res['params']
    })

# Sort by score and get top 5
sorted_results = sorted(all_results, key=lambda x: x['score'], reverse=True)[:5]
for rank, result in enumerate(sorted_results, 1):
    print(f"\nRank {rank}: Score = {result['score']:.4f}")
    print(f"  n_estimators: {int(result['params']['n_estimators'])}")
    print(f"  max_depth: {int(result['params']['max_depth'])}")
    print(f"  min_samples_split: {int(result['params']['min_samples_split'])}")
    print(f"  min_samples_leaf: {int(result['params']['min_samples_leaf'])}")
    print(f"  max_features: {result['params']['max_features']:.3f}")

# Additional statistics
print("\n" + "="*70)
print("OPTIMIZATION STATISTICS")
print("="*70)
all_scores = [res['target'] for res in optimizer.res]
print(f"Total trials: {len(optimizer.res)}")
print(f"Best score: {max(all_scores):.4f}")
print(f"Mean score: {np.mean(all_scores):.4f}")
print(f"Standard deviation: {np.std(all_scores):.4f}")
print(f"Improvement over first trial: {(max(all_scores) - all_scores[0]) * 100:.2f}%")

BASELINE MODEL
Baseline Accuracy: 0.8543

BAYESIAN OPTIMIZATION

Running Bayesian Optimization for 20 iterations...
(This uses smart sampling to find optimal parameters efficiently)

|   iter    |  target   | n_esti... | max_depth | min_sa... | min_sa... | max_fe... |
-------------------------------------------------------------------------------------
| [39m1        [39m | [39m0.8602926[39m | [39m143.63502[39m | [39m28.767857[39m | [39m15.175890[39m | [39m6.3879263[39m | [39m0.2404167[39m |
| [39m2        [39m | [39m0.8500558[39m | [39m88.998630[39m | [39m6.4520903[39m | [39m17.591170[39m | [39m6.4100351[39m | [39m0.7372653[39m |
| [39m3        [39m | [39m0.8596296[39m | [39m55.146123[39m | [39m29.247746[39m | [39m16.983967[39m | [39m2.9110519[39m | [39m0.2636424[39m |
| [39m4        [39m | [39m0.8583033[39m | [39m95.851127[39m | [39m12.606056[39m | [39m11.445615[39m | [39m4.8875051[39m | [39m0.3621062[39m |
| [39m5        [39m

#### Step 4: Compare and Reflect


Compare the outputs of the different strategies.  Do they converge to similar parameters?  Why or why not?  Which method would you try first in practice?

*Enter your answer in this cell*
All three methods showed moderate convergence and achieved nearly identical performance (GridSearch: 0.8607, RandomSearch: 0.8603, Bayesian: 0.8603), but with notably different parameters - GridSearch used 100 estimators with min_samples_split=10, RandomSearch chose 207 estimators with bootstrap disabled, and Bayesian settled on 144-150 estimators with max_depth28-29. These differences exist because multiple parameter combinations achieve similar performance due to a flat optimization plateau near the optimum. In practice, start with RandomizedSearchCV - it matched the best performance (0.8618 test accuracy) while being simpler to implement, requiring no extra libraries, and exploring wider parameter spaces including categorical options. Bayesian optimization showed 0.00% improvement over its first trial, indicating this parameter space was too easy to benefit from sophisticated sequential learning. RandomizedSearchCV offers the best balance of simplicity, speed, and effectiveness for initial exploration, reserving Bayesian methods for expensive evaluations and GridSearch for final fine-tuning.