<a href="https://colab.research.google.com/github/MattValSE/AutoML2024_Team5/blob/main/AutoML_project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explainable Automated Machine Learning (LTAT.02.023)
## Course project 1
Authors:  Kristjan Laid, Mattias Väli and Evely Kirsiaed

Description:
1. Build and train a baseline, and record the result (train different machine learning algorithms with their default hyperparameters (random forest, decision tree,........etc.) and then select the one that achieves the best performance.).
2. Based on the problem at hand, you study the potential pipeline structure, algorithms, or feature transformers at each step and hyperparameter ranges. Use hyperOpt with the potential search space to beat the baseline if possible.

Assessment Criteria:
1. The project’s code should be available and ready to run if needed.
2. Everyone should understand the whole project and be ready to answer any question regarding it, not only the part they contributed to.
3. The total assessment time is 15 minutes: 10 minutes for the presentation and 5 minutes for questions.
4. You may be interrupted during the presentation for some questions.
6. Your presentation should include a dataset description, search space configurations, used baseline, selected pipeline (autoML output), overtime monitoring of the process selection, comparison between the selected and baseline pipelines, statistical test results, and justification for each step.
7. Evaluation criteria:

a. Correctness of the code - 33% of the mark.

b. Completeness of the presentation - 33% of the mark.

c. Questions’ answers - 33% of the mark.

8. All team members share the same mark for a and b and might get a different mark for c, based on each student’s answers.

## Dataset description
todo

## Baseline model


In [None]:
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

In [None]:
!pip install xgboost



In [None]:
# Load the dataset (e.g., Iris dataset)
data = datasets.load_iris()
X = data.data
y = data.target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
res = dict()
for model in [GradientBoostingClassifier(random_state = 42),
              RandomForestClassifier(random_state = 42),
              XGBClassifier(random_state = 42),
              RidgeClassifier(random_state = 42),
              LogisticRegression(random_state = 42),
              GaussianNB(),
              BernoulliNB(),
              NearestCentroid(),
              KNeighborsClassifier(),
              DecisionTreeClassifier(random_state = 42),
              SVC()]:
  # Perform cross-validation and calculate the mean accuracy for each baseline model
  accuracy = cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()
  res[model] = accuracy

print("Mean accuracy for baseline models")
for el in dict(sorted(res.items(), key=lambda item: item[1], reverse=True)):
  print(f"Cross-validated accuracy: {res[el]:.5f}, Model: {el}")

Mean accuracy for baseline models
Cross-validated accuracy: 0.96667, Model: RandomForestClassifier(random_state=42)
Cross-validated accuracy: 0.96667, Model: SVC()
Cross-validated accuracy: 0.96000, Model: GradientBoostingClassifier(random_state=42)
Cross-validated accuracy: 0.96000, Model: LogisticRegression(random_state=42)
Cross-validated accuracy: 0.96000, Model: KNeighborsClassifier()
Cross-validated accuracy: 0.95333, Model: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
             

## Hyperparameter optimization

### Hyperparameter optimization using Hyperopt

### 1. Random search

In [None]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
from hyperopt.pyll.base import scope

In [None]:
# 2. Define the objective function to minimize
def objective(params):
    # Create the model using parameters from Hyperopt
    rf = RandomForestClassifier(n_estimators=int(params['n_estimators']),
                                criterion='gini',
    #                            max_depth=int(params['max_depth']),
    #                            min_samples_split=int(params['min_samples_split']),
    #                            min_samples_leaf=int(params['min_samples_leaf']),
                                random_state=42)

    # Fit the model on the training set
    #rf.fit(X_train, y_train)

    # Make predictions on the validation set
    #y_pred = rf.predict(X_valid)

    # Compute the Mean Absolute Error (MAE)
    #mae = mean_absolute_error(y_valid, y_pred)
    accuracy = cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()

    # Return a dictionary with the loss (to minimize) and status
    return {'loss': -accuracy,'status': STATUS_OK}

# 3. Define the search space for Hyperopt
search_space = {
    'n_estimators': scope.int(hp.quniform('n_estimators', 30, 250, 5)),
#    'max_depth': scope.int(hp.quniform('max_depth', 2, 20, 1)),  # Integer values between 2 and 20
#    'min_samples_split': scope.int(hp.quniform('min_samples_split', 1, 50, 1)),
#    'min_samples_leaf': scope.int(hp.quniform('min_samples_leaf', 1, 50, 1))  # Integer values between 1 and 50
}


trials = Trials()
# 5. Run Hyperopt to minimize the objective function
best = fmin(fn=objective,                # Objective function
            space=search_space,          # Search space
            algo=tpe.suggest,            # Tree-structured Parzen Estimator (TPE) algorithm
            max_evals=100,               # Number of evaluations
            trials=trials,               # Store results
            rstate=np.random.default_rng(42))  # Ensure reproducibility with a fixed random seed

# 6. Print the best hyperparameters
print("Best hyperparameters found: ", best)

100%|██████████| 100/100 [00:02<00:00, 48.17trial/s, best loss: -0.9666666666666666]
Best hyperparameters found:  {'n_estimators': 210.0}


In [None]:
# 7. Train a model with the best hyperparameters on the full training set and evaluate on the test set

best_rf = RandomForestClassifier(n_estimators=int(best['n_estimators']),
                                 criterion='gini',
#                                 max_depth=int(best['max_depth']),
#                                 min_samples_split=int(best['min_samples_split']),
#                                 min_samples_leaf=int(best['min_samples_leaf']),
                                 random_state=42)

#best_rf = RandomForestClassifier(random_state=42)
accuracy = cross_val_score(best_rf, X, y, cv=5, scoring='accuracy') .mean()
print(f"Cross-validated accuracy: {accuracy:.5f}")


Cross-validated accuracy: 0.96667


### 2. Grid search

In [None]:
import itertools
n_estimators_values = np.arange(30, 250, 5)
max_depth_values = np.arange(2, 20, 1)

grid_search_space = {
    'n_estimators': hp.choice('n_estimators', n_estimators_values),
    'max_depth': hp.choice('max_depth', max_depth_values)  # Grid of fixed values
}
# Use itertools.product to find the Cartesian product (i.e., all possible combinations)
all_combinations = list(itertools.product(n_estimators_values, max_depth_values))
total_combinations = len(all_combinations)
# Perform Grid Search using fmin
best_grid = fmin(fn=objective,        # Objective function
                 space=grid_search_space,  # Grid search space
                 algo=tpe.suggest,    # Still use TPE, but this is effectively Grid Search due to the fixed values
                 max_evals=total_combinations,       # Number of evaluations, adjust if necessary
                 rstate=np.random.default_rng(42))  # Ensure reproducibility
best_grid_rf = RandomForestClassifier(n_estimators=int(best_grid['n_estimators']),
                                      max_depth=int(best_grid['max_depth']),
                                      criterion='gini')

100%|██████████| 792/792 [00:22<00:00, 34.46trial/s, best loss: -0.9666666666666666]


In [None]:
accuracy = cross_val_score(best_grid_rf, X, y, cv=5, scoring='accuracy') .mean()
print(f"Cross-validated accuracy: {accuracy:.5f}")

Cross-validated accuracy: 0.94667


## Bayesian optimization

In [None]:
!pip install scikit-optimize
from skopt.space import Real
from skopt.utils import use_named_args
from skopt import gp_minimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-24.9.0-py3-none-any.whl.metadata (11 kB)
Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyaml-24.9.0-py3-none-any.whl (24 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-24.9.0 scikit-optimize-0.10.2


In [None]:
# Define the SVM model's hyperparameter space
space = [Real(1e-6, 100.0, "log-uniform", name='C'),
         Real(1e-6, 100.0, "log-uniform", name='gamma')]

# Define the objective function to minimize
@use_named_args(space)
def objective(**params):
    # Create the SVM model with the given hyperparameters
    model = SVC(C=params['C'], gamma=params['gamma'])

    # Perform cross-validation and calculate the mean accuracy
    accuracy = cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()

    # Return the negative accuracy (because we want to minimize the objective)
    return -accuracy

# Run Bayesian optimization to find the best hyperparameters
result = gp_minimize(objective, space, n_calls=50, random_state=42)
#print(result)

# Extract the optimal hyperparameters
best_C = result.x[0]
best_gamma = result.x[1]
print(f"Optimal hyperparameters: C = {best_C}, gamma = {best_gamma}")

# Train the final model with the optimal hyperparameters
final_model = SVC(C=best_C
                  #125.8925
                  , gamma=best_gamma)
final_model.fit(X, y)

# Evaluate the model
accuracy = cross_val_score(final_model, X, y, cv=5, scoring='accuracy').mean()
print(f"Cross-validated accuracy of the final model: {accuracy:.5f}")

Optimal hyperparameters: C = 5.094785234755079, gamma = 0.0713002448227164
Cross-validated accuracy of the final model: 0.98000
