# ML Modeling
This notebook outlines a method for constructing an ensemble of **Random Forest Classifiers** using **Bayesian optimization**.

In [1]:
from time import time
import math
import warnings
import pickle as pkl

In [2]:
import numpy as np
import pandas as pd

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

In [4]:
from skopt.space import Real, Integer
from skopt import gp_minimize

In [5]:
import import_ipynb
from ecosystem_classifier import EcosystemClassifier

importing Jupyter notebook from ecosystem_classifier.ipynb


In [6]:
timer_start = time()

We begin by importing the preprocessed DataFrame containing the dataset (see **data_preprocessing.jpynb**).

In [7]:
df = pd.read_pickle(filepath_or_buffer="dataframe.pkl")

In [8]:
_, nr_columns = df.shape

## Building the Variables Vector Space 
We commence by partitioning the independent variable $x$ from the target variable $y$, following which we divide the dataset into training and testing subsets.

In [9]:
x = df.drop(labels="y",
            axis=1)
y = df.y

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=.2,
                                                    stratify=y)

We construct a dictionary that associates variable names with their permissible values. Subsequently, we establish the vector space within which the optimization process will unfold.

In [11]:
criterion = {0: "gini", 
             1: "entropy", 
             2: "log_loss"}
max_features = {0: "sqrt",
                1: "log2",
                2: None}
class_weight = {0: "balanced",
                1: None}

variables_dict = {"criterion": criterion,
                  "max_features": max_features, 
                  "class_weight": class_weight}

In [12]:
space = [Integer(low=0,
                 high=2,
                 name="criterion"),
        Integer(low=2, 
                high=nr_columns,
                name="max_depth"), 
        Integer(low=0, 
                high=2, 
                name="max_features"), 
        Integer(low=0, 
                high=1, 
                name="class_weight"),
        Real(low=.01, 
             high=1, 
             name="max_samples")]

In [13]:
del df, nr_columns, x, y, class_weight, criterion, max_features

## Building the Ecosystem Model
The **build_ecosystem** function constructs an ensemble of Random Forest Classifiers (an Ecosystem Classifier) optimized through Bayesian optimization. It begins by defining an objective function that computes the loss to be minimized during optimization. This function is then optimized using Bayesian optimization techniques, iterating over a specified number of optimization runs. The chosen loss function aims to maximize the **Recall Score** via cross validation. Since the main focus consists in optimizing the **True** outputs, a Macro averaging is chosen for the Recall.

During each optimization run, a Random Forest classifier is trained with hyperparameters sampled from the variables vector space. The best-performing models from each optimization run are stored. Finally, the top-performing models are selected based on their cross validation scores, which will serve as weights, and an instance of EcosystemClassifier is built, encapsulating the selected models and their weights.

In [14]:
def build_ecosystem(x: np.array,
                    y: np.array,
                    space: any,
                    variables: dict,
                    nr_calls: int = 10,
                    nr_forests: int = 1
                    ) -> EcosystemClassifier:
    """
    Build ecosystem consisting of Random Forest Classifiers whose parameters have been tuned via Bayesian optimization using Gaussian Processes to optimize macro recall.
    :param x: Variables.
    :param y: Target.
    :param space: Vector space that x belongs to.
    :param variables: Dictionary mapping variable names to variables_dictionaries.
    :param nr_calls: Nr. of calls of the objective function.
    :param nr_forests: Nr. of Random Forest Classifiers constituting the ecosystem.
    :return: DataFrame containing the top sqrt(nr_forests) performing models.
    """
    
    def objective_function(params: list
                           ) -> float:
        """
        Loss function to be optimized by the Bayesian optimization.
        :param params: list of parameters belonging to the vector space 'space'.
        :return: Loss.
        """
        params_value = {}
        for count, param in enumerate(space):
            name = param.name
            params_value[name] = params[count] if name not in variables\
                                 else variables[name][params[count]]
            
        x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                            test_size=.2,
                                                            stratify=y)
        
        random_forest = RandomForestClassifier()
        random_forest.set_params(**params_value)
        random_forest.fit(x_train, y_train)
        model_list.append(random_forest)
        
        return -f1_score(y_true=y_test,
                         y_pred=random_forest.predict(x_test),
                         average="macro")

    result_list = {}
    for count in range(nr_forests):
        model_list = []
        result = gp_minimize(func=objective_function,
                             dimensions=space, 
                             n_calls=nr_calls)
        result_list[f"forest.{count}"] = (-result.fun,
                                          model_list[np.argmin(result.func_vals)],
                                          result)
        
    df = pd.DataFrame(data=result_list).T
    df.rename(columns = {0: "weight",
                         1: "model",
                         2: "report"},
              inplace=True)
    df.sort_values(by="weight",
                   ascending=False,
                   inplace=True)
    df = df.head(math.floor(math.sqrt(nr_forests)))
    
    return EcosystemClassifier(weights=df.weight,
                               models=df.model, 
                               reports=df.report)

In [15]:
warnings.filterwarnings("ignore",
                        category=UserWarning)

In [16]:
ecosystem = build_ecosystem(x=x_train,
                            y=y_train, 
                            space=space,
                            variables=variables_dict,
                            nr_calls=60,
                            nr_forests=100)

Finally, we export and store the results obtained by the Ecosystem Classifier in binary format.

In [17]:
df = pd.DataFrame(columns=["y_true",
                           "y_pred",
                           "y_prob"])
df.y_true, df.y_pred, df.y_prob = y_test.values, ecosystem.predict(x_test), ecosystem.predict_proba(x_test)

test = {"df": df,
        "feature_importance": ecosystem.feature_importance(feature_names=x_train.columns), 
        "loss_eval": np.array([report.func_vals for report in ecosystem.reports])}

with open(file="test_results.pkl",
          mode="wb") as results:
    pkl.dump(obj=test,
             file=results)

In [18]:
print(f"Total running time of the script: {(time() - timer_start) / 3600: .2f}h")

Total running time of the script:  4.37h
