# **PV056 project**

**Theme:** XGBoost with random forest base learner

**Description:**
Use XGBoost with the random forest as a base learner.

**Supervisors:**
- Bc. Terézia Mikulová, učo [483657](https://is.muni.cz/auth/person/483657) (supervisor)
- doc. RNDr. Lubomír Popelínský, Ph.D., učo [1945](https://is.muni.cz/auth/person/1945) (consultant)


**Students:**
- Josef Karas, učo [511737](https://is.muni.cz/auth/person/511737)
- Filip Chladek, učo [514298](https://is.muni.cz/auth/person/514298)
- Bc. Martin Beňa, učo [485152](https://is.muni.cz/auth/person/485152)




# **First, copy this notebook to your drive.**

To run this notebook, you need the modified auto-sklearn library and other modules provided in the link below. These three cells will download, unpack, install requests libraries, and clean excessive files that are not needed.
\
**Run these cells only once OR if you wish to start again.**

In [1]:
%%capture
!gdown --fuzzy https://drive.google.com/file/d/17zt3tN_xUB_GDHbtzmRG78Ds3Vd9vugX/view?usp=sharing

In [2]:
import shutil
import os

shutil.unpack_archive('/content/project_tools.zip', '/content', 'zip')
os.remove('/content/project_tools.zip')

In [3]:
%%capture
!pip install -r /content/requirements.txt

After installing the pip requirements, **you need to restart the relation to apply changes made by installed libraries.**

You can restart relation with GUI:
\
**Runtime** -> **Restart relation**

# Common settings

Core functions are implemented and imported from modul ***utils.py***.

*   Params.BASE_LEARNER - is used to chose base learner from 2 variants (Random forest and Decision tree).
*   MEM - Miximum amout of allocated memory.






## Aditional info
The outputs of trained models and their results are stored in the results folder in **.pkl** format.

In the notebook, it is mentioned that the ```n_jobs``` parameter is set to -1 everywhere. However, if users prefer to train their own Auto-sklearn and Bagging models instead of using the provided **.pkl** files, they should set the ```n_jobs``` parameter for XGBoost to 1. Failing to do so may result in the spawning of too many threads, which can negatively impact performance.


In [None]:
from utils import (
    Task, LearnerType, Params,
    get_AutoSklearnClassifier, get_XGBModel,
    AutoSklearnClassifier, XGBClassifier,
    dump_pkl, dump_json, load_pkl, load_json,
    sorted_class_count
)
import inspect

## Dataset loading

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
np.random.seed(Params.SEED)

IDS = [
    ("credit-g", 31),
    ("spambase", 44),
    ("electricity", 151),
    ("pc4", 1049),
    ("pc3", 1050),
    ("JM1", 1053),
    ("kc1", 1067),
    ("pc1", 1068),
    ("bank-marketing/bank-marketing-full", 1461),
    ("madelon", 1485),
    ("ozone-level-8hr", 1487),
    ("phoneme", 1489),
    ("qsar-biodeg", 1494),
    ("churn", 40701),
]


DATASETS = []
for dataset_name, dataset_id in IDS:
    data = fetch_openml(data_id=dataset_id, parser="auto", as_frame=True)
    assert len(data.target_names) == 1
    target = data.target_names[0]

    if data.frame.shape[0] < 1000:
        continue

    if len(data.frame[target].unique()) != 2:
        continue

    X = data.frame.drop(columns=[target])
    (l1, l2), (c1, c2) = np.unique(data.frame[target], return_counts=True)
    (c1, l1), (c2, l2) = sorted(((c1, l1), (c2, l2)))
    # NOTE: convention that minority label will always be 1
    y = data.frame[target] == l1
    DATASETS.append((dataset_name, X, y))


[name for name, _, _ in DATASETS]

## Categorical features

Categorical features are experimental and require ```enable_categorical=True``` in XGBoost. In our case, we decide to use OneHotEncoding instead to avoid using experimental features.

In [None]:
# OneHotEncoding

for i, (name, X, y) in enumerate(DATASETS):
    categorical_columns = X.select_dtypes(include=['category']).columns
    if len(categorical_columns) != 0:
        preprocessor = ColumnTransformer(
            transformers=[
                ('cat', OneHotEncoder(sparse_output=False, handle_unknown="error"), categorical_columns)
            ],
            remainder='passthrough'  # Pass through numerical columns without any transformation
        )
        X = pd.DataFrame(preprocessor.fit_transform(X, y))
        DATASETS[i] = name, X, y

## Datasets info

In [None]:
from tabulate import tabulate

headers = ["dataset_name", "IR", "#minority", "#majority", "#instances", "#features", "#int features", "#float features", "#category features", "#NaN"]
table = []
for name, X, y in DATASETS:
    c1, c2 = sorted_class_count(y)
    int_features = len(X.select_dtypes(include=['int64']).columns)
    float_features = len(X.select_dtypes(include=['float64']).columns)
    category_features = len(X.select_dtypes(include=['category']).columns)
    table.append([name, round(c1/c2, 2), c1, c2, X.shape[0], X.shape[1], int_features, float_features, category_features, X.isna().sum().sum()])
table.sort(key=lambda i: i[1])

print(tabulate(table, headers))

## Utils for validation

The following cell defines a utility function for validating machine learning models using stratified k-fold cross-validation. It supports models from AutoSklearnClassifier and XGBClassifier and evaluates them across various metrics.

### Key Points:

**Imports:**
Essential libraries for model validation, timing, and performance metrics.
**Function Parameters:**
*   *X:* Features
*   *y:* Target
*   *model:* Model to evaluate
*   *task:* Task type
*   *name:* Model name
*   *ratio:* Data split ratio
*   *extra:* Additional data for semi-supervised learning

**Workflow:**
1.   *Model Handling:* Extracts the core model and initializes relevant structures.
2.   *K-Fold Split:* Performs stratified k-fold cross-validation.
3.   *Training & Testing:* Splits data, applies preprocessing, trains the model, and makes predictions.
4.   *Metrics Calculation:* Computes metrics like ROC AUC, confusion matrix, precision, recall, F1 score, and F-beta score.
5.   *Results Storage:* Saves results to a JSON file for each fold.
This ensures efficient and consistent model evaluation, capturing detailed performance metrics for analysis and comparison.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from time import perf_counter
import os

from sklearn.metrics import (
    confusion_matrix,
    recall_score,
    precision_score,
    f1_score,
    fbeta_score,
    roc_auc_score,
    roc_curve
)

def validate(X, y, model, task, name, ratio, extra=None):
    y = pd.Series(y)

    if isinstance(model, AutoSklearnClassifier):
        model = model.get_models_with_weights()[0][1]
        _, clf = model.steps.pop()
    elif isinstance(model, XGBClassifier):
        clf = model
        class Model: pass
        model = Model()
        model.steps = []
    else:
        assert False

    res = load_json(Params.BASE_LEARNER, task, name, ratio)
    skf = StratifiedKFold(n_splits=Params.K_FOLDS, shuffle=True, random_state=Params.SEED)
    for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
        if str(fold) in res:
            continue
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        if task == Task.SEMI_SUPERVISED:
            X_extra_train, y_extra_train = extra
            X_train = np.concatenate([X_train, X_extra_train])
            y_train = np.concatenate([y_train, y_extra_train])

        for _, step in model.steps:
            if hasattr(step, "fit_resample"):
                X_train, y_train = step.fit_resample(X_train, y_train)
            elif hasattr(step, "fit") and hasattr(step, "transform"):
                preprocesor = step.fit(X_train, y_train)
                X_train = preprocesor.transform(X_train)
                X_test = preprocesor.transform(X_test)
            else:
                assert False, f"This step is not a transformer or resampler: {step}."

        train_time = perf_counter()
        clf = clf.fit(X_train, y_train)
        train_time = perf_counter() - train_time

        inference_time = perf_counter()
        y_prob = clf.predict_proba(X_test)[:, 1]
        inference_time = perf_counter() - inference_time

        if task in (Task.DATA_IMBALANCE, Task.FEATURE_INADEQUACY):
            y_test_threshold, y_test, y_prob_threshold, y_prob = train_test_split(
                y_test, y_prob, test_size=0.5, random_state=Params.SEED, stratify=y_test
            )
            fpr, tpr, thresholds = roc_curve(y_test_threshold, y_prob_threshold)
            threshold_selection = {
                "(3*tpr*(1-fpr)/(2*(1-fpr)+tpr)": thresholds[(3*tpr*(1-fpr)/(2*(1-fpr)+tpr)).argmax()] <= y_prob,
                "tpr-fpr": thresholds[(tpr-fpr).argmax()] <= y_prob,
            }
        else:
            threshold_selection = {
                "None": clf.predict(X_test)
            }

        for threshold, y_pred in threshold_selection.items():
            if threshold not in res:
                res[threshold] = {}
            res[threshold][fold] = {
                "cpu_count": len(os.sched_getaffinity(0)),
                "train_time": train_time,
                "inference_time": inference_time,

                "auc_roc": roc_auc_score(y_test, y_prob),
                "confusion_matrix": [int(n) for n in confusion_matrix(y_test, y_pred).ravel()],

                "minority_precision": precision_score(y_test, y_pred, pos_label=1, average='binary', zero_division=0.0),
                "majority_precision": precision_score(y_test, y_pred, pos_label=0, average='binary', zero_division=0.0),

                "minority_recall": recall_score(y_test, y_pred, pos_label=1, average='binary', zero_division=0.0),
                "majority_recall": recall_score(y_test, y_pred, pos_label=0, average='binary', zero_division=0.0),

                "minority_f1": f1_score(y_test, y_pred, pos_label=1, average='binary', zero_division=0.0),
                "majority_f1": f1_score(y_test, y_pred, pos_label=0, average='binary', zero_division=0.0),
                "macro_f1": f1_score(y_test, y_pred, average='macro', zero_division=0.0),

                "minority_f2": fbeta_score(y_test, y_pred, beta=2, pos_label=1, average='binary', zero_division=0.0),
                "majority_f2": fbeta_score(y_test, y_pred, beta=2, pos_label=0, average='binary', zero_division=0.0),
                "macro_f2": fbeta_score(y_test, y_pred, beta=2, average='macro', zero_division=0.0),
            }

    dump_json(res, Params.BASE_LEARNER, task, name, ratio)



# Data imbalance

In [None]:
from imblearn.datasets import make_imbalance

IMBALANCED_DATSETS = []
ratios = [0.5, 0.25] + [r/100 for r in range(1, 21)]
for name, X, y in DATASETS:
    X, y = X.copy(), y.copy()
    c1, c2 = sorted_class_count(y)
    IMBALANCED_DATSETS.append((name, c1/c2, X, y))
    for ratio in sorted(ratios, reverse=True):
        c1, c2 = sorted_class_count(y)
        new_minority_count = int(c2 * ratio)
        if c1 < new_minority_count:
            continue
        X, y = make_imbalance(X, y, sampling_strategy={0: c2, 1: new_minority_count}, random_state=Params.SEED)
        IMBALANCED_DATSETS.append((name, ratio, X, y))

for _, ratio, _, y in IMBALANCED_DATSETS:
    c1, c2 = sorted_class_count(y)
    assert (c1 / c2 - ratio) < 0.001

[(name, ratio) for name, ratio, _, _ in IMBALANCED_DATSETS]

In [None]:
from collections import Counter

for name, ratio, X, y in IMBALANCED_DATSETS:
    try:
        print(name, ratio)
        try:
            model = load_pkl(Params.BASE_LEARNER, Task.DATA_IMBALANCE, name, ratio)
        except:
            model = get_AutoSklearnClassifier(X, y, Task.DATA_IMBALANCE)
            dump_pkl(model, Params.BASE_LEARNER, Task.DATA_IMBALANCE, name, ratio)
        validate(X, y, model, Task.DATA_IMBALANCE, name, ratio)
    except:
        pass

# Noisy data

In [None]:
import numpy as np

NOISY_DATASETS = []
noise_amount = [a / 100 for a in range(1, 11)]

for name, X, y in DATASETS:
    X, y = X.copy(), y.copy()

    NOISY_DATASETS.append((name, 0, X.copy(), y.copy()))
    indices_left = [X.index.copy() for _ in range(X.shape[1])]
    for noise in sorted(noise_amount):
        noise_to_add = round(X.shape[0] * noise) - (X.shape[0] - len(indices_left[0]))
        for i, feature in enumerate(X.columns):
            noise_indices = np.random.choice(indices_left[i], noise_to_add, replace=False)
            if X[feature].dtype == "float64":
                X.loc[noise_indices, feature] = np.random.uniform(X[feature].min(), X[feature].max(), noise_to_add)
            elif X[feature].dtype == "int64":
                X.loc[noise_indices, feature] = np.random.randint(X[feature].min(), X[feature].max()+1, noise_to_add)
            elif X[feature].dtype == "category":
                X.loc[noise_indices, feature] = np.random.choice(X[feature].unique(), noise_to_add)
            else:
                assert False, X[feature].dtype
            indices_left[i] = indices_left[i].drop(noise_indices)

        NOISY_DATASETS.append((name, noise, X.copy(), y.copy()))


for i in range(1, len(NOISY_DATASETS)):
    name, noise, X2, y2 = NOISY_DATASETS[i]
    _, _, X, y = [
        (name2, noise2, X2, y2)
        for name2, noise2, X2, y2 in NOISY_DATASETS
        if name2 == name and noise2 == 0
    ][0]
    for feature in X.columns:
        assert len(X[feature]) == len(X2[feature])
        sm = sum(a != a2 for a, a2 in zip(X[feature], X2[feature]))
        diff = 0.075 if len(X[feature].unique()) <= 10 else 0.015
        assert abs(sm / len(X[feature]) - noise) < diff


[(name, noise) for name, noise, _, _ in NOISY_DATASETS]

In [None]:
class NoiseRemover:
    def fit_resample(self, X, y):
        model = get_XGBModel(X, y, "DecisionTree", LearnerType.REGRESSION)
        model = model.fit(X, y)
        X_transformed, y_transformed = X, model.predict(X)
        return X_transformed, 0.5 <= y_transformed


for name, noise, X, y in NOISY_DATASETS:
    try:
        print(name, noise)
        X_transformed, y_transformed = NoiseRemover().fit_resample(X, y)
        model = get_XGBModel(X_transformed, y_transformed, Params.BASE_LEARNER, LearnerType.CLASSIFICATION)
        validate(X_transformed, y_transformed, model, Task.NOISY_DATA, name, noise)
    except:
        pass

# Semi-supervised

In [None]:
import numpy as np
import pandas as pd

HIDDEN_DATASETS = []
hidden_amount = [a / 100 for a in range(1, 11)]

for name, X, y in DATASETS:
    X, y = X.copy(), y.copy()
    N = X.shape[0]

    X_hidden = pd.DataFrame(np.empty((0, X.shape[1])), columns=X.columns)

    HIDDEN_DATASETS.append((name, 0, X.copy(), X_hidden.copy(), y.copy()))
    indices_left = X.index
    for hidden in sorted(hidden_amount):
        indices_to_hide = np.random.choice(indices_left, round(N * hidden) - (N - len(indices_left)), replace=False)
        X_hidden = pd.concat([X_hidden, X.loc[indices_to_hide]], ignore_index=True)
        indices_left = indices_left.drop(indices_to_hide)
        X = X.drop(indices_to_hide)
        y = y.drop(indices_to_hide)

        HIDDEN_DATASETS.append((name, hidden, X.copy(), X_hidden.copy(), y.copy()))

for name, hidden, X, X_hidden, y in HIDDEN_DATASETS:
    assert abs(len(X_hidden) / (len(X)+len(X_hidden)) - hidden) < 0.01, f"{len(X_hidden)}, {len(X)}, {hidden}"
    assert len(X) == len(y)
    assert y.shape == (len(y), )
    assert X.shape[1] == X_hidden.shape[1], f"{X.shape}, {X_hidden.shape}"
    assert all(a==b for a, b in zip(X.columns, X_hidden.columns)), f"{X.columns}, {X_hidden.columns}"
    for i in range(len(X_hidden)):
        assert i in X_hidden.index, f"{i}, {X_hidden.index}"

[(name, hidden) for name, hidden, _, _, _ in HIDDEN_DATASETS]

In [None]:
from sklearn.ensemble import BaggingClassifier
import pandas as pd

N_ESTIMATORS = 11
ITERS = 10
for name, hidden, X, X_hidden, y in HIDDEN_DATASETS:
    print(name, hidden)
    X_extra_train = pd.DataFrame(np.empty((0, X.shape[1])), columns=X.columns)
    y_extra_train = pd.Series(np.empty(0))
    try:
        X, y, X_extra_train, y_extra_train = load_pkl(
            Params.BASE_LEARNER, Task.SEMI_SUPERVISED, name, hidden
        )
    except:
        num_indices_to_select = (len(X_hidden) + ITERS - 1) // ITERS
        while len(X_hidden) != 0:
            model = BaggingClassifier(
                estimator=get_XGBModel(X, y, Params.BASE_LEARNER, LearnerType.CLASSIFICATION),
                n_estimators = N_ESTIMATORS,
                max_samples = 0.5,
                max_features = 0.5,
                bootstrap = True,
                bootstrap_features = True,
                n_jobs=-1,
                random_state = Params.SEED
            )
            X_train = pd.concat([X, X_extra_train], ignore_index=True)
            if len(y_extra_train) != 0:
                y_train = pd.concat([y, y_extra_train], ignore_index=True)
            else:
                y_train = y
            model = model.fit(X_train, y_train.to_numpy().ravel())
            y_prob = model.predict_proba(X_hidden)[:, 0].flatten()
            max_indices = np.argsort(np.maximum(y_prob, 1 - y_prob))[::-1][:num_indices_to_select]
            X_extra_train = pd.concat([X_extra_train, pd.DataFrame(X_hidden.iloc[max_indices])], ignore_index=True)
            if len(y_extra_train) != 0:
                y_extra_train = pd.concat([y_extra_train, pd.Series(0.5 <= y_prob[max_indices])], ignore_index=True)
            else:
                y_extra_train = pd.Series(0.5 <= y_prob[max_indices])
            X_hidden = X_hidden.drop(X_hidden.index[max_indices])
        dump_pkl((X, y, X_extra_train, y_extra_train), Params.BASE_LEARNER, Task.SEMI_SUPERVISED, name, hidden)

    model = get_XGBModel(X, y, Params.BASE_LEARNER, LearnerType.CLASSIFICATION)
    validate(X, y, model, Task.SEMI_SUPERVISED, name, hidden, extra=(X_extra_train, y_extra_train))

# Inadequate features

In [None]:
for name, X, y in DATASETS:
    print(name)
    try:
        model = load_pkl(Params.BASE_LEARNER, Task.FEATURE_INADEQUACY, name, None)
    except:
        model = get_AutoSklearnClassifier(X, y, task=Task.FEATURE_INADEQUACY)
        dump_pkl(model, Params.BASE_LEARNER, Task.FEATURE_INADEQUACY, name, None)
    validate(X, y, model, Task.FEATURE_INADEQUACY, name, None)