## Naive Bayes Model Selection

We have implemented several different variants of the Naive Bayes classifier, each with its own assumptions and characteristics. To determine which model performs best for our specific dataset, we will conduct a model selection process using some resampling technique.

### Choice of Resampling Technique

By default, we will use k-fold cross-validation for model selection and hyperparameter tuning. However, in some algorithms, training is extremely costly, and in such cases, because we have a large dataset, we may opt for a simple train-validation split instead. The choice of resampling technique will be made based on the computational cost of training each model.

Note that in all cases, the test set will remain untouched until the final evaluation phase.

### Model Space

We will consider the following Naive Bayes variants for model selection:
1. Gaussian Naive Bayes Variants:
   1. Gaussian Naive Bayes (As a baseline model): The standard Gaussian Naive Bayes model that assumes features are normally distributed.
      1. Dropping rows with missing values
      2. Dropping features with missing values
   2. Robust Gaussian Naive Bayes: A variant of the Gaussian Naive Bayes that uses the missingness of a feature as an additional categorical feature. (`laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]`)
   3. Categorical-Aware Robust Gaussian Naive Bayes with: A variant of the Robust Gaussian Naive Bayes that treats continuous features as independent of one another (as does the standard Naive Bayes model) but dependent on the categorical features (not the missingness indicators). (`laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]`)
2. Histogram-Based Naive Bayes Variants:
   1. Histogram Naive Bayes: A Naive Bayes model that uses histograms to estimate the probability density functions of continuous features.
      1. Dropping rows with missing values (`bins in range(10, 1000, 10) + [None]`)
      2. Dropping features with missing values (`bins in range(10, 1000, 10) + [None]`)
   2. Robust Histogram Naive Bayes: A variant of the Histogram Naive Bayes that incorporates missingness indicators as additional categorical features. (`bins in range(10, 1000, 10) + [None]`, `laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]`)
   3. Categorical-Aware Robust Histogram Naive Bayes: A variant of the Robust Histogram Naive Bayes that treats continuous features as independent of one another but dependent on the categorical features. (`bins in range(10, 1000, 10) + [None]`, `laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]`)
3. Kernel Density Estimation (KDE) Based Naive Bayes Variants:
   1. Gaussian KDE Naive Bayes: A Naive Bayes model that uses Kernel Density Estimation with a Gaussian Kernel to estimate the probability density functions of continuous features.
      1. Dropping rows with missing values (`bandwidth in [0.05, 0.1, 0.5, 1, 2, None]`, `num_points in [100, 1000, 3000, 5000]`, `range_padding in [0, 0.1, 0.5]`)
      2. Dropping features with missing values (`bandwidth in [0.05, 0.1, 0.5, 1, 2, None]`, `num_points in [100, 1000, 3000, 5000]`, `range_padding in [0, 0.1, 0.5]`)
   2. Robust Gaussian KDE Naive Bayes: A variant of the KDE Naive Bayes that incorporates missingness indicators as additional categorical features. (`bandwidth in [0.05, 0.1, 0.5, 1, 2, None]`, `num_points in [100, 1000, 3000, 5000]`, `range_padding in [0, 0.1, 0.5]`, `laplace_smoothing in <about_the_best_as_in_previous_models> + [0]`)
   3. Categorical-Aware Robust Gaussian KDE Naive Bayes: A variant of the Robust KDE Naive Bayes that treats continuous features as independent of one another but dependent on the categorical features. (`bandwidth in [0.05, 0.1, 0.5, 1, 2, None]`, `num_points in [100, 1000, 3000, 5000]`, `range_padding in [0, 0.1, 0.5]`, `laplace_smoothing in <about_the_best_as_in_previous_models> + [0]`)
   > **Note:** KDE models are by design lazily evaluated, meaning that training consists on just storing all the training data, and actual density estimation is performed at prediction time. Therefore, they are extremely costly to evaluate. For that reason, an **approximation** is used instead, where a large-enough sample of the feature axis is drawn and the density is estimated at those points only, using them to approximate the density at prediction time via 1-nearest-neighbor.
4. Box-Cox Transformed Gaussian Naive Bayes Variants:
   1. Box-Cox Transformed Gaussian Naive Bayes: A Naive Bayes model that applies a Box-Cox transformation to continuous features before modeling them with Gaussian distributions.
      1. Dropping rows with missing values
      2. Dropping features with missing values
   2. Robust Box-Cox Transformed Gaussian Naive Bayes: A variant of the Box-Cox Transformed Gaussian Naive Bayes that incorporates missingness indicators as additional categorical features. (`laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]`)
   3. Categorical-Aware Robust Box-Cox Transformed Gaussian Naive Bayes: A variant of the Robust Box-Cox Transformed Gaussian Naive Bayes that treats continuous features as independent of one another but dependent on the categorical features. (`laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]`)


> **Note:** In all cases, categorical features are modeled using a multi-class generalization of the Bernoulli distribution ($f(x=i|\vec{p})=p_i$), while continuous features are modeled using different Density Estimation techniques (Gaussian, KDE, or Histogram).

In [1]:
# Load the data and prepare the clean variables.
import pandas as pd
import numpy as np

df = pd.read_csv("../data/processed/train_no_preprocess.csv")
df.head()

Unnamed: 0,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,...,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
0,90.901,85.57,75.316,40.945,,,,1.869,3.111,135.816,...,1,42.605,-1.962,-2.519,,,,42.605,0.589395,b
1,133.477,3.669,99.223,227.121,2.243,365.016,2.278,1.223,3.539,440.917,...,2,173.249,-0.759,2.545,87.317,-3.002,-2.594,260.566,0.000461,s
2,115.111,26.919,77.658,50.266,,,,2.691,3.655,133.495,...,1,53.895,0.685,-0.613,,,,53.895,0.623627,b
3,,83.642,74.642,25.176,,,,2.646,25.176,53.813,...,0,,,,,,,0.0,1.772114,b
4,81.958,7.074,46.894,90.979,0.952,83.883,-0.226,1.626,17.517,174.686,...,2,80.028,-0.456,-1.902,33.561,0.496,-0.555,113.589,0.41876,b


In [2]:
X = df.drop(columns=["Label", "Weight"])
y = df["Label"]
weights = df["Weight"]
categorical_features = df.columns.get_indexer(["PRI_jet_num"]).tolist()

df_drop_rows = df[~X.isna().any(axis=1)].reset_index(drop=True)
X_drop_rows = df_drop_rows.drop(columns=["Label", "Weight"])
y_drop_rows = df_drop_rows["Label"]
weights_drop_rows = df_drop_rows["Weight"]
categorical_features_drop_rows = df_drop_rows.columns.get_indexer(["PRI_jet_num"]).tolist()

df_drop_cols = df.drop(columns=X.columns[X.isna().any(axis=0)])
X_drop_cols = df_drop_cols.drop(columns=["Label", "Weight"])
y_drop_cols = df_drop_cols["Label"]
weights_drop_cols = df_drop_cols["Weight"]
categorical_features_drop_cols = df_drop_cols.columns.get_indexer(["PRI_jet_num"]).tolist()

# Convert them to numpy and store them in a datasets dictionary for easy reference.
datasets = {
    "original": (X.to_numpy(), y.to_numpy(), weights.to_numpy(), categorical_features),
    "drop-rows": (
        X_drop_rows.to_numpy(),
        y_drop_rows.to_numpy(),
        weights_drop_rows.to_numpy(),
        categorical_features_drop_rows,
    ),
    "drop-columns": (
        X_drop_cols.to_numpy(),
        y_drop_cols.to_numpy(),
        weights_drop_cols.to_numpy(),
        categorical_features_drop_cols,
    ),
}
del (
    df,
    X,
    y,
    df_drop_rows,
    X_drop_rows,
    y_drop_rows,
    df_drop_cols,
    X_drop_cols,
    y_drop_cols,
    categorical_features,
    categorical_features_drop_rows,
    categorical_features_drop_cols,
)  # Free memory

In [3]:
from typing import Dict, List, Optional, TypedDict, Literal, NewType, Union
from pydantic import BaseModel


class Experiment(BaseModel):
    model_class: Literal["BespokeNB", "CategoricalAwareBespokeNB"]
    categorical_estimator_class: Literal["CategoricalEstimator", "RobustCategoricalEstimator"]
    continuous_estimator_class: Literal[
        "GaussianEstimator",
        "RobustGaussianEstimator",
        "HistogramEstimator",
        "RobustHistogramEstimator",
        "EagerGaussianKDEstimator",
        "RobustEagerGaussianKDEstimator",
        "BoxCoxGaussianEstimator",
        "RobustBoxCoxGaussianEstimator",
    ]
    dataset: Literal["original", "drop-rows", "drop-columns"]
    num_folds: int
    categorical_estimator_params: Dict[str, Union[Optional[float], Optional[int]]] = {}
    continuous_estimator_params: Dict[str, Union[Optional[float], Optional[int]]] = {}


class ExperimentResult(Experiment):
    fold_index: int
    accuracy: float
    b_recall: float
    b_precision: float
    b_f1_score: float
    s_recall: float
    s_precision: float
    s_f1_score: float
    ams_score: float

Let's setup the experiments:

In [4]:
experiments: List[Experiment] = []

#### Gaussian experiments

In [5]:
# Gaussian Naive Bayes Variants:

# Standard
for dataset in ["drop-rows", "drop-columns"]:
    experiments.append(
        Experiment(
            model_class="BespokeNB",
            categorical_estimator_class="CategoricalEstimator",
            continuous_estimator_class="GaussianEstimator",
            dataset=dataset,
            num_folds=10,
        )
    )

# Robust and categorical-aware
for laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]:
    experiments.append(
        Experiment(
            model_class="BespokeNB",
            categorical_estimator_class="RobustCategoricalEstimator",
            continuous_estimator_class="RobustGaussianEstimator",
            dataset=dataset,
            num_folds=10,
            continuous_estimator_params=dict(laplace_smoothing=laplace_smoothing),
        )
    )

    experiments.append(
        Experiment(
            model_class="CategoricalAwareBespokeNB",
            categorical_estimator_class="RobustCategoricalEstimator",
            continuous_estimator_class="RobustGaussianEstimator",
            dataset=dataset,
            num_folds=10,
            continuous_estimator_params=dict(laplace_smoothing=laplace_smoothing),
        )
    )

#### Histogram experiments

In [6]:
# Histogram Naive Bayes Variants:

# Standard
for dataset in ["drop-rows", "drop-columns"]:
    for bins in list(range(10, 1000, 10)) + [None]:
        experiments.append(
            Experiment(
                model_class="BespokeNB",
                categorical_estimator_class="CategoricalEstimator",
                continuous_estimator_class="HistogramEstimator",
                dataset=dataset,
                num_folds=10,
                continuous_estimator_params=dict(bins=bins),
            )
        )

# Robust and categorical-aware
for laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]:
    for bins in list(range(10, 1000, 10)) + [None]:
        experiments.append(
            Experiment(
                model_class="BespokeNB",
                categorical_estimator_class="RobustCategoricalEstimator",
                continuous_estimator_class="RobustHistogramEstimator",
                dataset=dataset,
                num_folds=10,
                continuous_estimator_params=dict(laplace_smoothing=laplace_smoothing, bins=bins),
            )
        )

        experiments.append(
            Experiment(
                model_class="CategoricalAwareBespokeNB",
                categorical_estimator_class="RobustCategoricalEstimator",
                continuous_estimator_class="RobustHistogramEstimator",
                dataset=dataset,
                num_folds=10,
                continuous_estimator_params=dict(laplace_smoothing=laplace_smoothing, bins=bins),
            )
        )

Let us skip the KDE estimators, as they are much much more computationally expensive to run, and perform first the Box-Cox experiments.

#### Box-Cox experiments

In [7]:
# Box-Cox Gaussian Naive Bayes Variants:

# Standard
for dataset in ["drop-rows", "drop-columns"]:
    experiments.append(
        Experiment(
            model_class="BespokeNB",
            categorical_estimator_class="CategoricalEstimator",
            continuous_estimator_class="BoxCoxGaussianEstimator",
            dataset=dataset,
            num_folds=10,
        )
    )

# Robust and categorical-aware
for laplace_smoothing in [0, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]:
    experiments.append(
        Experiment(
            model_class="BespokeNB",
            categorical_estimator_class="RobustCategoricalEstimator",
            continuous_estimator_class="RobustBoxCoxGaussianEstimator",
            dataset=dataset,
            num_folds=10,
            continuous_estimator_params=dict(laplace_smoothing=laplace_smoothing),
        )
    )

    experiments.append(
        Experiment(
            model_class="CategoricalAwareBespokeNB",
            categorical_estimator_class="RobustCategoricalEstimator",
            continuous_estimator_class="RobustBoxCoxGaussianEstimator",
            dataset=dataset,
            num_folds=10,
            continuous_estimator_params=dict(laplace_smoothing=laplace_smoothing),
        )
    )

Now let's create an experiment runner function to run all the experiments:

In [8]:
import sys
import os
from typing import Type
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, classification_report, precision_recall_fscore_support

sys.path.append(os.path.abspath("../"))

import src.naive_bayes
import src.evaluate


def _instantiate_estimator(
    estimator_cls: Type[src.naive_bayes.ProbabilityEstimator],
    estimator_params: Dict[str, Union[Optional[float], Optional[int]]],
) -> src.naive_bayes.ProbabilityEstimator:
    init_kwargs = {}
    for key in estimator_cls.__init__.__code__.co_varnames[1:]:
        if not key in estimator_params:
            raise ValueError(f"Missing value for estimator parameter: {key}")
        init_kwargs[key] = estimator_params[key]
    return estimator_cls(**init_kwargs)


def _get_estimator_instances(
    experiment: Experiment, num_features: int, categorical_features: list[int]
) -> Dict[int, src.naive_bayes.ProbabilityEstimator]:
    categorical_estimator_cls = getattr(src.naive_bayes, experiment.categorical_estimator_class)
    continuous_estimator_cls = getattr(src.naive_bayes, experiment.continuous_estimator_class)
    instances = {}
    for feature in range(num_features):
        if feature in categorical_features:
            instances[feature] = _instantiate_estimator(
                categorical_estimator_cls, experiment.categorical_estimator_params
            )
        else:
            instances[feature] = _instantiate_estimator(
                continuous_estimator_cls, experiment.continuous_estimator_params
            )
    return instances


def _get_model_instance(
    experiment: Experiment, num_features: int, categorical_features: list[int]
) -> src.naive_bayes.BespokeNB | src.naive_bayes.CategoricalAwareBespokeNB:
    model_cls = getattr(src.naive_bayes, experiment.model_class)
    estimators = _get_estimator_instances(
        experiment, num_features=num_features, categorical_features=categorical_features
    )
    if model_cls == src.naive_bayes.BespokeNB:
        return model_cls(estimators=estimators)
    elif model_cls == src.naive_bayes.CategoricalAwareBespokeNB:
        return model_cls(
            estimators=estimators,
            categorical_features=categorical_features,
        )
    else:
        raise ValueError(f"Unknown model class: {experiment.model_class}")


def run_cv_experiments(experiment: Experiment) -> List[ExperimentResult]:
    X, y, weights, categorical_features = datasets[experiment.dataset]

    # Create cross-validation folds
    results = []
    kf = KFold(n_splits=experiment.num_folds, shuffle=False)
    for fold_index, (train, test) in enumerate(kf.split(X)):
        model = _get_model_instance(experiment, num_features=X.shape[1], categorical_features=categorical_features)
        model.fit(X[train], y[train])
        y_pred = model.predict(X[test])
        accuracy = accuracy_score(y[test], y_pred)
        (b_precision, s_precision), (b_recall, s_recall), (b_f1_score, s_f1_score), _ = precision_recall_fscore_support(
            y[test], y_pred, labels=["b", "s"], average=None, zero_division=0
        )
        ams_score = src.evaluate.ams_score(y_true=y[test], y_pred=y_pred, weights=weights[test])
        result = ExperimentResult(
            **experiment.model_dump(),
            fold_index=fold_index,
            accuracy=accuracy,
            b_recall=b_recall,
            b_precision=b_precision,
            b_f1_score=b_f1_score,
            s_recall=s_recall,
            s_precision=s_precision,
            s_f1_score=s_f1_score,
            ams_score=ams_score,
        )
        results.append(result)
    return results

And finally, let's run the experiments and collect the results:

In [9]:
from tqdm import tqdm
import warnings
import pandas as pd

np.seterr(divide="ignore", invalid="ignore")

experiment_results: List[ExperimentResult] = []
failed_experiments: List[Experiment] = []
with warnings.catch_warnings():
    bar = tqdm(total=len(experiments), desc="Running experiments")
    for experiment in experiments:
        bar.set_postfix_str(
            f"{experiment.model_class} with {experiment.continuous_estimator_class} on {experiment.dataset}"
        )
        bar.refresh()
        try:
            experiment_results.extend(run_cv_experiments(experiment))
        except Exception as e:
            print(f"Experiment failed: {e}")
            failed_experiments.append(experiment)
        bar.update(1)
        # Save intermediate results
        df_results = pd.DataFrame([result.model_dump() for result in experiment_results])
        df_results.to_csv("../results/naive_bayes_model_selection_results_intermediate.csv", index=False)
        failed_results = pd.DataFrame([exp.model_dump() for exp in failed_experiments])
        failed_results.to_csv("../results/naive_bayes_model_selection_failed_experiments_intermediate.csv", index=False)
    bar.close()

Running experiments:  99%|█████████▉| 2223/2244 [3:28:18<01:40,  4.81s/it, BespokeNB with BoxCoxGaussianEstimator on drop-rows]                      

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2224/2244 [3:28:24<01:42,  5.14s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2225/2244 [3:28:30<01:42,  5.37s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2226/2244 [3:28:33<01:25,  4.72s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2227/2244 [3:28:39<01:26,  5.07s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2228/2244 [3:28:42<01:11,  4.49s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2229/2244 [3:28:48<01:13,  4.93s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2230/2244 [3:28:51<01:01,  4.41s/it, CategoricalAwareBespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2231/2244 [3:28:57<01:02,  4.84s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments:  99%|█████████▉| 2232/2244 [3:29:01<00:52,  4.35s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2233/2244 [3:29:06<00:52,  4.81s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2234/2244 [3:29:10<00:43,  4.33s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2235/2244 [3:29:15<00:43,  4.79s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2236/2244 [3:29:19<00:34,  4.30s/it, CategoricalAwareBespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2237/2244 [3:29:25<00:33,  4.80s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2238/2244 [3:29:28<00:25,  4.31s/it, CategoricalAwareBespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2239/2244 [3:29:34<00:23,  4.80s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2240/2244 [3:29:37<00:17,  4.30s/it, CategoricalAwareBespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2241/2244 [3:29:43<00:14,  4.78s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2242/2244 [3:29:46<00:08,  4.30s/it, CategoricalAwareBespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments: 100%|█████████▉| 2243/2244 [3:29:52<00:04,  4.79s/it, BespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]                

Experiment failed: Data must be positive.


Running experiments: 100%|██████████| 2244/2244 [3:29:55<00:00,  4.30s/it, CategoricalAwareBespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]

Experiment failed: Data must be positive.


Running experiments: 100%|██████████| 2244/2244 [3:29:55<00:00,  5.61s/it, CategoricalAwareBespokeNB with RobustBoxCoxGaussianEstimator on drop-columns]
