# ANODE Walkthrough

This is a simple conceptual guide through how the [ANODE method](https://journals.aps.org/prd/abstract/10.1103/PhysRevD.101.075042) for anomaly detection works. The notebook is oversimplified such that it does not make use of all optimization measures implemented in the main paper. It rather shows the core concept while hiding the technical implementation details behind a scikit-learn style API.

The core assumption of ANODE is that you have a resonant feature $m$, in which a potential (a-priori unknown) signal process is localized. Furthermore, we want to make use of extra dimensions, our auxiliary features $x$, to discriminate between such a signal and the background. This is illustrated below.

![resonant anomaly detection](images/resonant_anomaly_detection.png)

We would now like to learn to distinguish signal from background in a data-driven manner. An optimal test statistic between signal and background would be the likelihood ratio $\frac{p_{sig}}{p_{bkg}}$. But for this we would need to know how the signal looks like. What we can try to estimate instead is the data-to-background likelihood ratio, which is monotolically linked to the signal-to-background one (data=sig+bkg): $\frac{p_{data}}{p_{bkg}} = \frac{f_{sig} p_{sig} + (1- f_{sig}) p_{bkg}}{p_{bkg}} = f_{sig} \frac{p_{sig}}{p_{bkg}} + (1 - f_{sig})$ where $f_{sig}$ is the (unknown) signal fraction in the data.

For this aim, we first divide the $m$ spectrum into a signal region (SR), in which we want to look for a localized signal, and the complementary sidebands (SB). Then we train a conditional normalizing flow to learn the background distribution $p_{bkg}(x|m)$ in $x$ as a function of $m$ from the SB and interpolate into the SR. Within the SR, we can learn the data likelihood directly, also as a function of $m$: $p_{data}(x|m)$.

We can then simply take the ratio of these two likelihoods for every data point in the test set. This is an anomaly score that we can select on and it will thus increase the relative fraction of signal over background.

These steps are now illustrated via code below.

In [None]:
!pip install vector scikit-learn==1.4.0

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import subprocess
import sys

from os.path import exists, join, dirname, realpath
from sklearn.metrics import roc_curve

# adding parent directory to path
parent_dir = dirname(realpath(globals()["_dh"][0]))
sys.path.append(parent_dir)

from sk_cathode.generative_models.conditional_normalizing_flow_torch import ConditionalNormalizingFlow
from sk_cathode.utils.preprocessing import ExtStandardScaler, LogitScaler, make_ext_pipeline

In [None]:
# :sunglasses:
plt.style.use('dark_background')

The input data are preprocessed via another script `demos/utils/data_preparation.py`. It downloads the LHCO R\&D dataset and applies the preprocessing to extract the conditional feature $m=m_{jj}$ and four auxiliary features $x=(m_{j1}, \Delta m_{jj}, \tau_{21,j1}, \tau_{21,j2})$. Moreover, it divides the $m$ spectrum into SR and SB, and splits the data into training/validation/test sets.

In [None]:
data_path = "/global/cfs/cdirs/ntrain1/anomaly/input_data_deltaR/"

In [None]:
# data preparation (download and high-level preprocessing)
if not exists(join(data_path, "innerdata_test.npy")):
    process = subprocess.run(f"{sys.executable} {join(parent_dir, 'demos', 'utils', 'data_preparation.py')} --outdir {data_path}", shell=True, check=True)

In [None]:
# data loading
outerdata_train = np.load(join(data_path, "outerdata_train.npy"))
outerdata_val = np.load(join(data_path, "outerdata_val.npy"))
innerdata_train = np.load(join(data_path, "innerdata_train.npy"))
innerdata_val = np.load(join(data_path, "innerdata_val.npy"))
innerdata_test = np.load(join(data_path, "innerdata_test.npy"))

Now the conditional normalizing flow is trained on SB data, in order to model $p_{bkg}(x|m)$. Since flows learn a smooth mapping, it is hard for them to learn steep edges. Thus, we first apply a logit transformation to smoothen out the boundaries and then apply a standard scaler transformation to normalize the data to zero mean and unit variance. The flow is then trained on these transformed data.

In [None]:
# either train new flow model from scratch

outer_flow_savedir = "./trained_flows/"

m_train = outerdata_train[:, 0:1]
X_train = outerdata_train[:, 1:-1]
m_val = outerdata_val[:, 0:1]
X_val = outerdata_val[:, 1:-1]

# We streamline the preprocessing with an (extended) sklearn pipeline.
# The sklearn pipeline does not properly normalize probabilities, so we
# use an extended version that properly tracks jacobian determinants.
full_outer_model = make_ext_pipeline(LogitScaler(),
                                     ExtStandardScaler(),
                                     ConditionalNormalizingFlow(
                                         save_path=outer_flow_savedir,
                                         num_inputs=outerdata_train[:, 1:-1].shape[1],
                                         early_stopping=True, epochs=None,
                                         verbose=True))

# Let's protect ourselves from accidentally overwriting a trained model.
if not exists(join(outer_flow_savedir, "DE_models")):
    full_outer_model.fit(X_train, m_train, X_val=X_val, m_val=m_val)
else:
    print(f"The model exists already in {outer_flow_savedir}. Remove first if you want to overwrite.")

In [None]:
# or loading existing flow model, fitting preprocessing on the fly
# and stacking it with the pre-trained flow into our extended pipeline

outer_scaler = make_ext_pipeline(LogitScaler(), ExtStandardScaler())
outer_scaler.fit(outerdata_train[:, 1:-1])

outer_flow_savedir = "./trained_flows/"
outer_flow_model = ConditionalNormalizingFlow(save_path=outer_flow_savedir,
                                              num_inputs=outerdata_train[:, 1:-1].shape[1],
                                              load=True)

full_outer_model = make_ext_pipeline(outer_scaler, outer_flow_model)

The exact same thing we also do on the SR data, in order to model $p_{data}(x|m)$.

In [None]:
# either train new flow model from scratch

inner_flow_savedir = "./trained_flows_inner/"

m_train = innerdata_train[:, 0:1]
X_train = innerdata_train[:, 1:-1]
m_val = innerdata_val[:, 0:1]
X_val = innerdata_val[:, 1:-1]

# We streamline the preprocessing with an (extended) sklearn pipeline.
# The sklearn pipeline does not properly normalize probabilities, so we
# use an extended version that properly tracks jacobian determinants.
full_inner_model = make_ext_pipeline(LogitScaler(),
                                     ExtStandardScaler(),
                                     ConditionalNormalizingFlow(
                                         save_path=inner_flow_savedir,
                                         num_inputs=innerdata_train[:, 1:-1].shape[1],
                                         early_stopping=True, epochs=None,
                                         verbose=True))

# Let's protect ourselves from accidentally overwriting a trained model.
if not exists(join(inner_flow_savedir, "DE_models")):
    full_inner_model.fit(X_train, m_train, X_val=X_val, m_val=m_val)
else:
    print(f"The model exists already in {inner_flow_savedir}. Remove first if you want to overwrite.")

In [None]:
# or loading existing flow model

inner_scaler = make_ext_pipeline(LogitScaler(), ExtStandardScaler())
inner_scaler.fit(innerdata_train[:, 1:-1])

inner_flow_savedir = "./trained_flows_inner/"
inner_flow_model = ConditionalNormalizingFlow(save_path=inner_flow_savedir,
                                              num_inputs=innerdata_train[:, 1:-1].shape[1],
                                              load=True)

full_inner_model = make_ext_pipeline(inner_scaler, inner_flow_model)

Now all we have to do is to evaluate both likelihoods on the test set and take their ratio. For numeric stability, we get both likelihoods in log space, take their difference and exponentiate it to yield the ratio. This is our anomaly score, which should ideally be high for signal and low for background events.

Let's evaluate how well this classifier performs in terms of distinguishing signal from background. We can of course only do this because we have the true labels for this simulation, which we don't in a real analysis.

We quantify the performance via the significance improvement characteristic (SIC), which measures the significance via $S/\sqrt{B}$ after applying a selection on the classifier output divided by the unselected significance. The true positive rate in the horizontal axis quantifies how tightly we apply this cut on the anomaly score.

So higher numbers are better and we should see a non-trivial (non-random and even $>1$) SIC for ANODE, which means that we substantially improve the significance of the signal over background in the analysis, even without ever showing true signal labels to the classifier during training.

In [None]:
X_test = innerdata_test[:, 1:-1]
m_test = innerdata_test[:, 0:1]
y_test = innerdata_test[:, -1]

with np.errstate(divide='ignore', invalid='ignore'):
    outer_logprobs = full_outer_model.predict_log_proba(X_test, m=m_test)
    inner_logprobs = full_inner_model.predict_log_proba(X_test, m=m_test)

    # taking the ratio in log space
    preds_test = np.exp(inner_logprobs - outer_logprobs).flatten()

    # clean out potential NaNs
    preds_test_clean = preds_test[~(np.isnan(preds_test) | np.isinf(preds_test))]
    y_test_clean = y_test[~(np.isnan(preds_test) | np.isinf(preds_test))]

    fpr, tpr, _ = roc_curve(y_test_clean, preds_test_clean)
    sic = tpr / np.sqrt(fpr)

    random_tpr = np.linspace(0, 1, 300)
    random_sic = random_tpr / np.sqrt(random_tpr)

plt.plot(tpr, sic, label="ANODE")
plt.plot(random_tpr, random_sic, "w:", label="random")
plt.xlabel("True Positive Rate")
plt.ylabel("Significance Improvement")
plt.legend(loc="upper right")
plt.show()

Conceptually, ANODE is tightly related with [CATHODE](https://journals.aps.org/prd/abstract/10.1103/PhysRevD.106.055006). There the same likelihood ratio between data and background is learned, but it is approximated via a classifier that learns to distinguish samples from the two distributions, rather than modeling the two distributions separately and taking the ratio. In practice, neural networks are seen as more powerful in learning likelihood ratios than two separate flows. However, there might still be use cases where the ANODE approach could be more suitable.