Tutorial for MMD with the TorchDrift library: https://towardsai.net/p/machine-learning/drift-detection-using-torchdrift-for-tabular-and-time-series-data

more documentation on TorchDrift MMD: https://torchdrift.org/notebooks/note_on_mmd.html

In [1]:
RED = "\033[91m"
AUTO = "\033[0m"

In [2]:
import pandas as pd
import torch
import numpy as np
from tqdm import tqdm
import torchdrift.detectors as detectors
from joblib import Parallel, delayed

In [3]:
DATA_PATH = r'default SAC 500 norm space results\baseline_obs-a.csv'
SAVE_DIR = 'default SAC 500 norm space results' + '/'
SEGMENT_NAME = 'day'
SEGMENTS = 365
BOOTSTRAP = 10_000
PVAL = 0.05
JOBS = 48
SAVE_NAME = f'{SEGMENT_NAME}ly baseline MMD'


In [4]:
df_data = pd.read_csv(DATA_PATH,
                      index_col=0,
                      usecols = lambda x: x != 'actions', #excludes a col
                      )

##### On the (Statistical) Detection of Adversarial Examples

**Two-sample hypothesis testing** — As stated before, the test we chose is appropriate to handle high dimensional inputs and small sample sizes. We compute the biased estimate of MMD using a **Gaussian kernel**, and then apply **10 000 bootstrapping iterations** to estimate the distributions. Based on this, we compute the **pvalue** and compare it to the threshold, in our experiments **0.05**. For samples of **legitimate data, the observed p-value should always be very high**, whereas for sample sets containing adversarial examples, we expect it to be low—since they are sampled from a different distribution and thus the hypothesis should be rejected. The test is more likely to detect a difference in two distributions when it considers samples of large size (i.e., the sample contains more inputs from the distribution).

In [5]:
kernel = detectors.mmd.GaussianKernel()

In [6]:
def show_results(results): #results is a tuple of (action (mmd,pval))
    for result in results:
        if result[1][1] > PVAL:
            dist = 'identical'
            colour = AUTO
        else:
            dist = 'distinct'
            colour = RED
        print(f'For {SEGMENT_NAME} {result[0]}: mmd:{result[1][0]:.5f}, p-value:{result[1][1]}, {colour}distributions are {dist}{AUTO}')

Time series splits

In [7]:
samples = np.array_split(df_data.to_numpy(), SEGMENTS) #using sklearn time series split, which returns indeces, might let me load all the data as a cuda tensor, instead of transfering it sequentially.

Compares one segment to the next

In [8]:
def process_action(i, dist1, dist2):
    return i, detectors.kernel_mmd(torch.from_numpy(dist1).to('cuda'), 
                                  torch.from_numpy(dist2).to('cuda'), #excludes i
                                  n_perm=BOOTSTRAP,
                                  kernel=kernel)

results = Parallel(n_jobs=JOBS, #set n_jobs so you don't run out of vram, 
            prefer='threads' #threads are like 8 times faster than multiprocessing, less overhead and the cpu work is negligable
            )(delayed(process_action)(i, samples[i], samples[i + 1]) for i in tqdm(range(SEGMENTS - 1))) #offset  of 1 avoids comparing the 0th segment to the slice of [:0], which is empty
                #a slice of samples is a list, so we concatinate them into a simgle array
show_results(results)


100%|██████████| 364/364 [20:17<00:00,  3.34s/it]


For week 0: mmd:0.10257, p-value:0.16279999911785126, [0mdistributions are identical[0m
For week 1: mmd:0.10587, p-value:0.13539999723434448, [0mdistributions are identical[0m
For week 2: mmd:0.11805, p-value:0.06659999489784241, [0mdistributions are identical[0m
For week 3: mmd:0.10113, p-value:0.1850999891757965, [0mdistributions are identical[0m
For week 4: mmd:0.10680, p-value:0.14489999413490295, [0mdistributions are identical[0m
For week 5: mmd:0.10473, p-value:0.17299999296665192, [0mdistributions are identical[0m
For week 6: mmd:0.13454, p-value:0.03689999878406525, [91mdistributions are distinct[0m
For week 7: mmd:0.07986, p-value:0.5388999581336975, [0mdistributions are identical[0m
For week 8: mmd:0.07984, p-value:0.5317000150680542, [0mdistributions are identical[0m
For week 9: mmd:0.08249, p-value:0.4777999818325043, [0mdistributions are identical[0m
For week 10: mmd:0.08545, p-value:0.41029998660087585, [0mdistributions are identical[0m
For week 11: 

In [9]:
mmd = [result[1][0].item() for result in results]
pval = [result[1][1].item() for result in results]
segment = [result[0] for result in results]
df_results = pd.DataFrame({'MMD':mmd, 'P_value':pval})
df_results.index.name = SEGMENT_NAME

In [10]:
df_results.to_csv(SAVE_DIR + SAVE_NAME + '.csv', 
                  #index=0,
                  )