Tutorial for MMD with the TorchDrift library: https://towardsai.net/p/machine-learning/drift-detection-using-torchdrift-for-tabular-and-time-series-data

more documentation on TorchDrift MMD: https://torchdrift.org/notebooks/note_on_mmd.html

In [1]:
import pandas as pd
import torch
import torchdrift.detectors as detectors

In [2]:
SAMPLE = r'default SAC 500 norm space results\baseline_obs-a.csv'
SAVE_DIR = r'default SAC 500 norm space results' + '/'
ATK_NAME = 'MMD_baseline_random_daily_samples_2'

##### On the (Statistical) Detection of Adversarial Examples

**Two-sample hypothesis testing** — As stated before, the test we chose is appropriate to handle high dimensional inputs and small sample sizes. We compute the biased estimate of MMD using a **Gaussian kernel**, and then apply **10 000 bootstrapping iterations** to estimate the distributions. Based on this, we compute the **pvalue** and compare it to the threshold, in our experiments **0.05**. For samples of **legitimate data, the observed p-value should always be very high**, whereas for sample sets containing adversarial examples, we expect it to be low—since they are sampled from a different distribution and thus the hypothesis should be rejected. The test is more likely to detect a difference in two distributions when it considers samples of large size (i.e., the sample contains more inputs from the distribution).

In [3]:
BOOTSTRAP = 10_000
PVAL = 0.05
kernel = detectors.mmd.GaussianKernel()

Because our dataset is a time series, we will use MMD on different time segments rather than shuffling the dataset

Load unperturbed observations from untargeted adversarial attack

In [4]:
df_obs = pd.read_csv(SAMPLE, 
                        index_col=0,
                        dtype='float32',
                        )
df_obs.set_index(df_obs.index.astype(int), inplace=True) #all data is loaded as float32, but the index should be an int

Remove actions if stored in df

In [5]:
if 'a' in df_obs.columns:
    df_obs.drop(columns=['a'], inplace=True)
elif 'actions' in df_obs.columns:
    df_obs.drop(columns=['actions'], inplace=True)

In [6]:
samples_per_day = 24
# Split the DataFrame into two equal parts day by day
df1 = pd.DataFrame()
df2 = pd.DataFrame()

for i in range(0, len(df_obs), samples_per_day):
    daily_samples = df_obs.iloc[i:i+samples_per_day]
    daily_samples = daily_samples.sample(frac=1)  # shuffle the daily samples
    df1 = df1.append(daily_samples.iloc[:samples_per_day//2])
    df2 = df2.append(daily_samples.iloc[samples_per_day//2:])

df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)

In [7]:
result = detectors.kernel_mmd(torch.from_numpy(df1.values).to('cuda'), #clean obs from adv trace
                                  torch.from_numpy(df2.values).to('cuda'), #perturbed obs from adv trace
                                  n_perm=BOOTSTRAP,
                                  kernel=kernel)
torch.cuda.empty_cache() #free gpu memory
print(f'mmd:{result[0]}, p-value:{result[1]}')

mmd:0.0003371238708496094, p-value:0.9971999526023865


convert cuda tensors to numpy

In [8]:
cpu_result = [tensor.item() for tensor in result]

In [9]:
mmd_savename = SAVE_DIR+'MMDs.csv'
try:
    df_mmd = pd.read_csv(mmd_savename,
                         index_col=0)
    df_mmd = df_mmd.append(
                pd.Series(cpu_result,
                        index=df_mmd.columns,
                        name=ATK_NAME,),
            )
    #df_mmd.loc[ATK_NAME] = cpu_result
    df_mmd.to_csv(mmd_savename)
    print(f'{mmd_savename} updated')
except:
    df_mmd = pd.DataFrame([cpu_result],
                      columns=['MMD','p_value'],
                      index=[ATK_NAME])
    df_mmd.to_csv(mmd_savename)
    print(f'{mmd_savename} created')

default SAC 500 norm space results/MMDs.csv updated
