In this notebook, TranAD is used to perform anomaly detection, the source code of TranAD can be found [here](https://github.com/imperial-qore/TranAD), the most recent version (Sep 13, 2023) has a few bugs within the plotting and TranAD model logic however, that have been fixed in the local version of this repository, alongside adding support for SUE's data

To use TranAD, or any model supported in the TranAD project, the data will have to be processed and encoded into vector data, standardized to a range of 0-1.

In [1]:
import pandas as pd
import numpy as np

In [2]:
def scale_feature(dataframe_col):
    return (dataframe_col - dataframe_col.min()) / (dataframe_col.max() - dataframe_col.min())

def process_data(base_sue_df):
    df = base_sue_df.copy()
    df['_source_@timestamp'] = pd.to_datetime(df['_source_@timestamp'])
    df = df.sort_values(by="_source_@timestamp", ascending=True)
    feature_df = df[["_source_network_bytes", "_source_event_duration", "udp", "tcp", "label"]]

    normalized_feature_df = feature_df.copy()
    normalized_feature_df["_source_network_bytes"] = scale_feature(feature_df["_source_network_bytes"])
    normalized_feature_df["_source_event_duration"] = scale_feature(feature_df["_source_event_duration"])
    return normalized_feature_df

Let's process the data, and put it in a numpy array format so it can be taken in as vector data by TranAD, the count of rows is specified, as the full 5 million record dataset is too large to be processed by the transformer in a reasonable amount of time.

In [3]:
train_rows = 100000
test_rows = 100000

In [4]:
train_df = process_data(pd.read_csv("../data/train_data.csv"))
test_df = process_data(pd.read_csv("../data/test_data.csv"))

np.save("TranAD/processed/SUE/train.npy", train_df[["_source_network_bytes", "_source_event_duration", "udp", "tcp"]].values[:train_rows])
np.save("TranAD/processed/SUE/test.npy", test_df[["_source_network_bytes", "_source_event_duration", "udp", "tcp"]].values[:test_rows])
# Do disgusting label mapping, to have TranAD properly parse the data
np.save("TranAD/processed/SUE/labels.npy", test_df[["label", "label", "label", "label"]].values[:test_rows])

Now that the data is properly structured, we can run TranAD, or any of the other model's supported by this project

In [1]:
import os
# Change working dir to TranAD, this assumes that the current working dir is this notebook which 'should' be the case
original_cwd = os.getcwd()
try:
    os.chdir(os.path.join(original_cwd, "TranAD"))
    %run main.py --model TranAD --dataset SUE --retrain
finally:
    # Change working dir back when done
    os.chdir(original_cwd)

  from .autonotebook import tqdm as notebook_tqdm


[92mCreating new model: TranAD[0m
[95mTraining TranAD on SUE[0m
