## Description

The Jupyter notebook consists of three parts: 

1. Preprocessing of the NSL-KDD data set
2. Train of a fully connected DNN
3. Execution of the XAI methods for getting explanations for the model

The code for executing these steps is not part of the notebook. Instead each step is done in a separated class written in python. The Juptyter notebook acts like a 'main.py' for executing the different steps of the paper.

### Dependencies

In [None]:
# common dependencies
from os.path import exists
from IPython.display import display
import numpy as np
import tensorflow as tf

# Load own modules
from xai_anomaly_detection.explanations import protodash
from xai_anomaly_detection.explanations import brcg
from xai_anomaly_detection.explanations.shap import shap_explanations
from xai_anomaly_detection.explanations.lime import lime_explanations
from xai_anomaly_detection.preprocessing import preprocessing
from xai_anomaly_detection.model.FCModel import FCModel, f1_m, precision_m, recall_m, get_sequential_model

### Data preprocessing

In [None]:
# Initialise instance which loads the data
Preprocessing = preprocessing.PreprocessNSLKDD()
# show head of train data set
display(Preprocessing.train_data.head(5))

In [None]:
# Start preprocessing step
# one-hot encoding of categorical features
# min-max normalization 
# convert all sub attack classes to common 'attack' label
Preprocessing.preprocessing()

# show head of train data set after preprocessing
display(Preprocessing.train_data.head(5))

# The paper said after preprocessing there will be 122 features
# but I get 124 features (with the label column)

In [None]:
# get train data separated in features and labels
(x_train, y_train) = Preprocessing.get_data()

print("Shape y: ", y_train.shape)
print("Shape x: ", x_train.shape)

# columns of features
columns = Preprocessing.test_data.columns[Preprocessing.test_data.columns != 'outcome']
display(columns)

### Model initialization and training

In [None]:
# initialise subclasses tf model
model = FCModel(x_train.shape[1])
# compile model
model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01),
    metrics = ['accuracy', precision_m, recall_m, f1_m]
)
model.build(x_train.shape)
model.summary()

In [None]:
# train the model if not exists
if exists('tmp/weights.index'):
    model.load_weights('tmp/weights')
else:
    model.fit(x_train, y_train, epochs=5, batch_size=64)
    model.save_weights('tmp/weights', save_format='tf')


In [None]:
# evaluate model
(x_test, y_test) = Preprocessing.get_data(test_data=True)
scores = model.evaluate(x_test, y_test)
for i in range(1, len(model.metrics_names)):
    print("%s: %.2f%%" % (model.metrics_names[i], scores[i]*100))

### Generating explanations

#### Build another model for SHAP
Reason: see below

In [None]:
# A bug causing 'model.outputs' to be 'None' for subclassed models
# see https://github.com/tensorflow/tensorflow/issues/45202
# this forces me to create another model

# get compiled model
seq_model = get_sequential_model(x_train.shape[1])

# train model
if exists('tmp/seq_model_weights.index'):
    seq_model.load_weights('tmp/seq_model_weights')
else:
    seq_model.fit(x_train, y_train, epochs=5, batch_size=64)
    seq_model.save_weights('tmp/seq_model_weights', save_format='tf')

# evaluate
scores = seq_model.evaluate(x_test, y_test)
for i in range(1, len(seq_model.metrics_names)):
    print("%s: %.2f%%" % (seq_model.metrics_names[i], scores[i]*100))

#### SHAP

In [None]:
# initialise shap class and create explainer for model
Shap = shap_explanations(seq_model, x_train, x_test)

In [None]:
# generate global explanation with SHAP summary plot
Shap.generate_summary_plot(columns)

# https://github.com/slundberg/shap
# shap seems to have some version issues
# there should be a legend and more colors 
# but I couldn't fix it with other matplotlib version

In [None]:
# local explanation with a SHAP force plot
Shap.generate_force_plot(columns)

#### LIME

In [None]:
# Local explanations with LIME

# select random sample
x_rand = x_test[np.random.randint(x_test.shape[0], size=1)].flatten()

Lime = lime_explanations(x_train, columns)

# note: graph background is transparent 
# thus it is a little bit ugly in dark mode

# here I used the original model instead of sequential model
# it proofs that the model is correctly build and only the bug 
# in tf prevents to execute shap on it
Lime.generate_lime_explanation(model, x_rand, num_features=10, show_table=True)

#### BRCG

In [None]:
# brcg needs dataframes as input
(x_train_df, y_train_df) = Preprocessing.get_data(test_data=False, as_df=True)
(x_test_df, y_test_df) = Preprocessing.get_data(test_data=True, as_df=True)

# after 12 minutes I stopped the training and decided to train with a smaller set
# with 0.1 of the data set it still took 3.15 min -> Accuracy: 0.7955
indices = np.random.choice(x_train_df.shape[0], replace = False, size=int(0.1*x_train_df.shape[0]))
x_train_df = x_train_df.iloc[indices]
y_train_df = y_train_df.iloc[indices]

# generate and print BRCG rules
display(brcg.explain_rules(x_train_df, x_test_df, y_train_df, y_test_df))

#### ProtoDash

In [None]:
# Explanations with ProtoDash from data
(x_train_df, y_train_df) = Preprocessing.get_data(test_data=False, as_df=True)
indices = np.random.choice(x_train_df.shape[0], replace = False, size=int(0.1*x_train_df.shape[0]))
x_train_df = x_train_df.iloc[indices]
y_train_df = y_train_df.iloc[indices]
# for full data set I get:
# MemoryError: Unable to allocate 17.4 GiB for an array with shape (125972, 18488) and data type float64
# such I also use a smaller data set

# generate protodash explanations from data
display(protodash.generate_protodash_explanations(x_train_df))

# sometimes the generation crashes with error:
# TypeError: bad operand type for unary -: 'NoneType'
# I couldn't find it
# maybe a bug? https://githubhelp.com/Trusted-AI/AIX360/issues/75