# DDoSNet2019

This project implements the RNN autoencoder model detailed in this paper: https://arxiv.org/pdf/2006.13981.pdf


We used the **CIC-DDoS2019** dataset, which contains a _"comprehensive variety of DDoS
attacks and addresses the gaps of the existing current datasets_". 

The dataset contains over 30GB of data, and over 70 million lines, spread over 18 different files detailing different attacks.

Dealing with such files on a typical desktop computer requires some work arounds;

- We combined the csv files into one large csv file. The data contained over 5.5 billion cells after removing multiple features. 

- To use less memory, we had to process the data in chunks.

- We saved the model after processing each chunk, so that we can continue training it at a later time. 

- We used callbacks to monitor the loss, with a patience of 3 epochs.

- Due to the using chunks, we had to change the logic of model in run-time.


_note: the code contains comments with further detail about some aspects_


In [None]:
import numpy as np
import pandas as pd

# generate large CSV from the files from https://www.unb.ca/cic/datasets/ddos-2019.html

files = ["Portmap.csv", "NetBIOS.csv", "LDAP.csv", "MSSQL.csv", "UDP.csv", "UDPLag2.csv", "Syn2.csv", "DrDoS_NTP.csv",
         "DrDoS_DNS.csv", "DrDoS_LDAP.csv", "DrDoS_MSSQL.csv", "DrDoS_NetBIOS.csv", "DrDoS_SNMP.csv", "DrDoS_SSDP.csv",
         "DrDos_UDP.csv", "UDPLag.csv", "Syn.csv", "TFTP.csv"]

first = True
counter = 0
for file in files:
    file_path = "data/combined/" + file
    for chunk in pd.read_csv(file_path, low_memory=False, chunksize=1048576):
        chunk.dropna()
        counter+= 1
        print(counter)
        # create the new file
        if first:
            chunk.to_csv('all_data.csv', index=False)
            first = False
            continue
        chunk.to_csv('all_data.csv', mode='a', header=False, index=False)  # append the data



Throughout our development, we have noticed that the code runs better on CPU, and because of the inherent randomness of the model, we used a set random seed. 

We have implemented the Model API, the model below has 3 steps

1) Pretraining step - the data will only go through the encoder/decoder layers. 

2) Training step - the data will only go through the SimpleRNN layer

3) Finish step - the data will go through the autoencoder, and then the final layer before outputting a result

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # My CPU is faster than my GPU! Can be removed for GPU tensorflow. 
import tensorflow as tf
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers, losses
from tensorflow.keras.models import Model


total_rows = sum(1 for row in open('all_data.csv', 'r')) # computationally efficient way to sum the lines. 

# seed to remove randomness and reproduce results
np.random.seed(42)
tf.random.set_seed(42)

In [None]:

class AnomalyDetector(Model):
    def __init__(self):
        super(AnomalyDetector, self).__init__()
        self.pretrained = False
        self.finished_training = False
        self.encoder = tf.keras.Sequential([
            layers.SimpleRNN(64, activation="relu", return_sequences=True),
            layers.SimpleRNN(32, activation="relu", return_sequences=True),
            layers.SimpleRNN(16, activation="relu", return_sequences=True),
            layers.SimpleRNN(8, activation="relu", return_sequences=True)])

        self.decoder = tf.keras.Sequential([
            layers.SimpleRNN(16, activation="relu", return_sequences=True),
            layers.SimpleRNN(32, activation="relu", return_sequences=True),
            layers.SimpleRNN(64, activation="relu", return_sequences=True),
            layers.SimpleRNN(79, activation="sigmoid")])

        self.final = tf.keras.Sequential([
            layers.SimpleRNN(1, activation="sigmoid")
            # The orignal model uses a two-class softmax layer.
            # However, according to the documentation, the tensorflow sigmoid activation is equivalent to a two-class softmax.
            # There is no reason for using a softmax layer, since we only have binary classification
            
            # In the future, a softmax layer can be made for fine or medium-grained classification. 
        ])

    def call(self, x):
        decoded = None
        if self.finished_training:
            x = tf.expand_dims(x, -1)
            encoded = self.encoder(x)
            decoded = self.decoder(encoded)
            decoded = tf.expand_dims(decoded, -1) # re-expanding the dimensions.
            final = self.final(decoded)
            return final
        x = tf.expand_dims(x, -1)
        if not self.pretrained:
            encoded = self.encoder(x)
            decoded = self.decoder(encoded)
        final = self.final(x)
        if self.pretrained:
            return final
        return decoded


In [None]:
autoencoder = AnomalyDetector() # create a model

Our model uses a learning rate of 0.00001 - which showed the best results.

In [None]:
autoencoder.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001),
                    loss=tf.keras.losses.CategoricalCrossentropy())
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# compile the model, create an early stopping callback

We use 70% of the data for training, and 30% for validation. The training data is further split. 

As part of preprocessing the data,

During the preprocessing stage, we turn all "BENIGN" labels to 0s and 1s otherwise, as well as splitting the labels from the rest of the data. 

"na" and infinity values are dropped as well as all socket features - this is done to prevent overfitting of the model to the socket data which can vary from networks. We're training the model based on packet features alone. 

Lastly, we also normalize the data in order to improve our model. 



In [None]:
file_path = "all_data.csv" # the large file created

current_row = 0
limit = int(total_rows * 0.7) # We use 70% of the data for training(which is also split into training/testing, 30% for validation
for chunk in pd.read_csv(file_path, low_memory=False, chunksize=1000000):
    current_row += 1000000
    print("Current chunk: ", current_row)
    pd.set_option('use_inf_as_na', True)
    chunk = chunk.dropna()
    chunk.head()
    to_drop = ['Flow ID', ' Source IP', ' Source Port', ' Destination IP', ' Destination Port', ' Protocol',
               ' Timestamp',
               'SimillarHTTP', ' Timestamp']
    
    
    # data proprocessing stage
    chunk = chunk.drop(columns=to_drop, axis=1)

    chunk[' Label'] = (chunk[' Label'] != "BENIGN").astype(int) # 
    raw_data = chunk.values

    chunk.head()

    # The last element contains the labels
    labels = raw_data[:, -1]

    # The rest of the data 
    data = raw_data[:, 0:-1]

    train_data, test_data, train_labels, test_labels = train_test_split(
        data, labels, test_size=0.25, random_state=42
    )

    min_val = tf.reduce_min(train_data)
    max_val = tf.reduce_max(train_data)

    train_data = (train_data - min_val) / (max_val - min_val)
    test_data = (test_data - min_val) / (max_val - min_val)

    train_data = tf.cast(train_data, tf.float32)
    test_data = tf.cast(test_data, tf.float32)

    train_labels = train_labels.astype(bool)
    test_labels = test_labels.astype(bool)

    normal_train_data = train_data[train_labels] # this is fed to the autoencoder
    normal_test_data = test_data[test_labels]

    anomalous_train_data = train_data[~train_labels] # not used in our case, but still something you can test on.
    anomalous_test_data = test_data[~test_labels]

    # if we have trained over 70% of the data, start the evaluation phase.
    if current_row < limit:
        autoencoder.pretrained = False
        size_to_train = int(len(normal_train_data) * 0.1)
        pretraining_normal_train_data = normal_train_data[0:size_to_train]
        autoencoder.fit(pretraining_normal_train_data, pretraining_normal_train_data,
                        epochs=1,
                        batch_size=32,
                        validation_data=(test_data, test_data), callbacks=[callback],
                        shuffle=True)

        autoencoder.pretrained = True

        autoencoder.fit(normal_train_data, normal_train_data,
                        epochs=1,
                        batch_size=32,
                        validation_data=(test_data, test_data),
                        shuffle=True)

        my_predictions = autoencoder.predict(data)
        my_predictions = np.rint(my_predictions)

        print("Partial check")
        print("Accuracy = {}".format(accuracy_score(labels, my_predictions)))
        print("Precision = {}".format(precision_score(labels, my_predictions)))
        print("Recall = {}".format(recall_score(labels, my_predictions)))
        print("Saving model. Chunk is currently ", current_row)
        print("The total chunks are: ", total_rows)
        print("70%: ", limit) 
        autoencoder.save("AnomalyDetectorModel")
        print("Saving complete.")
        
        
    else:
        print("Training complete. Saving model with finished_training = True")
        autoencoder.finished_training = True
        autoencoder.save("AnomalyDetectorModel")
        print("Done.")

        print("Attempting to reconstruct model from save...")
        reconstructed_model = keras.models.load("AnomalyDetectorModel")
        print()
        print("Testing accuracy on unseen data:")
        print("Chunk:", current_row)

        predictions = reconstructed_model.predict(data)
        preds = np.rint(predictions)

        print("Accuracy = {}".format(accuracy_score(labels, preds)))
        print("Precision = {}".format(precision_score(labels, preds)))
        print("Recall = {}".format(recall_score(labels, preds)))


Future work should focus on a few things;

- Testing the model on different datasets.

- Adding finer classification - the data contains labels about the type of attacks. 

Results:

Accuracy = __0.9949557173639242__

Precision = __0.9949557173639242__

Recall = __1.0__
