## Botnet Detection with an Autoencoder Ensemble
21 June 2021  
This notebook was created for a course at Istanbul Technical University.

It is a continuation of our [preliminary experiments](https://www.kaggle.com/happyemoji/botnet-detection-with-an-autoencoder).
- We implement (a simplified version of) the autoencoder-ensemble-based anomaly detection described in the Kitsune paper [1].

In [None]:
import os
import numpy as np

from scipy.cluster.hierarchy import dendrogram, to_tree
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import layers, losses, Sequential
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

---
1. Dataset
2. Autoencoder Ensemble
3. Conclusion  
4. References
---
# 1. Dataset
N-BaIoT [2]  
https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT
- Nine IoT devices were used.
- The devices were 2 smart doorbells, 1 smart thermostat, 1 smart babymonitor, 4 security cameras and 1 webcam.
- Traffic was captured when the devices were in normal execution and after infection with malware.
- Mirai and BashLite (aka gafgyt) malware were used.
- From the network traffic, 115 features were extracted as described in [1].

In [None]:
def load_nbaiot(filename):
    return np.genfromtxt(
        os.path.join("/kaggle/input/nbaiot-dataset", filename),
        delimiter=",",
        skip_header=1
    )

benign = load_nbaiot("1.benign.csv")
X_train = benign[:40000]
X_test0 = benign[40000:]
X_test1 = load_nbaiot("1.mirai.scan.csv")
X_test2 = load_nbaiot("1.mirai.ack.csv")
X_test3 = load_nbaiot("1.mirai.syn.csv")
X_test4 = load_nbaiot("1.mirai.udp.csv")
X_test5 = load_nbaiot("1.mirai.udpplain.csv")

In [None]:
print(X_train.shape, X_test0.shape, X_test1.shape, X_test2.shape,
      X_test3.shape, X_test4.shape, X_test5.shape)

---
# 2. Autoencoder Ensemble
The paper [1] describes Kitsune, an ensemble of autoencoders for network intrusion detection. Using an ensemble instead of one big autoencoder -- as was presented in [5] -- helps against the curse of dimensionality. So training and execution are computationally more efficient. Source code for a Python implementation is provided at https://github.com/ymirsky/Kitsune-py.

At the start, a number of samples are analyzed to group the features into disjoint subsets. This is done by calculating the pairwise correlation distances between all features of these samples. The subsets are selected with agglomerative hierarchical clustering. So each subset is a group of features that are highly inter-correlated. The maximum size of a subset is a parameter of the algorithm.

After that, an autoencoder is created for each subset of features. These autoencoders from the ensemble layer. Each autoencoder sees only its subset of features of each sample. The reconstruction errors of all ensemble members form the input for the final autoencoder. That is, it learns how well the ensemble can reconstruct the training data. The final autoencoder's own reconstruction error is used for the output, i.e., surpassing a given threshold is reported as an anomaly.

We reimplement a basic version of this in the sequel.

## Clustering the Features
For the clustering phase, a part of the training data is split off.

We use `scikit-learn`'s Agglomerative Clustering. Building clusters with a given maximum size requires re-creating the linking matrix "manually".

In [None]:
def agglomerative_clustering(data):
    # sqrt makes this a proper distance metric
    correlation_distance = np.sqrt(1-np.corrcoef(data.T))
    ac = AgglomerativeClustering(
        n_clusters=None,
        affinity="precomputed",
        linkage="single",
        distance_threshold=0
    )
    ac.fit(correlation_distance)
    return ac

feature_mapping_phase = 7777
ac = agglomerative_clustering(X_train[:feature_mapping_phase])

Next, to create the linkage matrix, we follow the tutorial at https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html.

In [None]:
def linkage_matrix(model):
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    return np.column_stack([model.children_, model.distances_, counts]).astype(float)

lm = linkage_matrix(ac)

In [None]:
import matplotlib.pyplot as plt

dendrogram(lm)
plt.close("all")

With the algorithm described in the Kitsune paper, we split the dendrogram until no cluster is larger than a given size.

In [None]:
def find_subsets(tree, max_cluster_size=10):
    if tree.count <= max_cluster_size:
        return [np.array(tree.pre_order())]
    recursion1 = find_subsets(tree.get_left(), max_cluster_size)
    recursion2 = find_subsets(tree.get_right(), max_cluster_size)
    return recursion1+recursion2
    
subsets = find_subsets(to_tree(lm))

subsets

## Autoencoder Ensemble for Anomaly Detection
Here we implement the ensemble as outlined above. Details are on pages 8 and 9 in the Kitsune paper [1].

In [None]:
class Autoencoder(Model):
    def __init__(self, n):
        super(Autoencoder, self).__init__()
        self.encoder = Sequential([
            layers.Dense(n, activation="relu"),
            layers.Dense(int(0.75*n), activation="relu"),
        ])
        self.decoder = layers.Dense(n, activation="relu")
    
    def call(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

def compile_and_train(ae, x):
    ae.compile(optimizer=Adam(learning_rate=0.01), loss='mse')
    ae.fit(
        x=x,
        y=x,
        # in reality, it is supposed to be an online algorithm, so
        # we make only 1 pass over the training data
        epochs=1
    )

class Ensemble:
    def __init__(self, feature_subsets):
        self.map = feature_subsets
        self.scaler_ensemble = MinMaxScaler()
        self.scaler_output = MinMaxScaler()
        self.ensemble_layer = []
        for subset in feature_subsets:
            ae = Autoencoder(len(subset))
            self.ensemble_layer += [ae]
        self.output_layer = Autoencoder(len(feature_subsets))
        
    def train(self, data):
        scaled = self.scaler_ensemble.fit_transform(data)
        loss_ensemble = []
        
        for i, (features, ae) in enumerate(zip(self.map, self.ensemble_layer)):
            x = scaled[:, features]
            print(f"##**~~__ Autoencoder {i+1}/{len(self.map)} for {len(features)} dimensions")
            compile_and_train(ae, x)
            loss_ensemble += [losses.mse(x, ae(x))]
            
        # Because of the above loop, loss_ensemble now has shape
        # (n_autoencoders, n_samples). But for the output layer, the previous
        # layer outputs are actually treated as features. Therefore transpose
        loss_ensemble = self.scaler_output.fit_transform(np.array(loss_ensemble).T)
        print(f"##**~~__ Output Autoencoder for {loss_ensemble.shape[1]} dimensions")
        compile_and_train(self.output_layer, loss_ensemble)
        loss_out = losses.mse(loss_ensemble, self.output_layer(loss_ensemble))
        self.threshold = np.mean(loss_out)+np.std(loss_out)
    
    def predict(self, data):
        scaled = self.scaler_ensemble.transform(data)
        loss_ensemble = []
        
        for features, ae in zip(self.map, self.ensemble_layer):
            x = scaled[:, features]
            loss_ensemble += [losses.mse(x, ae(x))]
            
        loss_ensemble = self.scaler_output.transform(np.array(loss_ensemble).T)
        loss_out = losses.mse(loss_ensemble, self.output_layer(loss_ensemble))
        return loss_out > self.threshold

In [None]:
ensemble = Ensemble(subsets)
ensemble.train(X_train[feature_mapping_phase:])

In [None]:
test_data = [X_test0, X_test1, X_test2, X_test3, X_test4, X_test5]

for i, x in enumerate(test_data):
    print(i)
    print(f"Shape of data: {x.shape}")
    outcome = ensemble.predict(x)
    print(f"Detected anomalies: {np.mean(outcome)*100}%")
    print()

---
# 3. Conclusion

Indeed, the results are almost exactly the same as they are with one big autoencoder, see [our example](https://www.kaggle.com/happyemoji/botnet-detection-with-an-autoencoder).

The following will be our next steps.

### Reduced Feature Set
In [2], it is suggested to use a subset of 23 features instead of all 115. However, different -- supervised -- algorithms were used. Are these 23 features enough also for the unsupervised anomaly detection system?

- Compare performance of the full and the reduced feature set using
    - a shallow autoencoder,
    - a deep autoencoder and
    - an ensemble of autoencoders.

### Datasets for Comparison
MedBIoT [2]  
https://cs.taltech.ee/research/data/medbiot/
- Three real and 80 simulated IoT devices were used.
- The real devices were 2 smart switches and 1 smart light bulb.
- Traffic was captured when the devices were in normal execution and after infection with malware.
- Mirai, BashLite, and Torii malware were used.
- There is an unprocessed version of the dataset (pcap files) and one where 115 features were extracted as described in [1].

IoT-23 [3]  
https://www.stratosphereips.org/datasets-iot23
- Three real IoT devices and a Raspberry Pi were used.
- The real devices were 1 smart light bulb, 1 smart doorbell and 1 smart speaker / virtual assistant.
- Traffic was captured when the IoT devices were in normal execution and when the Raspberry Pi was infected with malware.
- Eleven malware families -- including Mirai, Torii and Gagfyt (aka BashLite) -- were used across 20 different captures.

`pcap` files are provided, so the feature extraction has to be performed before comparing with the other datasets.

### Algorithmic Improvements
Have there been any advances on anomaly detection with autoencoders which can be incorporated? For example, can we use Variational Autoencoders for better performance?

# References
[1] Mirsky, Yisroel et al. "Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection", NDSS (2018). https://arxiv.org/abs/1802.09089v2  
[2] Alhowaide, Alaa, et al. "Towards the design of real-time autonomous IoT NIDS." Cluster Computing (2021): 1-14. https://doi.org/10.1007/s10586-021-03231-5  
[3] Guerra-Manzanares, Alejandro, et al. "MedBIoT: Generation of an IoT Botnet Dataset in a Medium-sized IoT Network." ICISSP 1 (2020): 207-218. https://doi.org/10.5220/0009187802070218  
[4] Meidan, Yair, et al. "N-BaIoT—Network-based Detection of IoT Botnet Attacks Using Deep Autoencoders." IEEE Pervasive Computing 17.3 (2018): 12-22. https://arxiv.org/abs/1805.03409  
[5] Garcia, Sebastian et al. "IoT-23: A labeled dataset with malicious and benign IoT network traffic" (2020). (Version 1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4743746