# Unsupervised anomaly detection with RNN autoencoders based on LSTM cells, with some statistical data filtering

### Case: The Stolen Szechuan Sauce

Since the logs are presented in a chronological order, it is reasonable to assume that an anomaly isn't just a single event, but a sequence of events. This is why we will use a recurrent neural network (RNN) to detect anomalies. The RNN will be an autoencoder, which means that it will learn to reconstruct the input sequence. The reconstruction error will be used to detect anomalies.

#### The implimentation is based on tensorflow

First, we load and preprocess the data.

In [1]:
import numpy as np
import pandas as pd


# loading the data
data = pd.read_csv('./data/dc_file_modified2.csv')

# filtering out the unnecesaary columns
sub_data = data[[
            'inode', 
            'M',
            'A',
            'C',
            'B', 
            'file_stat',
            'NTFS_file_stat',
            'file_entry_shell_item',
            'NTFS_USN_change', 'filef',
            'directory',
            'link', 
            'dir_appdata', 
            'dir_win', 
            'dir_user',
            'dir_other',
            'file_executable',
            'file_graphic',
            'file_documents',
            'file_ps', 
            'file_other', 
            'mft', 
            'lnk_shell_items',
            'olecf_olecf_automatic_destinations/lnk/shell_items',
            'winreg_bagmru/shell_items',
            'usnjrnl', 
            'is_allocated1',
            'is_allocated0',
            'filename'
            ]]

# reshaping the columns
sub_data["inode + filename"] = sub_data['inode'].astype(str) +" - "+ sub_data["filename"]
inodes = sub_data['inode'].astype(int).to_list()
sub_data = sub_data.drop(['inode'], axis=1)
sub_data = sub_data.drop(['filename'], axis=1)

  data = pd.read_csv('./data/dc_file_modified2.csv')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_data["inode + filename"] = sub_data['inode'].astype(str) +" - "+ sub_data["filename"]


Filtering the noise. Namely, we ignore all the entries from inodes < 100, which are mostly system files.
Also, we get rid of the inode `84656`, which is responsible for journaling.

In [2]:
file_names = sub_data['inode + filename'].to_list()
sub_data = sub_data.drop(['inode + filename'], axis=1)
sub_data = sub_data.to_numpy(dtype=np.float32)  # converting to NumPy

boring_indodes = set(list(range(100)) + [84656])

good_data = []
good_file_names = []

for i in range(len(sub_data)):
    if inodes[i] not in boring_indodes:
        good_data.append(sub_data[i])
        good_file_names.append(file_names[i])

sub_data = np.array(good_data)
file_names = good_file_names

## Time to build the model

We start by defyning a minimalistic autoencoder layer, which uses LSTM cells.

Reshaping the data to fit the model.

In [3]:
input_data = sub_data.reshape((sub_data.shape[0], 1, sub_data.shape[1]))

# shifting the targets by 1, so that the model can predict the next value
target_data = np.concatenate((np.expand_dims(sub_data[0], axis=0), sub_data[1:]), axis=0)
target_data = target_data.reshape((target_data.shape[0], 1, target_data.shape[1]))


In [4]:
import tensorflow as tf
from online_autoencoder import OnlineLSTMAutoencoder, ReconstructionLoss

# Creating and compiling the model
inputs = tf.keras.Input(shape=(1, input_data.shape[-1]))
outputs = OnlineLSTMAutoencoder(
    timesteps=50, features=input_data.shape[-1], encoding_dim=248,
)(inputs)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=ReconstructionLoss(),
)

Finally, we train the model.

Since we uses batches of size 1 in order to preserve the sequence order, we train the model for only one epoch.

In [5]:
model.fit(x=input_data, y=target_data, epochs=1)



<keras.callbacks.History at 0x2b7349a2690>

Outputting the results.

In [11]:
import csv

loss = tf.keras.losses.MeanSquaredError()

with open('./outputs/anomalies.csv', 'w') as file:
    predictions = model.predict(input_data)
    writer = csv.writer(file)

    for i in range(predictions.shape[0]):
        mse = loss(target_data[i], predictions[i])

        try:
            writer.writerow([mse.numpy(), file_names[i+1].strip()])
        except:
            pass


