<a href="https://colab.research.google.com/github/jyonalee/Insider-Threat-and-Anomaly-Detection-from-User-Activities/blob/master/Anomaly_Detection_LSTM_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install awscli to download the data
!pip3 install awscli --upgrade --user

# download data and save it on `data`
!mkdir data
!~/.local/bin/aws s3 sync --no-sign-request --region us-west-1 "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/" data/.

# Anomaly Detection with LSTM in Network Traffic Data

This project explores anomaly detection in network traffic with RNN-LSTM to train the model.

The dataset can be obtained [here](https://www.unb.ca/cic/datasets/ids-2018.html)

This is part of the capstone project for the Machine Learning Nano Degree from Udacity

## Data Exploration

---



In [1]:
import pandas as pd
import numpy as np
import os
import glob

from lib.helper_functions import *

In [2]:
# if saved dataframe file exists, load
# if dataframe isn't saved, load raw csv file and save the dataframe
exists = os.path.isfile('flowmeter_dataframe.pkl')
if exists:
    df = pd.read_pickle('flowmeter_dataframe.pkl')
else:
    # load data and do preliminary cleaning
    directory = '/home/jlee/cse-cic-ids2018/Processed Traffic Data for ML Algorithms'

    filepath = os.path.join(directory,'Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv')
    df1 = pd.read_csv(filepath)
    df1 = df1[df1['Protocol'] != 'Protocol']
    df1 = optimize_and_clean_df(df1)

    filepath = os.path.join(directory,'Friday-16-02-2018_TrafficForML_CICFlowMeter.csv')
    df2 = pd.read_csv(filepath)
    df2 = df2[df2['Protocol'] != 'Protocol']
    df2 = optimize_and_clean_df(df2)

    filepath = os.path.join(directory,'Friday-02-03-2018_TrafficForML_CICFlowMeter.csv')
    df3 = pd.read_csv(filepath)
    df3 = optimize_and_clean_df(df3)

    filepath = os.path.join(directory,'Friday-23-02-2018_TrafficForML_CICFlowMeter.csv')
    df4 = pd.read_csv(filepath)
    df4 = optimize_and_clean_df(df4)

    filepath = os.path.join(directory,'Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv')
    df5 = pd.read_csv(filepath)
    df5 = optimize_and_clean_df(df5)

    filepath = os.path.join(directory,'Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv')
    df6 = pd.read_csv(filepath)
    df6 = optimize_and_clean_df(df6)

    filepath = os.path.join(directory,'Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv')
    df7 = pd.read_csv(filepath)
    df7 = optimize_and_clean_df(df7)

    filepath = os.path.join(directory,'Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv')
    df8 = pd.read_csv(filepath)
    df8 = df8[df8['Protocol'] != 'Protocol']
    df8 = optimize_and_clean_df(df8)

    filepath = os.path.join(directory,'Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv')
    df9 = pd.read_csv(filepath)
    df9 = optimize_and_clean_df(df9)
    
    # combine dataframes to one
    df = pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9], ignore_index=True)
    
    # save dataframe to file for future use
    pd.to_pickle(df, 'flowmeter_dataframe.pkl')
    
    # clean up intermediary dataframes to free memory
    del df1
    del df2
    del df3
    del df4
    del df5
    del df6
    del df7
    del df8
    del df9
    
    ### this file is significantly larger (~4gb csv which crashes a 16gb machine with out of memory) so excluding for now
    # filepath = os.path.join(directory,'Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv')
    # df10 = pd.read_csv(filepath)
    # df10 = df10[df10['Protocol'] != 'Protocol']
    # df10 = optimize_and_clean_df(df10)

    # df = pd.concat([df,df10], ignore_index=True)
    # del df10

In [3]:
df.memory_usage().sum() / 1024**2 

2338.5255813598633

In [5]:
len(df)

8284195

In [11]:
df = df.sort_values(by=['Timestamp'])

In [12]:
df = df[df['Timestamp'] > pd.to_datetime('2018-01-01')].reset_index(drop=True)

In [13]:
# get count of each label
print(df['Label'].value_counts())

Benign                      6112137
DDOS attack-HOIC             686012
DoS attacks-Hulk             461912
Bot                          286191
FTP-BruteForce               193360
SSH-Bruteforce               187589
Infilteration                161934
DoS attacks-SlowHTTPTest     139890
DoS attacks-GoldenEye         41508
DoS attacks-Slowloris         10990
DDOS attack-LOIC-UDP           1730
Brute Force -Web                611
Brute Force -XSS                230
SQL Injection                    87
Name: Label, dtype: int64


In [14]:
# get distribution in of each label
print(df['Label'].value_counts()/len(df))

Benign                      0.737808
DDOS attack-HOIC            0.082810
DoS attacks-Hulk            0.055758
Bot                         0.034547
FTP-BruteForce              0.023341
SSH-Bruteforce              0.022644
Infilteration               0.019547
DoS attacks-SlowHTTPTest    0.016886
DoS attacks-GoldenEye       0.005011
DoS attacks-Slowloris       0.001327
DDOS attack-LOIC-UDP        0.000209
Brute Force -Web            0.000074
Brute Force -XSS            0.000028
SQL Injection               0.000011
Name: Label, dtype: float64


so in essence, 73.8% of data points in this dataset is 'Benign' while the rest are some form of malicious attack

In [15]:
df.head()

Unnamed: 0,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,3389,6,2018-02-14 01:00:00,1671932,8,7,1144,1581,677,0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,Benign
1,3389,6,2018-02-14 01:00:00,3641507,8,10,1148,1581,677,0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,Benign
2,80,6,2018-02-14 01:00:00,89,2,0,0,0,0,0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,Benign
3,3389,6,2018-02-14 01:00:00,4363661,8,11,1148,1581,677,0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,Benign
4,3389,6,2018-02-14 01:00:00,1297112,8,7,1138,1581,677,0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,Benign


In [None]:
# todo:
# more stats
# plot timeline of events
# determine what is a normal sequence vs not normal sequence and visualize if possible

In [None]:
# todo:
# with the definition of a `normal sequence`, need to transform & process the data accordingly before training with LSTM
# with the baseline algorithms for anomaly detection, should be fine to train with given data as-is