# Setting up the stage for our anomaly detection experiments

## Data gathering strategy
- The data capturing period started at 9 a.m., Monday, and ended at 5 p.m. on Friday, for a total of 5 days.
- Monday is the normal day and only includes benign traffic. 
- They have been executed both morning and afternoon on Tuesday, Wednesday, Thursday and Friday.
- The implemented attacks include Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack, Infiltration, Botnet and DDoS, and they have been executed both morning and afternoon on Tuesday, Wednesday, Thursday and Friday.

In [16]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm
import sys
from pprint import pprint
import seaborn as sns
import time
import os
import math

In [17]:
# # In case dask is needed
# Keep an eye on [Dask's best practices](https://docs.dask.org/en/stable/best-practices.html)!
# import dask.dataframe as dd
# import dask.array as da
# import dask.bag as db
# from dask.distributed import LocalCluster

# # Setting up dask cluster
# cluster = LocalCluster()
# client = cluster.get_client()
# print(f'Dask dashboard at: {client.dashboard_link}')
# print(
#     f"For an explanation on how to interpret the dashboard: https://docs.dask.org/en/stable/dashboard.html")

## Data Loading and Exploring

### Notes 
- Confirmed that all .csv files have the same features.
- From literature review (see README), we know that the original dataset was flawed. We will use the corrected version, taking the "Attempted Category" as our golden label.


In [18]:
# Path to dataset folder
path = "../data/raw/"

# Getting all file paths
paths = []
for dirname, _, filenames in os.walk(path):
    for filename in filenames:
        if filename.endswith(".csv"):
            paths.append(os.path.join(dirname, filename))

### Exploring

In [19]:
# # Loading only a few rows of a single dataset to explore data structure
df_proto = pd.read_csv(paths[0], nrows=10)
df_proto.head()

Unnamed: 0,id,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,...,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,ICMP Code,ICMP Type,Total TCP Flow Time,Label,Attempted Category
0,1,192.168.10.5-192.168.10.3-49159-445-6,192.168.10.5,49159,192.168.10.3,445,6,2017-07-04 11:53:44.398274,90030854,10,...,57,29987510.0,35592.5,30013373,29946916,-1,-1,90030854,BENIGN,-1
1,2,8.6.0.1-8.0.6.4-0-0-0,8.6.0.1,0,8.0.6.4,0,0,2017-07-04 11:54:12.355218,106007973,26,...,19,19816840.0,8154881.0,27220170,7234941,-1,-1,0,BENIGN,-1
2,3,192.168.10.5-192.168.10.3-123-123-17,192.168.10.5,123,192.168.10.3,123,17,2017-07-04 11:54:32.240412,64015367,4,...,139,64015130.0,0.0,64015127,64015127,-1,-1,0,BENIGN,-1
3,4,192.168.10.3-192.168.10.1-60280-53-17,192.168.10.3,60280,192.168.10.1,53,17,2017-07-04 11:55:07.615878,46870,1,...,0,0.0,0.0,0,0,-1,-1,0,BENIGN,-1
4,5,192.168.10.3-192.168.10.1-61995-53-17,192.168.10.3,61995,192.168.10.1,53,17,2017-07-04 11:54:12.427035,62958,1,...,0,0.0,0.0,0,0,-1,-1,0,BENIGN,-1


### Preprocessing

In [10]:
df = pd.DataFrame()

for path in tqdm(paths):
    temp_df = pd.read_csv(path)
    df = pd.concat([df, temp_df], ignore_index=True)

100%|██████████| 5/5 [00:14<00:00,  3.00s/it]


In [29]:
# Printing all unique labels
unique_labels = df['Label'].unique()
print(f'Unique labels:\n{unique_labels}')

Unique labels:
['BENIGN' 'FTP-Patator - Attempted' 'FTP-Patator' 'SSH-Patator'
 'SSH-Patator - Attempted' 'Web Attack - Brute Force - Attempted'
 'Web Attack - Brute Force' 'Infiltration - Attempted' 'Infiltration'
 'Infiltration - Portscan' 'Web Attack - XSS - Attempted'
 'Web Attack - XSS' 'Web Attack - SQL Injection - Attempted'
 'Web Attack - SQL Injection' 'DoS Slowloris' 'DoS Slowloris - Attempted'
 'DoS Slowhttptest' 'DoS Slowhttptest - Attempted' 'DoS Hulk'
 'DoS Hulk - Attempted' 'DoS GoldenEye' 'Heartbleed'
 'DoS GoldenEye - Attempted' 'Botnet - Attempted' 'Botnet' 'Portscan'
 'DDoS']


In [27]:
[label for label in unique_labels if 'Attempted' not in label]

['BENIGN',
 'FTP-Patator',
 'SSH-Patator',
 'Web Attack - Brute Force',
 'Infiltration',
 'Infiltration - Portscan',
 'Web Attack - XSS',
 'Web Attack - SQL Injection',
 'DoS Slowloris',
 'DoS Slowhttptest',
 'DoS Hulk',
 'DoS GoldenEye',
 'Heartbleed',
 'Botnet',
 'Portscan',
 'DDoS']

In [None]:
# Dropping all rows that belong to the attempted attacks
df = df[df.Label != 'Web Attack � Brute Force']

In [None]:
# Adding normal/anomalous label
dfs['ad_label'] = dfs['label'].apply(
    lambda x: 1 if x != 'BENIGN' else 0, meta=('label', 'int64'))
dfs.head()

In [None]:
dfs.columns