## CICIDS-2017 dataset labelling

**Credit**: data labelling inspired from G. Engelen's work: [_Troubleshooting an Intrusion Detection Dataset:
the CICIDS2017 Case Study_](https://doi.org/10.1109/SPW53761.2021.00009) whose code is hosted on [Github](https://github.com/GintsEngelen/WTMC2021-Code/blob/main/labelling_CSV_flows.py).

In [1]:
import os
import pandas as pd

from datetime import datetime

### 1. Data loading

**Note**: currently only Friday is supported.

In [2]:
exp_day = "friday"

By default, `conn.log` comes with 21 features given at the beginning of the file (see [here](https://f.hubspotusercontent00.net/hubfs/8645105/Corelight_May2021/Pdf/002_CORELIGHT_080420_ZEEK_LOGS_US_ONLINE.pdf) or [there](https://www.icir.org/vern/cs261n-Sp20/slides/Protocols.pdf)):
```sh
$ head -n 8 friday_zeek_logs/conn.log 
#separator \x09
#set_separator	,
#empty_field	(empty)
#unset_field	-
#path	conn
#open	YYYY-MM-DD-HH-MM-SS
#fields	ts	uid	id.orig_h	id.orig_p	id.resp_h	id.resp_p	proto	service	duration	orig_bytes	resp_bytes	conn_state	local_orig	local_resp	missed_bytes	history	orig_pkts	orig_ip_bytes	resp_pkts	resp_ip_bytes	tunnel_parents
#types	time	string	addr	port	addr	port	enum	string	interval	count	count	string	bool	bool	count	string	count	count	count	count	set[string]
```

Build the dataframe:

In [3]:
# open data from zeek_logs
path_to_log = os.path.join(os.getcwd(), f"{exp_day}_zeek_logs", "conn.log")
header_sequence = ["ts", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto",
                   "service", "duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "local_resp",
                   "missed_bytes", "history", "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents"]
df = pd.read_csv(path_to_log, delimiter="\t", header=None, names=header_sequence, comment='#')
print(f"initial dataframe length: {df.shape[0]} rows")
n_rows = df.shape[0]

initial dataframe length: 547334 rows


Add label column

In [4]:
df["label"] = "benign"

### 2. Dataset cleaning and labelling

In [5]:
df["orig_bytes"] = df["orig_bytes"].replace("-", "0")
df["orig_bytes"] = df["orig_bytes"].astype("float32")

Botnet labelling

In [6]:
DATE_FORMAT_INTERNAL = '%d/%m/%Y %I:%M:%S %p'
DATE_FORMAT_DATASET = '%d/%m/%Y %I:%M:%S %p'
TIME_DIFFERENCE = 18000
t_start = datetime.strptime('07/07/2017 09:30:00 AM', DATE_FORMAT_INTERNAL).timestamp()
t_end = datetime.strptime('07/07/2017 12:59:59 PM', DATE_FORMAT_INTERNAL).timestamp()
df.loc[
       (
           (df["ts"] - TIME_DIFFERENCE >= t_start)
           & (df["ts"] - TIME_DIFFERENCE <= t_end)
        )
        & (
            ((df["id.orig_h"] == "205.174.165.73") | (df["id.resp_h"]  == "205.174.165.73"))
            | ((df["id.orig_h"] == '192.168.10.17') & (df["id.resp_h"] == '52.7.235.158'))
            | ((df["id.orig_h"] == '192.168.10.12') & (df["id.resp_h"] == '52.6.13.28'))
        )
        & (df["orig_bytes"] > 0)
        & (df["proto"] == "tcp"),
        "label"
    ] = "Bot"

Portscan labelling

In [7]:
t_start = datetime.strptime('07/07/2017 12:30:00 PM', DATE_FORMAT_INTERNAL).timestamp()
t_end = datetime.strptime('07/07/2017 03:40:00 PM', DATE_FORMAT_INTERNAL).timestamp()
attacker = '172.16.0.1'
victim = '192.168.10.50'
df.loc[
    ((df["id.orig_h"] == attacker) & (df["id.resp_h"] == victim))
    & ((df["ts"] - TIME_DIFFERENCE >= t_start) & (df["ts"] - TIME_DIFFERENCE <= t_end))
    & (df["proto"] == "tcp")
    , "label"
] = "portscan"

DDOS labelling

In [8]:
t_start = datetime.strptime('07/07/2017 03:40:00 PM', DATE_FORMAT_INTERNAL).timestamp()
t_end = datetime.strptime('07/07/2017 04:30:00 PM', DATE_FORMAT_INTERNAL).timestamp()
attacker = '172.16.0.1'
victim = '192.168.10.50'
df.loc[
    ((df["id.orig_h"] == attacker) & (df["id.resp_h"] == victim))
    & ((df["ts"] - TIME_DIFFERENCE >= t_start) & (df["ts"] - TIME_DIFFERENCE <= t_end))
    & (df["proto"] == "tcp"),
    "label"
] = "ddos"

In [9]:
# print attack statistics
df["label"].value_counts()

label
benign      290779
portscan    160134
ddos         95683
Bot            738
Name: count, dtype: int64

The next steps consists in dropping unnecessary features and replacing non-numerical values with hot-encoded arrays:

In [10]:
# drop features
features_to_drop = [
    "ts", "uid",
    "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p",
    "service", "history", "tunnel_parents"
]
[df.drop(columns=f, inplace=True) for f in features_to_drop if f in df.columns]

# hot encode literal features
hot_encoded_features = ["proto", "local_orig", "local_resp", "conn_state"]
for f in hot_encoded_features:
    if f in df.columns:
        df = pd.get_dummies(df, columns = [f], dtype=float)
print(f"resulting number of input features: {len(df.columns) - 1}")

resulting number of input features: 27


In [11]:
print(f"retained input features: \n--> %s" %'\n--> '.join([f for f in df.columns if f!= "label"]))

retained input features: 
--> duration
--> orig_bytes
--> resp_bytes
--> missed_bytes
--> orig_pkts
--> orig_ip_bytes
--> resp_pkts
--> resp_ip_bytes
--> proto_icmp
--> proto_tcp
--> proto_udp
--> local_orig_F
--> local_orig_T
--> local_resp_F
--> local_resp_T
--> conn_state_OTH
--> conn_state_REJ
--> conn_state_RSTO
--> conn_state_RSTR
--> conn_state_RSTRH
--> conn_state_S0
--> conn_state_S1
--> conn_state_S2
--> conn_state_S3
--> conn_state_SF
--> conn_state_SH
--> conn_state_SHR


Non-defined inputs are replaced by null values and all input features are finally converted to floats:

In [12]:
for f in df.columns:
    if "-" in df[f].unique():
        df[f] = df[f].replace("-", "0")
        print(f"replaced dash character with 0 in {f}")

replaced dash character with 0 in duration
replaced dash character with 0 in resp_bytes


In [13]:
for f in df.columns:
    if f != "label":
        df[f] = df[f].astype("float32")

The dataset can finally be exported

In [14]:
df.to_csv(f"{exp_day}_dataset.csv", index=False)