# Reduced Dataset Composition
- In this notebook we present the process that we apply to compose a dataset with a reduced number of rows and columns, starting from the complete dataset (containing all the features and data point obtained after the merging and labelling steps). 
- We show the methodology we apply, not only for investigating the features, but also to prepare the data to be used by ML algorithms.
- The reduction of the rows that completes the composition of the reduced dataset is reported in the <strong>"NAME OF THE NOTEBOOK" </strong>
- We select the features by manual evaluation (delete malformed or irrelevant features, delete hidden labels), and using a features selection algorithm on the Tshark features to keep only the most useful for the detection.

## Dataset Processing and Evaluation

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval

In [2]:
PATH = '/data/puccetti/space_data/final_merging_all_feature_ordered.csv'

In [3]:
pd.set_option("display.max_columns", None)
pd.get_option("display.max_columns")

In [4]:
df = pd.read_csv(PATH, nrows=5000000)

  df = pd.read_csv(PATH, nrows=5000000)


In [5]:
df = df.sort_values('timestamp')

In [6]:
print(df['timestamp'])

0          2023-03-16 14:22:23.903192576
1          2023-03-16 14:22:23.903765248
2          2023-03-16 14:22:23.904402432
3          2023-03-16 14:22:23.904744448
4          2023-03-16 14:22:23.905764096
                       ...              
4999995    2023-06-15 09:40:56.200875264
4999996    2023-06-15 09:40:56.201245184
4999997    2023-06-15 09:40:56.201245184
4999998    2023-06-15 09:40:56.201554432
4999999    2023-06-15 09:40:56.201554432
Name: timestamp, Length: 5000000, dtype: object


In [7]:
print(df.shape)

(5000000, 483)


## Some checks on columns and values

In [8]:
print(df['attack'].value_counts())

observe                4091350
nmap discovery          500000
ros2 reconnaissance     401494
ros2 reflection           3753
ros2 node crashing        3403
Name: attack, dtype: int64


Delete some unuseful columns: 
- 'Unnamed' columns are just duplicate indexes of dataframes

In [9]:
subs = "Unnamed"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

1
['Unnamed: 0']


In [10]:
print(df.columns)

Index(['timestamp', 'layers.frame.frame.time', 'layers.frame.frame.time_delta',
       'layers.frame.frame.time_delta_displayed',
       'layers.frame.frame.time_relative', 'layers.frame.frame.number',
       'layers.frame.frame.len', 'layers.frame.frame.cap_len',
       'layers.frame.frame.protocols', 'layers.sll.sll.pkttype',
       ...
       'Active', 'pgalloc_dma', 'pgmajfault', 'SwapFree', 'src_topic',
       'subscribers_count', 'publishers_count', 'msg_type', 'msg_data',
       'attack'],
      dtype='object', length=482)


## Delete Features related to time 
If we want to train an detector with a shuffled dataset (we want just to distinguish between normal and attack data points, without considering the chronological order of data point occurrences) we have to delete the features related to time as they mark attacks and normal behavoir and are not generalizable (hidden label).

However, we keep the timestamp to use it for time series analysis. The time stamp will be dropped based on the detector that we want to build.  

In [11]:
subs = "time"
res = [i for i in df.columns if subs in i and i != 'timestamp']
print(len(res))
print(res)
df=df.drop(res, axis=1)

17
['layers.frame.frame.time', 'layers.frame.frame.time_delta', 'layers.frame.frame.time_delta_displayed', 'layers.frame.frame.time_relative', 'layers.tcp.tcp.options_tree.tcp.options.timestamp', 'layers.tcp.tcp.options_tree.tcp.options.timestamp_tree.tcp.option_kind', 'layers.tcp.tcp.options_tree.tcp.options.timestamp_tree.tcp.option_len', 'layers.tcp.tcp.options_tree.tcp.options.timestamp_tree.tcp.options.timestamp.tsval', 'layers.tcp.tcp.options_tree.tcp.options.timestamp_tree.tcp.options.timestamp.tsecr', 'layers.tcp.Timestamps.tcp.time_relative', 'layers.tcp.Timestamps.tcp.time_delta', 'layers.dns.dns.time', 'layers.dhcpv6.Client Identifier.dhcpv6.duidllt.time', 'layers.dhcpv6.Elapsed time.dhcpv6.option.type', 'layers.dhcpv6.Elapsed time.dhcpv6.option.length', 'layers.dhcpv6.Elapsed time.dhcpv6.option.value', 'layers.dhcpv6.Elapsed time.dhcpv6.elapsed_time']


In [12]:
print(df.shape)

(5000000, 465)


### Delete features that specify Source or Destination at diffrent layers of the protocol stack
The model generalization could be degradated by knowledge related to specific values observed during the monitoring campaign. The source and destination addresses are not generalizable, then, we drop them. 

In [13]:
subs = "dst"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

11
['layers.ip.ip.dst', 'layers.ip.ip.dst_host', 'layers.tcp.tcp.dstport', 'layers.udp.udp.dstport', 'layers.arp.arp.dst.hw_mac', 'layers.arp.arp.dst.proto_ipv4', 'layers.ipv6.ipv6.dst', 'layers.ipv6.ipv6.dst_host', 'layers.icmp.ip.ip.dst', 'layers.icmp.ip.ip.dst_host', 'layers.icmp.udp.udp.dstport']


In [14]:
print(df.shape)

(5000000, 454)


In [15]:
subs = "src"
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

21
['layers.sll.sll.src.eth', 'layers.ip.ip.src', 'layers.ip.ip.src_host', 'layers.tcp.tcp.srcport', 'layers.udp.udp.srcport', 'layers.rtps.rtps.guidPrefix.src', 'layers.rtps.rtps.guidPrefix.src_tree.rtps.hostId', 'layers.rtps.rtps.guidPrefix.src_tree.rtps.appId', 'layers.rtps.rtps.guidPrefix.src_tree.rtps.sm.guidPrefix.instanceId', 'layers.arp.arp.src.hw_mac', 'layers.arp.arp.src.proto_ipv4', 'layers.vssmonitoring.vssmonitoring.srcport', 'layers.ipv6.ipv6.src', 'layers.ipv6.ipv6.src_host', 'layers.icmp.ip.ip.src', 'layers.icmp.ip.ip.src_host', 'layers.icmp.udp.udp.srcport', 'layers.ipv6.ipv6.src_sa_mac', 'layers.icmp.udp.udp.srcport_tree._ws.expert._ws.expert.message', 'layers.icmp.udp.udp.srcport_tree._ws.expert._ws.expert.severity', 'layers.icmp.udp.udp.srcport_tree._ws.expert._ws.expert.group']


In [16]:
print(df.shape)

(5000000, 433)


In [17]:
subs = "host"
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

4
['layers.ip.ip.host', 'layers.ipv6.ipv6.host', 'layers.icmp.ip.ip.host', 'layers.http.http.host']


In [18]:
print(df.shape)

(5000000, 429)


In [19]:
subs = "addr"
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

4
['layers.ip.ip.addr', 'layers.ipv6.ipv6.addr', 'layers.icmp.ip.ip.addr', 'layers.dhcpv6.Client Identifier.dhcpv6.duidllt.link_layer_addr']


In [20]:
print(df.shape)

(5000000, 425)


### Delete features related to the "Frame" protocol
From wireshark doc (https://wiki.wireshark.org/Protocols/frame):

"The frame protocol isn't a real protocol itself, but used by Wireshark as a base for all the protocols on top of it. It shows information from capturing, such as the exact time a specific frame was captured. You could think of it as a pseudo dissector."

In [21]:
subs = ".frame."
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

4
['layers.frame.frame.number', 'layers.frame.frame.len', 'layers.frame.frame.cap_len', 'layers.frame.frame.protocols']


In [22]:
print(df.shape)

(5000000, 421)


### Delete features that contains ID keyword
We want the dataset to be more general as possible. We drop the ID wich are specific to the execution of the system during the monitoring campaign. Also, the ID can implicitly be an hidden label. For example, the attacker can be associated, during the training to a specific id. However, at test time the association can be different, degrading the performance of the model.

First, the features are printed to ensure that we do not drop features with substring "id" in the name that are relevant.

In [23]:
subs = "id"
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

37
['layers.ip.ip.id', 'layers.dns.dns.id', 'layers.rtps.rtps.guidPrefix', 'layers.rtps.Default port mapping: MULTICAST_METATRAFFIC, domainId=0.rtps.domain_id', 'layers.rtps.rtps.sm.id', 'layers.rtps.rtps.sm.id_tree.rtps.sm.flags', 'layers.rtps.rtps.sm.id_tree.rtps.sm.flags_tree.rtps.flag.reserved', 'layers.rtps.rtps.sm.id_tree.rtps.sm.flags_tree.rtps.flag.data.serialized_key', 'layers.rtps.rtps.sm.id_tree.rtps.sm.flags_tree.rtps.flag.data_present', 'layers.rtps.rtps.sm.id_tree.rtps.sm.flags_tree.rtps.flag.inline_qos', 'layers.rtps.rtps.sm.id_tree.rtps.sm.flags_tree.rtps.flag.endianness', 'layers.rtps.rtps.sm.id_tree.rtps.sm.octetsToNextHeader', 'layers.rtps.rtps.sm.id_tree.rtps.extra_flags', 'layers.rtps.rtps.sm.id_tree.rtps.octets_to_inline_qos', 'layers.rtps.rtps.sm.id_tree.rtps.sm.rdEntityId', 'layers.rtps.rtps.sm.id_tree.rtps.sm.rdEntityId_tree.rtps.sm.rdEntityId.entityKey', 'layers.rtps.rtps.sm.id_tree.rtps.sm.rdEntityId_tree.rtps.sm.rdEntityId.entityKind', 'layers.rtps.rtps.sm.i

In [24]:
print(df.shape)

(5000000, 384)


In [25]:
subs = "port"
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

4
['layers.tcp.tcp.port', 'layers.udp.udp.port', 'layers.rtps.Default port mapping: MULTICAST_METATRAFFIC, domainId=0.rtps.traffic_nature', 'layers.icmp.udp.udp.port']


In [26]:
print(df.shape)

(5000000, 380)


### Delete malformed features
These features seems to be badly formatted and can be the results of a formatting exeption during the captures. 

In [27]:
subs = "ubuntu"
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

30
['layers.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.name', 'layers.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.name.len', 'layers.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.count.labels', 'layers.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.type', 'layers.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.class', 'layers.icmp.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.name', 'layers.icmp.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.name.len', 'layers.icmp.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.count.labels', 'layers.icmp.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.type', 'layers.icmp.dns.Queries.connectivity-check.ubuntu.com: type A, class IN.dns.qry.class', 'layers.icmp.dns.Queries.connectivity-check.ubuntu.com: type AAAA, class IN.dns.qry.name', 'layers.icmp.dns.Queries.connect

In [28]:
print(df.shape)

(5000000, 350)


In [29]:
subs = "microsoft"
res = [i for i in df.columns if subs in i and i != 'src_topic']
print(len(res))
print(res)
df=df.drop(res, axis=1)

50
['layers.dns.Queries.eu-v20.events.data.microsoft.com: type A, class IN.dns.qry.name', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type A, class IN.dns.qry.name.len', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type A, class IN.dns.count.labels', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type A, class IN.dns.qry.type', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type A, class IN.dns.qry.class', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type AAAA, class IN.dns.qry.name', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type AAAA, class IN.dns.qry.name.len', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type AAAA, class IN.dns.count.labels', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type AAAA, class IN.dns.qry.type', 'layers.dns.Queries.eu-v20.events.data.microsoft.com: type AAAA, class IN.dns.qry.class', 'layers.dns.Queries.winatp-gw-neu.microsoft.com: type A, class IN.dns.qry.name', 'layers.dns.Queries.wi

In [30]:
print(df.shape)

(5000000, 300)


### Manual evaluation of the Tshark features 

In [31]:
subs = "."
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
#df=df.drop(res, axis=1)

268
['layers.sll.sll.pkttype', 'layers.sll.sll.hatype', 'layers.sll.sll.unused', 'layers.sll.sll.etype', 'layers.ip.ip.version', 'layers.ip.ip.hdr_len', 'layers.ip.ip.dsfield', 'layers.ip.ip.dsfield_tree.ip.dsfield.dscp', 'layers.ip.ip.dsfield_tree.ip.dsfield.ecn', 'layers.ip.ip.len', 'layers.ip.ip.flags', 'layers.ip.ip.flags_tree.ip.flags.rb', 'layers.ip.ip.flags_tree.ip.flags.df', 'layers.ip.ip.flags_tree.ip.flags.mf', 'layers.ip.ip.flags_tree.ip.frag_offset', 'layers.ip.ip.ttl', 'layers.ip.ip.proto', 'layers.ip.ip.checksum', 'layers.ip.ip.checksum.status', 'layers.tcp.tcp.stream', 'layers.tcp.tcp.len', 'layers.tcp.tcp.seq', 'layers.tcp.tcp.nxtseq', 'layers.tcp.tcp.ack', 'layers.tcp.tcp.hdr_len', 'layers.tcp.tcp.flags', 'layers.tcp.tcp.flags_tree.tcp.flags.res', 'layers.tcp.tcp.flags_tree.tcp.flags.ns', 'layers.tcp.tcp.flags_tree.tcp.flags.cwr', 'layers.tcp.tcp.flags_tree.tcp.flags.ecn', 'layers.tcp.tcp.flags_tree.tcp.flags.urg', 'layers.tcp.tcp.flags_tree.tcp.flags.ack', 'layers.tcp

### Drop the malformed features 
We drop the features with "/" or "\"

In [32]:
subs = "/"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

13
['layers.http.HTTP/1.1 200 OK\\r\\n._ws.expert._ws.expert.message', 'layers.http.HTTP/1.1 200 OK\\r\\n._ws.expert._ws.expert.severity', 'layers.http.HTTP/1.1 200 OK\\r\\n._ws.expert._ws.expert.group', 'layers.http.HTTP/1.1 200 OK\\r\\n.http.response.version', 'layers.http.HTTP/1.1 200 OK\\r\\n.http.response.code', 'layers.http.HTTP/1.1 200 OK\\r\\n.http.response.code.desc', 'layers.http.HTTP/1.1 200 OK\\r\\n.http.response.phrase', 'layers.http.GET / HTTP/1.1\\r\\n._ws.expert._ws.expert.message', 'layers.http.GET / HTTP/1.1\\r\\n._ws.expert._ws.expert.severity', 'layers.http.GET / HTTP/1.1\\r\\n._ws.expert._ws.expert.group', 'layers.http.GET / HTTP/1.1\\r\\n.http.request.method', 'layers.http.GET / HTTP/1.1\\r\\n.http.request.uri', 'layers.http.GET / HTTP/1.1\\r\\n.http.request.version']


In [33]:
subs = "\\"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

0
[]


In [34]:
print(df.shape)

(5000000, 287)


In [35]:
subs = "PTR"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

18
['layers.mdns.Queries._pgpkey-hkp._tcp.local: type PTR, class IN, "QM" question.dns.qry.name', 'layers.mdns.Queries._pgpkey-hkp._tcp.local: type PTR, class IN, "QM" question.dns.qry.name.len', 'layers.mdns.Queries._pgpkey-hkp._tcp.local: type PTR, class IN, "QM" question.dns.count.labels', 'layers.mdns.Queries._pgpkey-hkp._tcp.local: type PTR, class IN, "QM" question.dns.qry.type', 'layers.mdns.Queries._pgpkey-hkp._tcp.local: type PTR, class IN, "QM" question.dns.qry.class', 'layers.mdns.Queries._pgpkey-hkp._tcp.local: type PTR, class IN, "QM" question.dns.qry.qu', 'layers.mdns.Queries._ipp._tcp.local: type PTR, class IN, "QM" question.dns.qry.name', 'layers.mdns.Queries._ipp._tcp.local: type PTR, class IN, "QM" question.dns.qry.name.len', 'layers.mdns.Queries._ipp._tcp.local: type PTR, class IN, "QM" question.dns.count.labels', 'layers.mdns.Queries._ipp._tcp.local: type PTR, class IN, "QM" question.dns.qry.type', 'layers.mdns.Queries._ipp._tcp.local: type PTR, class IN, "QM" questi

In [36]:
print(df.shape)

(5000000, 269)


In [37]:
subs = "full_uri"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

1
['layers.http.http.request.full_uri']


In [38]:
subs = "request_number"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

1
['layers.http.http.request_number']


In [39]:
subs = "<Root>"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

18
['layers.dns.Additional records.<Root>: type OPT.dns.resp.name', 'layers.dns.Additional records.<Root>: type OPT.dns.resp.type', 'layers.dns.Additional records.<Root>: type OPT.dns.rr.udp_payload_size', 'layers.dns.Additional records.<Root>: type OPT.dns.resp.ext_rcode', 'layers.dns.Additional records.<Root>: type OPT.dns.resp.edns0_version', 'layers.dns.Additional records.<Root>: type OPT.dns.resp.z', 'layers.dns.Additional records.<Root>: type OPT.dns.resp.z_tree.dns.resp.z.do', 'layers.dns.Additional records.<Root>: type OPT.dns.resp.z_tree.dns.resp.z.reserved', 'layers.dns.Additional records.<Root>: type OPT.dns.resp.len', 'layers.icmp.dns.Additional records.<Root>: type OPT.dns.resp.name', 'layers.icmp.dns.Additional records.<Root>: type OPT.dns.resp.type', 'layers.icmp.dns.Additional records.<Root>: type OPT.dns.rr.udp_payload_size', 'layers.icmp.dns.Additional records.<Root>: type OPT.dns.resp.ext_rcode', 'layers.icmp.dns.Additional records.<Root>: type OPT.dns.resp.edns0_ver

In [40]:
print(df.shape)

(5000000, 249)


In [41]:
subs = "len"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

17
['layers.ip.ip.hdr_len', 'layers.ip.ip.len', 'layers.tcp.tcp.len', 'layers.tcp.tcp.hdr_len', 'layers.tcp.tcp.options_tree.tcp.options.mss_tree.tcp.option_len', 'layers.tcp.tcp.options_tree.tcp.options.sack_perm_tree.tcp.option_len', 'layers.tcp.tcp.options_tree.tcp.options.wscale_tree.tcp.option_len', 'layers.ssl.ssl.record.ssl.record.length', 'layers.udp.udp.length', 'layers.ipv6.ipv6.plen', 'layers.dhcpv6.Option Request.dhcpv6.option.length', 'layers.icmp.ip.ip.hdr_len', 'layers.icmp.ip.ip.len', 'layers.icmp.udp.udp.length', 'layers.dhcpv6.Client Identifier.dhcpv6.option.length', 'layers.data.data.len', 'layers.tcp.tcp.options_tree.tcp.options.sack_tree.tcp.option_len']


In [42]:
print(df.shape)

(5000000, 232)


In [43]:
subs = "seq"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

2
['layers.tcp.tcp.seq', 'layers.tcp.tcp.nxtseq']


### Save list of features 

In [44]:
features = df.columns

In [45]:
dict = {'features': features}
     
df_features = pd.DataFrame(dict)

In [46]:
df_features.to_csv("/data/puccetti/space_data/features_usable_temp.csv")

### Create dataset with the subset of the features 
The objecftive is to understand the memory occupation of the resulting dataset

In [47]:
PATH = '/data/puccetti/space_data/final_merging_all_feature_ordered.csv'

In [48]:
features = pd.read_csv("/data/puccetti/space_data/features_usable_temp.csv")
to_load = features['features'].values.tolist()

In [49]:
df = pd.read_csv(PATH, usecols=to_load)

  df = pd.read_csv(PATH, usecols=to_load)


In [50]:
print(df.shape)

(30247050, 230)


In [51]:
df.to_csv("/data/puccetti/space_data/usable_temp.csv")

In [52]:
df = pd.read_csv("/data/puccetti/space_data/usable_temp.csv")

  df = pd.read_csv("/data/puccetti/space_data/usable_temp.csv")


# Prepare the data for training
In this section, we prepare the data to be processed by ML algorithms. In particular, we perform the following steps:
- <strong>Convert mixed dtypes </strong>: we uniform the type of features with mixed type values.
- <strong>Handle NaN values</strong>: we replace NaN and infinite values with -1. 
- <strong>Convert Label to Numeric</strong>: We substitute label values with numeric values. Then, we create two versions of the dataset: with  binary labels (attack, normal), and with multiple labels (one label for each attack).
- <strong>Convert String To Numeric</strong>: we convert the string values to numbers using categorical encoding. This technique assigns a unique number to any unique string values of a feature.
- <strong>Split the dataset in Training and Test sets</strong>: after removing labels and timestamps columns, we split the dataframe in training and test sets with a 60/40 split.

## Convert mixed dtypes
We convert mixed dtypes columns to string using the following lambda function:

In [53]:
def convert_dtype(x):
    if not x:
        return ''
    try:
        return str(x)   
    except:        
        return ''
    
def convert_hex(x):
    if not x:
        return 0
    try:
        return literal_eval(x)
    except:        
        return 0

In [54]:
#Indexes of the columns to be converted
to_convert = [7,10,17,21,31,35,38,41,42,45,49,53,62,64,65,66,69,81,82,85,86,87,90,91,94,96,103,106,111,113,116,120,122,124,127,134,136,139,162,173,176,178,181,184,190,195,197,198,199,225,228,229]

In [55]:
#Convert
for i in to_convert:
    df[df.columns[i]] = df[df.columns[i]].apply(lambda x: convert_dtype(x))

In [56]:
print(df.shape)

(30247050, 231)


### Handling NaN values

In [57]:
nanv = []
for col in df.columns:
    nanv.append(df[col].isnull().values.any())

In [58]:
print(nanv)

[False, False, False, False, False, False, True, False, True, True, False, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, False, True, True, False, False, True, True, False, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, False, True, False, False, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, False, False, False, True, True, False, False, True, True, False, True, False, True, True, True, True, True, True, False, True, True, False, True, True, True, True, False, True, False, True, True, False, True, True, True, False, True, False, True, False, True, True, False, True, True, True, True, True, True, False, True, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, T

In [59]:
df.replace([np.inf, -np.inf], -1, inplace=True)
df.fillna(-1, inplace=True)
df=df.dropna(thresh=1, axis=1)

In [60]:
df.replace('nan', -1, inplace=True)

In [61]:
nanv = []
for col in df.columns:
    nanv.append(df[col].isnull().values.any())
print(nanv)

[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False

In [62]:
#Save the processed dataset
df.to_csv('/data/puccetti/space_data/usable_temp_nan.csv')
#df = pd.read_csv('/data/puccetti/space_data/final_full_dataset_nan.csv')

### Convert Label columns to Numeric values
We substitute label values with numeric values. We create two version of the dataset:
- binary classification
- multiple label classification

In [63]:
print(df['timestamp'])

0           2023-03-16 14:22:23.903192576
1           2023-03-16 14:22:23.903765248
2           2023-03-16 14:22:23.904402432
3           2023-03-16 14:22:23.904744448
4           2023-03-16 14:22:23.905764096
                        ...              
30247045    2023-06-16 19:58:12.266677760
30247046    2023-06-16 19:58:12.267390208
30247047    2023-06-16 19:58:12.267472640
30247048    2023-06-16 19:58:12.267490048
30247049    2023-06-16 19:58:12.309100288
Name: timestamp, Length: 30247050, dtype: object


In [64]:
print(df['attack'].value_counts())

metasploit SYN flood    14696890
nmap SYN flood           7993846
observe                  6645324
nmap discovery            500000
ros2 reconnaissance       401494
ros2 node crashing          5743
ros2 reflection             3753
Name: attack, dtype: int64


In [65]:
df['attack'] = df['attack'].replace('metasploit SYN flood', 1) 
df['attack'] = df['attack'].replace('nmap discovery', 2)
df['attack'] = df['attack'].replace('nmap SYN flood', 3) 
df['attack'] = df['attack'].replace('ros2 node crashing', 4)
df['attack'] = df['attack'].replace('ros2 reconnaissance', 5)
df['attack'] = df['attack'].replace('ros2 reflection', 6)
df['attack'] = df['attack'].replace('observe', 0)

df['attack'] = pd.to_numeric(df['attack'])

df['attack'].unique(), df['attack'].nunique()

(array([0, 2, 5, 6, 4, 3, 1]), 7)

In [66]:
print(df['attack'].value_counts())

1    14696890
3     7993846
0     6645324
2      500000
5      401494
4        5743
6        3753
Name: attack, dtype: int64


In [67]:
df.to_csv('/data/puccetti/space_data/usable_temp_multi.csv')

In [68]:
df['attack'] = df['attack'].replace(2, 1)
df['attack'] = df['attack'].replace(3, 1) 
df['attack'] = df['attack'].replace(4, 1)
df['attack'] = df['attack'].replace(5, 1)
df['attack'] = df['attack'].replace(6, 1)

In [69]:
print(df['attack'].value_counts())

1    23601726
0     6645324
Name: attack, dtype: int64


In [70]:
df.to_csv('/data/puccetti/space_data/usable_temp_bin.csv')

## Convert String to Numeric

In [71]:
list_column_string=df.select_dtypes(exclude=[np.number])

for i in list_column_string:
    if i != 'timestamp':
        df[i] = pd.Categorical(df[i])

In [72]:
for i in list_column_string:
    if i != 'timestamp':
        df[i] = df[i].cat.codes

### Split the dataset to create Train and Test Sets

In [73]:
from sklearn.model_selection import train_test_split

In [74]:
#df = df.drop(['Unnamed: 0'], axis=1)
df = df.drop(['timestamp'], axis=1)
print("Dataset shape: " + str(df.shape))

Dataset shape: (30247050, 230)


In [75]:
subs = "Unnamed"
res = [i for i in df.columns if subs in i]
print(len(res))
print(res)
df=df.drop(res, axis=1)

1
['Unnamed: 0']


I want to make the feature selection only on the feature related to the network monitor (Tshark).

In [76]:
subs = "."
res = [i for i in df.columns if subs in i]
print(len(res))

198


In [78]:
label = df['attack']
df = df.drop(['attack'], axis=1)

x_train, x_test, y_train, y_test = train_test_split(df[res], label, test_size=0.4, random_state=42)

x_train = x_train.to_numpy()
x_test = x_test.to_numpy()

In [79]:
print("Train Set Shape: " + str(x_train.shape))
print("Train Set Label Shape: " + str(y_train.shape))
print("Test Set Shape: " + str(x_test.shape))
print("Test Set Label Shape: " + str(y_test.shape))

Train Set Shape: (18148230, 198)
Train Set Label Shape: (18148230,)
Test Set Shape: (12098820, 198)
Test Set Label Shape: (12098820,)


# Select best features for the light version of the dataset
We use the feature ranking algorithm of ExtraTreesClassifier to select the best Tshark features. 

In [80]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

In [81]:
clf = ExtraTreesClassifier(n_estimators=30)
clf = clf.fit(x_train, y_train)
clf.feature_importances_

array([1.13910822e-02, 1.27566710e-02, 2.36111938e-02, 1.10788548e-04,
       1.37356228e-02, 3.10693957e-05, 1.10606544e-04, 1.24532835e-02,
       6.39916232e-02, 1.54908340e-02, 3.75477375e-02, 2.28914434e-05,
       2.46530753e-05, 7.15428216e-03, 3.25095565e-03, 4.99499203e-02,
       1.67866265e-02, 5.69900745e-02, 3.37684839e-02, 6.76913524e-04,
       1.11525403e-04, 6.21370943e-04, 2.80709863e-04, 5.31840261e-04,
       8.59062744e-04, 4.12366630e-02, 2.62099899e-03, 1.58707979e-03,
       1.64999210e-02, 2.39206703e-02, 4.62840850e-02, 1.54767023e-02,
       3.91403165e-03, 4.97502319e-04, 9.34643428e-03, 1.00535905e-02,
       7.58778324e-03, 1.26580733e-03, 2.40636031e-07, 1.00687336e-01,
       1.97235373e-04, 1.14213919e-04, 5.93018142e-05, 3.33300611e-04,
       2.04132413e-04, 4.32268501e-02, 8.03508674e-02, 7.24221504e-04,
       1.97092415e-04, 8.74856162e-05, 7.30467572e-05, 1.43462290e-03,
       5.14176284e-04, 9.73569276e-04, 4.53423076e-03, 2.03125732e-02,
      

In [82]:
importances = clf.feature_importances_
indices = np.argsort(importances)[-30:]

In [83]:
print(indices)

[ 34  35 103   0   7   1   4  60 100  31   9  61  28  16  63 107  55  57
   2  29  18  10  25  45  30  15  17   8  46  39]


In [84]:
best_features = df.columns[indices]

In [85]:
subs = "."
res = [i for i in df.columns if subs not in i]
print(len(res))

30


In [86]:
best_features = set(best_features)
res = set(res)
union = list(best_features.union(res))

In [87]:
print(union)
print(len(union))

['Tcp_Listen', 'layers.tcp.tcp.flags_tree.tcp.flags.ack', 'layers.tcp.tcp.stream', 'layers.ip.ip.checksum.status', 'layers.sll.sll.hatype', 'layers.tcp.tcp.analysis.tcp.analysis.acks_frame', 'Disk_Read', 'layers.ip.ip.flags_tree.ip.flags.rb', 'layers.tcp.tcp.flags_tree.tcp.flags.syn_tree._ws.expert._ws.expert.severity', 'Cached', 'publishers_count', 'SwapFree', 'layers.tcp.tcp.flags_tree.tcp.flags.syn_tree._ws.expert._ws.expert.message', 'pgdeactivate', 'Tcp_Close', 'layers.tcp.tcp.window_size', 'Buffers', 'Active', 'layers.ip.ip.checksum', 'layers.tcp.tcp.options_tree.tcp.options.nop_tree.tcp.option_kind', 'pgactivate', 'Inactive', 'msg_type', 'Net_Received', 'layers.ssl.ssl.record.ssl.record.content_type', 'layers.icmpv6.icmpv6.type', 'layers.tcp.tcp.window_size_value', 'layers.tcp.tcp.payload', 'Net_Sent', 'pgfault', 'layers.tcp.tcp.flags_tree.tcp.flags.syn', 'Tcp_Syn', 'layers.ip.ip.version', 'subscribers_count', 'layers.tcp.tcp.analysis.tcp.analysis.initial_rtt', 'layers.tcp.tcp.o

In [88]:
np.save('/data/puccetti/space_data/usable_features.npy', union, allow_pickle=True)

# Compose the reduced dataset

In [89]:
features = np.load('/data/puccetti/space_data/usable_features.npy')

In [91]:
features = list(features)

In [92]:
features.append('attack')

In [93]:
df = pd.read_csv(PATH, usecols=features)

  df = pd.read_csv(PATH, usecols=features)


In [94]:
print(df['attack'])

0           observe
1           observe
2           observe
3           observe
4           observe
             ...   
30247045    observe
30247046    observe
30247047    observe
30247048    observe
30247049    observe
Name: attack, Length: 30247050, dtype: object


In [95]:
df.to_csv('/data/puccetti/space_data/reduced_final.csv')

In [None]:
#df = pd.read_csv('/data/puccetti/space_data/usable_temp_bin.csv', usecols=features)

In [None]:
#df.to_csv('/data/puccetti/space_data/usable_final_bin.csv')

In [None]:
#df = pd.read_csv('/data/puccetti/space_data/usable_temp_multi.csv', usecols=features)

In [None]:
#df.to_csv('/data/puccetti/space_data/usable_final_multi.csv')