# Reading network frames as if you were a hardware component

## Encapsulation of different networking protocols

On any network messages are sent as frames. These frames travel over a physical medium (e.g. copper wire, EM radiation (WiFi), ...). When they arrive at a network host, they have to be interpreted and passed on the next component in the chain.

<center>
<img src="images/01_network.png"/>
</center>


From a **Networking 101** course (or by common sense ) you should know that files or other datasets are fragmented in a set of frames that are sent one-by-one over the chosen medium. In order to manage this massive amount data that is being sent, different networking protocols are used.

These different network protocols typically encapsulate each other. All of these *layers* work in a similar fashion: first they provide meta-data in a **header**; next follows the **payload**. The payload of one layer contains the entire part of the next layer, which again consists of a header and a payload. This concept continues similar to [Rusian Matryoshka dolls](https://en.wikipedia.org/wiki/Matryoshka_doll).

<center>
<img src="images/01_encaps.png"/>
</center>

The image above shows a simplified version of the **encapsulation** of the different layers. One important remark to make is that one of the fields in a header sections defines which protocol is used in its payload. This we will use heavily :)

## Reading the data

As we are not the first people on the globe that are making hardware for sending and receiving network frames, a lot of components already exists to *translate* the symbols on the physical layer to digital signals. The incoming data is often presented through some bus, which can typically be 8, 32 or even 64 bits. This can be even higher when you're looking at networking hardware in server rooms.

## The dataset
For the purposes of this workshop, we will work with a custom subset of the [CIC-IDS2017](https://www.unb.ca/cic/datasets/ids-2017.html). This is is a dataset that is commonly used in the network intrusion detection community, especially in the literature that performs intrusion detection using machine learning.

This dataset roughly contains three kinds of traffic: Unlabelled traffic, labelled benign traffic and labelled attack traffic. Its authors created it by combining benign agents that generated regular, benign traffic with specific attacks. Spreading it over 5 weekdays, from Monday to Friday, each day contains a number of attacks.

There are two types of data to encompass all this: Packet CAPture (.pcap) files that contain the binary packet traces as transmitted on the network, as well as Comma Seperated Value (.csv) files that list specific traffic flows alongside their specific features as well as their individual label.
For this workshop, we used a subset of the data from Friday. Throughout the exercises you will discover what kinds of traffic are present is our selection.

## Interface with the dataset

The data we will use for this workshop is stored inside Numpy files (with extension .npy). To allow for easy use, we prepared an interface that provides access to packets.

In [None]:
from lib.dataset import NIDSDataset

data_file = 'data/dataset_packets_v2.npy'
labels_file = 'data/dataset_labels_v1.npy'

dataset = NIDSDataset(data_file, labels_file)

Reading packets from the dataset simply requires iterating over the dataset in a for-loop, as is shown below.

Please note that the previous **code-block** should have been executed prior to the for-loop example. The variable *dataset* is undefined if otherwise.

In [None]:
packet_counter = 0

for packet in dataset:
    packet_counter += 1

print("Dataset contains {} packets: {}".format(packet_counter, packet_counter == len(dataset)))

For the purpose of this workshop, we will use data as sequentially presented in words of 32-bits, organised as an array of 4 bytes. This is achieved by iterating over a packet:

In [3]:
for example_packet in dataset:
    for word in example_packet:
        print(word)
    break

As the packets are stored inside a Numpy file, each word is a Numpy-array, which should be considered when manipulating the data. For the sake of manageability, the packets are truncated a 100 bytes: Only the first 100 bytes of packets are stored inside the dataset. Shorter packets of course will contain fewer than 100 bytes, with the missing trailing bytes zero-padded in the Numpy-file. The zero-padding will not show up when iterating over a packet, as this will only provide actual packet bytes. Example: A packet containing 62 bytes will return a Numpy-array with 2 elements as its final word.

Each packet has a label: Use the *get_label()* method to get a string that provides the label:

In [None]:
traffic_labels = []

for packet in dataset:
    label = packet.get_label()
    
    if not (label in traffic_labels):
        traffic_labels.append(label) 

print("These are the traffic classes present in the dataset: {}.".format(", ".join(traffic_labels)))

<hr/>
<center>
Continue with the <a href="10_regexes.ipynb">next notebook</a> in a new browser tab.<br/><br/>
<img src="images/footer.png"/>
</center>