# NIDS based on Network Flow Measurements

### What is network flow and flow measurement?

A network flow consists of all network packets that have the same flow identifier (ID). The flow ID can be extracted from the packet header and is usually defined by the 5-tuple: source and destination IP addresses, source and destination ports, and the protocol that is used at the transport layer.

<center>
<img src="images/counting/5tuple.png"/>
</center>

Flow measurement tries to gather metadata on existing flows like the presence of certain features, the average volume, or peak volumes. Measuring the flow data is indispensable in a number of applications such as traffic analysis, network visibility, congestion control, heavy-hitter detection, anomaly detection, and intrusion detection.

Flow-measurement based NIDS use the network flow data (flow size, flow volume, flow features, etc.) for intrusion detection.

### Getting started

For network flow measurement on high-speed networks, especially when the focus is on hardware, we need to prioritize two things to enable processing at line speed: high-speed lookup and high-speed counting. Keeping the memory and storage requirement within the on-chip memory is the best way to reduce the latency in lookup and counting, and this can be done by using efficient probabilistic lookup data structures and approximate counting techniques. In this tutorial, we focus only on high-speed lookup. However, a useful introduction to approximate counting is given in the closing notebook.

In this exercise we take flow size/flow volume as our parameter and we consider measuring/counting the flow size/flow volume of flow IDs to detect an anomaly and intrusion. We define **flow size** as the number of packets and **flow volume** as the byte volume of packets. We start with flow size first and later with flow volume.

This exercise will walk you through the simplest of data structures and will illustrate how malicious flows can be detected through flow measurements.

Similar to the earlier notebooks, don't forget to preload the dataset first !!

In [6]:
from lib.dataset import NIDSDataset

data_file = 'data/dataset_packets_v2.npy'
labels_file = 'data/dataset_labels_v1.npy'

dataset = NIDSDataset(data_file, labels_file)

print("Dataset is loaded")

Dataset is loaded


We need to keep count for every flow ID that is passing. Therefore an array will be used that stores all the count values. In Python, we can make use of a dictionary (a key-value pair data structure) to store the flow IDs and their corresponding flow sizes. 

In [7]:
# INITIALISE THE COUNTERS TO ZERO

wordcounter=0
flowid = ""

# Python dictionary to store the flowid and its sizes. 
# Here we are taking only flow sizes 
# (i.e; the size of  each packet is taken as 1)
library = {} 

# loop over all datasets
for d in dataset:
    
    wordcounter = 0
    flowid = ""
    flowvolume = ""
    flowid_complete = 0
    # loop over all words
    for word in d:

        # examine if Ethertype is 0x0800 - in link layer header
        # if Ethertype is not equal to 0x0800, break
        """WRITE code here"""
        
        # examine if proto is tcp or udp 6/17 - in network layer header
        # if proto is not tcp or udp, break
        """WRITE code here"""
        
        # extract Total length (flow volume) - in network layer header
        # hint: convert to hex and concatanate the bytes
        # (we will not be using flow colume in this exercise, but needed in the next.)
        """WRITE code here"""
                
        # Extract flowid. Flowid is source IPv4 address, dest IPv4 address, 
        # source port, dest port
        # hint: convert numbers to hex and remove the leading 0x. 
        #       then concatenate together
        
        # extract Source Address - in network layer header
        """WRITE code here"""
        
        # extract Destination Address - in network layer header
        """WRITE code here"""
        
        # extract source port estination port - in network layer header
        """WRITE code here"""

        # examine Destination port - in transport layer header
        # set the flag flowid_complete to 1 when the flowid is complete, and break the loop
        # If the flowid is complete, update the dictionary
        """WRITE code here"""
        
        wordcounter += 1
    
    # Check if the extracted flowid is present in library. if not
    # add the flowid with a size 1. If present, increment the size by 1.
    if(flowid_complete==1):
        """WRITE code here"""
        
    # end of iteration over words

# PRINTS all the flowids along with the sizes
for flowid in library:  
    print(flowid, '->', library[flowid])


To determine a possible malicious flow, a maximum can be put on the flow size. Setting the exact number of the threshold is a delicate task and requires experience.

In the next exercise, we set a threshold of **20** to determine which of the flowids exhibit anomalous behaviour by exceeding the allocated bandwidth.

In [8]:
"""Print the flow IDs that exceed a threshold"""

threshold = 20

# PRINTS all the flowids along with the sizes
for flowid in library:
    if(library[flowid]>threshold):
        print(flowid, '->', library[flowid])


<center><div style="background-color: #b9ffb9; padding: 10px,20px;  width: 80%;">The code above should report that there are 195 FlowIDs in the dataset, of which 10 exceed the allowed bandwidth.</div></center>

<hr/>
<center>
Continue with the <a href="21_counting.ipynb">next notebook</a> in a new browser tab.<br/><br/>
<img src="images/footer.png"/>
</center>