# NIDS based on Network Flow Measurements

### What is network flow and flow measurement?

A network flow consists of all network packets that have the same flow identifier (ID). The flow ID can be extracted from the packet header and consists, and is usually defined by the 5-tuple. 

<hr/>
<center>
<img src="images/counting/5tuple.png"/>
</center>

Flow measurement is a collection of flow data. Measuring the flow data is useful in a number of applications such as traffic analysis, network visibility, congestion control, heavy-hitter detection, anomaly detection, and intrusion detection.

Flow-measurement based NIDS use the network flow data (flow size, flow volume, flow features, etc.) for intrusion detection.

### Getting started

In this exercise we take flow size/flow volume as our parameter and we consider measuring/counting the flow size/flow volume of flow IDs to detect an anomaly and intrusion. We define flow size as the number of packets and flow bytes as the byte volume of packets. We start with flow size first and later with flow volume.

This exercise will walk you through the simplest of data structures to show how malicious can be detected through flow measurements.

#### How do we measure the flow size/volume?
We require a counter array to count the flow size. In software, we can make use of a dictionary to store the flow IDs and corresponding flow sizes. 


#### Load the dataset first

In [130]:
#Load dataset
import import_ipynb
from lib.dataset import NIDSDataset

data_file = 'data/packets.npy'
labels_file = 'data/labels.npy'

dataset = NIDSDataset(data_file, labels_file)

#### Exercise 20.1

In [131]:
# INITIALISE THE COUNTERS TO ZERO

wordcounter=0
flowid = ""
library = {} # Library to store the flowid and sizes. Here we are taking only flow sizes (i.e; the size of
             # each packet is taken as 1)

# loop over all datasets
for d in dataset:

    decision_is_made = 0 # decision_is_made = 1 when ethertype is not 0x0800 or packet is neither TCP nor UDP
                         # decision_is_made = 2 when the flow ID is extracted
    wordcounter = 0
    flowid = ""

    # loop over all words
    for word in d:
        # stop parsing if a decission is made
        if decision_is_made == 0:

            # examine if Ethertype is 0x0800 - in link layer header
            # if Ethertype is not equal to 0x0800, break
            """WRITE code here"""
            
            # examine if proto is tcp or udp 6/17 - in network layer header
            # if proto is not tcp or udp, break
            """WRITE code here"""
            
                    
            # Extract flowid. flowid is (sorce address, dest address, source port, dest port)
            # hint: convert to hex and remove 0x. concatanate the addresses and ports
            
            # extract Source Address - in network layer header
            """WRITE code here"""
            

            # extract Destination Address - in network layer header
            """WRITE code here"""
            
            
            # extract source port estination port - in network layer header
            """WRITE code here"""
            

            # examine Destination port - in transport layer header
            # If the flowid is complete, break out of the loop
            """WRITE code here"""
            
        wordcounter += 1
    
    if(decision_is_made==2):
        #check if the flowid is in the library and
        # if it is present increment flow size by 1, and if not present add the new 
        # flowid with a size 1. 
        """ WRITE CODE here"""
        
    # end of iteration over words

# PRINTS all the flowids along with the sizes
for flowid in library:  
    print(flowid, '->', library[flowid])


#### Do any of the flows exhibits anomalous behaviour?

#### Exercise 20.2

we set a threshold to determine which of the flowids exhibits anomalous behaviour by exceeding the allocated bandwidth.

In [132]:
"""Print the flow IDs that exceeds a threshold"""

threshold = 10

# iterate through the library and print all the flowids from the librabry that exceeds the threshold
""" WRITE code here """



' WRITE code here '

#### Making use of a hashtable
If we want to reduce the memory footprint, we can hash the flowids to locate an index to store the flowids and flowsize. Python actually can make a hashtable of it, where the flow IDs and associated flow size are stored as key-value pairs. In HDL, there are no such concept as hashtables, so we have to implement the hash tables manually.

<hr/>
<center>
<img src="images/counting/hashmap.png"/>
</center>

#### How to address hash collisions?
Since we are trimming down the hashed value according to the length of the hash table, there will be collisions. We need to take measures such as chaining to avoid hash collisions in hashtables. C

Chaining is simple and is like a linked list, where each index can include a separate list with many elements. And the advantage is that hash table never fills up, we can always add more elements to the chain.

<hr/>
<center>
<img src="images/counting/chaining.png"/>
</center>

#### Exercise 20.3a

In [133]:
""" HASHTABLE """

import array
import hashlib

n = 256 # length of the hashtable
array = [None] * n

""" Define functions here for the hashtable """

def _hash(key):
    """ Md5 hash function to calculate the index"""
    n=256
    md5 = hashlib.md5(str(hash(key)).encode('utf-8'))
    return int(md5.hexdigest(), 16) % n
    
def add_ht(key, value):
    """Add a value to hashtable by its key and update the contents if the cell is not empty"""
    index = _hash(key)
    if array[index] is not None:
        # WRITE the code to Check if the flowid present in the [key,value] pair in the array[index] is equal to 
        # the incoming flowid. if equal, then add the value to the existing value.
        # If the flowids are not equal, then we have to do the chaining and append the new
        # element to the list in the array[index].
        """ WRITE code here """
        
        
    else: 
        # If the index is empty, creare an empty list in array[index] and 
        # append the key-value pair to the list.
        """ WRITE code here """
        
        
def query_ht(key):
    """Get a value by key"""
    index = _hash(key)
    if array[index] is None:
        return 0
    else:
        # iterate through all key-value-pairs and find if the flowid exist. 
        # If exists then return its value. # If no return was done during iteration, 
        # that means flowid does not exist. Then return 0
        """ WRITE code here """
        
        


#### Exercise 20.3b

In [134]:
""" Now, reading the dataset again to update the hashtable. you can Copy and """
""" paste the code from the previous Exercise. """
""" In this exercise we will be taking the actual flow volume instead of flow size = 1 """
wordcounter=0
flowid = ""     # flow id
flowvolume = "" # flow volume (Total length from the ip header)
flowlist = []  # keeps a list to store flows

# loop over all datasets
for d in dataset:

    decision_is_made = 0 # decision_is_made = 1 when ethertype is not 0x0800 or packet is neither TCP nor UDP
                         # decision_is_made = 2 when the flow ID is extracted
    wordcounter = 0
    flowid = ""
    flowvolume = "" 

    # loop over all words
    for word in d:
        # stop parsing if a decission is made
        if decision_is_made == 0:

            # examine if Ethertype is 0x0800 - in link layer header
            # if Ethertype is not equal to 0x0800, break loop
            """WRITE code here"""
            
            # examine if proto is tcp or udp 6/17 - in network layer header
            # if proto is not tcp or udp, break loop
            """WRITE code here"""
            
                    
            # extract Total length (flow volume) - in network layer header
            # hint: convert to hex and concatanate the bytes
            """WRITE code here"""
            
                
            # Extract flowid. flowid is (sorce address, dest address, source port, dest port)
            # hint: convert to hex and remove 0x. concatanate the addresses and ports
            
            # extract Source Address - in network layer header
            """WRITE code here"""
            

            # extract Destination Address - in network layer header
            """WRITE code here"""
            
            
            # extract source port estination port - in network layer header
            """WRITE code here"""
            

            # examine Destination port - in transport layer header
            # If the flowid is complete, break out of the loop
            """ WRITE code here """
            
            
        wordcounter += 1
        
        
    # Convert the flow volume to integer
    """ WRITE code here """
    
        
    if flowid not in flowlist:
        flowlist.append(flowid)


    """ Updating the table """
    # add flowid and flowvolume to the hashtable using the function
    """ WRITE code here (1 line of code)"""
    

#### Checking for anomalies
#### Exercise 20.3c

In [135]:
"""Print the flow IDs that exceeds a threshold"""
threshold = 600
# iterate through the flowlist and print those flowids having total volume greater than the threshold
"""WRITE code here """



'WRITE code here '

<hr/>
<center>
Continue with the <a href="01_readingframes.ipynb">next notebook</a> in a new browser tab.<br/><br/>
<img src="images/footer.png"/>
</center>