# NIDS based on Network Flow Measurements

### Are previous data structures really hardware friendly?

Even though hash tables and collision resistant mechanisms such as linked lists are nice solutions, linked lists are not tailored for hardware.  Also, one-to-one mapping requires a considerable amount memory as the number of encountered flows could be in millions. 

A hardware friendly alternative is to use **sketches**, which additionally have some built-in collision resistance mechanisms. Instead of one-to-one mapping, one single flow is mapped to multiple counters and each counter is shared by multiple flows in the case of collisions. Sketches require less memory as the flowIDs don't have to be stored. As for the downside of using sketches, there is a possibility of overestimation.

### CM Sketch

A CM sketch is represented by a 2D array of counters with width w and depth d. The depth corresponds to the number of hash functions that is used.

<center>
<img src="images/counting/CM_sketch.png"/>
</center>

In the example above, h<sub>1</sub>, h<sub>2</sub>,...,h<sub>d</sub> are independent hash values on the FlowID *f1*. Each FlowID is mapped to *d* counters in the sketch during an update operation. When the CM sketch is queried the minimum of all the *d* counters is given as result.

In [2]:
#Load dataset
from lib.dataset import NIDSDataset

data_file = 'data/dataset_packets_v2.npy'
labels_file = 'data/dataset_labels_v1.npy'

dataset = NIDSDataset(data_file, labels_file)

#### Implementing CM Sketch

#### Exercise 22.1a

In [1]:
import array
import hashlib
""" CM Sketch parameters and functions  """    
m = 256
d = 4

#initializes an empty 2-d array
tables = []
for i in range(d):
    table = array.array("l", (0 for i in range(m))) # "l" is the typecode indicating that the type is signed long
    tables.append(table)
    
# Here, instead of using d independant hashes, the output of md5 hash is split into d hash values.
def _hash(flowid):
    """ hash computation """
    m=256
    d=4
    md5 = hashlib.md5(str(hash(flowid)).encode('utf-8'))
    for i in range(d):
        md5.update(str(i).encode('utf-8'))
        yield int(md5.hexdigest(), 16) % m # yield gives a generator object and has to be iterated to read the values.
        
def add_cms(flowid, value):
    """Add a value to hashtable by its key and update the contents if the cell is not empty"""
    # get the d indexed locations of the sketch by hashing. Uncomment 'indices =' and complete the code.
    """WRITE code here """ 
    indices = _hash(flowid)
    
    # Iterate through tables and indices and update the value stored in each indexed location
    # Whether the location is empty or not, just add the value to the already existing value
    for table, i in zip(tables,indices):
        """ WRITE Code here """
        table[i] += value

def query_cms(flowid):
    """Get a value by key"""
    # get the d indexed locations of the sketch by hashing. Uncomment 'indices =' and complete the code.
    """WRITE code here """ 
    indices = _hash(flowid)
    
    # Iterate through tables and indices and return the minimum of the values stored in the indexed locations
    """ WRITE Code """
    return min(table[i] for table, i in zip(tables, indices))


Loop through the dataset again and determine which of the flowids exhibits anomalous behaviour by exceeding the allocated bandwidth. 

(***HINT***: you can recover most of the work from your previous exercises.)

#### Exercise 22.1b

In [3]:
m = 256
d = 4
#resets the sketch to an empty 2-d array
tables = []
for i in range(d):
    table = array.array("l", (0 for i in range(m))) # "l" is the typecode indicating that the type is signed long
    tables.append(table)
    
""" Now, reading the dataset again to update the hashtable. you can Copy and """
""" paste the code from the previous Exercise. """

wordcounter=0
flowid = ""     # flow id
flowvolume = "" # flow volume
flowlist = []  # keeps a list to store flows

# loop over all datasets
for d in dataset:

    decision_is_made = 0 # decision_is_made = 1 when ethertype is not 0x0800 or packet is neither TCP nor UDP
    wordcounter = 0
    flowid = ""
    flowvolume = ""
    flowid_complete = 0

    # loop over all words
    for word in d:
        # stop parsing if a decission is made
        if decision_is_made == 0:

            # examine if Ethertype is 0x0800 - in link layer header
            # if Ethertype is not equal to 0x0800, break loop
            """WRITE code here"""
            if wordcounter == 3:
                if(word[0] == 8) and (word[1] == 0):
                    decision_is_made = 0
                else:
                    decision_is_made = 1
                    break
            # examine if proto is tcp or udp 6/17 - in network layer header
            # if proto is not tcp or udp, break loop
            """WRITE code here"""
            if wordcounter == 5:
                if(word[3] == 6)or(word[3] == 17):
                    decision_is_made = 0
                else:
                    decision_is_made = 1
                    break
                    
            # extract Total length (flow volume) - in network layer header
            # hint: convert to hex and concatanate the bytes
            """WRITE code here"""
            if(wordcounter == 4):
                flowvolume += hex(word[0])[2:] # ip len 1/2
                flowvolume += hex(word[1])[2:]  # ip len 2/2
                
            # Extract flowid. flowid is (sorce address, dest address, source port, dest port)
            # hint: convert to hex and remove 0x. concatanate the addresses and ports
            
            # extract Source Address - in network layer header
            """WRITE code here"""
            if wordcounter == 6:
                flowid += hex(word[2])[2:]  # ip SA 1/4
                flowid += hex(word[3])[2:]  # ip SA 2/4
            if wordcounter == 7:
                flowid += hex(word[0])[2:]  # ip SA 3/4
                flowid += hex(word[1])[2:]  # ip SA 4/4
            # extract Destination Address - in network layer header
            """WRITE code here"""
            if wordcounter == 7:
                flowid += hex(word[2])[2:]  # ip DA 1/4
                flowid += hex(word[3])[2:]  # ip DA 2/4
            if wordcounter == 8:
                flowid += hex(word[0])[2:]  # ip DA 3/4
                flowid += hex(word[1])[2:]  # ip DA 4/4
            
            # extract source port estination port - in network layer header
            """WRITE code here"""
            if wordcounter == 8:
                flowid += hex(word[2])[2:]  # ip SPort 1/2
                flowid += hex(word[3])[2:]  # ip SPort 2/2

            # examine Destination port - in transport layer header
            # If the flowid is complete, set the flag flowid_complete to 1 and break out of the loop
            """ WRITE code here """
            if wordcounter == 9:
                flowid += hex(word[0])[2:]  # ip DPort 1/2
                flowid += hex(word[1])[2:]  # ip DPort 2/2
                flowid_complete = 1
                break
            
        wordcounter += 1
        
    if(flowid_complete == 1): 
        if flowid not in flowlist:
            flowlist.append(flowid)
            
        # Convert the flow volume to integer
        """ WRITE code here """
        flowvolume = (int("0x"+flowvolume,16))

        """ Updating the table """
        # add flowid and flowvolume to the hashtable using the function
        """ WRITE code here """
        add_cms(flowid,flowvolume)
            

#### Checking for anomalies

#### Exercise 22.1c

In [4]:
"""Print the flow IDs that exceeds a threshold"""
threshold = 1000

# iterate through the flowlist and print those flowids having total volume greater than the threshold
count_flows = 0
count_malicious = 0
""" WRITE code here """
for flowid in flowlist:
    count_flows += 1
    volume = query_cms(flowid)
    if(volume>threshold):
        print(flowid, '->', volume)
        count_malicious += 1

print("Total flows = ",count_flows)
print("Number of malicious flows = ", count_malicious)

c0a8a32c0a8a3db2ccc4 -> 14560
c0a8a3c0a8a32cc4db2c -> 1984
c0a8a32c0a8a3a4a0185 -> 14560
c0a8a3c0a8a32185a4a0 -> 5888
c0a8a9e000fcf6ea14eb -> 1232
c0a8a9c0a8a345c27 -> 1152
c0a8a9c0a8a3089089 -> 1920
c0a8a3c0a8a9089089 -> 1800
c0a8a9c0a8a3461bd -> 1360
c0a8a3c0a8a91bd46 -> 1040
c0a8a19c0a8aff08a08a -> 11712
c0a8a32c0a8aff08a08a -> 1150
c0a8a19c0a8a3089089 -> 2880
c0a8a3c0a8a19089089 -> 2700
c0a8a19e000fb14e914e9 -> 3024
Total flows =  195
Number of malicious flows =  15


<center><div style="background-color: #10FF107f;">The code above should *again* report that there are 195 FlowIDs in the dataset, of which 15 exceed the allowed bandwidth.</div></center>

### Some extra info
Checking whether the threshold is exceeded can be done during the update itself and thereby the malicious flow id can be blacklisted in real-time. 

To see how the collisions cause overestimation in CM Sketch, change the value of m to 64 in 22.1a and 22.1b(You have to make changes in three places, two in 22.1a and one in 22.1b), and then rerun the exercises 22.1a and 22.1b. Then Run 22.1d below to check the difference. (DO NOT run the cell 22.1c again. Keep the result there so that we can compare the results of 22.1c and 22.1d)

#### Exercise 22.1d

In [10]:
"""Print the flow IDs that exceeds a threshold"""
threshold = 1000

# iterate through the flowlist and print those flowids having total volume greater than the threshold
count_flows = 0
count_malicious = 0
""" WRITE code here """
for flowid in flowlist:
    count_flows += 1
    volume = query_cms(flowid)
    if(volume>threshold):
        print(flowid, '->', volume)
        count_malicious += 1

print("Total flows = ",count_flows)
print("Number of malicious flows = ", count_malicious)

Total flows =  0
Number of malicious flows =  0


<center><div style="background-color: #10FF107f;">The code above should report that there are 195 FlowIDs in the dataset, of which 31 exceed the allowed bandwidth.</div></center>

We can see that now 31 more flows are marked as malicious, even though 16 of them are not. This is because of overestimation and this overestimation causes false positives. i.e; there is always a memory vs accuracy trade-off.

<hr/>
<center>
Continue with the <a href="30_machine_learning.ipynb">next notebook</a> in a new browser tab.<br/><br/>
<img src="images/footer.png"/>
</center>