## 1.Sampling Task
Pick any of the CTU-13 datasets (Malware capture 42 to 54). Download the unidirectional netflows. DO NOT DOWNLOAD THE VIRUS THAT WAS USED TO GENERATE THE DATA UNLESS USING A VM OR OTHER SANDBOX. The flows are collected from a host in the network. Its IP address should be obvious from the data sample. We are interested in the other addresses the host connects with.
Estimate the distribution over the other IP_addresses, what are the 10 most frequent values? Write code for RESERVOIR sampling, use it to estimate the distribution in one pass (no need to actually stream the data, you may store it in memory, or run every file separately, but do store and load the intermediate results). Use a range of reservoir sizes. What are the 10 most frequent IP-addresses and their frequencies when sampled? Use the theory to explain any approximation errors you observe.

In [1]:
import pandas as pd
import matplotlib
import numpy as np

Dataset is CTU-Malware-Capture-Botnet-43. Infected host's ip is 147.32.84.165

In [2]:
#pre-process, divide raw data in to follwing coulmns: Data, Duration, Protocol, Source, Destination and Label
columns=['Date','Duration','Protocol','Source','Dest','Label']
lst=[]
with open("capture20110811.pcap.netflow.labeled") as fp:  
    for cnt, line in enumerate(fp):
        if cnt!=0:
            dat=line.split("\t")
            lst.append([dat[0],dat[1],dat[2],dat[3].split(':')[0],dat[5].split(':')[0],dat[11].split('\n')[0]])
dataset=pd.DataFrame(lst, columns=columns)

Filter out the other addresses connected to infected host

In [256]:
infected_host='147.32.84.165'
infected_dataset=dataset.loc[(dataset['Source']==infected_host) | (dataset['Dest']==infected_host)]
infected_dataset=infected_dataset.reset_index()
infected_dataset.to_csv('infected_dataset.csv')
infected_dataset.head()

Unnamed: 0,index,Date,Duration,Protocol,Source,Dest,Label
0,540283,2011-08-11 10:27:20.087,0.0,UDP,147.32.84.165,147.32.80.9,1
1,541362,2011-08-11 10:27:22.334,0.0,UDP,147.32.84.165,147.32.80.9,1
2,541377,2011-08-11 10:27:22.355,0.045,TCP,147.32.84.165,74.125.232.198,Botnet
3,541384,2011-08-11 10:27:22.362,0.034,TCP,74.125.232.198,147.32.84.165,Botnet
4,702906,2011-08-11 10:32:25.092,0.0,UDP,147.32.84.165,147.32.80.9,1


Calculate the group truth distribution

In [4]:
from collections import Counter
#exclude host machine
counter=Counter(list(infected_dataset[infected_dataset['Dest']!='147.32.84.165'].loc[:,'Dest']))
top10=counter.most_common(10)
top10_percentage=[]
for i in range(len(top10)):
    top10_percentage.append([top10[i][0],top10[i][1]/len(infected_dataset)])
top10_percentage

[['193.23.181.44', 0.11903337258005366],
 ['174.128.246.102', 0.06614680846957093],
 ['174.37.196.55', 0.06479569186820823],
 ['67.19.72.206', 0.0605107220753151],
 ['72.20.15.61', 0.05724874056631087],
 ['173.236.31.226', 0.03296724507324982],
 ['184.154.89.154', 0.032388195101237235],
 ['46.4.36.120', 0.031403810148815846],
 ['147.32.80.9', 0.015190410932463472],
 ['217.163.21.37', 0.013530467679360728]]

In [375]:
#In order to save space, process file without read whole dataset at once. This function is not used 
def RESERVIOR(fp,m):
    #fp is file pointer; m is the reservior size
    reservior=[]
    count=0
    while True:
        newline=fp.readline()
        if not newline:
            break
        else:
            dat=newline.split("\t")
            Source=dat[3].split(':')[0].strip()
            Dest=dat[5].split(':')[0].strip()
            #filter data
            if Dest==infected_host or Source!=infected_host:
                continue
        if count<m: 
            reservior.append(Dest)
        else:
            #Choose to sample the iâ€™th item (i>m) with probability pi = m/i
            if np.random.uniform(0,1)<m/count:
                s=np.random.randint(0,m)
                reservior[s]=Dest
        count=count+1
    return reservior

Test reservior algorithm using size 10,100,1000. This process can be very time consuming

In [376]:
for j in [10,100,1000,10000]:
    with open("capture20110811.pcap.netflow.labeled") as fp:
        fp.readline()
        sampledData=RESERVIOR(fp,j)
        counter=Counter(sampledData)
        top10=counter.most_common(10)
        top10_percentage=[]
        for i in range(len(top10)):
            top10_percentage.append([top10[i][0],top10[i][1]/len(sampledData)])
        print(f'distribution when k={j} is: {top10_percentage}')

distribution when k=10 is: [['193.23.181.44', 0.2], ['80.12.204.241', 0.1], ['174.128.246.102', 0.1], ['217.163.21.37', 0.1], ['147.32.80.9', 0.1], ['174.37.196.55', 0.1], ['184.154.89.154', 0.1], ['80.232.168.198', 0.1], ['67.19.72.206', 0.1]]
distribution when k=100 is: [['193.23.181.44', 0.14], ['174.37.196.55', 0.12], ['72.20.15.61', 0.08], ['173.236.31.226', 0.06], ['67.19.72.206', 0.05], ['174.128.246.102', 0.05], ['184.82.155.107', 0.05], ['46.4.36.120', 0.04], ['184.154.89.154', 0.03], ['173.192.170.88', 0.02]]
distribution when k=1000 is: [['193.23.181.44', 0.15], ['174.128.246.102', 0.073], ['174.37.196.55', 0.067], ['67.19.72.206', 0.067], ['72.20.15.61', 0.06], ['184.154.89.154', 0.051], ['173.236.31.226', 0.042], ['46.4.36.120', 0.03], ['147.32.80.9', 0.016], ['217.163.21.39', 0.013]]
distribution when k=10000 is: [['193.23.181.44', 0.139], ['174.128.246.102', 0.0782], ['174.37.196.55', 0.0752], ['67.19.72.206', 0.0701], ['72.20.15.61', 0.0627], ['46.4.36.120', 0.0381], ['

## 2.Sketching task
Build code for computing a COUNT-MIN sketch, play with different heights and widths for the Count-Min sketch matrix. Compare it to the RESERVOIR sampling strategy. Is it more space-efficient/accurate? What about run-time? Use the theory to explain any differences you observe

## 3. Flow data discretization task
We aim to learn a sequential model from NetFlow data from an infected host (unidirectional netfows). Consider scenario 10 from the CTU-13 data sets (see paper 4 from below resources). Remove all background flows from the data. You are to discretize the NetFlows. Investigate the data from one of the infected hosts. Select and visualize two features that you believe are most relevant for modeling the behavior of the infected host. Discretize these features using use any of the methods discussed in class (combine the two values into a single discrete value). Do you observe any behavior in the two features that could be useful for detecting the infection? Explain. Apply the discretization to data from all hosts in the selected scenario.