# Feature Selection using the Adjusted Mutual Information (AMI) Score 

### (UDP analysis only)

This notebook uses the Adjusted Mutual Information Score to determine the most relevant features in the DDoS attacks. It uses the bigflows pcap from [http://tcpreplay.appneta.com/wiki/captures.html] containing normal network traffic. The DDoS attack pcaps from DDoSDB [https://ddosdb.org/] are used to compare it with the normal network traffic.

In [1]:
#all relevant libraries need to be imported
from sklearn.metrics.cluster import adjusted_mutual_info_score, mutual_info_score, normalized_mutual_info_score
from sklearn.feature_selection import mutual_info_classif
import numpy as np
import dpkt
import Utils
import ipaddress
import pandas as pd
import time
import os
import json
import operator

### Load bigflows containing only UDP traffic

In [2]:
start = time.time()

input = open('normal_pcaps/bigflows_udp.pcap', 'rb')
normal = dpkt.pcap.Reader(input)
length = 152733 #this is the number of packets in bigflows_udp.pcap   #791615 #791179

ips_n = np.empty(length)
protocol_n = np.empty(length)
total_length_n = np.empty(length)
src_port_n = np.empty(length)
dst_port_n = np.empty(length)
udp_length_n = np.empty(length)
labels_n = np.zeros(length)

udp_packet = 0
tcp_packet = 0
other = 0
total_packet = 0
 
i=0
for ts, buf in normal:
    eth = dpkt.ethernet.Ethernet(buf)
    total_packet += 1 

    if eth.type == 2048:
        ip = eth.data
        ips_n[i] = int.from_bytes(ip.src,"little")
        protocol_n[i] = ip.p
        total_length_n[i] = ip.len
        
        if ip.p == 6:
            tcp_packet += 1
            TCP=ip.data 
            dst_port_n[i] = TCP.dport
            src_port_n[i] = TCP.sport
        elif ip.p == 17:
            udp_packet += 1
            UDP=ip.data
            dst_port_n[i] = UDP.dport
            src_port_n[i] = UDP.sport
            udp_length_n[i] = UDP.ulen
        else:
            dst_port_n[i] = 0
            src_port_n[i] = 0
    else:
        other += 1
        
    i += 1

input.close()
print("total # of packets: %s"  % (total_packet))
print("# of UDP packets: %s" % (udp_packet))
print("# of TCP packets: %s" % (tcp_packet))
print("# of other packets except for TCP or UDP: %s"  % (other))

end = time.time()
print(end - start)


total # of packets: 152733
# of UDP packets: 152733
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
2.6143155097961426


### Get attack keys of DDoS attacks containing only UDP traffic

In [3]:
path_to_json = 'signatures/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

udp_keys=[]

for jf in json_files:
    with open('signatures/' + jf) as f:
        data=json.load(f)
    protocol = data.get("protocol")

    if protocol == "UDP":
        udp_keys.append(jf[:-5])

print("# of UDP DDoS attacks: ", len(udp_keys))

# of UDP DDoS attacks:  159


### Get AMI score and other stats per UDP DDoS Attack

In [8]:
start = time.time()

counter = 1
most_important_features = []

#iterate over all attack keys with UDP attacks
for attack_key in udp_keys:
    try:
        input = open('attack_pcaps/' + attack_key + '.pcap', 'rb')
        attack = dpkt.pcap.Reader(input)

        udp_packet = 0
        tcp_packet = 0
        other = 0
        total_packet = 0
        i = 0
        
        ############### 
        #loop trough packets to get the number of packets in the attack
        #############
        for ts, buf in attack:
            eth = dpkt.ethernet.Ethernet(buf)
            total_packet += 1 

        input.close()
        
        #######################
        #initialize numpy arrays for all the to measure features
        ##############################
        ips_a = np.empty(total_packet)
        protocol_a = np.empty(total_packet)
        total_length_a = np.empty(total_packet)
        src_port_a = np.empty(total_packet)
        dst_port_a = np.empty(total_packet)
        udp_length_a = np.empty(total_packet)
        labels_a = np.ones(total_packet)    
        
        ###############
        #loop trough packets and gather features
        #########
        input = open('attack_pcaps/' + attack_key + '.pcap', 'rb')
        attack = dpkt.pcap.Reader(input)
        i = 0
        for ts, buf in attack:
            eth = dpkt.ethernet.Ethernet(buf)
            if eth.type == 2048:
                ip = eth.data
                ips_a[i] = int.from_bytes(ip.src,"little")
                protocol_a[i] = ip.p
                total_length_a[i] = ip.len

                if ip.p == 6:
                    tcp_packet += 1
                    TCP=ip.data 
                    dst_port_a[i] = TCP.dport
                    src_port_a[i] = TCP.sport
                elif ip.p == 17:
                    udp_packet += 1
                    UDP=ip.data
                    udp_length_a[i] = UDP.ulen
                    dst_port_a[i] = UDP.dport
                    src_port_a[i] = UDP.sport
                else:
                    other += 1
                    dst_port_a[i] = 0
                    src_port_a[i] = 0
            else:
                other += 1

            i += 1
            
        ###############
        #make Numpy arrays with attack features and normal features together
        ################
        ips = np.append(ips_a, ips_n)
        protocol = np.append(protocol_a, protocol_n)
        total_length = np.append(total_length_a, total_length_n)
        src_port = np.append(src_port_a, src_port_n)
        dst_port = np.append(dst_port_a, dst_port_n)
        udp_length = np.append(udp_length_a, udp_length_n)
        labels = np.append(labels_a, labels_n)

        features = [ips,protocol,total_length,src_port,dst_port,udp_length]

        
        ##########
        #Calculate adjusted mutual information score
        ############
        scores = []
        for x in features:
            score = adjusted_mutual_info_score(np.repeat(x,4), np.repeat(labels,4), "arithmetic")
            scores.append(score)
            
        ###########
        #Print all stats per attack
        #######
        print("attackkey: ", attack_key)
        print("total # of packets: %s"  % (total_packet))
        print("# of UDP packets: %s" % (udp_packet))
        print("# of TCP packets: %s" % (tcp_packet))
        print("# of other packets except for TCP or UDP: %s"  % (other))
        print("Features: ips,protocol,total_length,src_port,dst_port,udp_length")
        print("AMI scores: ", scores)
        
        max_index, max_value = max(enumerate(scores), key=operator.itemgetter(1))
        most_important_features.append(max_index)
        print("Most important feature: ", max_index)
        end = time.time()
        print(end - start)
        print(counter,"/",len(udp_keys))
        counter += 1
        print("#####################################")
    except:
        print("something went wrong with attack ", attack_key)
print(end - start)

attackkey:  a3e78a6e687223edf712320712686ab2
total # of packets: 247
# of UDP packets: 247
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.006548274996752432, 8.869476593269325e-13, 0.006570755099051521, 0.0083814682667021, 0.007996524118674544, 0.006571443148166646]
Most important feature:  3
1.3508050441741943
1 / 159
#####################################
attackkey:  af254ac963139fc85e9a42ceaabc91d6
total # of packets: 475
# of UDP packets: 475
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.011468554019273255, 5.058903570033307e-13, 0.013082023237447902, 0.014653962661291023, 0.01387406720611379, 0.013083262731725696]
Most important feature:  3
2.7843987941741943
2 / 159
#####################################
attackkey:  f686ce6cfb9e038f7071e080228698d2
total # of packets: 219
# of UDP packets

attackkey:  967a3709db351926be99ddc0f7d4697a
total # of packets: 681
# of UDP packets: 681
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.015501244594714494, -2.4367108534375903e-16, 0.017702573596698228, 0.019820242198163147, 0.01874812023426459, 0.017704201729856418]
Most important feature:  3
93.43812656402588
19 / 159
#####################################
attackkey:  7f19b44817e1d16ea99933c6bce54131
total # of packets: 66618
# of UDP packets: 66618
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.16167265044565451, 0.0, 0.4126516174331586, 0.1801784361694599, 0.37856030967213306, 0.4118634886988974]
Most important feature:  2
99.2564013004303
20 / 159
#####################################
attackkey:  079c27d54c337110c12ae61ea3efe580
total # of packets: 176
# of UDP packets: 176
# of TCP pack

attackkey:  e6c3668c7cd00f4b287b6fc92a0afcb9
total # of packets: 319
# of UDP packets: 319
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.008168936358077167, 7.117474725007162e-13, 0.010208246371887325, 0.010444008460021506, 0.009869985129128976, 0.010209258671937756]
Most important feature:  3
217.00736331939697
37 / 159
#####################################
attackkey:  acc4394329ad248f5e77c165804a0bc6
total # of packets: 98
# of UDP packets: 98
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.0029154746380653644, 6.629798910045078e-13, 0.0031999559775679038, 0.0037353306096356243, 0.003503529972764894, 0.003200327752157312]
Most important feature:  3
218.3472306728363
38 / 159
#####################################
attackkey:  ae87f4dbbd8c4ccc941995ff673e29b0
total # of packets: 494
# of UDP pa

attackkey:  974dc23908b40cd3c28be24e053de886
total # of packets: 280
# of UDP packets: 280
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.007308298653755319, -7.961231232547978e-13, 0.007244845896968431, 0.005570285673717165, 0.008935304032764204, 0.007245589840367675]
Most important feature:  4
249.90180587768555
55 / 159
#####################################
attackkey:  45cb858201114ef20e94ce24ba917d8d
total # of packets: 107
# of UDP packets: 62
# of TCP packets: 0
# of other packets except for TCP or UDP: 45
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.003151168036510986, 0.5219976296250405, 0.0027110999702208765, 0.004035913391098665, 0.00385690482164785, 0.0035072365207143067]
Most important feature:  1
251.25247240066528
56 / 159
#####################################
attackkey:  58c00bfa1a7cdeed42dec16b50b7a299
total # of packets: 1040
# of UDP pa

attackkey:  5bdf5fc1273bda23ccaff90e5cffb40f
total # of packets: 204
# of UDP packets: 204
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.005550629742280533, 1.0473341165442937e-12, 0.006265296228650655, 0.007102287754663141, 0.006735456266263002, 0.0062659623143266245]
Most important feature:  3
322.70003962516785
73 / 159
#####################################
attackkey:  68387e4ccd91c2f0a78bd8050165b33a
total # of packets: 106
# of UDP packets: 106
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.0031262318950289864, 3.023270100609267e-16, 0.0035460912858169148, 0.004003494117569772, 0.003801094634059248, 0.003546497874512349]
Most important feature:  3
324.0506055355072
74 / 159
#####################################
attackkey:  5e08c144bd92404c22e7c8bf5aa1f75a
total # of packets: 9
# of UDP p

attackkey:  187e8c4e14e803d71c75b2cd102b6fc2
total # of packets: 1404
# of UDP packets: 1404
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.027859874232852775, 6.789731082341956e-14, 0.03837643177038228, 0.03577888047394688, 0.03338410614586096, 0.03837988321702712]
Most important feature:  5
393.03588485717773
91 / 159
#####################################
attackkey:  72666503ae7af37518dbba7c85c80418
total # of packets: 1143
# of UDP packets: 1143
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.0237462283877188, 2.4144709554982115e-13, 0.03242600574495582, 0.030322925300331583, 0.028631631266849582, 0.032428936827655054]
Most important feature:  5
394.5491931438446
92 / 159
#####################################
attackkey:  e9500bc4af5548ba856451a23466073e
total # of packets: 7394
# of UDP pack

attackkey:  3e04da1987736f143f2a69ca98ac95cc
total # of packets: 179
# of UDP packets: 179
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.004953641184347925, 3.9114373894123436e-13, 0.0052655928409190735, 0.006338950768391168, 0.006002927458855988, 0.00526616720000382]
Most important feature:  3
464.12977480888367
109 / 159
#####################################
attackkey:  2ea180355c6612d69993431075783e86
total # of packets: 153
# of UDP packets: 153
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.004319244507161743, 0.0, 0.005062109286943288, 0.005527241278397546, 0.005226905955493438, 0.005062664821909646]
Most important feature:  3
465.4859080314636
110 / 159
#####################################
attackkey:  54d639d921ceef8060eb804a5672749b
total # of packets: 1291
# of UDP packets: 1291
# o

attackkey:  6297de9c19ce35bb4738bd3687512a9a
total # of packets: 102
# of UDP packets: 102
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.0030204653368750252, 1.9203873630859377e-12, 0.0034740696188898966, 0.0038697804777831813, 0.003672212946621696, 0.0034744683550611386]
Most important feature:  3
527.9913449287415
127 / 159
#####################################
attackkey:  e4ed2d7237e72181e0185f1d46e022de
total # of packets: 291
# of UDP packets: 291
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.007550170844360365, 0.0, 0.009106422618021186, 0.00965252352625253, 0.009108950080537538, 0.009107338593140137]
Most important feature:  3
529.4660859107971
128 / 159
#####################################
attackkey:  e85739e64cf0f7882f36f676c5114fac
total # of packets: 134
# of UDP packets: 79
# of

attackkey:  c8b2ee81d5d417e9bb9e23d520c477ee
total # of packets: 634
# of UDP packets: 634
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.014634253755934294, 1.335241883680795e-13, 0.01921880614528435, 0.01863975417615, 0.017805231354103963, 0.019220580705656704]
Most important feature:  5
683.6802277565002
145 / 159
#####################################
attackkey:  f621936359aca6116a3bd69b370c2a67
total # of packets: 872
# of UDP packets: 872
# of TCP packets: 0
# of other packets except for TCP or UDP: 0
Features: ips,protocol,total_length,src_port,dst_port,udp_length
AMI scores:  [0.018990067069445265, 3.026651105553221e-13, 0.02369229803392268, 0.024315856859065024, 0.02262508126618706, 0.023694456910560272]
Most important feature:  3
685.1737365722656
146 / 159
#####################################
attackkey:  e9beae87c340afe2668dce0a7001ac24
total # of packets: 293
# of UDP packets: 

### Analysis and Conclusions

In [18]:
for i in range(0,6):
    print(i,most_important_features.count(i))

0 0
1 31
2 3
3 93
4 4
5 26


The counts of most important feature per feature are: <br/>

0 IP address: 0 <br/>
1 Protocol: 31 <br/>
2 Total packet length: 3 <br/>
3 UDP source port: 93 <br/>
4 Destination port: 4 <br/>
5 UDP payload length: 26 <br/>

None of the attacks has its IP address as most important feature. This can be explained by the fact that there are too much different IP addresses. Furthermore by the fact that the values of the IP addresses are laying in between each other for the normal and attack traffic.

The 31 at the protocol is remarkable, since this AMI value should be zero since all packets should be UDP packets. With manual inspection we discovered that some of the attacks are mixed with ICMP packets. 

The UDP source port is for most attacks the distinguishing feature. UDP payload second best.
