# Snort log formatting

We are given 17 files produced by snort running on a network. While these files are given a pcap file extension, this is not the case, so we have formatted these logs into a format that would allow us to extract the important information to label the conn.log packets as either normal or malicious traffic.

In [1]:
lines=[]
for i in range(17):
    if i < 10:
        j = '0'+str(i)
    else:
        j = str(i)
    f = open('../../alert.full.maccdc2012_000'+j+'.pcap','r')
    lines.append(f.readlines())
    print(len(lines[i]))

1144879
380932
91765
46519
489802
0
534753
589684
386575
608430
1233926
341013
1231744
199433
1029444
2416650
906679


In [2]:
g = open('../../alert.full.maccdc2012_00000.pcap','r')
len(g.readlines())

1144879

Here we look at the format of the files so that we can extract the important information. We have that while the time is mostly ordered, it is not completely ordered making it harder to compare with conn.log.

In [3]:
for j in range(17):
    print(j)
    for i in range(1,20):
        if j != 5:
            print(lines[j][-i])        

0


TCP Options (3) => NOP NOP TS: 646535267 1343654 

***A**** Seq: 0x35E70F53  Ack: 0x3C19364E  Win: 0xB5  TcpLen: 32

TCP TTL:63 TOS:0x0 ID:22287 IpLen:20 DgmLen:1500 DF

03/16-08:30:07.770000 192.168.27.253:443 -> 192.168.202.110:49653

[Classification: Potential Corporate Privacy Violation] [Priority: 1] 

[**] [1:2013659:3] ET POLICY Self Signed SSL Certificate (SomeOrganizationalUnit) [**]



[Xref => http://www.networkforensics.com/2010/05/16/network-detection-of-x86-buffer-overflow-shellcode/]

***A**** Seq: 0x2CD9A251  Ack: 0x8E7AD34A  Win: 0x3908  TcpLen: 20

TCP TTL:64 TOS:0x0 ID:6111 IpLen:20 DgmLen:1420 DF

03/16-08:30:07.760000 192.168.202.68:8080 -> 192.168.24.100:1038

[Classification: Executable Code was Detected] [Priority: 1] 

[**] [1:2012088:1] ET SHELLCODE Possible Call with No Offset TCP Shellcode [**]



TCP Options (3) => NOP NOP TS: 646535185 1343633 

***A**** Seq: 0x35DA9994  Ack: 0x7277A9A3  Win: 0xB5  TcpLen: 32

TCP TTL:63 TOS:0x0 ID:49566 IpLen:20 DgmLe

We are looking to find attributes to match to the main log. Here we will use the ip addresses and port numbers as well as the time. From the above, the line containing this data all start with '03/', while it looks like no other lines start the same. As such I will use this to access this data from here. We will now look to filter the lines and then look to format each line into a table. We use the same reasoning for finding the attack type and '[class'.

In [4]:
filtered = []
desc = []
for j in range(17):
    if j != 5:
        for i in range(0,len(lines[j])-1):
            if lines[j][i].startswith('03/'):
                filtered.append(lines[j][i])
            if lines[j][i].startswith('[Class'):
                desc.append(lines[j][i])

In [5]:
print(len(filtered))
print(len(desc))
s = 0
for j in range(17):
    s = s + len(lines[j])
print(s)
print(s/len(filtered))

1682304
1682304
11632228
6.9144625466027545


Here there are approximately 7 lines for each line that starts with '03/' which looking at the data looks like a good approximation. As such, I am reasonably confident that it has picked up all the lines it needed to.  

Here we extract the information that we need, i.e elements of time, src/dst IP and Ports, attack type, also whether the IPs are IPv4 or IPv6, and if Ports are included in the log.

In [6]:
day=[]
hour=[]
minute=[]
seconds=[]
milli=[]
srcIP=[]
srcPort=[]
dstIP=[]
dstPort=[]
IPv4=[]
Port=[]
num1=[]
num2=[]
num3=[]
atcktype=[]
for i in range(0,len(filtered)):
    y = desc[i].find(']')
    atcktype.append(desc[i][17:y])
    day.append(filtered[i][3:5])
    hour.append(filtered[i][6:8])
    minute.append(filtered[i][9:11])
    seconds.append(filtered[i][12:14])
    milli.append(filtered[i][15:21])
    #only IPv4 contain . so we can use this to identify if the IP is v4 or v6
    #the xi values are use to signify important places in the string to help with selecting the important sections of the string
    if filtered[i][25] == '.':
        IPv4.append(True)
        #Checks if a port is given by the log
        if (filtered[i][22:].find(':') < filtered[i][22:].find('->')) and (filtered[i][22:].find(':') != -1):
            Port.append(True)
            x1 = filtered[i][22:].find(':') + 22
            srcIP.append(filtered[i][22:x1])
            x2 = filtered[i][22:].find('->') + 22
            srcPort.append(filtered[i][x1+1:x2-1])
            x3 = x2 + 3
            x4 = filtered[i][x3:].find(':') + x3
            dstIP.append(filtered[i][x3:x4])
            dstPort.append(filtered[i][x4+1:-1])
        else:
            num1.append(i)
            Port.append(False)
            x1 = filtered[i][22:].find('->') + 22
            srcIP.append(filtered[i][22:x1-1])
            dstIP.append(filtered[i][x1+3:-1])  
            srcPort.append(0)
            dstPort.append(0)
    else:
        IPv4.append(False)
        arrow = filtered[i][22:].find('->') + 22
        #Checks if a port is given by the log
        if filtered[i][22:arrow].count(':') == 8:
            num2.append(i)
            Port.append(True)
            x1 = filtered[i][22:arrow].rfind(':') + 22
            srcIP.append(filtered[i][22:x1])
            x2 = arrow
            srcPort.append(filtered[i][x1+1:x2-1])
            x3 = x2 + 3
            x4 = filtered[i][x3:].rfind(':') + x3
            dstIP.append(filtered[i][x3:x4])
            dstPort.append(filtered[i][x4+1:-1])
        else:
            num3.append(i)
            Port.append(False)
            x1 = arrow
            srcIP.append(filtered[i][22:x1-1])
            dstIP.append(filtered[i][x1+3:-1])
            srcPort.append(0)
            dstPort.append(0)

We now need to check that the code above is accurate in what information it has extracted. 

In [7]:
print(atcktype[1:10])

['Web Application Attack', 'Web Application Attack', 'Web Application Attack', 'Unsuccessful User Privilege Gain', 'Unsuccessful User Privilege Gain', 'Web Application Attack', 'Misc activity', 'Web Application Attack', 'Web Application Attack']


In [8]:
for i in range(5):
    print(filtered[num2[i]])
    print(srcIP[num2[i]])
    print(srcPort[num2[i]])
    print(dstIP[num2[i]])
    print(dstPort[num2[i]])
    print('/')
    print(filtered[num1[i]])
    print(srcIP[num1[i]])
    print(dstIP[num1[i]])
    print('/')
    print(filtered[num3[i]])
    print(srcIP[num3[i]])
    print(dstIP[num3[i]])
    print('/')

03/16-11:22:13.200000 2001:dbb:c18:202:20c:29ff:fe93:571e:60681 -> 2001:dbb:c18:1:216:47ff:fe9d:f2d7:3306

2001:dbb:c18:202:20c:29ff:fe93:571e
60681
2001:dbb:c18:1:216:47ff:fe9d:f2d7
3306
/
03/16-07:30:00.060000 192.168.27.25 -> 192.168.202.100

192.168.27.25
192.168.202.100
/
03/16-07:31:58.510000 :: -> ff02::1:ff8e:385a

::
ff02::1:ff8e:385a
/
03/16-11:22:13.280000 2001:dbb:c18:202:20c:29ff:fe93:571e:60681 -> 2001:dbb:c18:1:216:47ff:fe9d:f2d7:1433

2001:dbb:c18:202:20c:29ff:fe93:571e
60681
2001:dbb:c18:1:216:47ff:fe9d:f2d7
1433
/
03/16-07:30:00.300000 192.168.27.103 -> 192.168.202.100

192.168.27.103
192.168.202.100
/
03/16-07:34:33.350000 :: -> ff02::1:ff8e:385a

::
ff02::1:ff8e:385a
/
03/16-11:22:13.280000 2001:dbb:c18:202:20c:29ff:fe93:571e:60681 -> 2001:dbb:c18:1:216:47ff:fe9d:f2d7:161

2001:dbb:c18:202:20c:29ff:fe93:571e
60681
2001:dbb:c18:1:216:47ff:fe9d:f2d7
161
/
03/16-07:30:00.500000 192.168.202.1 -> 192.168.202.100

192.168.202.1
192.168.202.100
/
03/16-07:38:16.020000 :: -

The above makes me confident in the ability for the code to separate the relevant information regarless of IPv and if ports are included. We will now format the time. We will use that the first packet in the snort log occurs at the first time instance in conn.log (time = 1331901000.000000) and the 1037495th packet which occurs on the second day (time = 1331987253.530000). The reason that we have two different packets is because a day is not excatly 24 hours so this would not work for the packets not on the same day as the control packet. (It had since been brought to my attention that there are functions for this.)

In [9]:
time=[]
scaler16 = (int(milli[0]) / 1000000) + int(seconds[0]) + int(minute[0]) * 60 + int(hour[0]) * 3600
scaler17 = (int(milli[1037495]) / 1000000) + int(seconds[1037495]) + int(minute[1037495]) * 60 + int(hour[1037495]) * 3600 #1037495 1175157
for i in range(len(filtered)):
    if int(day[i]) == 16:
        temp = (int(milli[i]) / 1000000) + int(seconds[i]) + int(minute[i]) * 60 + int(hour[i]) * 3600 - scaler16 + 1331901000
    else:
        temp = (int(milli[i]) / 1000000) + int(seconds[i]) + int(minute[i]) * 60 + int(hour[i]) * 3600 - scaler17 +  1331987253.53 #1331987253.53 1331997339.739999
    time.append(str(temp).ljust(17,'0')) 

Here we check that the time code above does work as intended by checking against the matching packets in conn.log. 

In [10]:
for i in range (20,50):
    print(time[-i])
    print(filtered[-i])

1332017979.110000
03/17-15:59:39.110000 192.168.202.83:56138 -> 192.168.206.44:161

1332017979.110000
03/17-15:59:39.110000 192.168.202.83:55579 -> 192.168.206.44:1433

1332017979.090000
03/17-15:59:39.090000 192.168.202.83:52619 -> 192.168.206.44:5432

1332017979.090000
03/17-15:59:39.090000 192.168.202.83:38159 -> 192.168.206.44:705

1332017979.080000
03/17-15:59:39.080000 192.168.202.83:54484 -> 192.168.206.44:1521

1332017979.080000
03/17-15:59:39.080000 192.168.202.83:48842 -> 192.168.206.44:3306

1332017979.080000
03/17-15:59:39.080000 192.168.202.83:52618 -> 192.168.206.44:22

1332017959.020000
03/17-15:59:19.020000 192.168.202.83:52307 -> 192.168.206.44:5432

1332017959.020000
03/17-15:59:19.020000 192.168.202.83:54705 -> 192.168.206.44:1433

1332017959.020000
03/17-15:59:19.020000 192.168.202.83:55143 -> 192.168.206.44:161

1332017959.020000
03/17-15:59:19.020000 192.168.202.83:54255 -> 192.168.206.44:1521

1332017959.010000
03/17-15:59:19.010000 192.168.202.83:55663 -> 192.16

In [11]:
for i in range(1,100):
    print('new')
    print(filtered[-i])
    print(time[-i])
    print(len(filtered) - i)

new
03/17-15:59:41.320000 192.168.202.1 -> 192.168.202.80

1332017981.320000
1682303
new
03/17-15:59:39.590000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.590000
1682302
new
03/17-15:59:39.590000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.590000
1682301
new
03/17-15:59:39.590000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.590000
1682300
new
03/17-15:59:39.570000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.570000
1682299
new
03/17-15:59:39.570000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.570000
1682298
new
03/17-15:59:39.570000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.570000
1682297
new
03/17-15:59:39.550000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.550000
1682296
new
03/17-15:59:39.550000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.550000
1682295
new
03/17-15:59:39.550000 192.168.1.254:1900 -> 239.255.255.250:1900

1332017979.550000
1682294
new
03/17-15:59:39.540000 192.168.1.254:1900 -> 239

While looking at how effective my code is at matching to the timings of conn.log I have found couple things:

* There are snort log entries that do not correlate to any entry in conn log, I believe that this is due to conn.log being a subset of the data that the snort log has been formed from. As such, not every snort log has an equivalent in conn.log.

* Some of the packets may be assigned different times as when the packet was recorded it fell between the two discrete times e.g. filtered[1682258] is .85 according to the snort log but is recorded as .84 in conn.log. However, as I believe that this will not be the case for most packets, I will not consider this when matching with the conn.log. I may, if I have time, see how looking at +.01, and -.01 times change the number of packets flagged as malicious in conn.log. 

* While I haven't seen it, given the above point, it is very possible that some of the .xx9999 times in one will be .xx in the other. I will ignore this for the time being, however if I have time I will have the program check for these and see how this affects the number of packets flagged as malicious in conn.log

Overall, I think that the algorithm above works at converting the snort time to the conn.log time. So I will now create a table with all the information and then export this to a table to be compared with conn.log. Therefore, we can export the information to a file for use in labelling conn.log.

In [12]:
import pandas as pd

In [13]:
d = {
    'ts' : time,
    'srcIP': srcIP,
    'srcPort': srcPort,
    'dstIP': dstIP,
    'dstPort': dstPort,
    'IPv4': IPv4,
    'Port': Port,
    'type': atcktype
}
df = pd.DataFrame(data=d)
df

Unnamed: 0,ts,srcIP,srcPort,dstIP,dstPort,IPv4,Port,type
0,1331901000.000000,192.168.202.79,50465,192.168.229.251,80,True,True,Web Application Attack
1,1331901000.010000,192.168.202.79,50467,192.168.229.251,80,True,True,Web Application Attack
2,1331901000.030000,192.168.202.79,50469,192.168.229.251,80,True,True,Web Application Attack
3,1331901000.040000,192.168.202.79,50471,192.168.229.251,80,True,True,Web Application Attack
4,1331901000.050000,192.168.229.153,445,192.168.202.79,55173,True,True,Unsuccessful User Privilege Gain
...,...,...,...,...,...,...,...,...
1682299,1332017979.570000,192.168.1.254,1900,239.255.255.250,1900,True,True,Misc Attack
1682300,1332017979.590000,192.168.1.254,1900,239.255.255.250,1900,True,True,Information Leak
1682301,1332017979.590000,192.168.1.254,1900,239.255.255.250,1900,True,True,Misc Attack
1682302,1332017979.590000,192.168.1.254,1900,239.255.255.250,1900,True,True,Misc Attack


In [14]:
df.to_csv('../data/raw/snort.csv')