In [1]:
from helper_functions import *

chosen_files = ['conn.log','dns.log','ftp.log','http.log','ssh.log','ssl.log']
orig_log_dataframes = build_dataframes(chosen_files)
log_dataframes = anonymise_dataframes(orig_log_dataframes, chosen_files)

## Bro/Zeek log parser and anonymiser

#### This file serves to parse Bro/Zeek logs and identify and anonymise Personally Identifiable Information (PII). As example data the bro logs of the UNSW-NB15 dataset [1] have been used.

Bro/Zeek has a large number of logs available. However for the purpose of this research 6 logs have been selected from which information will be used for intrusion detection. The 6 selected logs are the following:

> *conn.log, dns.log, ftp.log, http.log, ssh.log and ssl.log*

For all logs the fields have been evaluated to determine the most useful fields. It has been found, that these 6 logs give valuable information for detection and hold minimal PII information. There is still some PII information in these logs and the fields that have been identified as PII are highlighted in red below. 

An example is given here with the conn.log, where the Source IP and destination IP are identified as PII and marked red. The entire list of logs and fields can be found at the bottom of this page.

[1] https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/

In [2]:
print("---- " + 'conn.log' + " ---- Normal")
display(orig_log_dataframes['conn.log'].iloc[:1,:].style.apply(highlight_col, axis=None))

---- conn.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents
0,1424256987.747828,C8kfmaJvH8YSujzW5,59.166.0.2,57872,149.171.126.9,56104,tcp,-,0.027552,227.0,11587.0,SF,-,0,ShADadfF,42,2646,44,25470,(empty)


### Anonymised fields

The following fields have been selected for anonymisation:

**conn.log:** src IP - src IP

**dns.log:** src IP - dst IP - dns query - dns response
   
**ftp.log:** src IP - dst IP - user - password - arg - src data_channel - dst data_channel

**http.log:** src IP - dst IP - host - uri - filename - username - password
* The user agent could be considered PII, however in this work it is chosen to keep the user agent as is

**ssh.log:** src IP - dst IP - server name 

**ssl.log:** src ip - dst ip - subject - issuer - client_subject - client_issuer

These fields can be seen, marked in red, below.

In [3]:
for key in chosen_files:
    print("---- " + key + " ---- Normal")
    display(orig_log_dataframes[key].iloc[:1,:].style.apply(highlight_col, axis=None))

---- conn.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents
0,1424256987.747828,C8kfmaJvH8YSujzW5,59.166.0.2,57872,149.171.126.9,56104,tcp,-,0.027552,227.0,11587.0,SF,-,0,ShADadfF,42,2646,44,25470,(empty)


---- dns.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,trans_id,query,qclass,qclass_name,qtype,qtype_name,rcode,rcode_name,AA,TC,RD,RA,Z,answers,TTLs,rejected
0,1424256988.447831,CcCQqj3E6KlwtFnRja,59.166.0.4,7745,149.171.126.2,53,udp,48100,server-95ab7e07.int,1.0,C_INTERNET,1.0,A,0.0,NOERROR,F,F,T,T,0,149.171.126.7,60.0,F


---- ftp.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,user,password,command,arg,mime_type,file_size,reply_code,reply_msg,data_channel.passive,data_channel.orig_h,data_channel.resp_h,data_channel.resp_p,fuid
0,1424256989.952859,CsfksnMVMnOYhPiB7,59.166.0.8,5146,149.171.126.3,21,anonymous,jobs@server.com,EPSV,-,-,,229,Extended Passive Mode OK (|||24196|),T,59.166.0.8,149.171.126.3,24196.0,-


---- http.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,trans_depth,method,host,uri,referrer,user_agent,request_body_len,response_body_len,status_code,status_msg,info_code,info_msg,filename,tags,username,password,proxied,orig_fuids,orig_mime_types,resp_fuids,resp_mime_types
0,1424256990.350022,CwgILB4FJZI9P9o4z1,59.166.0.1,41195,149.171.126.1,80,1,GET,Tracker,/announce?peer_id=-AR2621-949883860326&port=15836&uploaded=0&downloaded=0&left=8388610&compact=1&numwant=0&event=started&info_hash=\x1d]\xfb\xcc\x9f\xeb\xfckTfW\xe3e\xe8\xed\xa9 \x0f5\xf0,-,-,0,83,200.0,OK,,-,-,(empty),-,-,-,-,-,FdHIAGPzf33uaIXE9,text/plain


---- ssh.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,status,direction,client,server
0,1424256988.242291,ChBvzV2CEZnYgi9gaa,59.166.0.0,3778,149.171.126.2,22,success,INBOUND,SSH-2.0-PuTTY_Release_0.60,SSH-1.99-OpenSSH_4.3


---- ssl.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,version,cipher,curve,server_name,session_id,last_alert,established,cert_chain_fuids,client_cert_chain_fuids,subject,issuer,client_subject,client_issuer
0,1424257005.940392,CkIFoc3DTlsBeaFW66,175.45.176.1,17478,149.171.126.12,443,TLSv10,TLS_RSA_WITH_3DES_EDE_CBC_SHA,-,-,08a79d45b747f5c828da00000000000000000000000000000000000000000000,-,F,-,-,-,-,-,-


## Anonymisation process

The PII fields are then anonymised in three categories of anonymisation. IP addresses, queries and other.

##### IP addresses

The IP addresses are transformed to a random IP addresses where IP addresses are split in octets and each octed is granted a random number. Which means that IP addresses in each octet are grouped together. For example: 1.2.3.50 and 1.2.3.100 might be transformed to 9.222.51.2 and 9.222.51.244. 

##### Queries

The same is done with dns queries. To be able to preseve information, the domains at each level are mapped to the same random number (label) using scikitlearn's LabelEncoder. For example: google.com and dns.google.com could be mapped to 5.3 and 1.5.3

##### Other

The other PII fields are encoded with scikitlearn's LabelEncoder in its entirety. Which means that the same inputs are mapped to the same number (label)

## Reversibility

All transformation tables and LabelEncoders are deleted after the anonymisation process, which makes it irriversable. 

## Results - Normal and Anonymised

Below, the results of the fields marked as PII and the anonymisation of the 6 selected log files can be found.

In [4]:
for key in chosen_files:
    print("---- " + key + " ---- Normal")
    display(orig_log_dataframes[key].iloc[:1,:].style.apply(highlight_col, axis=None))
    print("---- " + key + " ---- Anonymised")
    display(log_dataframes[key].iloc[:1,:].style.apply(highlight_col, axis=None))
    

---- conn.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents
0,1424256987.747828,C8kfmaJvH8YSujzW5,59.166.0.2,57872,149.171.126.9,56104,tcp,-,0.027552,227.0,11587.0,SF,-,0,ShADadfF,42,2646,44,25470,(empty)


---- conn.log ---- Anonymised


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents
0,1424256987.747828,C8kfmaJvH8YSujzW5,43.50.116.71,57872,221.210.86.69,56104,tcp,-,0.027552,227.0,11587.0,SF,-,0,ShADadfF,42,2646,44,25470,(empty)


---- dns.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,trans_id,query,qclass,qclass_name,qtype,qtype_name,rcode,rcode_name,AA,TC,RD,RA,Z,answers,TTLs,rejected
0,1424256988.447831,CcCQqj3E6KlwtFnRja,59.166.0.4,7745,149.171.126.2,53,udp,48100,server-95ab7e07.int,1.0,C_INTERNET,1.0,A,0.0,NOERROR,F,F,T,T,0,149.171.126.7,60.0,F


---- dns.log ---- Anonymised


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,trans_id,query,qclass,qclass_name,qtype,qtype_name,rcode,rcode_name,AA,TC,RD,RA,Z,answers,TTLs,rejected
0,1424256988.447831,CcCQqj3E6KlwtFnRja,43.50.116.15,7745,221.210.86.71,53,udp,48100,0.0,1.0,C_INTERNET,1.0,A,0.0,NOERROR,F,F,T,T,0,221.210.86.231,60.0,F


---- ftp.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,user,password,command,arg,mime_type,file_size,reply_code,reply_msg,data_channel.passive,data_channel.orig_h,data_channel.resp_h,data_channel.resp_p,fuid
0,1424256989.952859,CsfksnMVMnOYhPiB7,59.166.0.8,5146,149.171.126.3,21,anonymous,jobs@server.com,EPSV,-,-,,229,Extended Passive Mode OK (|||24196|),T,59.166.0.8,149.171.126.3,24196.0,-


---- ftp.log ---- Anonymised


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,user,password,command,arg,mime_type,file_size,reply_code,reply_msg,data_channel.passive,data_channel.orig_h,data_channel.resp_h,data_channel.resp_p,fuid
0,1424256989.952859,CsfksnMVMnOYhPiB7,43.50.116.241,5146,221.210.86.125,21,1,2,EPSV,0,-,,229,Extended Passive Mode OK (|||24196|),T,43.50.116.241,221.210.86.125,24196.0,-


---- http.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,trans_depth,method,host,uri,referrer,user_agent,request_body_len,response_body_len,status_code,status_msg,info_code,info_msg,filename,tags,username,password,proxied,orig_fuids,orig_mime_types,resp_fuids,resp_mime_types
0,1424256990.350022,CwgILB4FJZI9P9o4z1,59.166.0.1,41195,149.171.126.1,80,1,GET,Tracker,/announce?peer_id=-AR2621-949883860326&port=15836&uploaded=0&downloaded=0&left=8388610&compact=1&numwant=0&event=started&info_hash=\x1d]\xfb\xcc\x9f\xeb\xfckTfW\xe3e\xe8\xed\xa9 \x0f5\xf0,-,-,0,83,200.0,OK,,-,-,(empty),-,-,-,-,-,FdHIAGPzf33uaIXE9,text/plain


---- http.log ---- Anonymised


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,trans_depth,method,host,uri,referrer,user_agent,request_body_len,response_body_len,status_code,status_msg,info_code,info_msg,filename,tags,username,password,proxied,orig_fuids,orig_mime_types,resp_fuids,resp_mime_types
0,1424256990.350022,CwgILB4FJZI9P9o4z1,43.50.116.153,41195,221.210.86.153,80,1,GET,251,216,-,-,0,83,200.0,OK,,-,0,(empty),0,0,-,-,-,FdHIAGPzf33uaIXE9,text/plain


---- ssh.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,status,direction,client,server
0,1424256988.242291,ChBvzV2CEZnYgi9gaa,59.166.0.0,3778,149.171.126.2,22,success,INBOUND,SSH-2.0-PuTTY_Release_0.60,SSH-1.99-OpenSSH_4.3


---- ssh.log ---- Anonymised


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,status,direction,client,server
0,1424256988.242291,ChBvzV2CEZnYgi9gaa,43.50.116.118,3778,221.210.86.71,22,success,INBOUND,SSH-2.0-PuTTY_Release_0.60,SSH-1.99-OpenSSH_4.3


---- ssl.log ---- Normal


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,version,cipher,curve,server_name,session_id,last_alert,established,cert_chain_fuids,client_cert_chain_fuids,subject,issuer,client_subject,client_issuer
0,1424257005.940392,CkIFoc3DTlsBeaFW66,175.45.176.1,17478,149.171.126.12,443,TLSv10,TLS_RSA_WITH_3DES_EDE_CBC_SHA,-,-,08a79d45b747f5c828da00000000000000000000000000000000000000000000,-,F,-,-,-,-,-,-


---- ssl.log ---- Anonymised


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,version,cipher,curve,server_name,session_id,last_alert,established,cert_chain_fuids,client_cert_chain_fuids,subject,issuer,client_subject,client_issuer
0,1424257005.940392,CkIFoc3DTlsBeaFW66,138.48.142.153,17478,221.210.86.78,443,TLSv10,TLS_RSA_WITH_3DES_EDE_CBC_SHA,-,0,08a79d45b747f5c828da00000000000000000000000000000000000000000000,-,F,-,-,0,0,0,0
