# <b>DarkVec: Automatic Analysis of Darknet Trafficwith Word Embeddings</b>
## <b>Appendix 3: Intermediate Preprocessing</b>  

___
# <b>Table of Content</b> <a id="toc"></a>
* [<b>Intermediate Preprocessing</b>](#preprocessing)  
    * [darkvec.csv.gz](#darkvec)
    * [ips.json](#ips)
    * [darkvec_d5.csv.gz](#darkvecd5)
    * [darknet_d1.csv.gz](#darkvecd1)
    * [darknet_d1_f5.csv.gz](#darkvecd1f5)
    * [embeddings_ip2vec.csv.gz](#ip2vec)
    
Here we report the codes used during the intermediate preprocessing which can speed up the notebook execution.

All the intermediate dataset are saved in the `DATASETS` folder specified in the configuration file.

___
***Note:*** All the code and data we provide are the ones included in the paper. To speed up the notebook execution, by default we trim the files when reading them. Comments on how to run on complete files are provided in the notebook. Note that running the notebook with the complete dataset requires *a PC with significant amount of memory*. 

In [1]:
from config import *
from src.callbacks import *
from src.utils import *
import pandas as pd
import numpy as np
import warnings
import json
from glob import glob
from datetime import datetime
from src.knngraph import *
from keras.models import load_model as k_load_model

from pandas.core.common import SettingWithCopyWarning
from pandas.errors import DtypeWarning

warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=SettingWithCopyWarning)
warnings.filterwarnings("ignore", category=DtypeWarning)

___
# <b>Intermediate Preprocessing</b> <a name="preprocessing"></a>




Load the full ground truth we have manually built and define local functions

In [2]:
demonstrative = True

In [3]:
# Get the paths of the raw traces to process
full_logs = []
traces_logs = DEBUG.split('file://')[1:]
for trace_day in traces_logs:
    sub_traces = glob(trace_day.split('*')[0]+'*')
    for sub_trace in sub_traces:
        full_logs += [sub_trace+'/packets.log.gz']

In [4]:
def convert_proto(x):
    """Convert the network protocol decimal representation to string

    Parameters
    ----------
    x : int
        network protocol decimal index

    Returns
    -------
    str
        string id of the protocol

    """
    if x == 6: return 'TCP'
    elif x == 17: return 'UDP'
    elif x == 47: return 'GRE'
    elif x == 1: return 'ICMP'
    else: return 'OTH'

def get_portproto(x):
    """Convert the destination port and the protocol of the packet to the 
    port/protocol pair or unknown

    Attributes
    ----------
    x : list
        destination port and protocol of the packet

    Returns
    -------
    str
        unk or port/proto pair

    """
    try: val = f'{x[0]}/{x[1]}'
    except: val = 'unk'

    return val

In [5]:
gt = pd.read_csv(GROUNDTRUTH).drop(columns=['Unnamed: 0'])\
       .rename(columns={'s_ip':'ip'})
gt.head()

Unnamed: 0,ip,class
0,185.200.118.73,adscore
1,185.200.118.49,adscore
2,185.200.118.70,adscore
3,185.200.118.39,adscore
4,185.200.118.41,adscore


### <b>darkvec.csv.gz</b> <a name="darkvec"></a>



Raw darkvec traces. We adjust the traces fields and organize them in a dataframe. Each row is a received packets and the columns are:
- `ts`. It is the timestamp of the packet arrival
- `ip`. It is the source IP address sending who sent the packet
- `port`. It is the destination (darknet) port
- `proto`. Used protocol among TCP, UDP, ICMP, GRE, OTH (for others)
- `pp`. `port/proto` pairs used for the language definition
- `class`. Ground truth class of the source IP

In [None]:
if demonstrative:
    logs = full_logs[:5]
else:
    logs = full_logs

# Load the raw traces
darknet = pd.concat([pd.read_csv(x, sep=' ') for x in logs])
# Convert the procol numbers to the string identifier
darknet.proto = darknet.proto.apply(lambda x: convert_proto(x))
# Convert the timestamp to datetime
darknet.ts = darknet.ts.apply(lambda x: datetime.fromtimestamp(x))
# Extract the classes of services aka languages (port/protocol pairs)
darknet['pp'] = darknet[['dst_port', 'proto']]\
                .apply(lambda x: f'{x[0]}/{x[1]}', axis=1)
darknet = darknet[['ts', 'src_ip', 'dst_port', 'proto', 'pp']]\
            .rename(columns={'src_ip':'ip', 'dst_port':'port'})
darknet.index = pd.DatetimeIndex(darknet.ts)
darknet = darknet.drop(columns=['ts']).reset_index()
# Add the ground truth class column
darknet = darknet.merge(gt, on='ip', how='left').fillna('unknown')

if not demonstrative:
    darknet.to_csv(f'{DATASETS}/darknet.csv.gz', compression='gzip', index=False)

darknet.head()

### <b>ips.json</b> <a name="ips"></a>



It contains list of IPs. Namely:

* `d30_u`: it is referred to the 30 days dataset unfiltered;
* `d30_f`: 30 days dataset filtered;
* `d1_u`: last day unfiltered;
* `d1_f30`: last day filtered over 30 days.


In [49]:
filters = dict()
# Get 30 days of traffic unfiltered
ip_d30_u = list(darknet.ip.unique())
filters['d30_u'] = ip_d30_u
# Get 30 days of traffic filtered
ip_d30_f = darknet.value_counts('ip')
ip_d30_f = list(set(ip_d30_f[ip_d30_f>=10].index))
filters['d30_f'] = ip_d30_f
# Get the last day of traffic from raw traces
last_day_traces = []
for trace in glob(DEBUG.split('file://')[-1].split('*')[0]+'*'):
    last_day_traces.append(trace+'/packets.log.gz')
last_day = pd.concat([pd.read_csv(x, sep=' ') for x in last_day_traces])
ip_d1_u = list(last_day.src_ip.unique())
filters['d1_u'] = ip_d1_u
# Filter the last day
ip_d1_f30 = list(set(last_day.src_ip.unique()).intersection(ip_d30_f))
filters['d1_f30'] = ip_d1_f30


print(f'IPs 30 days unfiltered: {ip_d30_u[:3]}')
print(f'IPs 30 days filtered over 30 days: {ip_d30_f[:3]}')
print(f'IPs last day unfiltered: {ip_d1_u[:3]}')
print(f'IPs last day filtered over 30 days: {ip_d1_f30[:3]}')

if not demonstrative:
    with open(f'{DATASETS}/ips.json', 'w') as file:
        file.write(json.dumps('ips.json'))

IPs 30 days unfiltered: ['145.239.33.107', '89.40.70.51', '94.102.51.17']
IPs 30 days filtered over 30 days: ['45.12.49.222', '68.119.89.115', '20.194.18.193']
IPs last day unfiltered: ['192.3.136.75', '172.245.10.231', '45.155.205.93']
IPs last day filtered over 30 days: ['159.203.165.156', '71.6.199.23', '162.142.125.21']


### <b>darkvec_d5.csv.gz</b> <a name="darkvecd5"></a>



Last 5 days of unfiltered darknet traffic. We adjust the traces fields and organize them in a dataframe. Each row is a received packets and the columns are:
- `ts`. It is the timestamp of the packet arrival
- `ip`. It is the source IP address sending who sent the packet
- `port`. It is the destination (darknet) port
- `proto`. Used protocol among TCP, UDP, ICMP, GRE, OTH (for others)
- `pp`. `port/proto` pairs used for the language definition
- `class`. Ground truth class of the source IP

In [50]:
last5 = []
for days in DEBUG.split('file://')[-5:]:
    days = days.split('*')[0]+'*/packets.log.gz'
    for sub_day in glob(days):
        last5.append(sub_day)

In [51]:
# Get the last 5 days of traffic from raw traces
if demonstrative:
    _last5 = last5[:3]
else:
    _last5 = last5

# Load the raw traces    
last_day5 = pd.concat([pd.read_csv(x, sep=' ') for x in _last5])
# Convert the procol numbers to the string identifier
last_day5.proto = last_day5.proto.apply(lambda x: convert_proto(x))
# Convert the timestamp to datetime
last_day5.ts = last_day5.ts.apply(lambda x: datetime.fromtimestamp(x))
# Extract the classes of services aka languages (port/protocol pairs)
last_day5['pp'] = last_day5[['dst_port', 'proto']]\
                .apply(lambda x: f'{x[0]}/{x[1]}', axis=1)
last_day5 = last_day5[['ts', 'src_ip', 'dst_port', 'proto', 'pp']]\
            .rename(columns={'src_ip':'ip', 'dst_port':'port'})
last_day5.index = pd.DatetimeIndex(last_day5.ts)
last_day5 = last_day5.drop(columns=['ts']).reset_index()
# Add the ground truth class column
last_day5 = last_day5.merge(gt, on='ip', how='left').fillna('unknown')

if not demonstrative:
    last_day5.to_csv(f'{DATASETS}/darknet_d5.csv.gz', compression='gzip', index=False)

last_day5.head()

Unnamed: 0,ts,ip,port,proto,pp,class
0,2021-03-27 22:29:04.445707,94.232.46.25,3393,TCP,3393/TCP,unknown
1,2021-03-27 22:29:04.445723,94.232.46.25,3393,TCP,3393/TCP,unknown
2,2021-03-27 22:29:04.570144,45.146.164.196,3995,TCP,3995/TCP,unknown
3,2021-03-27 22:29:04.570161,45.146.164.196,3995,TCP,3995/TCP,unknown
4,2021-03-27 22:29:04.577582,192.3.136.75,1300,TCP,1300/TCP,unknown


### <b>darknet_d1.csv.gz</b> <a name="darkvecd1"></a>



Last day of darknet traffic unfiltered. We adjust the traces fields and organize them in a dataframe. Each row is a received packets and the columns are:
- `ts`. It is the timestamp of the packet arrival
- `ip`. It is the source IP address sending who sent the packet
- `port`. It is the destination (darknet) port
- `proto`. Used protocol among TCP, UDP, ICMP, GRE, OTH (for others)
- `pp`. `port/proto` pairs used for the language definition
- `class`. Ground truth class of the source IP

In [52]:
last_day_traces = []
for trace in glob(DEBUG.split('file://')[-1].split('*')[0]+'*'):
    last_day_traces.append(trace+'/packets.log.gz')

if demonstrative:
    _last_day_traces = last_day_traces[:3]
else:
    _last_day_traces = last_day_traces
    
# Load the raw traces    
last_day = pd.concat([pd.read_csv(x, sep=' ') for x in _last_day_traces])
# Convert the procol numbers to the string identifier
last_day.proto = last_day.proto.apply(lambda x: convert_proto(x))
# Convert the timestamp to datetime
last_day.ts = last_day.ts.apply(lambda x: datetime.fromtimestamp(x))
# Extract the classes of services aka languages (port/protocol pairs)
last_day['pp'] = last_day[['dst_port', 'proto']]\
                .apply(lambda x: f'{x[0]}/{x[1]}', axis=1)
last_day = last_day[['ts', 'src_ip', 'dst_port', 'proto', 'pp']]\
            .rename(columns={'src_ip':'ip', 'dst_port':'port'})
last_day.index = pd.DatetimeIndex(last_day.ts)
last_day = last_day.drop(columns=['ts']).reset_index()
last_day = last_day.merge(gt, on='ip', how='left').fillna('unknown')

to_replace = ['netscout', 'esrg_stanford', 'quadmetrics', 'quadmetrics', 'criminalip', 'adscore']
to_replace_idx = last_day.loc[last_day['class'].isin(to_replace)].index
# Add the ground truth class column
last_day.loc[to_replace_idx, 'class'] = 'unknown'

if not demonstrative:
    last_day.to_csv(f'{DATASETS}/darknet_d1.csv.gz', compression='gzip', index=False)

last_day.head()

Unnamed: 0,ts,ip,port,proto,pp,class
0,2021-03-31 08:29:08.449076,192.3.136.75,1970,TCP,1970/TCP,unknown
1,2021-03-31 08:29:08.449088,192.3.136.75,1970,TCP,1970/TCP,unknown
2,2021-03-31 08:29:08.451478,192.3.136.75,1970,TCP,1970/TCP,unknown
3,2021-03-31 08:29:08.451491,192.3.136.75,1970,TCP,1970/TCP,unknown
4,2021-03-31 08:29:08.459322,172.245.10.231,3956,TCP,3956/TCP,unknown


### <b>darknet_d1_f5.csv.gz</b> <a name="darkvecd1f5"></a>



Last day of darknet traffic filtered over the last 5 days. We adjust the traces fields and organize them in a dataframe. Each row is a received packets and the columns are:
- `ts`. It is the timestamp of the packet arrival
- `ip`. It is the source IP address sending who sent the packet
- `port`. It is the destination (darknet) port
- `proto`. Used protocol among TCP, UDP, ICMP, GRE, OTH (for others)
- `pp`. `port/proto` pairs used for the language definition
- `class`. Ground truth class of the source IP

In [55]:
# Load last 5 days of traffic
if demonstrative:
    last_day5 = pd.read_csv(f'{DATASETS}/darknet_d5.csv.gz').iloc[:1000]
else:
    last_day5 = pd.read_csv(f'{DATASETS}/darknet_d5.csv.gz')
# Filter: keep IPs sending at least 10 packets
# over 5 days
freq = last_day5.value_counts('ip')
filter5 = freq[freq>=10].index

# Load last day
if demonstrative:
    last5 = pd.read_csv(f'{DATASETS}/darknet_d1.csv.gz').iloc[:1000]
else:
    last5 = pd.read_csv(f'{DATASETS}/darknet_d1.csv.gz')
# Filter last day
last5 = last5[last5.ip.isin(set(filter5))]

if not demonstrative:
    last5.to_csv(f'{DATASETS}/darknet_d1_f5.csv.gz', index=False, compression='gzip')
    
last5.head()

Unnamed: 0,ts,ip,port,proto,pp,class
0,2021-03-31 08:29:08.449076,192.3.136.75,1970,TCP,1970/TCP,unknown
1,2021-03-31 08:29:08.449088,192.3.136.75,1970,TCP,1970/TCP,unknown
2,2021-03-31 08:29:08.451478,192.3.136.75,1970,TCP,1970/TCP,unknown
3,2021-03-31 08:29:08.451491,192.3.136.75,1970,TCP,1970/TCP,unknown
6,2021-03-31 08:29:08.553020,45.155.205.93,5543,TCP,5543/TCP,unknown


### <b>embeddings_ip2vec.csv.gz</b> <a name="ip2vec"></a>



Embeddings generated thrugh the IP2VEC methodology after 5 days of training.

In [7]:
def extract_corpus(data, w2v):
    """Extract the IP2VEC corpus

    Parameters
    ----------
    data : numpy.ndarray
        dataset
    w2v : dict
        word to embedding lookup

    Returns
    -------
    list
        tokens constituting the corpus
    """
    corpus = [[w2v[w] for w in ww]  for ww in data]
    return corpus

In [13]:
# Load Corpus
if demonstrative:
    files = glob(f'{CORPUS}/ip2vec5/*.npz')[:2]
else:
    files = glob(f'{CORPUS}/ip2vec5/*.npz')
# Get target words
x = np.concatenate([np.load(a)['x'] for a in files])
# Get context words
y = np.concatenate([np.load(a)['y'] for a in files])
merged = set(x).union(set(y))
# Tokenize distinct IPs
v2w = {v:cnt for v,cnt in enumerate(sorted(merged))}
w2v = {v:k for k,v in v2w.items()}

# Load the embedder for the comparison
d1_f5 = pd.read_csv(f'{DATASETS}/darknet_d1_f5.csv.gz')

if demonstrative:
    d1_f5 = d1_f5.iloc[:50]

embedder = k_load_model(f'{MODELS}/ip2vec5embedder')
# Retrieve the embeddings from the embedder weights
single_ips = d1_f5.ip.unique()

if not demonstrative:
    single_token = [w2v[x] for x in single_ips]
else:
    single_token = []
    for x in single_ips:
        try:
            single_token.append(w2v[x])
        except:
            # Only because we trimmed the input for demontrative purposes
            pass

ip2vec_embs = [embedder.get_weights()[0][x] for x in single_token]
# Save the retrieved embeddings
ip2vecEmbeddings = pd.DataFrame(ip2vec_embs)
try:
    ip2vecEmbeddings['ip'] = single_ips
except:
    # Only because we trimmed the input for demontrative purposes
    ip2vecEmbeddings['ip'] = single_ips[:ip2vecEmbeddings.shape[0]]
ip2vecEmbeddings = ip2vecEmbeddings.merge(d1_f5[['ip', 'class']].drop_duplicates(),
                                          on='ip').set_index('ip')

if not demonstrative:
    ip2vecEmbeddings.to_csv(f'{DATASETS}/embeddings_ip2vec.csv.gz', 
                            index=True, compression='gzip')

ip2vecEmbeddings.head()



Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,class
ip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
192.3.136.75,0.140516,-0.121416,0.082213,-0.109613,-0.11198,0.073576,-0.110181,0.06992,0.1224,0.115999,...,-0.152108,0.127227,0.140479,0.132453,-0.120265,-0.123345,-0.116033,0.16564,0.130773,unknown
172.245.10.231,0.571844,-0.57298,0.628429,-0.611953,-0.662718,0.6416,-0.585989,0.598031,0.589008,0.608322,...,-0.663602,0.646712,0.638398,0.584536,-0.661189,-0.66424,-0.588468,0.664454,0.611667,unknown
45.155.205.93,0.680109,-0.716235,0.650782,-0.680528,-0.70418,0.703452,-0.640864,0.632664,0.711205,0.705902,...,-0.663054,0.706435,0.686659,0.631891,-0.720504,-0.65523,-0.70976,0.714704,0.627607,unknown
23.129.64.232,0.627756,-0.544646,0.551813,-0.632093,-0.546742,0.631491,-0.627662,0.566308,0.562425,0.618965,...,-0.536841,0.594436,0.615056,0.562984,-0.532072,-0.620822,-0.628711,0.613212,0.540006,unknown
192.241.222.5,0.252327,-0.250525,0.244665,-0.219606,-0.188968,0.241786,-0.204343,0.195317,0.219202,0.281548,...,-0.196166,0.281408,0.257982,0.247891,-0.265824,-0.280848,-0.195806,0.210411,0.273139,stretchoid
