# <b>DarkVec: Automatic Analysis of Darknet Trafficwith Word Embeddings</b>
## <b>Appendix 2: Models Training</b>  

___

# <b>Table of Content</b> <a id="toc_"></a>

* [<b>DarkVec Training</b>](#darkvec)  
    * [Per-Service Language](#xserv)  
    * [Auto-Defined Language](#auto)  
    * [Single Language](#single)  
* [<b>DarkVec 5 Days</b>](#darkvec5)
* [<b>DANTE training</b>](#dante)  
* [<b>IP2VEC training</b>](#ip2vec)  

In this notebook we report the snippets for training and saving the models used in the paper. The corpus is loaded from the `CORPUS` path in the configuration file. Models and scalers are saved in the `MODELS` path in the configuration file.

The model names are related to the parameters _C_, or context window size, and _V_ or embeddings size. Namely, they are:
* `single_cC_vV_iter20`: DarkVec model trained on 30 days. Single language;
* `auto_cC_vV_iter20`: DarkVec model trained on 30 days. Auto-defined languages;
* `service_cC_vV_iter20`: DarkVec model trained on 30 days. Per-service languages;
* `fivedays_c25_v50_iter20`: DarkVec model trained on 5 days. Per-service languages;
* `ip2vec5embedder`: keras embedder generated through our implementation following the IP2VEC paper.


___
***Note:*** All the code and data we provide are the ones included in the paper. To speed up the notebook execution, by default we trim the files when reading them. Comments on how to run on complete files are provided in the notebook. Note that running the notebook with the complete dataset requires *a PC with significant amount of memory*. 

***Note:*** Be aware that these script require a large amount of time for training all models tested in the paper.

In [None]:
from config import *
import multiprocessing

from glob import glob
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
from keras.models import Model
from keras.layers import Input, Embedding, Dense, Reshape, Dot

from glob import glob
import numpy as np
import pandas as pd
from sklearn.utils import shuffle

import logging
from gensim.models import Word2Vec 
from gensim.models.word2vec import PathLineSentences
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)

# <b>Darkvec Training</b> <a name="darkvec"></a>  



In [None]:
demonstrative = True

### <b>Per-Service Language</b>  <a name="xserv"></a>



Grid search DarkVec training over 30 days of traffic. Per-service language means that each sequence of IPs is extracted with respect to the classes of service (port/protocol) pairs. The language definition is stored in the `SERVICES` path of the configuration file.

In [None]:
# Specify models parameter
if demonstrative:
    Cs = [5]
    Vs = [10]
    EPOCHS = 1
else:
    Cs = [5, 25, 50, 75]
    Vs = [50, 100, 150, 200]
    EPOCHS = 20
METHODOLOGY = 'darkvec'
DAYS = 30
LANGUAGE = 'xserv'

for C in Cs:
    for V in Vs:
        model_name = f'service_c{C}_v{V}_iter{EPOCHS}'
        model_path_name = f'{MODELS}/{model_name}.model'

        # Define the path for streaming training
        x = PathLineSentences(f'{CORPUS}/{METHODOLOGY}{DAYS}{LANGUAGE}', max_sentence_length=100000000);
        # Init a new model
        model = Word2Vec(sentences=None, min_count = 10, workers = multiprocessing.cpu_count(),
                         sg = 1, size = V, window = C, sample = 0, seed = 15)
        # Build the vocabulary
        model.build_vocab(sentences=x)
        vocab_size = len(model.wv.vocab)
        tot_examples = model.corpus_count
        # Train the model
        model.train(x, total_examples=tot_examples, epochs=EPOCHS, 
                    queue_factor=2, report_delay=0.0)
        if not demonstrative:
            model.save(model_path_name)

### <b>Auto-Defined Language</b> <a id="auto"></a>


Grid search DarkVec training over 30 days of traffic. Aut-defined language means that each sequence of IPs is extracted with respect to top-10 (port/protocol) pairs. All the other pairs are classified as _oth_. The language definition is stored in the `SERVICES` path of the configuration file.

In [None]:
# Specify models parameter
if demonstrative:
    Cs = [5]
    Vs = [10]
    EPOCHS = 1
else:
    Cs = [5, 25, 50, 75]
    Vs = [50, 100, 150, 200]
    EPOCHS = 20
METHODOLOGY = 'darkvec'
DAYS = 30
LANGUAGE = 'auto'

for C in Cs:
    for V in Vs:
        model_name = f'auto_c{C}_v{V}_iter{EPOCHS}'
        model_path_name = f'{MODELS}/{model_name}.model'

        # Define the path for streaming training
        x = PathLineSentences(f'{CORPUS}/{METHODOLOGY}{DAYS}{LANGUAGE}', max_sentence_length=100000000);
        # Init a new model
        model = Word2Vec(min_count = 10, workers = multiprocessing.cpu_count(),
                         sg = 1, size = V, window = C, sample = 0, seed = 15)
        # Build the vocabulary
        model.build_vocab(sentences=x)
        vocab_size = len(model.wv.vocab)
        tot_examples = model.corpus_count
        # Train the model
        model.train(x, total_examples=tot_examples, epochs=EPOCHS, 
                    queue_factor=2, report_delay=0.0)
        if not demonstrative:
            model.save(model_path_name)

### <b>Single Language</b> <a name="single"></a>


Grid search DarkVec training over 30 days of traffic. Single language means that each sequence of IPs is extracted exactly as they reached the darknet.

In [None]:
# Specify models parameter
if demonstrative:
    Cs = [5]
    Vs = [10]
    EPOCHS = 1
else:
    C = 75
    V = 50
    EPOCHS = 20
METHODOLOGY = 'darkvec'
DAYS = 30
LANGUAGE = 'single'

model_name = f'single_c{C}_v{V}_iter{EPOCHS}'
model_path_name = f'{MODELS}/{model_name}.model'

# Define the path for streaming training
x = PathLineSentences(f'{CORPUS}/{METHODOLOGY}{DAYS}{LANGUAGE}', max_sentence_length=100000000);
# Init a new model
model = Word2Vec(min_count = 10, workers = multiprocessing.cpu_count(),
                 sg = 1, size = V, window = C, sample = 0, seed = 15)
# Build the vocabulary
model.build_vocab(sentences=x)
vocab_size = len(model.wv.vocab)
tot_examples = model.corpus_count
# Train the model
model.train(x, total_examples=tot_examples, epochs=EPOCHS, 
            queue_factor=2, report_delay=0.0)
if not demonstrative:
    model.save(model_path_name)

# <b>DarkVec 5 days</b>  <a name="darkvec5"></a>


Model training for the state of art comparison. Darkvec here is trained on the last 5 days with the per-service language definition.

In [None]:
# Specify models parameter
if demonstrative:
    Cs = [5]
    Vs = [10]
    EPOCHS = 1
else:
    C = 25
    V = 50
    EPOCHS = 20
METHODOLOGY = 'darkvec'
DAYS = 5
LANGUAGE = 'xserv'

model_name = f'fivedays{C}_v{V}_iter{EPOCHS}'
model_path_name = f'{MODELS}/{model_name}.model'

# Define the path for streaming training
x = PathLineSentences(f'{CORPUS}/{METHODOLOGY}{DAYS}{LANGUAGE}', max_sentence_length=100000000);
# Init a new model
model = Word2Vec(min_count = 10, workers = multiprocessing.cpu_count(),
                 sg = 1, size = V, window = C, sample = 0, seed = 15)
# Build the vocabulary
model.build_vocab(sentences=x)
vocab_size = len(model.wv.vocab)
tot_examples = model.corpus_count
# Train the model
model.train(x, total_examples=tot_examples, epochs=EPOCHS, 
            queue_factor=2, report_delay=0.0)
if not demonstrative:
    model.save(model_path_name)

# <b>DANTE Training</b> <a name="dante"></a>


Word2Vec model training. The model is implemented following the DANTE original paper

In [None]:
# Specify models parameter
if demonstrative:
    Cs = [5]
    Vs = [10]
    EPOCHS = 1
else:
    C = 25
    V = 50
    EPOCHS = 10
DAYS = 5
METHODOLOGY = 'dante'

model_name = f'{METHODOLOGY}_c{C}_v{V}_iter{EPOCHS}'
model_path_name = f'{MODELS}/{model_name}.model'

# Define the path for streaming training
x = PathLineSentences(f'{CORPUS}/{METHODOLOGY}{DAYS}', max_sentence_length=100000000);
# Init a new model
model = Word2Vec(min_count = 0, workers = multiprocessing.cpu_count(),
                 sg = 1, size = V, window = C, sample = 0, seed = 15)
# Build the vocabulary
model.build_vocab(sentences=x)
vocab_size = len(model.wv.vocab)
tot_examples = model.corpus_count
# Train the model
model.train(x, total_examples=tot_examples, epochs=EPOCHS, 
            queue_factor=2, report_delay=0.0)
if not demonstrative:
    model.save(model_path_name)

    Extraction of DANTE logs during corpus loading:

    [...]
    INFO : PROGRESS: at sentence #104260000, processed 7199935012 words, keeping 42135 word types
    INFO : PROGRESS: at sentence #104270000, processed 7200855910 words, keeping 42135 word types
    INFO : PROGRESS: at sentence #104280000, processed 7201857738 words, keeping 42135 word types
    [...]

    According to the huge number of samples, we are not able to finish nor the training, nor the data loading

# <b>IP2VEC Training</b> <a name="ip2vec"></a>  



Keras-based model training. The model is implemented following the IP2VEC original paper

In [None]:
def extract_corpus(data, w2v):
    """Extract the IP2VEC corpus

    Parameters
    ----------
    data : numpy.ndarray
        dataset
    w2v : dict
        word to embedding lookup

    Returns
    -------
    list
        tokens constituting the corpus
    """
    corpus = [[w2v[w] for w in ww]  for ww in data]
    return corpus

In [None]:
# Load Corpus
if demonstrative:
    files = glob(f'{CORPUS}/ip2vec5/*.npz')[:2]
else:
    files = glob(f'{CORPUS}/ip2vec5/*.npz')
# Get target words
x = np.concatenate([np.load(a)['x'] for a in files])
# Get context words
y = np.concatenate([np.load(a)['y'] for a in files])
merged = set(x).union(set(y))
# Tokenize distinct IPs
v2w = {v:cnt for v,cnt in enumerate(sorted(merged))}
w2v = {v:k for k,v in v2w.items()}
# Merge target words and context words
data = [[x, y] for x, y in zip(x, y)]
data = np.array(data)

In [None]:
# Retrieve the full corpus
corpus = pd.DataFrame(extract_corpus(data, w2v)).to_numpy()
# 10% of negative sampling
ns = int(10*corpus.shape[0]/100)
clist = corpus.tolist()
clist = {f'{a[0]}_{a[1]}' for a in clist}
negative_x = []
negative_y = []
negative = set()
cnt = 0
while cnt < ns:
    idx1 = np.random.randint(0, corpus.shape[0])
    idx2 = np.random.randint(0, corpus.shape[0])
    if f'{corpus[idx1, 0]}_{corpus[idx2, 1]}' in clist and f'{corpus[idx1, 0]}_{corpus[idx2, 1]}' not in negative:
        negative.add(f'{corpus[idx1, 0]}_{corpus[idx2, 1]}')
        negative_x.append(corpus[idx1, 0])
        negative_y.append(corpus[idx2, 1])
        cnt+=1

In [None]:
# Final IP2VEC training dataset
neg = pd.DataFrame([zipped for zipped in zip(negative_x, negative_y)], columns=['target', 'context'])
neg['label'] = 0
pos = pd.DataFrame(corpus, columns=['target', 'context'])
pos['label'] = 1
data = shuffle(neg.append(pos)).reset_index().drop(columns=['index'])
data.head()

In [None]:
# Define IP2VEC architecture
input_target = Input((1,))
input_context = Input((1,))

embedding = Embedding(len(w2v), 32, input_length=1, name='embedding')
target = embedding(input_target)
target = Reshape((32, 1))(target)
context = embedding(input_context)
context = Reshape((32, 1))(context)

dot_product = Dot(axes=1)([target, context])
dot_product = Reshape((1,))(dot_product)

output = Dense(1, activation='sigmoid')(dot_product)

model = Model([input_target, input_context], output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.summary()

embedder = Model(input_target, Reshape((32,))(target))
embedder.summary()

In [None]:
# Train the model
x1 = data.target.values
x2 = data.context.values
y = data.label

if demonstrative:
    model.fit([x1, x2], y, batch_size=1024, epochs=1)
else:
    model.fit([x1, x2], y, batch_size=1024, epochs=10)

if not demonstrative:
    print('Model Saved')
    embedder.save('{MODELS}/ip2vec5embedder')