# Exploring Temporal GNN Embeddings for Darknet Traffic Analysis
## Embeddings Generation
___

## Table of Contents
1. NLP Embeddings
2. GNN embeddings Without Features
3. GNN embeddings With Features

This notebook contains the main codes to (i) prduce NLP embeddings through i-DarkVec; (ii) prduce (t)GNN embeddings without node features; (iii) prduce (t)GNN embeddings with node features.

The tested GNNs are: i-GCN, GCN-GRU, i-GCN-GRU

In [1]:
from glob import glob
import pandas as pd
import pickle
import torch
import json
import sys
sys.path.append('../')

## 1. NLP Embeddings: i-DarkVec

This code demonstrates the process of training and updating word embeddings models using daily text data. The models learn and represent the meanings of words in a way that facilitates various NLP tasks and understanding textual data across different time periods.

1. **Model Initialization:** The code begins by initializing a word embeddings model, referred to as `word2vec`. The model is configured with specific parameters, including a context window size (`c`), embedding dimensionality (`e`), number of training epochs (`epochs`), and a random seed (`seed`).

2. **Processing Multiple Days' Data:** The code then enters a loop to process data from multiple days. This loop iterates over files located in the '../data/raw/' directory.

3. **Extracting Day Information:** For each file, the code extracts the corresponding day from the filename. This day information is important for identifying and processing data from different time periods.

4. **Loading Corpus Data:** Within the loop, the code loads a corpus of text data associated with the current day. The corpus data is stored in files with filenames like 'corpus_YYYYMMDD.pkl' and is read using the `pickle.load()` function.

5. **Training or Updating the Model:** Depending on the specific day being processed, if the day is the first one, the initialized model (`word2vec`) is trained on the current day's corpus data. This training process helps the model learn word embeddings from scratch.

    For other days, the code updates the pre-trained model (`word2vec`) using the corpus data from the current day. This updating process helps the model adapt and refine its embeddings based on new data.

6. **Retrieving and Saving Embeddings:** After training or updating the model, the code retrieves the word embeddings learned by `word2vec`. These embeddings capture the semantic relationships between words in the corpus. The embeddings are then saved to a CSV file named 'idarkvec_embeddings_YYYYMMDD.csv' to be used for further natural language processing (NLP) tasks or analysis.

In [4]:
from src.models.nlp import iWord2Vec
from tqdm.notebook import tqdm_notebook as tqdm

# Initialize the model
word2vec = iWord2Vec(c=5, e=128, epochs=1, seed=15)

for file in tqdm(sorted(glob(f'../data/raw/*'))):
    # Extract day
    day = file.split('/')[-1].replace('.csv', '').replace('raw_', '')
    
    # Load the corpus
    with open(f'../data/corpus/corpus_{day}.pkl', 'rb') as file:
        corpus = pickle.load(file)
    
    if day == '20211201':
        # Train the initialized model
        word2vec.train(corpus)
    else:
        # Update the pre-trained model
        word2vec.update(corpus)
    
    # Retrieve the embeddings and save them
    embeddings = word2vec.get_embeddings()
    embeddings.to_csv(f'../data/nlp_embeddings/idarkvec_embeddings_{day}.csv')

  0%|          | 0/31 [00:00<?, ?it/s]

## 2. GNN Embeddings Without Node Features

This code segment focuses on essential data loading and preprocessing steps for graph analysis, including host node mapping, feature conversion, and adjacency matrix generation.

1. **Loading Host Node Lookup:** The code loads a lookup dictionary for host nodes from a JSON file. This dictionary maps IP addresses to nodes in a graph.

2. **Loading and Converting Traffic Features:** A list of traffic feature files is loaded and converted into PyTorch tensors for analysis. **Note** Here they are unused

3. **Generating Weighted Adjacency Matrices:** Weighted adjacency matrices are generated from graph data files. These matrices represent connections between nodes in the network.

4. **Obtaining Node and Feature Counts:** The code determines the number of nodes and features in the data, which is essential for graph analysis and modeling.

In [3]:
from src.preprocessing.gnn import generate_adjacency_matrices
from tqdm.notebook import tqdm_notebook as tqdm

# Number of days to use as history for tGNN
HISTORY = 5
# Number of training days for GCN-GRU
TRAIN = 20

# Load saved lookup dictionaries for host nodes
with open(f'../data/graph/ip_lookup.json', 'r') as file:
    ip_lookup = json.loads(file.read())
    reverse_lookup = {v:k for k,v in ip_lookup.items()}
    ip_nodes = len(reverse_lookup)

# Load traffic features and convert them as torch tensors
flist = [x for x in sorted(glob(f'../data/features/*'))]
features = []
for file in tqdm(flist, desc='Loading features'):
    feat = pd.read_csv(file, index_col=[0]).sort_index()
    feat = torch.tensor(feat.to_numpy())
    features.append(feat)

    
# Generate the weighted adjacency matrices
flist = [x for x in sorted(glob(f'../data/graph/*')) if '.txt' in x]
X = generate_adjacency_matrices(flist, weighted=True)

# Get number of nodes and features
n_nodes, n_features = features[0].shape

Loading features:   0%|          | 0/31 [00:00<?, ?it/s]

Loading graphs:   0%|          | 0/31 [00:00<?, ?it/s]

Generating matrices:   0%|          | 0/31 [00:00<?, ?it/s]

### i-GCN

This code segment focuses on training a Graph Neural Network model and extracting embeddings from snapshots of data. The embeddings capture important features of the graph structure and can be utilized for downstream tasks or analysis.


1. **GNN Model Configuration:** A GNN model (`igcn`) is configured with various parameters, including the number of nodes (`n_nodes`), GCN layers, input and output sizes, embedding size, and other hyperparameters for training.

2. **Training Loop:** The code enters a loop to iterate through each snapshot in `X`. For each snapshot the GNN model is trained or updated using the snapshot data. The embeddings are retrieved from the trained model, the index of the embeddings is adjusted to match IP nodes and the embeddings are saved to a CSV file with a name indicating the day.

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from src.models.gnn import GCN

epochs=5 # Set training epochs
pbar = tqdm(total=len(X)*epochs) # Initialize progress bar

# Initialize the model
igcn = GCN(
    n_nodes=n_nodes, 
    ns=1, 
    gcn_layers=2, 
    input_size=n_nodes, 
    gcn_units=1024, 
    gcn_output_size=512,
    embedding_size=128, 
    predictor_units=64, 
    dropout=.0, 
    lr=1e-3, 
    cuda=False, 
    epochs=epochs
)

# Incremental training
for i in range(len(X)):
    # Get current snapshot
    X_train = X[i]
    
    # Train/Update the model
    igcn.fit(X_train, pbar=pbar, day=i)

    # Retrieve embeddings
    embeddings = igcn.get_embeddings(X_train)[:ip_nodes]

    # Adjust index with IP nodes
    new_index = [reverse_lookup[x] for x in range(ip_nodes)]
    embeddings = pd.DataFrame(embeddings, index=new_index)
    
    # Manage day name for saving
    if i+1<10: day = f'0{i+1}'
    else: day=i+1
        
    # Save the embeddings
    ename = f'../data/gnn_embeddings/embeddings_igcn_202112{day}.csv'
    embeddings.to_csv(ename)

### GCN-GRU

This code segment focuses on training a GNN-GRU model for graph data and extracting embeddings from sequential snapshots, which can be utilized for various downstream tasks or analysis.

1. **Model Initialization:** The GNN model with a Gated Recurrent Unit (GRU), named `gcngru`, is initialized. It is configured with various parameters, including the number of nodes (`n_nodes`), history length (`HISTORY`), GCN layers, input and output sizes, embedding size, and training hyperparameters.

2. **Standard Model Training:** The model is trained in a standard way using the initial snapshot data (`X[:TRAIN]`) with a specified number of epochs.

3. **Embeddings Extraction:** The code enters a loop to iterate through snapshots from `TRAIN` to the end of the data. For each snapshot the code retrieves historical snapshots (`x`) from the previous `HISTORY` snapshots to the current snapshot. Embeddings are obtained using the trained GNN-GRU model, the index of the embeddings is adjusted to match IP nodes and the embeddings are saved to a CSV file with a name indicating the day.

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from src.models.gnn import GCN_GRU

epochs=50 # Set training epochs

# Initialize the model
gcngru = GCN_GRU(
    n_nodes=n_nodes, 
    history=HISTORY, 
    ns=1, 
    gcn_layers=2, 
    input_size=n_nodes, 
    gcn_units=1024, 
    gcn_output_size=512, 
    embedding_size=128, 
    predictor_units=64, 
    dropout=.0, 
    lr=1e-3, 
    early_stop=3,
    cuda=False
)

# Train the model in a standard way
gcngru.fit(X[:TRAIN], epochs=epochs)

for idx in tqdm(range(TRAIN, len(X))):
    day=idx+1
    # Get current snapshot
    x = X[idx-HISTORY:idx]
    
    # Retrieve embeddings
    embeddings = gcngru.get_embeddings(x)[:ip_nodes]
    
    # Adjust index with IP nodes
    new_index = [reverse_lookup[x] for x in range(ip_nodes)]
    embeddings = pd.DataFrame(embeddings, index=new_index)
    
    # Save the embeddings
    embeddings.to_csv(f'../data/gnn_embeddings/embeddings_gcngru_202112{day}.csv')

### i-GCN-GRU

This code segment demonstrates the process of incrementally training a GNN-GRU model on sequential data, extracting embeddings for each snapshot, and storing them for potential downstream tasks or analysis.

1. **Model Initialization:** The GNN model with a Gated Recurrent Unit (GRU), named `igcngru`, is initialized. It is configured with various parameters, including the number of nodes (`n_nodes`), history length (`HISTORY`), GCN layers, input and output sizes, embedding size, and training hyperparameters.

2. **Incremental Training Loop:** The code enters a loop to iteratively perform incremental training of the GNN-GRU model. For each iteration the current snapshot and historical snapshots are selected for training (`X_train`). The model is trained or updated using the selected data. and the embeddings are retrieved from the trained GNN-GRU model for the selected snapshots. Then, the index of the embeddings is adjusted to match IP nodes and the embeddings are saved to a CSV file with a name indicating the day of the snapshot.

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from src.models.gnn import IncrementalGcnGru

epochs=5 # Set training epochs
pbar = tqdm(total=(len(X)-HISTORY)*epochs) # Initialize progress bar

# Initialize the model
igcngru = IncrementalGcnGru(
    n_nodes=n_nodes, 
    history=HISTORY, 
    ns=1, 
    gcn_layers=2, 
    input_size=n_nodes, 
    gcn_units=1024, 
    gcn_output_size=512, 
    embedding_size=128, 
    predictor_units=64, 
    dropout=.0, 
    lr=1e-3, 
    cuda=False, 
    epochs=epochs
)

# Incremental training
for i in range(len(X)-HISTORY):
    # Get current snapshot
    X_train = X[i:HISTORY+1+i]
    
    # Train/Update the model
    igcngru.fit(X_train, pbar=pbar, day=i)

    # Retrieve embeddings
    embeddings = igcngru.get_embeddings(X_train)[:ip_nodes]

    # Adjust index with IP nodes
    new_index = [reverse_lookup[x] for x in range(ip_nodes)]
    embeddings = pd.DataFrame(embeddings, index=new_index)
    
    # Manage day name for saving
    if HISTORY+1+i<10: day = f'0{HISTORY+1+i}'
    else: day=HISTORY+1+i
        
    # Save the embeddings
    ename = f'../data/gnn_embeddings/embeddings_igcngru_202112{day}.csv'
    embeddings.to_csv(ename)

## 3. GNN Embeddings With Node Features

This code segment focuses on essential data loading and preprocessing steps for graph analysis, including host node mapping, feature conversion, and adjacency matrix generation.

1. **Loading Host Node Lookup:** The code loads a lookup dictionary for host nodes from a JSON file. This dictionary maps IP addresses to nodes in a graph.

2. **Loading and Converting Traffic Features:** A list of traffic feature files is loaded and converted into PyTorch tensors for analysis.

3. **Generating Weighted Adjacency Matrices:** Weighted adjacency matrices are generated from graph data files. These matrices represent connections between nodes in the network.

4. **Obtaining Node and Feature Counts:** The code determines the number of nodes and features in the data, which is essential for graph analysis and modeling.

In [None]:
from src.preprocessing.gnn import generate_adjacency_matrices
from src.models.gnn import GCN_GRU
from src.models.gnn import IncrementalGcnGru
import pandas as pd
import torch
import json

# Number of days to use as history for tGNN
HISTORY = 5
# Number of training days for GCN-GRU
TRAIN = 20

# Load saved lookup dictionaries for host nodes
with open(f'data/{DSET_TYPE}/graph/ip_lookup.json', 'r') as file:
    ip_lookup = json.loads(file.read())
    reverse_lookup = {v:k for k,v in ip_lookup.items()}
    ip_nodes = len(reverse_lookup)

# Load traffic features and convert them as torch tensors
flist = [x for x in sorted(glob(f'data/{DSET_TYPE}/features/*'))]
features = []
for file in tqdm(flist):
    feat = pd.read_csv(file, index_col=[0]).sort_index()
    feat = torch.tensor(feat.to_numpy())
    features.append(feat)

    
# Generate the weighted adjacency matrices
flist = [x for x in sorted(glob(f'data/{DSET_TYPE}/graph/*')) if '.txt' in x]
X = generate_adjacency_matrices(flist, weighted=True)

# Get number of nodes and features
n_nodes, n_features = features[0].shape

### i-GCN

This code segment focuses on training a Graph Neural Network model and extracting embeddings from snapshots of data. The embeddings capture important features of the graph structure and can be utilized for downstream tasks or analysis.


1. **GNN Model Configuration:** A GNN model (`igcn`) is configured with various parameters, including the number of nodes (`n_nodes`), GCN layers, input and output sizes, embedding size, and other hyperparameters for training.

2. **Training Loop:** The code enters a loop to iterate through each snapshot in `X`. For each snapshot the GNN model is trained or updated using the snapshot data. The embeddings are retrieved from the trained model, the index of the embeddings is adjusted to match IP nodes and the embeddings are saved to a CSV file with a name indicating the day.

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from src.models.gnn import GCN

epochs=1 # Set training epochs
pbar = tqdm(total=len(X)*epochs) # Initialize progress bar

# Initialize the model
igcn = GCN(
    n_nodes=n_nodes, 
    ns=1, 
    gcn_layers=2, 
    input_size=n_features, 
    gcn_units=1024, 
    gcn_output_size=512,
    embedding_size=128, 
    predictor_units=64, 
    dropout=.0, 
    lr=1e-3, 
    cuda=False, 
    epochs=epochs
)

# Incremental training
for i in range(len(X)):
    # Get current snapshot
    X_train, features_train = X[i], features[i]
    
    # Train/Update the model
    igcn.fit(X_train, features=features_train, pbar=pbar, day=i)

    # Retrieve embeddings
    embeddings = igcn.get_embeddings(X_train, features_train)[:ip_nodes]

    # Adjust index with IP nodes
    new_index = [reverse_lookup[x] for x in range(ip_nodes)]
    embeddings = pd.DataFrame(embeddings, index=new_index)
    
    # Manage day name for saving
    if i+1<10: day = f'0{i+1}'
    else: day=i+1
        
    # Save the embeddings
    ename = f'../data/gnn_embeddings/embeddings_igcn_features_202112{day}.csv'
    embeddings.to_csv(ename)

### GCN-GRU

This code segment focuses on training a GNN-GRU model for graph data and extracting embeddings from sequential snapshots, which can be utilized for various downstream tasks or analysis.

1. **Model Initialization:** The GNN model with a Gated Recurrent Unit (GRU), named `gcngru`, is initialized. It is configured with various parameters, including the number of nodes (`n_nodes`), history length (`HISTORY`), GCN layers, input and output sizes, embedding size, and training hyperparameters.

2. **Standard Model Training:** The model is trained in a standard way using the initial snapshot data (`X[:TRAIN]`) and node features (`features[:TRAIN]`) with a specified number of epochs.

3. **Embeddings Extraction:** The code enters a loop to iterate through snapshots from `TRAIN` to the end of the data. For each snapshot the code retrieves historical snapshots (`x`) from the previous `HISTORY` snapshots to the current snapshot. Embeddings are obtained using the trained GNN-GRU model, the index of the embeddings is adjusted to match IP nodes and the embeddings are saved to a CSV file with a name indicating the day.

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from src.models.gnn import GCN_GRU

epochs=50 # Set training epochs

# Initialize the model
gcngru = GCN_GRU(
    n_nodes=n_nodes, 
    history=HISTORY, 
    ns=1, 
    gcn_layers=2, 
    input_size=n_features, 
    gcn_units=1024, 
    gcn_output_size=512, 
    embedding_size=128, 
    predictor_units=64, 
    dropout=.0, 
    lr=1e-3, 
    early_stop=3,
    cuda=False
)

# Train the model in a standard way
gcngru.fit(X[:TRAIN], epochs=50, features=features[:TRAIN])

for idx in tqdm(range(TRAIN, len(X))):
    day=idx+1
    
    # Get current day
    x = X[idx-HISTORY:idx]
    feature = features[idx-HISTORY:idx]
    
    # Retrieve embeddings
    embeddings = gcngru.get_embeddings(x, feature)[:ip_nodes]
    
    # Adjust index with IP nodes
    new_index = [reverse_lookup[x] for x in range(ip_nodes)]
    embeddings = pd.DataFrame(embeddings, index=new_index)
    
    # Save the embeddings
    embeddings.to_csv(f'../data/gnn_embeddings/embeddings_gcngru_features_202112{day}.csv')

### i-GCN-GRU

This code segment demonstrates the process of incrementally training a GNN-GRU model on sequential data, extracting embeddings for each snapshot, and storing them for potential downstream tasks or analysis.

1. **Model Initialization:** The GNN model with a Gated Recurrent Unit (GRU), named `igcngru`, is initialized. It is configured with various parameters, including the number of nodes (`n_nodes`), history length (`HISTORY`), GCN layers, input and output sizes, embedding size, and training hyperparameters.

2. **Incremental Training Loop:** The code enters a loop to iteratively perform incremental training of the GNN-GRU model. For each iteration the current snapshot and historical snapshots are selected for training (`X_train`). The model is trained or updated using the selected data. and the embeddings are retrieved from the trained GNN-GRU model for the selected snapshots. Then, the index of the embeddings is adjusted to match IP nodes and the embeddings are saved to a CSV file with a name indicating the day of the snapshot.

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from src.models.gnn import IncrementalGcnGru

epochs=1 # Set training epochs
pbar = tqdm(total=(len(X)-HISTORY)*epochs) # Initialize progress bar

# Initialize the model
igcngru = IncrementalGcnGru(
    n_nodes=n_nodes, 
    history=HISTORY, 
    ns=1, 
    gcn_layers=2, 
    input_size=n_features, 
    gcn_units=1024, 
    gcn_output_size=512, 
    embedding_size=128, 
    predictor_units=64, 
    dropout=.0, 
    lr=1e-3, 
    cuda=False, 
    epochs=epochs
)

# Incremental training
for i in range(len(X)-HISTORY):
    # Get current snapshot
    X_train, features_train = X[i:HISTORY+1+i], features[i:HISTORY+1+i]

    # Train/Update the model
    igcngru.fit(X_train, features=features_train, pbar=pbar, day=i)

    # Retrieve embeddings
    embeddings = igcngru.get_embeddings(X_train, features_train)[:ip_nodes]

    # Adjust index with IP nodes
    new_index = [reverse_lookup[x] for x in range(ip_nodes)]
    embeddings = pd.DataFrame(embeddings, index=new_index)

    # Manage day name for saving
    if HISTORY+1+i<10: day = f'0{HISTORY+1+i}'
    else: day=HISTORY+1+i
        
    # Save the embeddings
    ename = f'../data/gnn_embeddings/embeddings_igcngru_features_202112{day}.csv'
    embeddings.to_csv(ename)