# Exploring Temporal GNN Embeddings for Darknet Traffic Analysis
## Dataset Characterization
___

## Table of Contents
1. Loading and Processing Raw Trace Data
2. Total statistics
3. Average Daily Statistics
4. Graph Statistics

This notebook contains the main codes to characterized both the filtered dataset (total and on daily basis) and the resulting temporal graph.

In [16]:
import json
import pandas as pd
from glob import glob 
from tqdm.notebook import tqdm_notebook as tqdm

import sys
sys.path.append('../')

from src.preprocessing import *

# Total snapshots of the collection
TOT_DAYS = 31 
# Drop source hosts sending less than FILTER packets per snapshot
FILTER = 5 
# Generate the corpora keeping the top TOP_PORTS daily ports +1 as languages
TOP_PORTS = 2500 

## 1. Loading and Processing Raw Trace Data

This code loads, processes, and stores raw trace data for multiple days, with each day's data organized by hour and aggregated into a daily dataframe. Filters can be applied to refine the data further before storage.

1. **Load Raw Trace Files:** The code retrieves a list of raw trace files using the `glob` function.

2. **Process Hourly Traces:** The code iterates through the hourly trace files for the current day. These files are sorted in chronological order. For each hourly trace file:
     - The code extracts the day from the filename, indicating the date of the data being processed.
     - The `load_single_file` function is called to load the trace data from the current hourly file for the specified day.
     - The loaded dataframe is appended to the `dfs` list, allowing data to be collected for all hours of the current day.
     - This step ensures that data is organized and aggregated by day and hour.

3. **Daily Dataframe:** After processing all hourly files for the current day, the code concatenates the dataframes in the `dfs` list along the rows axis (`axis=0`) to create a daily dataframe, denoted as `df`. This daily dataframe contains data for the entire day, with data from each hour stacked on top of one another.

4. **Packets Filtering:** The code applies a packet filter (`FILTER`) to the daily dataframe `df` to further refine the data, potentially removing or selecting specific packets based on the filter criteria.

5. **Data Storage:** The processed daily dataframe (`df`) is appended to the `processed_df` list. This step ensures that data for each day is stored for subsequent analysis.

In [17]:
# Load raw trace files list
flist = glob(f'../data/traces/*.log.gz')

# Initialize progress bar
pbar = tqdm(total=len(flist)+TOT_DAYS, desc='Setting up')

processed_df = []
for i in range(TOT_DAYS):
    dfs = []
    for file in sorted(flist)[24*i:24*(i+1)]:
        # Retrieve current day traces
        day = file.split('trace_')[-1][:8]
        df = load_single_file(file, day)
        dfs.append(df)
        
        # Update progress bar
        pbar.set_description(f'{day} Loading raw traces')
        pbar.update(1)
        
    # Get daily dataframe
    df = pd.concat(dfs, axis=0, ignore_index=True)
    
    # Packets filter
    pbar.set_description(f'{day} Running packets filter')
    df = apply_packets_filter(df, FILTER)
    
    processed_df.append(df)
        
    # Update progress bar
    pbar.update(1)

Setting up:   0%|          | 0/775 [00:00<?, ?it/s]

## 2. Total statistics

The following code generate label-specific statistics for all the snapshots in the collection. Namely, it extracts (i) number of hosts, (ii) number of contacted ports and (iii) total sent packets.

In [18]:
# List of labels to keep for data filtering
to_keep = [
    'mirai', 'unk_bruteforcer', 'unk_spammer', 'shadowserver', 
    'driftnet', 'internetcensus', 'censys', 'rapid7', 'onyphe', 
    'netsystems', 'shodan', 'unk_exploiter', 'securitytrails', 
    'intrinsec']

# Load ground truth
gt = pd.read_csv(f'../data/ground_truth/ground_truth.csv')

# Concatenate processed dataframes, select relevant columns, and add a 'pkts' column
maindf = pd.concat(processed_df, axis=0, ignore_index=True)[['src_ip', 'dst_port']]
maindf['pkts'] = 1
print(f'Total: {maindf.src_ip.unique().shape[0], maindf.shape}')

# Merge ground truth data with the main dataframe and drop rows with missing values
maindf = maindf.merge(gt, on='src_ip', how='left').dropna()

# Group data by label and perform aggregations
print(maindf.groupby('label').agg({
    'src_ip':lambda x: len(set(x)),
    'dst_port':lambda x:len(set(x)),
    'pkts':sum
}).loc[to_keep])

# Calculate the sum of aggregated data for the selected labels
print(maindf.groupby('label').agg({
    'src_ip':lambda x: len(set(x)),
    'dst_port':lambda x:len(set(x)),
    'pkts':sum
}).loc[to_keep].sum())

Total: (60106, (62030013, 3))
                 src_ip  dst_port      pkts
label                                      
mirai             16147      2094   1982205
unk_bruteforcer     976      9791  10191173
unk_spammer        1014     49783   4891353
shadowserver        289        42    218443
driftnet            252      9246    564854
internetcensus      271       252    213909
censys              329     65069   3400900
rapid7              344       139     60469
onyphe              115       186     39030
netsystems           45       199    226559
shodan               36      1232    320861
unk_exploiter       430        33    148210
securitytrails       18       207    107826
intrinsec            12         8      9403
src_ip         20278
dst_port      138281
pkts        22375195
dtype: int64


## 3. Average Daily Statistics

The following code generate label-specific statistics on average over the snapshots in the collection. Namely, it extracts (i) number of hosts, (ii) number of contacted ports and (iii) total sent packets.

In [19]:
# List of labels to keep for data filtering
to_keep = [
    'mirai', 'unk_bruteforcer', 'unk_spammer', 'shadowserver', 
    'driftnet', 'internetcensus', 'censys', 'rapid7', 'onyphe', 
    'netsystems', 'shodan', 'unk_exploiter', 'securitytrails', 
    'intrinsec']

# Load ground truth
gt = pd.read_csv(f'../data/ground_truth/ground_truth.csv')

# Concatenate processed dataframes, select relevant columns, and add a 'pkts' column
maindf = pd.concat(processed_df, axis=0, ignore_index=True)[['src_ip', 'dst_port', 'interval']]
maindf['pkts'] = 1

# Merge ground truth data with the main dataframe and drop rows with missing values
maindf1 = maindf.merge(gt, on='src_ip', how='left').dropna()

# Group data by label and snapshot and perform aggregations
tmp = maindf1.groupby(['label', 'interval']).agg({
    'src_ip':lambda x: len(set(x)),
    'dst_port':lambda x:len(set(x)),
    'pkts':sum
})
print(tmp.reset_index()\
         .groupby('label')\
         .mean()\
         .loc[to_keep]\
         .astype(int))

# Get 'Unknown' statistics
maindf2 = maindf.merge(gt, on='src_ip', how='left').fillna('unknown')
# Group data by snapshot only and perform aggregations
print(maindf2[~maindf2.label.isin(to_keep)]\
          .groupby(['interval'])\
          .agg({'src_ip':lambda x: len(set(x)),
                'dst_port':lambda x: len(set(x)),
                'pkts':sum}).mean())

                 src_ip  dst_port      pkts
label                                      
mirai             16147      2094   1982205
unk_bruteforcer     976      9791  10191173
unk_spammer        1014     49783   4891353
shadowserver        289        42    218443
driftnet            252      9246    564854
internetcensus      271       252    213909
censys              329     65069   3400900
rapid7              344       139     60469
onyphe              115       186     39030
netsystems           45       199    226559
shodan               36      1232    320861
unk_exploiter       430        33    148210
securitytrails       18       207    107826
intrinsec            12         8      9403
src_ip         39828.0
dst_port       66028.0
pkts        39654818.0
dtype: float64


### 4. Graph Statistics

The following code loads, processes, and summarizes data from multiple graph snapshots, providing insights into network structures.

1. **Load Snapshots:** Graph snapshot data is loaded from '.txt' files in the '../data/graph/' directory into a list named `traces`.

2. **Concatenate Snapshots:** All loaded snapshots are merged into a single DataFrame, `traces`.

3. **Total Edges:** The total number of edges in `traces` is computed and stored in `tot_edges`.

4. **Snapshot Analysis:** For each unique 'interval' value in the DataFrame calculate the number of unique nodes and total edges in the snapshot and append node and edge counts to respective lists.

5. **Summary Statistics:** Calculate and print:
   - Total nodes (sum of all unique nodes).
   - Total edges (stored as `tot_edges`).
   - Average nodes (mean of node counts).
   - Average edges (mean of edge counts, rounded to an integer).

In [14]:
from glob import glob
import numpy as np
import pandas as pd

# Load generated graphs -- each file is a snapshot
traces = []
flist = [x for x in sorted(glob(f'../data/graph/*')) if '.txt' in x]
for i, fname in enumerate(flist):
    data = pd.read_csv(fname, header=None, 
                       names=["src", "dst", "weight", "label"])
    data['interval'] = i # Manage the snapshot info
    traces.append(data)
    
# Concatenate loaded graphs
traces = pd.concat(traces, ignore_index=True)

tot_edges = traces.shape[0]
nodes, edges = [], []
for t in traces.interval.unique():
    # Extract snapshots
    trace = traces[traces.interval==t]
    
    # Calculate the number of distinct nodes and edges in each snapshot
    node = np.hstack([trace.src.unique(), trace.dst.unique()]).shape[0] # Nodes
    edge = trace.shape[0] # Edges    
    nodes.append(node), edges.append(edge)
    
# Calculate and print the total number of nodes, total edges, average nodes, and average edges
print(f'Total nodes: {np.sum(nodes)}')
print(f'Total edges: {tot_edges}') 
print(f'Avg. nodes per snapshot: {np.mean(nodes).astype(int)}') 
print(f'Avg. edges per snapshot: {np.mean(edges).astype(int)}')

Total nodes: 198171
Total edges: 1525152
Avg. nodes per snapshot: 6392
Avg. edges per snapshot: 49198
