# GCN for Network Anomaly Detection


This notebook introduces methods to use [**Graph Convolutional Network**](https://tkipf.github.io/graph-convolutional-networks/) (**GCN**) to detect anomalies in network flows.
Being able to detect such anomalies could help preventing/stopping network attacks.

## Introduction

### Introduction to GCN

The basic idea behind **GCN** is the aggregation of the features from the neihborhood of each nodes. By repeating this operation $k$ times for a given node $U$, its resulting features is a combination of the features of the $k$-hop neighborhood of $U$.

This simple yet powerfull operation allows us to capture implicit topological information of the node as well as explicit features. The resulting features can be used to determine the role of a node within the sub-graph it belongs to.

> *Incrementing $k$ increases the diameter of the sub-graph from which $U$ is the center, however that doesn't make the model better since it can lead to oversmoothing, which results in having the same resulting features for all the nodes in the graph*

To make it short, GCN propagate nodes features through edges.

### GCN for network anomaly detection

#### Network capture to Graph
A network snapshot can be represented by a graph by considering IPs as nodes and packet exchanges between IPs as edges (*ex: IP_A send TCP request to IP_B*). 

Below is shown a visualization of a network snapshot of 60 seconds:

![Network graph representation](./data/network_image.png)

#### Useful graph geometry

The role of some nodes on the network can be guessed just by watching the graph representation.

> For example, the **DNS** servers are probably the nodes at the center of the big clusters.

We can hence infer **topological information are meaningful** in those graph and that **GCN** could be relevant to detect anomalies as detailed [here](https://github.com/harvardnlp/botnet-detection).

#### Needs for more features

However, **topological information are not sufficient** to detect more elabored attacks. For that purpose, **we need more features**, for example:

- IP geographic localization
- IP Reputation score (from websites as [UrlVoid](https://www.urlvoid.com/) for example)
- IP Service Provider (ISP)

However, those are information we don't necesseraly have access on the public datasets we can use to train our model. More generally, **we don't have access to any IP (graph node) related information for privacy reasons**.

#### Edge-based features

Fortunately, we can **interpolate features from packets**, within a certain time window:

- Quantity of bytes (min, mean, median, max, std)
- Number of packet
- Protocol (TCP, UDP, ICMP)
- ...

> That's true for almost all the available public datasets I've worked on: [CTU-13](https://www.stratosphereips.org/datasets-ctu13), [CSE-CIC-IDS2018 (2018)](https://www.unb.ca/cic/datasets/ids-2018.html), [UGR'16 (2016)](https://nesg.ugr.es/nesg-ugr16/index.php).

Unfortunately, **GCNs are not defined to handle features on edges** and require features for nodes that we can not provide.

#### Node features interpolation from edges

Even if we were to have access to private IPs (node) information and therefore would have features on the nodes, we cannot ignore information from IPs interactions (edges).

The issue is that there are, to the best of my knowledge, no researches on the subject.

In this notebook, I try to explore a solution that I've naively entitled **Edge2Node** that consists of **interpolating node features as a non-linear combination of the in/out edges**. 

### Requirements
At the time of writting this notebook, there are 3 major Deep Learning Graph libraries in python:

- [Pytorch Geometric](https://github.com/rusty1s/pytorch_geometric)
- [GraphNets](https://github.com/deepmind/graph_nets)
- [Deep Graph Library (DGL)](https://github.com/dmlc/dgl)

There aren't any concrete comparisons between the three yet, so I jut went with the one that attracted me the most. Since I'm more used to **PyTorch**, I've ignored **GraphNets**. **PyTorch-Geometric** implements a lot a GCN papers, however, it seemed a bit rough to me compared to **DGL**, which appeared to have a well thought pipeline and plans for the future.

Conclusion, I had no real reasons to went with **DGL**, that's just intuition.

In [7]:
!pip install --user torch==1.6.0 dgl==0.5.2 networkx==2.4



### Project decomposition

The notebook has required the following steps:

1. Dataset Preparation:
    - Truncate raw PCAP (drop useless data)
    - Slide time Window
    - Extract features for each interaction IP_A to IP_B within window (nb packets, nb bytes sent...)
    - Graph Generation
    
2. GCN Model Design:
    - Generate nodes features from in/out edges (**Edge2Node**)
    - Apply GCN on the graph (using predicted nodes features)
    - Generate edges embedding from new nodes features (**Node2Edge**)
    - Classify nodes and edges using the computed features
    
3. Training:
    - Loss to penalize errors on edge/node classification
    - Basic ML pytorch training loop
    - Model Evaluation
    
However, we will mainly focus on the step **2. GCN Model Design** here since the others steps are just a draft used for the proof of concept. 

## Edge2Node: Proof Of Concept

In [None]:
import torch
import networkx as nx
import numpy as np

import dgl
