# Exploratory Analysis of Netflow

## Setup

### Get Data

This notebook assumes that you have downloaded one or more netflow files from the [LANL dataset](https://csr.lanl.gov/data/2017.html) and converted them to HDF5 using something like `hdflow`. Example:

```bash
pip install hdflow
csv2hdf --format=lanl /path/to/lanl/netflow*
```

### Chapel

[Download](https://chapel-lang.org/download.html) and [build](https://chapel-lang.org/docs/usingchapel/building.html) the Chapel [programming language](https://chapel-lang.org/). Be sure to build for a multi-locale system, if appropriate.

### Arkouda

```bash
pip install arkouda
cd arkouda/install/dir
chpl --fast -senableParScan arkouda_server.chpl
./arkouda_server -nl <number_of_locales>
```

In [1]:
import arkouda as ak
from glob import glob

In [2]:
ak.connect(server='node01')

4.2.5
psp =  tcp://node01:5555
connected to tcp://node01:5555


### Load the Data

In [3]:
hdffiles = glob('/mnt/data/lanl_netflow/hdf5/*.hdf')
fields = ['srcIP', 'dstIP', 'srcPort', 'dstPort', 'start']

In [4]:
%time data = {field: ak.read_hdf(field, hdffiles) for field in fields}

CPU times: user 4.18 ms, sys: 340 µs, total: 4.53 ms
Wall time: 1min 3s


In [5]:
data

{'srcIP': array([22058209181, 22058295678, 5266512788, ..., 22058739089, 22058739089, 22058739089]),
 'dstIP': array([22058391981, 22058674074, 22057986724, ..., 22057863347, 22058450761, 22058554651]),
 'srcPort': array([5507, 3137, 5060, ..., 58889, 75615, 67796]),
 'dstPort': array([46272, 445, 5060, ..., 80, 80, 80]),
 'start': array([118781, 118783, 118785, ..., 345599, 345599, 345599])}

### Are src and dst Meaningful?
Typically, src and dst are not meaningful labels, but the curators of this dataset may have used it to encode the identity of the client and server. If so, then the frequency of server ports should differ quite a bit between src and dst.

In [6]:
%time (data['srcPort'] == 80).sum(), (data['dstPort'] == 80).sum()

CPU times: user 0 ns, sys: 2.55 ms, total: 2.55 ms
Wall time: 1.5 s


(1, 52123410)

In [7]:
%time (data['srcPort'] == 443).sum(), (data['dstPort'] == 443).sum()

CPU times: user 80 µs, sys: 3.04 ms, total: 3.12 ms
Wall time: 1.49 s


(6064, 74255132)

dst has lots of port 80 (HTTP) and 443 (HTTPS), while src has very little. Thus, unlike typical netflow, dst is probably the server side in this dataset, while src is the client side.

Confirm by looking at more of the port distributions:

In [8]:
%time sport, scount = data['srcPort'].value_counts()

CPU times: user 2.34 ms, sys: 36 µs, total: 2.37 ms
Wall time: 1.91 s


In [9]:
from collections import Counter
sportCounts = Counter()
for i in range(sport.size):
    sportCounts[sport[i]] = scount[i]
sportCounts.most_common(10)

[(15379, 2036008),
 (5060, 972016),
 (137, 697136),
 (95765, 661667),
 (41101, 450218),
 (123, 298170),
 (59844, 243825),
 (84474, 241587),
 (87103, 234884),
 (20995, 225309)]

In [10]:
len(sportCounts)

65055

In [11]:
dport, dcount = data['dstPort'].value_counts()

In [12]:
dportCounts = Counter()
for i in range(dport.size):
    dportCounts[dport[i]] = dcount[i]
dportCounts.most_common(10)

[(53, 97719325),
 (443, 74255132),
 (80, 52123410),
 (514, 29138686),
 (389, 18679249),
 (427, 13094194),
 (88, 12218516),
 (161, 11970935),
 (445, 10192812),
 (95765, 6444028)]

In [13]:
len(dportCounts)

62727