# Exploratory Analysis of Netflow

## Setup

### Get Data

This notebook assumes that you have downloaded one or more netflow files from the [LANL dataset](https://csr.lanl.gov/data/2017.html) and converted them to HDF5 using something like `hdflow`. Example:

```bash
pip install hdflow
csv2hdf --format=lanl /path/to/lanl/netflow*
```

### Chapel

[Download](https://chapel-lang.org/download.html) and [build](https://chapel-lang.org/docs/usingchapel/building.html) the Chapel [programming language](https://chapel-lang.org/). Be sure to build for a multi-locale system, if appropriate.

### Arkouda

```bash
pip install arkouda
cd arkouda/install/dir
chpl --fast -senableParScan arkouda_server.chpl
./arkouda_server -nl <number_of_locales>
```

In [None]:
import arkouda as ak
from glob import glob

In [None]:
ak.connect()

In [None]:
ak.get_config()

In [None]:
ak.get_mem_used()

### Load the Data

In [None]:
hdffiles = glob('/Volumes/Crucial X8/Data/lanl_netflow/hdf5/*.hdf')
fields = ['srcIP', 'dstIP', 'srcPort', 'dstPort', 'start']

In [None]:
%time data = {field: ak.read_hdf(field, hdffiles) for field in fields}

In [None]:
data

### Are src and dst Meaningful?
Typically, src and dst are not meaningful labels, but the curators of this dataset may have used it to encode the identity of the client and server. If so, then the frequency of server ports should differ quite a bit between src and dst.

In [None]:
%time (data['srcPort'] == 80).sum(), (data['dstPort'] == 80).sum()

In [None]:
%time (data['srcPort'] == 443).sum(), (data['dstPort'] == 443).sum()

dst has lots of port 80 (HTTP) and 443 (HTTPS), while src has very little. Thus, unlike typical netflow, dst is probably the server side in this dataset, while src is the client side.

Confirm by looking at more of the port distributions:

## src port values and counts

In [None]:
%time sport, scount = ak.value_counts(data['srcPort'])

## top 10 src port counts in numpy

In [None]:
from collections import Counter
sportCounts = Counter()
for i in range(sport.size):
    sportCounts[sport[i]] = scount[i]
sportCounts.most_common(10)

In [None]:
len(sportCounts)

## top 10 src port counts in arkouda

In [None]:
ix = ak.argmaxk(scount,10)
for i in ix.to_ndarray()[::-1]:
    print((sport[i], scount[i]))

## dest port values and counts

In [None]:
dport, dcount = ak.value_counts(data['dstPort'])

## top 10 dest port counts in numpy

In [None]:
dportCounts = Counter()
for i in range(dport.size):
    dportCounts[dport[i]] = dcount[i]
dportCounts.most_common(10)

In [None]:
len(dportCounts)

## top 10 dest port counts in arkouda

In [None]:
ix = ak.argmaxk(dcount,10)
for i in ix.to_ndarray()[::-1]:
    print((dport[i], dcount[i]))