# RAPIDS Academy Lab: GPU Security Analytics
## Mapping 2.6GB of Zeek connection logs with cuDF and Graphistry

The below exercises take you through using GPU RAPIDS ecosystem technologies to quickly load a large network security log dump into a notebook (cuDF), analyze the involved IPs, and visualize them (Graphistry).

This notebook is part of the security analytics track's introductory session in RAPIDS Academy. RAPIDS Academy is a group of GPU computing leaders working together to provide tutorials and trainings for learning GPU analytics.


#### Instructions

* **Video + chat**: See email/Slack: RAPIDS.ai Slack channel group + Zoom invite + GPU environment
* **Schedule**: Introduction (Zoom), then 10min per task, with 5-10min discussion inbetween each task

#### Assumptions

* **Python**: Comfortable with basic Python
* **PyData**: Helpful but not necessary: Familiarity with Jupyter Notebooks & Pandas
* **Security**: Minimal

#### Topics

* **Introduction: Jupyter Python notebooks**
  * Writing, running, & saving cells, `nvidia-smi` & GPU dashboard
* **Task 1: Starting & loading data**
  * `help()`, `%%time`, `cudf.read_csv()`, memory management
  * Advanced: comparing to `pandas.read_csv` and speedup via `usecols`
* **Task 2: Inspection & analysis** 
  * `head()`, `[[]]`, `describe()`, `count()`, `sort()`, `len()`
  * Advanced: `unique()`, `value_counts`
* **Task 3: Shaping & visualization**
  * `drop_duplicates`, `groupby().agg()`, `graphistry.plot()`
  * Advanced: `graphistry.hypergraph`

### After
* Solution notebook
* Slack channel
* [Subscribe to future sessions](learnrapids.com): Multi-GPU, ...
* Guide our roadmap! RAPIDS Academy + RAPIDS ecosystem survey

## Import RAPIDS and prepare data

* For Graphistry, get a free account at [alpha.graphistry.com](https://alpha.graphistry.com) and put in graphistry.register() call below


In [None]:
# Install graphistry client library if not available 
# NOTE: No local GPU needed as it uses a remote graphistry GPU server of your choice
# ! pip install --user graphistry

In [None]:
import cudf, json, graphistry, pandas as pd
from collections import OrderedDict
pd.options.display.max_rows = 1000
pd.options.display.max_columns = 100

graphistry.register(
    api=3,
    username='*****', 
    password='*****')

{'cudf': cudf.__version__, 'graphistry': graphistry.__version__}

In [None]:
! rm -f conn.log conn.log.gz
! echo "Downloading data..."
! wget https://www.secrepo.com/maccdc2012/conn.log.gz
! ls -alh conn.log.gz 
! echo "Decompressing data..."
! gunzip conn.log.gz
! ls -alh conn.log
! echo "DONE: DOWNLOADED DATA"

In [None]:
! echo "generating sample conn10.log ..."
! head -n 10 ./conn.log > conn10.log
! echo "generating sample conn1K.log ..."
! head -n 1000 ./conn.log > conn1K.log
! echo "generating sample conn1M.log ..."
! head -n 1000000 ./conn.log > conn1M.log
! echo "generating sample conn5M.logg ..."
! head -n 5000000 ./conn.log > conn5M.log
! ls -alh conn*
! echo "DONE: SAMPLES GENERATED"

## Quick demo A: Jupyter

1. Add cell: `+` button
1. Run (ctrl-enter): 
```
x = 1 + 2
x
```
1. Run shell command: `ls -alh` and then `nvidia-smi`
1. Save

In [None]:
x = 1 + 1
x

In [None]:
ls

In [None]:
! nvidia-smi

## Quick Demo B: GPU Dashboard

In the left Jupyterlab menu, find the icon for the GPU dashboard extension and open the tab

# Task 1: Load data

* Read the 2.6GB file `conn.log` into a cuDF GPU dataframe using `cudf.read_csv`
* Count the unique IPs
* Compare the speed relative against CPU pandas
* Intermediate + advanced: Keep GPU memory down

### Task 1a: Load the data... without leaking GPU memory

Load the data using `cudf.read_csv()`. It will be under the wrong format, initially.

Your task is to do extra work around keeping GPU memory use under control. 
* At the end of every cell, set your dataframe variables to `None` (ex: `gdf = None`) so the garbage collector can free the memory
* You can check GPU memory use via the dashboard and `! nvidia-smi`

Start by loading sample `./conn1K.log`, and when ready, try `./conn1M.log`, `./conn5M.log`, and the full `./conn.log`.

#### 1a. Reference
Print the last 3 lines of `./conn1K.log`

In [None]:
%%time

gdf = cudf.read_csv('./conn1K.log')

print(gdf.tail(3))

gdf=None

#### 1a. Task

* Print the last 3 lines of `./conn5M.log` using `cudf.read_csv`
* Use `! nvidia-smi` or the GPU monitor to check that memory consumption is low after

After, try on `./conn.log` and see if it can hold the 2.6GB in memory. (RAPIDS has a max str size.)

In [None]:
%%time

gdf = cudf.read_csv('./conn5M.log')

print(gdf.tail(3))

gdf = None

### Task 1b: Load 500MB

Load the data using tab separation

#### 1b. Reference

In [None]:
%%time

! ls -alh conn1K.log conn5M.log conn.log
! echo "lines  1K: `wc -l ./conn1K.log`"
! echo "lines  5M: `wc -l ./conn5M.log`"
! echo "lines all: `wc -l ./conn.log`"
! echo
! echo "==========================="
! echo "./conn1K.log first 3 lines:"
! head -n 3 ./conn1K.log
! echo
! echo "==========================="
! echo "via cudf.read_csv()"

gdf = cudf.read_csv('./conn1K.log', sep='\t')
print(gdf.head(3))
gdf = None

#### 1b. Task
1. Read `help(cudf.read_csv)` to learn about the delimiter parameter
1. Use the delimiter option to read the file with tab separation (`"\t"`) 
1. Get it right with `./conn1K.log` and then try `./conn5M.log`
1. ... remember to clear out your GPU memory!

In [None]:
help(cudf.read_csv)

In [None]:
%%time

gdf = cudf.read_csv('./conn1K.log', sep='\t')

print(gdf.head(3))

gdf = None

### Task 1c: Format & load 2.6GB

`read_csv` has many useful parameters to create data that is cleaner, friendlier, smaller, and faster. 

By using them, we will be able to load the whole dataset with out crashing, fairly quickly, and get native operations on top!


#### 1c. Reference

* Read about parameters `names`, `dtypes`, `colnames`, and byte ranges in `cudf.read_csv()`
* Load the Zeek format

In [None]:
# help(cudf.read_csv)

In [None]:
cols = ['ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'proto', 'service', 'duration',
        'orig_bytes', 'resp_bytes', 'conn_state', 
        'local_orig?',
        'missed_bytes', 'history', 'orig_pkts',
        'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents']

dtypes=OrderedDict([
    ('ts', 'int64'), ('uid', 'str'),
    ('id.orig_h', 'str'), ('id.orig_p', 'int32'), ('id.resp_h', 'str'), ('id.resp_p', 'int32'),
    ('proto', 'str'), ('duration', 'float64'),
    ('orig_bytes', 'int64'), ('resp_bytes', 'int64'),
    ('conn_state', 'str'), ('local_orig?', 'str'), ('local_resp?', 'str'),
    ('missed_bytes', 'int64'), ('history', 'str'),
    ('orig_pkts', 'int64'), ('orig_ip_bytes', 'int64'), ('resp_pkts', 'int64'), ('resp_ip_bytes', 'int64'),
    ('tunnel_parents', 'str')
])

# optional
cols_subset = [
    'ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 
    'proto', 'duration', 'orig_bytes', 'resp_bytes'
]

#### 1c. Task

* Plug in `names` and `dtypes` to `read_csv`
* ... Take care not to leak GPU memory, and try first on `./conn1K.log` before doing `./conn1M.log` / `./conn5M.log` / the full `./conn.log`.

Advanced: Only load in the columns you will use for your analysis and compare the impact on memory & load time

In [None]:
%%time

gdf = cudf.read_csv('./conn1K.log',  sep='\t',
    names=cols,
    dtypes=dtypes,
    usecols=cols_subset,
    na_values=[None, '-', '-','(empty)'])

print(gdf.dtypes)

gdf.head(3)

gdf = None

## Task 2. Analytics & Wrangling

CPU Pandas operators largely carry over to GPU cuDF. We'll take a look at activity by top IPs.


### Task 2a: Column manipulation

#### 2a. Reference
* df with subset of cols: `gdf[['col1', 'col2']]`
* get one col: `gdf['col1']`
* get df/col length: `len(gdf)` / `len(gs)`
* series of unique elements from a series: `gs.unique()`
* stats on one series: `gs.min()`, `gs.max()`, `gs.sum()`, ...

#### 2a. Task

* For the column of IPs `id.resp_h`, how many unique IPs are there?
* Intermediate/advanced: If you have time after 2b/c, for the column of bytes `orig_bytes`, what is the biggest payload?

In [None]:
%%time

### id.resp_h unique value count

gdf = cudf.read_csv('./conn.log', 
              sep='\t', 
              names=cols,
              dtypes=dtypes,
              usecols=['id.resp_h'],
              na_values=[None, '-', '-','(empty)'])

unique_resp_ips = gdf['id.resp_h'].unique()

print('# unique', len(unique_resp_ips))

print(unique_resp_ips[:10])

unique_resp_ips = None
gdf = None

In [None]:
%%time

### col orig_bytes max

gdf = cudf.read_csv('./conn.log', 
              sep='\t', 
              names=cols,
              dtypes=dtypes,
              usecols=['orig_bytes'],
              na_values=[None, '-', '-','(empty)'])

mx = gdf['orig_bytes'].max()

print('max orig_bytes', mx)

gdf = None

### Task 2b: Group by & column summaries

Quite powerful, you can group dataframe rows and get summary statistics for each group.

In this task, we'll summarize flows between different computers.

#### 2b. Reference

Pattern

```python
gdf\
    .groupby(['col1', 'col2', ...])\
    .agg({
        'col2': ['min', 'max'],
        'col3': 'min',
        'col4': ['count', 'mean', 'nunique'],
    })\
    .reset_index()
```

#### 2b. Task

Compute summaries for `./conn1K.log` then the full `./conn.log` for:

* `ts`: # entries, min/max/mean time
* `id.resp_p`: min/max/nunique
* `duration`: min/max/mean
* `orig_bytes`: min/max/mean/sum

Advanced: Which are the biggest exfils? Longest? SSH?

In [None]:
%%time

gdf = cudf.read_csv('./conn.log', 
              sep='\t', 
              names=cols,
              dtypes=dtypes,
              usecols=cols_subset,
              na_values=[None, '-', '-','(empty)'])

out = gdf.groupby(['id.resp_h', 'id.orig_h'])\
    .agg({
        'ts': ['count', 'min', 'max', 'mean'],
        'uid': 'nunique',
        'id.resp_p': ['min', 'max', 'nunique'],
        'proto': ['nunique'],
        'duration': ['min', 'max', 'mean', 'sum'],
        'orig_bytes': ['min', 'max', 'mean', 'sum'],
        'resp_bytes': ['min', 'max', 'mean', 'sum'],
    }).reset_index()


gdf = None
print(out.shape)
print(out.dtypes)
print(out.head(3))
out = None

## Task 3: Visualize!

Graphistry lets you plot many points, and even more interesting, relationships, and visually filter + cluster them on-the-fly.

#### Task 3a: Run Graphistry

(Precoded)

* Nodes: IPs
* Edges: Summarized flows

Try to -

* Pan/zoom: Similar to Google Maps - click/drag/scroll (or pinch); recenter
* Inspect: click on a node/edge, open table inspector
* Advanced: histogram, filter, cluster, color
* Advanced: rendering + clustering settings

In [None]:
def compute_groupby():

    gdf = cudf.read_csv('./conn.log', 
                  sep='\t', 
                  names=cols,
                  dtypes=dtypes,
                  usecols=cols_subset,
                  na_values=[None, '-', '-','(empty)'])

    out = gdf.groupby(['id.resp_h', 'id.orig_h'])\
        .agg({
            'ts': ['count', 'min', 'max', 'mean'],
            'uid': 'nunique',
            'id.resp_p': ['min', 'max', 'nunique'],
            'proto': ['nunique'],
            'duration': ['min', 'max', 'mean', 'sum'],
            'orig_bytes': ['min', 'max', 'mean', 'sum'],
            'resp_bytes': ['min', 'max', 'mean', 'sum'],
        }).reset_index()


    ########### Data cleaning: normal column names and times as actual timestamps

    out.columns = out.columns.to_flat_index() # -> col_name = (col, stat)
    out.columns = [ '%s_%s' % c for c in out.columns ]

    out = out.rename(columns={
        'id.resp_h_': 'id.resp_h',
        'id.orig_h_': 'id.orig_h',
    })

    out['ts_min'] = cudf.Series(pd.to_datetime((out['ts_min']*1000000000).to_pandas()))
    out['ts_max'] = cudf.Series(pd.to_datetime((out['ts_max']*1000000000).to_pandas()))
    out['ts_mean'] = cudf.Series(pd.to_datetime((out['ts_mean']*1000000000).to_pandas()))
    
    return out

out = compute_groupby()
    
print('# rows', len(out))
print('dtypes', out.dtypes)
print(out.head(3))

out = None

In [None]:
%%time

gdf = compute_groupby()

g = graphistry.edges(gdf).bind(source='id.orig_h', destination='id.resp_h')

gdf = None

g.plot()

In [None]:
g = None

### Task 3b: Map SSH activity

* Copy & rename `compute_groupby` as `compute_groupby2` 
* Filter connections to just SSH traffic (port 22) based on column `id.resp_p` 
* Plot

#### 3b. Reference

Hint: To filter, `gdf2 = gdf[ gdf['some_col'] == some_val]`

#### 3b. Task

In [None]:
def compute_groupby2():

    gdf = cudf.read_csv('./conn.log', 
                  sep='\t', 
                  names=cols,
                  dtypes=dtypes,
                  usecols=cols_subset,
                  na_values=[None, '-', '-','(empty)'])
    
    gdf = gdf[ gdf['id.resp_p'] == 22 ]

    out = gdf.groupby(['id.resp_h', 'id.orig_h'])\
        .agg({
            'ts': ['count', 'min', 'max', 'mean'],
            'uid': 'nunique',
            'id.resp_p': ['min', 'max', 'nunique'],
            'proto': ['nunique'],
            'duration': ['min', 'max', 'mean', 'sum'],
            'orig_bytes': ['min', 'max', 'mean', 'sum'],
            'resp_bytes': ['min', 'max', 'mean', 'sum'],
        }).reset_index()


    ########### Data cleaning: normal column names and times as actual timestamps

    out.columns = out.columns.to_flat_index() # -> col_name = (col, stat)
    out.columns = [ '%s_%s' % c for c in out.columns ]

    out = out.rename(columns={
        'id.resp_h_': 'id.resp_h',
        'id.orig_h_': 'id.orig_h',
    })

    out['ts_min'] = cudf.Series(pd.to_datetime((out['ts_min']*1000000000).to_pandas()))
    out['ts_max'] = cudf.Series(pd.to_datetime((out['ts_max']*1000000000).to_pandas()))
    out['ts_mean'] = cudf.Series(pd.to_datetime((out['ts_mean']*1000000000).to_pandas()))
    
    return out

In [None]:
out = compute_groupby2()
    
print('# rows', len(out))
print('dtypes', out.dtypes)
print(out.head(3))

out = None

In [None]:
%%time

gdf = compute_groupby2()

g = graphistry.edges(gdf).bind(source='id.orig_h', destination='id.resp_h')

gdf = None

g.plot()

### After

#### Do
* Solution notebook
* Slack channel
* [Subscribe to future sessions](learnrapids.com): Multi-GPU, ...
* Guide our roadmap! RAPIDS Academy + RAPIDS ecosystem survey

#### References
* [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
* 2.6GB of Zeek connection logs (22M rows) from https://www.secrepo.com/ 
* RAPIDS docs: https://docs.rapids.ai/api/cudf/stable/api.html