# RAPIDS Academy Lab: GPU Security Analytics
## Mapping 2.6GB of Zeek connection logs with cuDF and Graphistry

The below exercises take you through using GPU RAPIDS ecosystem technologies to quickly load a large network security log dump into a notebook (cuDF), analyze the involved IPs, and visualize them (Graphistry).

This notebook is part of the security analytics track's introductory session in RAPIDS Academy. RAPIDS Academy is a group of GPU computing leaders working together to provide tutorials and trainings for learning GPU analytics.


#### Instructions

* **Video + chat**: See email/Slack: RAPIDS.ai Slack channel group + Zoom invite + GPU environment
* **Schedule**: Introduction (Zoom), then 10min per task, with 5-10min discussion inbetween each task

#### Assumptions

* **Python**: Comfortable with basic Python
* **PyData**: Helpful but not necessary: Familiarity with Jupyter Notebooks & Pandas
* **Security**: Minimal

#### Topics

* **Introduction: Jupyter Python notebooks**
  * Writing, running, & saving cells, `nvidia-smi` & GPU dashboard
* **Task 1: Starting & loading data**
  * `help()`, `%%time`, `cudf.read_csv()`, memory management
  * Advanced: comparing to `pandas.read_csv` and speedup via `usecols`
* **Task 2: Inspection & analysis** 
  * `head()`, `[[]]`, `describe()`, `count()`, `sort()`, `len()`
  * Advanced: `unique()`, `value_counts`
* **Task 3: Shaping & visualization**
  * `drop_duplicates`, `groupby().agg()`, `graphistry.plot()`
  * Advanced: `graphistry.hypergraph`

### After
* Solution notebook
* Slack channel
* [Subscribe to future sessions](learnrapids.com): Multi-GPU, ...
* Guide our roadmap! RAPIDS Academy + RAPIDS ecosystem survey

## Import RAPIDS and prepare data

* For Graphistry, get a free account at [Graphistry Hub](https://www.graphistry.com/get-started) and put in graphistry.register() call below


In [None]:
# Install graphistry client library if not available 
# NOTE: No local GPU needed as it uses a remote graphistry GPU server of your choice
# ! pip install --user graphistry

In [3]:
import cudf, json, graphistry, pandas as pd
from collections import OrderedDict
pd.options.display.max_rows = 1000
pd.options.display.max_columns = 100


### Get free Graphistry Hub account & creds at https://www.graphistry.com/get-started
### First run: set to True and fill in creds
### Future runs: set to False and erase your creds
### When done: delete graphistry.json
if False:
    #creds = {'token': '...'}
    creds = {'username': '***', 'password': '***'}
    with open('graphistry.json', 'w') as outfile:
        json.dump(creds, outfile)
with open('graphistry.json') as f:
    creds = json.load(f)

graphistry.register(
    api=3, key='', protocol='https', server='hub.graphistry.com', 
    **creds)
    

{'cudf': cudf.__version__, 'graphistry': graphistry.__version__}



{'cudf': '0.14.0', 'graphistry': '0.11.2'}

In [None]:
! rm -f conn.log conn.log.gz
! echo "Downloading data..."
! wget https://www.secrepo.com/maccdc2012/conn.log.gz
! ls -alh conn.log.gz 
! echo "Decompressing data..."
! gunzip conn.log.gz
! ls -alh conn.log
! echo "DONE: DOWNLOADED DATA"

In [37]:
! echo "generating sample conn10.log ..."
! head -n 10 ./conn.log > conn10.log
! echo "generating sample conn1K.log ..."
! head -n 1000 ./conn.log > conn1K.log
! echo "generating sample conn1M.log ..."
! head -n 1000000 ./conn.log > conn1M.log
! echo "generating sample conn5M.log ..."
! head -n 5000000 ./conn.log > conn5M.log
! ls -alh conn*
! echo "DONE: SAMPLES GENERATED"

generating sample conn10.log ...
generating sample conn1K.log ...
generating sample conn1M.log ...
generating sample conn5M.log ...
-rw-r--r-- 1 leo@graphistry.com leo@graphistry.com 2.6G Sep 21  2014 conn.log
-rw-r--r-- 1 leo@graphistry.com leo@graphistry.com 1.4K Jul 14 19:15 conn10.log
-rw-r--r-- 1 leo@graphistry.com leo@graphistry.com 130K Jul 14 19:15 conn1K.log
-rw-r--r-- 1 leo@graphistry.com leo@graphistry.com 117M Jul 14 19:15 conn1M.log
-rw-r--r-- 1 leo@graphistry.com leo@graphistry.com 575M Jul 14 19:15 conn5M.log
DONE: SAMPLES GENERATED


## Quick demo A: Jupyter

1. Add cell: `+` button
1. Run (ctrl-enter): 
```
x = 1 + 2
x
```
1. Run shell command: `ls -alh` and then `nvidia-smi`
1. Save

In [38]:
#x = 1 + 1
#x

In [39]:
#! ls -alh

In [40]:
#! nvidia-smi

## Quick Demo B: GPU Dashboard

In the left Jupyterlab menu, find the icon for the GPU dashboard extension and open the tab

# Task 1: Load data

* Read the 2.6GB file `conn.log` into a cuDF GPU dataframe using `cudf.read_csv`
* Count the unique IPs
* Compare the speed relative against CPU pandas
* Intermediate + advanced: Keep GPU memory down

### Task 1a: Load the data... without leaking GPU memory

Load the data using `cudf.read_csv()`. It will be under the wrong format, initially.

Your task is to do extra work around keeping GPU memory use under control. 
* At the end of every cell, set your dataframe variables to `None` (ex: `gdf = None`) so the garbage collector can free the memory
* You can check GPU memory use via the dashboard and `! nvidia-smi`

Start by loading sample `./conn1K.log`, and when ready, try `./conn1M.log`, `./conn5M.log`, and the full `./conn.log`.

#### 1a. Reference
Print the last 3 lines of `./conn1K.log`

In [41]:
%%time

gdf = cudf.read_csv('./conn1K.log')

print(gdf.tail(3))

gdf=None

    1331901000.000000\tCCUIP21wTjqkj8ZqX5\t192.168.202.79\t50463\t192.168.229.251\t80\ttcp\t-\t-\t-\t-\tSH\t-\t0\tFa\t1\t52\t1\t52\t(empty)
996  1331901057.960000\tCI1Kjf11YgCRL3jcQ4\t192.168...                                                                                     
997  1331901057.950000\tCaNY2R3azdPp3cLYNg\t192.168...                                                                                     
998  1331901057.620000\tCJS2P118ZwEv34NwR5\t192.168...                                                                                     
CPU times: user 9.37 ms, sys: 4.51 ms, total: 13.9 ms
Wall time: 35.6 ms


#### 1a. Task

* Print the last 3 lines of `./conn5M.log` using `cudf.read_csv`
* Use `! nvidia-smi`, the GPU monitor, and `gdf.memory_usage()` to check that memory consumption, including that it is low after setting `gdf = None`

After, try on `./conn.log` and see if it can hold the 2.6GB in memory. (RAPIDS has a max str size.)

In [None]:
%%time

gdf = cudf.read_csv('./conn5M.log')

print( gdf.tail(3) )

In [None]:
!nvidia-smi
gdf = None
!nvidia-smi

### Task 1b: Load 500MB

Load the data using tab separation

#### 1b. Reference

In [None]:
%%time

! ls -alh conn1K.log conn5M.log conn.log
! echo "lines  1K: `wc -l ./conn1K.log`"
! echo "lines  5M: `wc -l ./conn5M.log`"
! echo "lines all: `wc -l ./conn.log`"
! echo
! echo "==========================="
! echo "./conn1K.log first 3 lines:"
! head -n 3 ./conn1K.log
! echo
! echo "==========================="
! echo "via cudf.read_csv()"

gdf = cudf.read_csv('./conn1K.log', sep='\t')
print(gdf.head(3))
gdf = None

#### 1b. Task
1. Read `help(cudf.read_csv)` to learn about the delimiter parameter
1. Use the delimiter option to read the file with tab separation (`"\t"`) 
1. Get it right with `./conn1K.log` and then try `./conn5M.log`
1. ... remember to clear out your GPU memory!

In [None]:
help(cudf.read_csv)

In [None]:
%%time

gdf = cudf.read_csv('./conn1K.log', #### FILL IN ###)

print(gdf.head(3))

gdf = None

### Task 1c: Format & load 2.6GB

`read_csv` has many useful parameters to create data that is cleaner, friendlier, smaller, and faster. 

By using them, we will be able to load the whole dataset with out crashing, fairly quickly, and get native operations on top!


#### 1c. Reference

* Read about parameters `names`, `dtypes`, `colnames`, and byte ranges in `cudf.read_csv()`
* Load the Zeek format

In [None]:
help(cudf.read_csv)

In [4]:
cols = ['ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'proto', 'service', 'duration',
        'orig_bytes', 'resp_bytes', 'conn_state', 
        'local_orig?',
        'missed_bytes', 'history', 'orig_pkts',
        'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents']

dtypes=OrderedDict([
    ('ts', 'int64'), ('uid', 'str'),
    ('id.orig_h', 'str'), ('id.orig_p', 'int32'), ('id.resp_h', 'str'), ('id.resp_p', 'int32'),
    ('proto', 'str'), ('duration', 'float64'),
    ('orig_bytes', 'int64'), ('resp_bytes', 'int64'),
    ('conn_state', 'str'), ('local_orig?', 'str'), ('local_resp?', 'str'),
    ('missed_bytes', 'int64'), ('history', 'str'),
    ('orig_pkts', 'int64'), ('orig_ip_bytes', 'int64'), ('resp_pkts', 'int64'), ('resp_ip_bytes', 'int64'),
    ('tunnel_parents', 'str')
])

# optional
cols_subset = [
    'ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 
    'proto', 'duration', 'orig_bytes', 'resp_bytes'
]

#### 1c. Task

* Plug in `names` and `dtypes` to `read_csv`
* ... Take care not to leak GPU memory, and try first on `./conn1K.log` before doing `./conn1M.log` / `./conn5M.log` / the full `./conn.log`.

In [None]:
%%time

gdf = cudf.read_csv(
    ### file ###,
    sep='\t',
    names=### fill in ###,
    dtypes=### fill in ###,
    na_values=['-', '-','(empty)'])

print(gdf.dtypes)

gdf.head(3)

gdf = None

**Advanced**: Load in only the columns you will use for your analysis and compare the impact on memory & load time

## Task 2. Analytics & Wrangling

CPU Pandas operators largely carry over to GPU cuDF. We'll take a look at activity by top IPs.


### Task 2a: Column manipulation

#### 2a. Reference
* df with subset of cols: `gdf[['col1', 'col2']]`
* get one col: `gdf['col1']`
* get df/col length: `len(gdf)` / `len(gs)`
* series of unique elements from a series: `gs.unique()`
* stats on one series: `gs.min()`, `gs.max()`, `gs.sum()`, ...

#### 2a. Task

**Intro**: For the column of IPs `id.resp_h`, how many unique IPs are there? Start with `conn1K.log` and then try on the full `conn.log`.

In [None]:
%%time

gdf = cudf.read_csv('./conn1K.log', 
              sep='\t', 
              names=cols,
              dtypes=dtypes,
              usecols=['id.resp_h'],
              na_values=['-', '-','(empty)'])


unique_resp_ips = ### get 'id.resp_h' column and then its unique values ###


print('# unique', len(unique_resp_ips))
print(unique_resp_ips[:10])
unique_resp_ips = None
gdf = None

**Intermediate/advanced**: If you have time after 2b/c, for the column of bytes `orig_bytes`, what is the biggest payload?

### Task 2b: Group by & column summaries

Quite powerful, you can group dataframe rows and get summary statistics for each group.

In this task, we'll summarize flows between different computers.

#### 2b. Reference

Pattern

```python
gdf\
    .groupby(['col1', 'col2', ...])\
    .agg({
        'col2': ['min', 'max'],
        'col3': 'min',
        'col4': ['count', 'mean', 'nunique'],
    })\
    .reset_index()
```

#### 2b. Task

Compute summaries for `./conn1K.log` then the full `./conn.log` for:

* `duration`: min/max/mean
* `orig_bytes`: min/max/mean/sum
* `resp_bytes`: min/max/mean/sum

In [None]:
%%time

gdf = cudf.read_csv('./conn1K.log', 
              sep='\t', 
              names=cols,
              dtypes=dtypes,
              usecols=cols_subset,
              na_values=['-', '-','(empty)'])


out = gdf.groupby(['id.resp_h', 'id.orig_h'])\
    .agg({
        'ts': ['count', 'min', 'max', 'mean'],
        'uid': 'nunique',
        'id.resp_p': ['min', 'max', 'nunique'],
        'proto': ['nunique'],
        'duration': ### min/max/mean ###
        'orig_bytes': ### min/max/mean/sum ###
        'resp_bytes': ### min/max/mean/sum ###
    }).reset_index()


gdf = None
print(out.shape)
print(out.dtypes)
print(out.head(3))
out = None

**Advanced**: Which are the biggest exfils and longest sessions for SSH connections? Hint: use `gdf.sort_values`.

## Task 3: Visualize!

Graphistry lets you plot many points, and even more interesting, relationships, and visually filter + cluster them on-the-fly.

#### Task 3a: Run Graphistry

Already coded:

* Nodes: IPs
* Edges: Summarized flows

**Try to -**

* Pan/zoom: Similar to Google Maps - click/drag/scroll (or pinch); recenter
* Inspect: click on a node/edge, open table inspector
* Advanced: histogram, filter, cluster, color
  * Add a histogram for `resp_bytes_sum` and hover over bars to see the heaviest download links
  * Open the timebar and shift-click to create a filter for edges on the second day
* Advanced: rendering + clustering settings

In [5]:
def compute_groupby(file='./conn.log'):

    gdf = cudf.read_csv(file, 
                  sep='\t', 
                  names=cols,
                  dtypes=dtypes,
                  usecols=cols_subset,
                  na_values=['-', '-','(empty)'])

    out = gdf.groupby(['id.resp_h', 'id.orig_h'])\
        .agg({
            'ts': ['count', 'min', 'max', 'mean'],
            'uid': 'nunique',
            'id.resp_p': ['min', 'max', 'nunique'],
            'proto': ['nunique'],
            'duration': ['min', 'max', 'mean', 'sum'],
            'orig_bytes': ['min', 'max', 'mean', 'sum'],
            'resp_bytes': ['min', 'max', 'mean', 'sum'],
        }).reset_index()


    ########### Data cleaning: normal column names and times as actual timestamps

    out.columns = out.columns.to_flat_index() # -> col_name = (col, stat)
    out.columns = [ '%s_%s' % c for c in out.columns ]

    out = out.rename(columns={
        'id.resp_h_': 'id.resp_h',
        'id.orig_h_': 'id.orig_h',
    })

    out['ts_min'] = cudf.Series(pd.to_datetime((out['ts_min']*1000000000).to_pandas()))
    out['ts_max'] = cudf.Series(pd.to_datetime((out['ts_max']*1000000000).to_pandas()))
    out['ts_mean'] = cudf.Series(pd.to_datetime((out['ts_mean']*1000000000).to_pandas()))
    
    return out

out = compute_groupby('./conn1K.log')
    
print('# rows', len(out))
print('dtypes', out.dtypes)
print(out.head(3))

out = None

# rows 58
dtypes id.resp_h                    object
id.orig_h                    object
ts_count                      int32
ts_min               datetime64[ns]
ts_max               datetime64[ns]
ts_mean              datetime64[ns]
uid_nunique                   int32
id.resp_p_min                 int64
id.resp_p_max                 int64
id.resp_p_nunique             int32
proto_nunique                 int32
duration_min                float64
duration_max                float64
duration_mean               float64
duration_sum                float64
orig_bytes_min              float64
orig_bytes_max              float64
orig_bytes_mean             float64
orig_bytes_sum              float64
resp_bytes_min              float64
resp_bytes_max              float64
resp_bytes_mean             float64
resp_bytes_sum              float64
dtype: object
      id.resp_h       id.orig_h  ts_count                        ts_min  \
0    10.21.6.40  192.168.202.89         2 2012-03-16 12:30:32.1799

In [10]:
%%time

gdf = compute_groupby('./conn.log')

g = graphistry.edges(gdf).bind(source='id.orig_h', destination='id.resp_h')

print('Computed network, now creating plot...')
gdf = None

#### if an error, run g.plot(render=False)
g.plot(render=False)

Computed network, now creating plot...
CPU times: user 5.64 s, sys: 322 ms, total: 5.97 s
Wall time: 6.45 s


'https://hub.graphistry.com/graph/graph.html?dataset=d14c52196b6443b6a5d53dadd438652e&type=arrow&viztoken=d39feb19-4345-4834-ab66-7a8820af1cfb&usertag=82077d8e-pygraphistry-0.11.2&splashAfter=1594787644&info=true'

In [None]:
g = None

### Task 3b: Map SSH activity

* Copy & rename `compute_groupby` as `compute_groupby2` 
* Filter connections to just SSH traffic (port 22) based on column `id.resp_p` 
* Plot
* Answer: Which link transferred the most data out over SSH, and is that typical of the sender/receiver?

#### 3b. Reference

Hint: To filter, `gdf2 = gdf[ gdf['some_col'] == some_val]`

#### 3b. Task

In [11]:
def compute_groupby2(file='./conn1K.log'):

    gdf = cudf.read_csv(file, 
                  sep='\t', 
                  names=cols,
                  dtypes=dtypes,
                  usecols=cols_subset,
                  na_values=['-', '-','(empty)'])
    
    gdf = gdf[ gdf['id.resp_p'] == 22 ]

    out = gdf.groupby(['id.resp_h', 'id.orig_h'])\
        .agg({
            'ts': ['count', 'min', 'max', 'mean'],
            'uid': 'nunique',
            'id.resp_p': ['min', 'max', 'nunique'],
            'proto': ['nunique'],
            'duration': ['min', 'max', 'mean', 'sum'],
            'orig_bytes': ['min', 'max', 'mean', 'sum'],
            'resp_bytes': ['min', 'max', 'mean', 'sum'],
        }).reset_index()


    ########### Data cleaning: normal column names and times as actual timestamps

    out.columns = out.columns.to_flat_index() # -> col_name = (col, stat)
    out.columns = [ '%s_%s' % c for c in out.columns ]

    out = out.rename(columns={
        'id.resp_h_': 'id.resp_h',
        'id.orig_h_': 'id.orig_h',
    })

    out['ts_min'] = cudf.Series(pd.to_datetime((out['ts_min']*1000000000).to_pandas()))
    out['ts_max'] = cudf.Series(pd.to_datetime((out['ts_max']*1000000000).to_pandas()))
    out['ts_mean'] = cudf.Series(pd.to_datetime((out['ts_mean']*1000000000).to_pandas()))
    
    return out

In [12]:
### Test

out = compute_groupby2('./conn1K.log')
    
print('# rows', len(out))
print('dtypes', out.dtypes)
print(out.head(3))

out = None

# rows 4
dtypes id.resp_h                    object
id.orig_h                    object
ts_count                      int32
ts_min               datetime64[ns]
ts_max               datetime64[ns]
ts_mean              datetime64[ns]
uid_nunique                   int32
id.resp_p_min                 int64
id.resp_p_max                 int64
id.resp_p_nunique             int32
proto_nunique                 int32
duration_min                float64
duration_max                float64
duration_mean               float64
duration_sum                float64
orig_bytes_min              float64
orig_bytes_max              float64
orig_bytes_mean             float64
orig_bytes_sum              float64
resp_bytes_min              float64
resp_bytes_max              float64
resp_bytes_mean             float64
resp_bytes_sum              float64
dtype: object
        id.resp_h       id.orig_h  ts_count                        ts_min  \
0  192.168.23.254  192.168.202.68         1 2012-03-16 12:30:30.2

In [14]:
### Plot

%%time

gdf = compute_groupby2('./conn.log')

g = graphistry.edges(gdf).bind(source='id.orig_h', destination='id.resp_h')

gdf = None

####run g.plot(render=False) if cannot render inline
g.plot(render=False)

CPU times: user 1.52 s, sys: 237 ms, total: 1.75 s
Wall time: 2.16 s


'https://hub.graphistry.com/graph/graph.html?dataset=d5f6d907ad784bdf8d2bac6d91f87be8&type=arrow&viztoken=cad9999f-3b37-473f-a1bd-e3b2ac592096&usertag=82077d8e-pygraphistry-0.11.2&splashAfter=1594788176&info=true'

### After

#### Do
* Solution notebook
* Slack channel
* [Subscribe to future sessions](learnrapids.com): Multi-GPU, ...
* Guide our roadmap! RAPIDS Academy + RAPIDS ecosystem survey

#### References
* [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
* 2.6GB of Zeek connection logs (22M rows) from https://www.secrepo.com/ 
* RAPIDS docs: https://docs.rapids.ai/api/cudf/stable/api.html