# Jupyter Playbooks for Suricata

* Markus Kont
* Stamus Networks
* github.com/markuskont
* twitter.com/markuskont

## Introduction

* Introduce a tool
  * not for experienced data scientists
  * spark some ideas
* Focus on use-cases around Suricata
  * no iris dataset
* Might not have time to cover everything
  * presentation is meant to be a resource

### Fmt.presentation()

 * Presentation **IS** a notebook
 * it is public
 * code examples are live
 * all data is generated by the notebook
   * (from `malware-traffic-analysis.net`)

### whoami

* 2011: Server Administrator
* 2014: Cyber Security MSc, TalTech
  * 2015: PhD candidate
* 2015: Technology Branch Researcher, NATO CCDCOE
  * trainings, exercises, research
* 2020: Developer & Threat Researcher, Stamus Networks
  * focus on analytics and threat hunting
* 2021: Dad
* Always: hacker

## About Jupyter

 * Initially IPython Notebooks
   * interactive coding
   * instant feedback
 * Then rebranded to Jupyter
   * de'facto tool for a data scientist
 * Supports different *kernels*
   * R
   * nodejs
   * julia
   * Go
   * ...

### pip install jupyter

#### Basic concepts

 * Organized into *cells*
 * *Cell* can be *code* or *markdown*
 * Cell is executed by *kernel*
 * JupyterLab is like IDE

#### Installing

```
pip install jupyter jupyterlab
```

#### Starting it up

```
(general) ➜  suricata-analytics-1 git:(next-suricon-2022-10-28) ✗ jupyter lab
[I 2022-10-30 06:10:48.141 ServerApp] jupyterlab | extension was successfully linked.
[I 2022-10-30 06:10:48.150 ServerApp] nbclassic | extension was successfully linked.
[I 2022-10-30 06:10:48.170 LabApp] JupyterLab extension loaded from /home/markus/venvs/general/lib/python3.10/site-packages/jupyterlab
[I 2022-10-30 06:10:48.170 LabApp] JupyterLab application directory is /home/markus/venvs/general/share/jupyter/lab
[I 2022-10-30 06:10:48.173 ServerApp] jupyterlab | extension was successfully loaded.
[I 2022-10-30 06:10:48.177 ServerApp] nbclassic | extension was successfully loaded.
[I 2022-10-30 06:10:48.177 ServerApp] The port 8888 is already in use, trying another port.
[I 2022-10-30 06:10:48.178 ServerApp] Serving notebooks from local directory: /home/markus/Projects/SN/suricata-analytics-1
[I 2022-10-30 06:10:48.178 ServerApp] Jupyter Server 1.21.0 is running at:
[I 2022-10-30 06:10:48.178 ServerApp] http://localhost:8889/lab?token=b675c4daec9a6c2beb11b0a6cd38a314509ae62b1989b2e2
[I 2022-10-30 06:10:48.178 ServerApp]  or http://127.0.0.1:8889/lab?token=b675c4daec9a6c2beb11b0a6cd38a314509ae62b1989b2e2
[I 2022-10-30 06:10:48.178 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2022-10-30 06:10:48.216 ServerApp]

    To access the server, open this file in a browser:
        file:///home/markus/.local/share/jupyter/runtime/jpserver-395207-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/lab?token=b675c4daec9a6c2beb11b0a6cd38a314509ae62b1989b2e2
     or http://127.0.0.1:8889/lab?token=b675c4daec9a6c2beb11b0a6cd38a314509ae62b1989b2e2
Opening in existing browser session.
```

#### Code

It is suricon, so let's start the demo by downloading a PCAP file. With **pure python**. Purpose of this is to demo:

* Simple python code in notebook;
* To get initial input for next *slides*

Firstly, import supporting libraries.

In [None]:
import requests
from zipfile import ZipFile

Then define download link and output path as variables.

In [None]:
URL = "https://malware-traffic-analysis.net/2022/01/03/2022-01-01-thru-03-server-activity-with-log4j-attempts.pcap.zip"
OUTPUT = "/tmp/malware-pcap.zip"

Download and store the file. Notice the real-time output as code gets evalutated.

In [None]:
response = requests.get(URL, stream=True)
if response.status_code == 200:
    print("Download good, writing %d KBytes to %s" % 
          (int(response.headers.get("Content-length")) / 1024,
           OUTPUT))
    with open(OUTPUT, 'wb') as f:
        f.write(response.raw.read())
    print("Done")
else:
    print("Demo effect has kicked in")

Then unzip the archive.

In [None]:
file_name = OUTPUT
with ZipFile(file_name, "r") as zip:
    zip.extractall(path="/tmp", pwd="infected".encode("utf-8"))

Find the PCAP and store for later use.

In [None]:
import glob
FILES = glob.glob("/tmp/*.pcap")
FILES

In [None]:
print(FILES[0])

#### Invoking a Shell command

* Writing code to do some simple things can be a hassle
* Jupyter provides some helpers
    * `%` calls builtin magic commands
    * `!` invokes any shell command

For example, we need a Suricata ruleset to proceed with presentation.

In [None]:
%pip install suricata-update

In [None]:
!/home/jovyan/.local/bin/suricata-update enable-source tgreen/hunting

In [None]:
!/home/jovyan/.local/bin/suricata-update

In [None]:
!rm -rf /tmp/logs && mkdir /tmp/logs

In [None]:
!suricata -S /var/lib/suricata/rules/suricata.rules -l /tmp/logs -r /tmp/2022-01-01-thru-03-server-activity-with-log4j-attempts.pcap -v

## Import pandas as pd

* `pandas` is a python library that provides *dataframes*
* more than a library, it's actually a language by itself
* think R and Julia
* forget what you know about for loops
  * but it's totally worth it!

In [None]:
%pip install pandas

In [None]:
import pandas as pd
import numpy as np
import json

In [None]:
pd.options.display.html.use_mathjax = False

### pd.DataFrame

* dataframe gives us nice row-column view of data

In [None]:
pd.DataFrame([{"src_ip": "1.1.1.1", "flow_id": 123}, {"src_ip": "2.2.2.2", "flow_id": 124}])

### Loading EVE data

Parse json documents into python dictionaries and then normalize nested key-value pairs with `pd.json_normalize`.

In [None]:
with open("/tmp/logs/eve.json", "r") as handle:
    DF = pd.json_normalize([
        json.loads(line) for line in handle
    ])
DF

### Measuring the data

Pandas method for getting total number of *rows* and *columns*.

In [None]:
DF.shape

Both can also be measured in native python.

In [None]:
len(DF)

In [None]:
len(DF.columns.values)

Python method does have some advantages.

In [None]:
len([c for c in list(DF.columns.values) if not c.startswith("stats")])

### Describe

* `describe` for statistical overview per column
* not terribly useful for EVE NSM
  * ...unless we actually want to see `stats` or `flow` data
* but interesting nevertheless

In [None]:
DF.describe()

## Basic Exploration and Hunting

* Prepare
* Locate
* Subset
* Pivot
* Enrich
* Dump

### Prepare

Before moving on, it's a good idea to apply some data preparation techniques. Here we:
* convert `timestamp` column to actual `datetime` objects
* convert `flow_id` to `int` values
    * numbers default to `float`, data science tools assume we do statistical analysis
    * we need to convert `NaN` values to something, `stats` events do not have `flow_id`!

In [None]:
DF["timestamp"] = pd.to_datetime(DF["timestamp"])

In [None]:
DF["flow_id"] = (
    DF
    .flow_id
    .fillna(0)
    .astype(int)
)

### Locate

Consier a simple example of *flow id correlation*. Before we can acutally do it, we need to explore our data.

* we start by digging into `event_type` `alert`
* exclude noisy categories that are not interesting for now
* sort output by timestamp
* drop columns that 

In [None]:
DF_ALERT = (
    DF
    .loc[DF.event_type == "alert"]
    .loc[DF["alert.category"] != "Generic Protocol Command Decode"]
    .sort_values(by=["timestamp"], ascending=True)
    .dropna(how="all", axis=1)
)

In [None]:
len(DF_ALERT)

### Subset

Alerts are pretty easy, since we already know what columns to look at first.

* Timestamp
* *Flow* IP pair
* flow ID to pick up interesting ones for correlation
* signature is most interesting data point
* category for extra context and logical grouping

In [None]:
DF_ALERT[["timestamp", "flow.src_ip", "flow.dest_ip", "flow_id", "alert.signature", "alert.category"]]

### Pivot

Let's do some Flow correlation:
* Dig in to flow with Log4j hunting alert
* See what fields are available
* Make selection
* Display data
* Dump it for analysis

In [None]:
FLOW_ID = input()

In [None]:
DF_LOG4J = (
    DF
    .loc[DF.flow_id == int(FLOW_ID)]
    .dropna(how="all", axis=1)
)
DF_LOG4J

In [None]:
pd.Series(list(DF_LOG4J.columns.values))

In [None]:
COLUMNS = ["src_ip", "dest_ip", "event_type", "http.hostname", "http.url", "fileinfo.filename", "http.http_refer"]
DF_LOG4J[COLUMNS]

### Extract interesting fields

We can clearly see what we're looking for, but dataframe is noisy and column length is limited. But we can easily extract `unique` values of columns we're really interested in.

In [None]:
(
    DF_LOG4J["http.url"]
    .dropna()
    .unique()
)

In [None]:
(
    DF_LOG4J["http.http_refer"]
    .dropna()
    .unique()
)

### Prepare extractor

* We now know we are hunting for malicious base64 payloads
* Analyst might need to report all injections
* So we can prepare some code to extract and decode all those scripts

In [None]:
import re
PATTERN = re.compile(r"Base64/([A-Za-z0-9]+={,2})")
def extract_base64(x: str):
    if pd.isna(x):
        return np.NaN
    match = re.search(PATTERN, x)
    if match:
        return match.group(1)
    return np.NaN

### Extract payload

* Forget what you know about `for` loops
* Instead, we `apply` functions over vectors of data
* Likely not most efficient over text fields
* Crucial when analysing numeric statistics
* But hey, the code is nice and clean
    * and, *linear*, easy to put into report
    * it's meant for writing academic papers

In [None]:
DF["base64_payload"] = DF["http.url"].apply(extract_base64)

In [None]:
DF_REPORT = DF.loc[pd.notna(DF["base64_payload"])].copy()

In [None]:
len(DF_REPORT)

### Decode payloads

* Beauty of working with pure python
* Can call anything that's useful, such as `base64.decode`

In [None]:
import base64

In [None]:
DF_REPORT["decoded_payload"] = DF_REPORT["base64_payload"].apply(base64.b64decode)
DF_REPORT["decoded_payload"] = DF_REPORT["decoded_payload"].apply(lambda x: x.decode("utf-8"))

In [None]:
pd.set_option('display.max_colwidth', None)
DF_REPORT[["flow_id", "base64_payload", "decoded_payload"]]

### Dump report

In [None]:
(
    DF_REPORT[[
        "event_type",
        "src_ip", 
        "dest_ip", 
        "flow_id", 
        "http.hostname", 
        "http.url", 
        "base64_payload", 
        "decoded_payload"
    ]]
    .to_csv("./report.csv")
)

## Analytics

Notebooks and data science tricks bring a lot of tools to the table:
* Aggregations
* Visualizations
* Clustering
* **Widgets**

### Aggregate

* Previous example was the *long way*
* Aggregations and *uniqueness* could have brought us to same point

In [None]:
DF_AGG_HTTP = (
    DF
    .loc[DF["alert.category"] != "Generic Protocol Command Decode"]
    .loc[pd.notna(DF["http.url"])]
    .groupby("alert.signature")
    .agg({
        "timestamp": ["min", "max"],
        "http.url": ["unique", "count"],
    })
)
DF_AGG_HTTP

### Simple visualizations

* Pandas can easily wrap around `matplotlib` to generate simple visualizations
* Aggregations are not done seamlessly, data must be prepared
* But then you just select what columns you want

In [None]:
DF_AGG_HTTP.plot.pie(y=("http.url", "count"))

### Look into numbers

* Pie charts look nice but are also a meme
* Hard to scale
* *why not just use Kibana*
* data science tools really shine when analyzing numerical data

In [None]:
DF_FLOW = (
    DF
    .loc[DF.event_type == "flow"]
    [["flow_id", "app_proto", "flow.bytes_toserver", "flow.bytes_toclient"]]
)

In [None]:
DF_FLOW.plot.scatter(x="flow.bytes_toserver", y="flow.bytes_toclient")

### Clustering - k-means

Let's try to apply `k-means` *clustering* to group those nodes!

In [None]:
from sklearn.cluster import KMeans
KMEANS = KMeans(n_clusters=3, max_iter=500000, init='k-means++')
KMEANS.fit(DF_FLOW[["flow.bytes_toserver", "flow.bytes_toclient"]])

Importantly, we need to extract the cluster numbers as *labels* and attach them as another dataframe vector.

In [None]:
DF_FLOW["cluster"] = KMEANS.labels_

### Plot clustered data

Then map colors based on cluster. Notice `map` method which is basically `apply` on global dictionary.

In [None]:
COLOR = DF_FLOW.cluster.map({0:'b', 1:'r', 2: 'g'})

In [None]:
DF_FLOW.plot.scatter(x="flow.bytes_toserver", y="flow.bytes_toclient", c=COLOR)

### Holoviews and Bokeh

* Matplotlib is nice but rather basic
* It's mostly for exporting pictures into academic papers
* We want interactions and pretty javascript-style pictures!
* Enter `hvplot`
    * basically a drop-in replacement

In [None]:
import hvplot.pandas
DF_FLOW.hvplot.scatter(
    x="flow.bytes_toserver", 
    y="flow.bytes_toclient", 
    c=COLOR,
    hover_cols=["flow_id", "cluster"]
)

### Holoviews and Bokeh - builtin grouping

* Can bring a lot more to the table
* For example, it wraps around pandas grouping and aggregation feature
* Different view of data

In [None]:
# Just a deprecation notice I did not want, nothing to see here
import warnings
warnings.filterwarnings("ignore")

import hvplot.pandas
DF_FLOW.hvplot.scatter(
    by=["app_proto"],
    x="flow.bytes_toserver", 
    y="flow.bytes_toclient", 
)

### msticpy

* https://msticpy.readthedocs.io/en/latest/
* python library from MSTIC - Microsoft Threat Intelligence Center
* lot of useful helpers to visualize, enrich, explore data
* for example - timeline graph

In [None]:
from msticpy.vis.timeline import display_timeline, display_timeline_values
from msticpy.vis.timeline_duration import display_timeline_duration

In [None]:
display_timeline(
    DF.loc[DF["alert.category"] != "Generic Protocol Command Decode"], 
    group_by="alert.signature_id", 
    time_column="timestamp",
    source_columns=["src_ip", "dest_ip"],
    legend="right",
    width=800,
    height=600
)

### msticpy - advanced visualizations

When was something first observed? How long did it last?

In [None]:
display_timeline_duration(
    DF.loc[DF.event_type == "alert"], 
    group_by="alert.signature", 
    time_column="timestamp",
    width=800,
    height=600
)

### Widgets

* Notebooks require a lot of coding
* Not always convenient, especially when exploring
* Hunting workflows are often non-linear
* https://ipywidgets.readthedocs.io/en/latest/
* Widgets make the notebook come alive
* Though at cost, output is not recorded like manual code

In [None]:
import ipywidgets as widgets

### Handler function

* display filtered global data
* we need pattern matching
* dataframe has too many columns, would be nice to select
* dataframe has too many rows, might overload the notebook
* can add any interaction we want really
* sometimes tricky workarounds are needed

In [None]:
def show_http(limit: int, url_pattern: str, columns: tuple, src_ip: str) -> pd.DataFrame:
    pd.set_option('display.max_rows', limit)
    pd.set_option('display.min_rows', limit)
    df = DF.loc[pd.notna(DF["http.url"])]
    if src_ip != "":
        df = df.loc[df.src_ip.str.contains(src_ip)]
    return (
        df[list(columns)]
        .loc[df["http.url"].str.contains(url_pattern, flags=re.IGNORECASE)]
    )

### Widgets - interact

* Then time the handler to interaction object
* Define widgets
* Some widget types are autodetected!

In [None]:
widgets.interact(
    show_http,
    limit=widgets.IntSlider(min=10, max=50),
    url_pattern="",
    columns=widgets.SelectMultiple(
        options=list(DF.columns.values),
        value=["src_ip", "dest_ip", "flow_id", "http.hostname", "http.url"]
    ),
    src_ip=widgets.Combobox(options=list(DF.loc[DF.event_type != "stats"].src_ip.unique()))
)

## Suricata Analytics

* OSS project by Stamus Networks
* Jupyter notebooks for hunting and data exploration
* Python data connector library
  * Special REST API endpoints in SELKS
* Kibana dashboards

### Dynamic Hunting Notebook Demo

## Thank you