# Suricata Analytics

## Background

* Prepared tools are great, *but*;
    * Specific ones are often too restrictive;
    * Generic ones solve all problems *well enough*;
* Lessons learned while developing a product;
    * Analytics moves fasters;
    * Backend is easy to patch in;
    * Figuring out a feature is *hard*;
    * Building good UX is even harder;
    * Frankensteins cannot be avoided;
    * Figuring out analytics AND building UX at the same time - **bad idea**;
* I love [Arkime](https://arkime.com/) for NSM hunting;
    * Access and pivot from any indexed value;
    * Focus on raw data, not alerts;
    * Best hunting UI ever - white text file with unique value listings;

## What if

* Analytics and backend could move faster than presentation;
* We had a public test ground for ideas;
* Feedback loop from real users was faster;
* We could do unique value listings per query;

## REST API to the rescue

* Scirius is built on REST model;
    * every frontend component actually queries backend API;
    * structured framework - `qfilter`, timestamp handling, etc;
    * some endpoints already existed - `events_tail`;
    * new ones pretty easy to add - unique, graph, etc;

## How to query REST API 

```bash
curl -XGET "https://$SCS/rest/rules/es/alerts_count/" \
    -H "Authorization: Token $TOKEN" \
    -H 'Content-Type: application/json'
```

```json
{
  "prev_doc_count": 0,
  "doc_count": 810605
}
```

## Using python

In [None]:
host = "CHANGEME"
token = "CHANGEME"

In [None]:
import requests
URL = f"https://{host}/rest/rules/es/events_tail"
HEADERS = {
    "Authorization": f"Token {token}"
}

In [None]:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [None]:
GET_PARAMS = {
    "qfilter": "event_type: http AND http.url: *wget*",
    "page_size": 100
}

In [None]:
resp = requests.get(URL,
                    headers=HEADERS,
                    verify=False,
                    params=GET_PARAMS)

## Problem?

* Lot of boilerplate;
* We don't want to commit secrets;
* Ongoing UX gets messy;
* Deal with data cleaning / preparation;
* Do it in *every* notebook;
* What if something changes?
* Jupyter is great...but using it as IDE is like coding in MS Word;
    * Write-only-code;
    * Developing a notebook - making a new one and copy-pasting code from old ones;

## Solution

* Why not hide the boilerplate behind importable python lib;
* Easier to develop the python code;
* No need to deal with authentication, data fetching, data cleaning, etc in every notebook;
* *Just get to the data ASAP*;

## Original idea (TM)??

* Nope;
* [Infosec Jupyterthon](https://infosecjupyterthon.com/introduction.html)
* [MSTICPy](https://github.com/microsoft/msticpy)

## Suricata Analytics

* [Open Source Project](https://git.stamus-networks.com/devel/suricata-analytics);
* Python library;
    * API Connector;
    * Data prep;
    * Importable widgets;
    * Misc helpers;
* Notebook repository;
* Kibana dashboards;

## Basic setup

### Clone the repository

```bash
git clone https://github.com/StamusNetworks/suricata-analytics.git
cd suricata-analytics
```

### Install in virtual env

```bash
python -m .venv venv
source .venv/bin/activate
pip install .
```

### Install jupyterlab

```bash
pip install -r requirements.txt
```

### Set up authentication

```bash
cp .env.example .env
```

```bash
# This is the scirius API token you can generate under Account Settings -> Edit Token
SCIRIUS_TOKEN=<TOKEN VALUE>
# This points to SELKS / Scirius / SSP manager instance
SCIRIUS_HOST=<IP or Hostname>
# Set to "no" if your SELKS / Scirius / SSP manager uses self-signed HTTPS certificate
SCIRIUS_TLS_VERIFY=yes
```

### Start the jupyter server

```bash
jupyter lab
```

* http://localhost:8888
* +1 port is in use;
* Startup sequence will autodirect with auth token;
* Link with token also visible on CLI;

## Basic usage

In [None]:
from surianalytics.connectors import RESTSciriusConnector

In [None]:
CONNECTOR = RESTSciriusConnector()

In [None]:
CONNECTOR.get_unique_fields()[:10]

In [None]:
CONNECTOR.last_request

## Basic usage

In [None]:
DF = CONNECTOR.get_events_df(qfilter="event_type: http")

In [None]:
DF

## Typical usage

### Setting core parameters

Page size hacks around pagination to increase the amount of data fetched. Maximum is 10k.

In [None]:
CONNECTOR.page_size = 10000

Query timeframe can be changed in many ways. But query delta setter is the easiest.

In [None]:
CONNECTOR.set_query_delta(minutes=15)

Basefilter will be concatenated with query filter with logical AND.

In [None]:
CONNECTOR.basefilter = "event_type: http"

### Event fetch

* `qfilter`is a kibana query;
* Can be omitted if you are happy with `basfilter`;
* Data Frame is a two dimentional tabular data structure with labeled axes;
* Method converts the response into [pandas](https://pandas.pydata.org/) dataframe;
    * Go-to data manipulation tool for data scientists;
    * Insipred by statistical languages like R;
    * Everything is a vector;
    * Ton of built-in methods to *wrangle* the data;

In [None]:
DF = CONNECTOR.get_events_df(qfilter="dest_port: [* TO 1000] AND NOT dest_port: (80 OR 8080)")
DF

### Where to even start?

* Unique value listings are your friend;
* Some stuff is more interesting than other;

In [None]:
DF.dest_port.unique()

We need to know what fields are actually available.

In [None]:
list(DF.columns.values)

### Data wrangling

Pandas truncates large frames to avoid exploding the browser.

In [None]:
import pandas as pd

Show more rows.

In [None]:
pd.set_option('display.max_rows', 50)
pd.set_option('display.min_rows', 50)

Show *all* columns.

In [None]:
pd.set_option('display.max_columns', None)

Show more data per column.

In [None]:
pd.set_option('display.max_colwidth', 150)

### Inspecting the raw events

* Filter interesting field;
* Omit missing values;
* Only inspect relevant fields;
    * We can do it programmatically;
* Drop empty columns;
* Sort values;
    * Can use multiple fields for hierarchical sorting;

In [None]:
(
    DF
    .loc[DF.dest_port == 443]
    .loc[pd.notna(DF["http.hostname"])]
    [
        ["timestamp", "src_ip", "src_port", "dest_ip", "dest_port"] + 
        [c for c in list(DF.columns.values) if c.startswith("http")]
    ]
    .dropna(how="all", axis=1)
    .sort_values(by=["timestamp"])
)

### Aggregating the data

* Too much to inspect;
* Aggregations provide a really nice overview;
* *Unique values* and *number of unique values* for text fields;
* Statistical view of numerics;
* Min, max agg for timestamp == first_seen / last_seen;

In [None]:
(
    DF
    .loc[DF.dest_port == 443]
    .groupby("dest_ip")
    .agg({
        "timestamp": ["min", "max", "count"],
        "src_ip": ["unique", "nunique"],
        "http.hostname": ["unique", "nunique"],
        "http.http_user_agent": ["unique", "nunique"],
        "http.url": ["unique", "nunique"],
        "http.length": ["min", "max", "mean", "median"]
    })
)

## Unique values

* Wrangling raw events provides a lot of value;
* But takes a lot of effort;
    * Hard to be consistent;
* Arkime has the best hunting interface in the world - `unique.txt`;
* Extracting unique values with pandas does not scale!
* Scirius did not have unique value listing API endpoint;
    * So I made one;

### Initial query

* Proceed from prior investigation;
* Investigate plaintext HTTP to port 443;
* Get simple listing;

In [None]:
CONNECTOR.basefilter = "event_type: HTTP AND dest_port: 443"

In [None]:
CONNECTOR.get_eve_unique_values(field="http.http_user_agent")

Or to get a *txt* version for simple copy-paste.

In [None]:
print("\n".join(CONNECTOR.get_eve_unique_values(field="http.http_user_agent")))

### Unique listing with counts

* `counts` set to `yes` will return structured data;

In [None]:
CONNECTOR.get_eve_unique_values(field="http.url", counts="yes")

* Pivot from interesting value, list out affected IP-s;

In [None]:
pd.DataFrame(CONNECTOR.get_eve_unique_values(qfilter="http.url: \\/", field="dest_ip", counts="yes"))

## Graph mining

In [None]:
G = CONNECTOR.get_eve_fields_graph_nx(qfilter="http.url: \\/", col_src="dest_ip", col_dest="http.http_user_agent", size_src=10, size_dest=10)

In [None]:
G.edges

In [None]:
from surianalytics.viz import draw_nx_graph

In [None]:
draw_nx_graph(G)

## Interactive widgets

* Coding every query and filter becomes tedious fast;
* Complex code is easy to break;
* End up debugging instead of exploring data;
* Only way is to write new exploration cells;
    * End up with too many cells;
* Enter widgets;

In [None]:
import ipywidgets as widgets

In [None]:
def show(limit: int, columns: list):
    pd.set_option('display.max_rows', limit)
    pd.set_option('display.min_rows', limit)
    df = (
        DF
        [list(columns)]
        .dropna(how="all", axis=1)
    )
    return df

In [None]:
widgets.interact(show, 
                 limit=widgets.IntSlider(min=10, max=200), 
                 columns=widgets.SelectMultiple(options=["timestamp", "src_ip", "src_port", "dest_ip", "dest_port"] + [c for c in list(DF.columns.values) if c.startswith("http")], 
                                                value=["timestamp", "src_ip", "src_port", "dest_ip", "dest_port"], 
                                                rows=20))

### More useful widget

* Combine data dump and unique listing to explore interesting values;
* Remember unique listing of servers with counts we did before;

In [None]:
INTERESTING_IP = CONNECTOR.get_eve_unique_values(qfilter="http.url: \\/", field="dest_ip", counts="yes")

In [None]:
INTERESTING_IP

* Convert the values;
* Widget options can be tuples with contextual labels;
* Maybe should implement this conversion as importable function...

In [None]:
INTERESTING_IP_FMT = [(i["key"] + ": " + str(i["doc_count"]), i["key"]) for i in INTERESTING_IP]

In [None]:
INTERESTING_IP_FMT

Then implement interactive widget hander.

In [None]:
def show(limit: int, server: str):
    pd.set_option('display.max_rows', limit)
    pd.set_option('display.min_rows', limit)
    return (
        DF
        .loc[DF.dest_ip == server]
        [
            ["timestamp", "src_ip", "src_port", "dest_ip", "dest_port"] + 
            [c for c in list(DF.columns.values) if c.startswith("http")]
        ]
        .dropna(how="all", axis=1)
    )

Note how the options are processed selection of interesting fields.

In [None]:
widgets.interact(show,
                limit=widgets.IntSlider(min=10, max=1000),
                server=widgets.Dropdown(options=INTERESTING_IP_FMT)
                )

## Extreme widgeting

* Adhoc widgets are cool;
* Importable ones are even cooler;

### Time picker

* Problem - too many notebooks;
* Syncing query timerange between became a hassle;

In [None]:
from surianalytics.widgets.config import Timepicker

In [None]:
Timepicker().display()

In [None]:
from surianalytics.widgets.config import update_connector_timeframe

In [None]:
update_connector_timeframe(CONNECTOR, "config.json")

In [None]:
CONNECTOR.hours

### Unique value hunting

* Fully interactive widget;
* Data wrangling streamlined;
* Implements my typical hunting flow;
* New connector will be created, existing one can be passed;

In [None]:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [None]:
from surianalytics.widgets.uniq import UniqPivot

Connector does not need to be passed. Calling `UniqPivot()` is enough. But we can reuse the object and ensure the timeframe is correct this way.

In [None]:
UP = UniqPivot(c=CONNECTOR)

In [None]:
UP.connector.last_request

In [None]:
UP.display()