# Jupyter Playbooks for Scirius

Author: Markus Kont

# Introduction

Back in 2022, I did a Suricon presentation titled [Jupyter Playbooks for Suricata](https://youtu.be/hevTFubjlDQ). [This led into a blog series](https://www.stamus-networks.com/jupyter-playbooks-for-suricata), [also available as a notebook](https://github.com/StamusNetworks/suricata-analytics/blob/main/jupyter/Notebooks/JupyterPlaybooksForSuricata.ipynb), meant to expand on the topic and provide more context to those interested in notebooks who lack experience to get started.

This post can be thought of as continuation of that blog series, though taking on a slightly new direction. Whereas the original Jupyter series was focused on core [Suricata](https://github.com/OISF/suricata), then now we look toward [Scirius](https://github.com/StamusNetworks/scirius). Note that I do not mean [SELKS](https://www.stamus-networks.com/selks), our open-source IDS/NSM threat hunting system, nor [Stamus Security Platform](https://www.stamus-networks.com/stamus-security-platform), our commercial appliance. Instead, the focus is on a software component that is shared between the two solutions. Our goal is to extend the functionalities of these products beyond what is already implemented in the user interface. Scirius, which is the Django web application that implements our management and backend functionalities, was chosen for this task. 

Interfacing our Jupyter data connectors to open-source Scirius means we can contribute to the Suricata community while also enhancing the product to our customers. Suricata is still seen as nothing more than rule-based IDS engine, whereas in reality it produces a ton of useful NSM events. Our goal is to expose this data to users in meaningful manner. And Jupyter notebooks are the tool we chose for this task.

# The problem

So far we've talked about processing simple EVE log files with pandas and Jupyter notebooks. However, this approach does not scale. Pandas is designed to be simple to use, not for ingesting and transforming vast amount of data. All processing is done in-memory, and it's not conservative in using it. On the other hand, Suricata can produce vast amounts of NSM events. Does it mean pandas is not fit for processing Suricata EVE JSON logs at scale? No, not at all!

Pandas is an amazing tool for interacting with data and for gaining quick insights. Problem is filtering and transforming large datasets. Scirius is already able to do the former by relaying queries to backend Elasticsearch database. Our commercial Stamus Security Platform also has a powerful streaming pipeline to enhance core Suricata events, which addresses the data preparation. But even without it, the core Suricata EVE logs have so much on offer that most user never dig into.

# The REST API

A little known feature in our products is the ability to query the REST API. REST, which stands for REpresentational State Transfer, is a standard paradigm for building web applications whereby backend server is responsible for frontend components via API requests. In our case, most frontend components simply fetch and display data from backend URLs. Important part being that we have already implemented a number of useful API endpoints to fetch useful data. It's also fairly simple to add new endpoints.

But before we can discuss newly added endpoints or even how anyone could contribute to adding them, we must first explore how API queries work. In short, anyone with proper *API token* is able to issue authenticated requests to endpoints. To generate that token, we must first navigate to `Account Settings` section which is available at the top right corner of the title menu.

![Account Settings](account-settings.png)

Then on the left hand side, choose `Edit Token`.

![Edit Token](img/edit-token.png)

Finally, the token will be visible in the `Token` field. If empty, then simply click `Regenerate` button to create a new one. Then copy the value to a keychain or password safe of your choice.

![Generate Token](generate-token.png)

Once we have found our token, we can start issuing queries to Scirius REST API. We can even fetch data from the command line! Simply point you web client to the appliance IP or fully qualified domain name with API endpoint in the *URI path*. API token must be defined within the `Authorization` header.

```bash
curl -XGET "https://$SELKS_OR_SSP/rest/rules/es/alerts_count/" \
    -H "Authorization: Token $TOKEN" \
    -H 'Content-Type: application/json'
```

This very simple endpoint returns the number of alerts that match within given time period. If left undefined, it will default to 30 days in the past to now.

```json
{
  "prev_doc_count": 0,
  "doc_count": 810605
}
```

We can pull data directly from any SELKS or SSP instance. Directly from command line. That's pretty cool! But let's look at something more powerful.

## Suricata Analytics project

The most difficult aspect of working with notebooks is data ingestion. Most notebooks become unusable over time since they depend on CSV or JSON files for input. Even worse, those files might be preprocessed and notebook assumes existence of some fields that are not actually present in raw data. Jupyter notebooks are often used as references when working on new notebooks simply because they cannot be used without being shipped with exact data they were originally developed with. This clearly diminishes their usefulness. By using Scirius as our data ingestion point, we're able to mitigate that problem. We can make *assumptions* about what data is present and how it's formatted without shipping it with notebooks.

This was one of the critical factors that motivated us to start the [Suricata Analytics project](https://github.com/StamusNetworks/suricata-analytics). If the REST API is the *server* component, then Suricata Analytics notebooks are the clients. Those notebooks use Python to interact with Scirius REST API. Next section will explain how it works.

## Scirius REST API with Python

Firstly, we need to point our notebooks to the right host. We also need to store the authentication token along with any parameters that might alter the connection. After all, hard coding variables like this into each notebook will severely diminish their usability. And to make matters worse, committing and pushing API tokens is a security breach. To keep things simple, we decided to use `.env` files. In fact, our SELKS on docker setup uses the same method, so it was only natural to use it for notebooks as well. It can be set up as described in [Suricata Analytics main README file](https://github.com/StamusNetworks/suricata-analytics/tree/main#jupyter).

```bash
SCIRIUS_TOKEN=<TOKEN VALUE>
SCIRIUS_HOST=<IP or Hostname>
SCIRIUS_TLS_VERIFY=yes
```

For now we handle a very limited set of options. Those being the token value itself, server IP or hostname, and an option to disable TLS verification if using self-signed certificates. Latter being the default for most lab setups and out of the box SELKS installations.

Python has`dotenv` package to import variables in this file into python session. Once imported, `dotenv_values` allows us to use variables in environment file like any other python dictionary. Note that Suricata Analytics project includes a reference docker container which mounts the environment file from project root directory into the home folder of container. Subsequent example is written with this in mind.

In [None]:
from dotenv import dotenv_values
import os

In [None]:
CONFIG = dotenv_values(os.path.join(os.path.expanduser("~"), ".env"))

We can use Python `requests` package to interact with Scirius REST API. But before we do, we need to set up some parameters. Like before, the API token is passed with `Authorization` header. Though this time it's more structured. We can also use the environment dictionary to dynamically build the URL and authentication.

In [None]:
import requests
URL = "https://{host}/rest/rules/es/events_tail".format(host=CONFIG["SCIRIUS_HOST"])
HEADERS = {
    "Authorization": "Token {token}".format(token=CONFIG["SCIRIUS_TOKEN"])
}

Each API endpoint usually defines it's own parameters. But some are common for most. The important ones being:
* `qfilter` for passing a KQL style query to the endpoint;
* `from_date` unix epoch to define point in time from which we want to retrieve the events;
* `to_date` unix epoch to define point in time to which the data should be retrieved;
* `page_size` how many documents should be fetched;

Note that we can pass any Kibana style query to the endpoint using the `qfilter` parameter. Essentially allowing us to fetch any data we want. We can also modify the query the query period. The default is to fetch data from last 30 days. This is something to be careful with since many queries might match more documents than what's returned by Elasticsearch. A wide query over past 30 days with default page size would return a tiny sample of overall data, and would thus not be very useful.

Ideally, we would need to fetch something specific. For example, we might be interested in `http` events where HTTP URI contains a command injection.

In [None]:
GET_PARAMS = {
    "qfilter": "event_type: http AND http.url: *wget*",
    "page_size": 100
}

Most data can simply be fetched with HTTP GET requests. A very powerful API endpoint to get started with is `events_tail` which allows the user to query raw EVE events.

In [None]:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [None]:
resp = requests.get(URL,
                    headers=HEADERS,
                    verify=False if CONFIG["SCIRIUS_TLS_VERIFY"] == "no" else True,
                    params=GET_PARAMS)

Once the data is retrieved, we can simply load the values from `results` JSON key and pass them to Pandas `json_normalize` helper to build a flat dataframe of EVE events. Once done, we can interact with the data as described in previous posts.

In [None]:
import pandas as pd
import json

In [None]:
DF = pd.json_normalize(json.loads(resp.text)["results"])

We can simply measure how many events were fetched.

In [None]:
len(DF)

Or we could subset the data frame for a quick glance.

In [None]:
(
    DF
    [["timestamp", "src_ip", "dest_ip", "event_type", "flow_id", "http.hostname"]]
    .head(3)
)

Naturally, a more useful interaction would be some kind of aggregate report. For example, we could see what URL-s were accessed, what user agents were used for individual HTTP hosts.

In [None]:
DF.groupby(["http.hostname"]).agg({"src_ip": "nunique",
                                   "dest_ip": "nunique",
                                   "http.hostname": "unique",
                                   "http.url": "unique",
                                   "http.http_user_agent": "unique"})

This is really powerful but involves some some boilerplate. In the next section we'll see how Suricata Analytics improves on this.

## Suricata Analytics data connector

Boilerplate refers to code that repeats in many parts of the code with little variation. But it must be there to set up some other functionality. In our case, user would need to import the API token and Scirius server address in every notebook using `dotenv`. If we ever changed how they are stored, then every notebook would break. Secondly, we would need to import requests and set up HTTP query parameters all the time.

Notebooks can become really complex. Especially when weighed down with code that's actually not relevant for exploring data. Having discarded many notebooks for that reason, we decided to write a Python *data connector* to move this complexity from notebooks to importable library. This connector is also part of the Suricata Analytics project and can simply be installed with `pip install .` while in the project root directory. This idea was very much inspired by [MSTIC Jupyter and Python Security Tools](https://msticpy.readthedocs.io/en/latest/), developed by [Microsoft Threat Intelligence team (MSTIC)](https://www.microsoft.com/en-us/security/blog/topic/threat-intelligence/?sort-by=newest-oldest&date=any). Like our project, it provides data connectors to quickly import and analyze security data into Jupyter Notebooks.

Once installed, the connector can be imported into any notebook.

In [None]:
from surianalytics.connectors import RESTSciriusConnector

Then we create new connector object. Environment file is automatically detected on object initialization, though the user can override the parameters with object arguments as well.

In [None]:
CONNECTOR = RESTSciriusConnector()

The object maintains persistent state so the user only needs to set certain parameters once. Page size parameter is one that could be easily overlooked. User might execute one query with modified page size yet forget to pass that argument in the next. That could skew the results since the second data fetch might be partial, due to more documents matching the query than would be returned by Elasticsearch.

The object allows user to simply set the parameter once. All subsequent queries would then use the value until it's once again updated.

In [None]:
CONNECTOR.set_page_size(100)

Same is true for defining the query time period. Relative time queries are very common when working with NSM data. Most users simply need to know what happened X amount of time ago in the past, and might not really care for setting exact timestamps.

We provided a helper method that handles this calculation automatically. Likewise, the time frame will apply to all subsequent queries once set.

In [None]:
CONNECTOR.set_query_delta(hours=1, minutes=30)

Naturally, the user could also explicitly set from and to timestamps as RFC3339 formatted strings, a unix Epochs, or parsed Python timestamp objects. Our library handles basic validation such as ensuring that timestamps are not in reverse. 

These are just some of the ways how we can easily prepare the following method call. That call would then be functionally identical to `requests` example that was shown in prior section, albeit with less lines of code. We also do not need to worry about parsing the results. Our library automatically converts the resulting JSON into a normalized pandas data frame, further reducing redundant code. 

In [None]:
DF = CONNECTOR.get_events_df(qfilter="event_type: http AND http.url: *wget*")

In [None]:
(
    DF
    [["timestamp", "src_ip", "dest_ip", "event_type", "flow_id", "http.hostname"]]
    .head(3)
)