In [None]:
!pip install wheel idstools pandas numpy

In [None]:
import pandas as pd
import numpy as np

from idstools import rule

import glob
import json
import os
import re

# ET Open

Download ET Open ruleset. 
```
wget https://rules.emergingthreats.net/open/suricata-6.0.1/emerging.rules.tar.gz
```

And unpack.

```
mkdir /tmp/etopen
tar -xzf emerging.rules.tar.gz -C /tmp/etopen
```

In [None]:
!wget -q -O /tmp/etopen.tgz https://rules.emergingthreats.net/open/suricata-6.0.1/emerging.rules.tar.gz

In [None]:
!mkdir -p /tmp/etopen
!tar -xzf /tmp/etopen.tgz -C /tmp/etopen

Note that this folder in `tmp` must be synced with following `glob` code which constructs a python list of all rule files.

In [None]:
RULES_LIST_ET_OPEN = glob.glob("/tmp/etopen/rules/*.rules")

Then use python code to get a organized list of rule files.

In [None]:
sorted(RULES_LIST_ET_OPEN)

And parse each rule file with `idstools`, and construct a python dictionary where keys are rule files and values are list of parsed rules.

In [None]:
%time PARSED_ET_OPEN = {k: rule.parse_file(k) for k in RULES_LIST_ET_OPEN}

Consider the following parsed rule. Notice how much information can be extracted from it. And reader should already be familiar with sequential option list.

In [None]:
print(
    json.dumps(
        PARSED_ET_OPEN["/tmp/etopen/rules/emerging-malware.rules"][0], 
        indent=2
    )
)

## High level view

Traditional data structures can be difficult for human eyes to grasp. On small scale they are fine, but things become complex if you consider that ET Open contains over 31 **thousand** rules. However, aggregations presented in row-column format can help us out here.

For that, we can use `pandas` scientific package which implements **data frames** in python. Great for data wrangling and exploration. Following block creates a new pandas data frame, and initializes columns of counters per rule file. For now, we're just interested in `total number of rules`, `number of enabled rules` and `number of disabled rules`.

In [None]:
DF_HIGH_LEVEL = pd.DataFrame()
DF_HIGH_LEVEL["file"] = list(PARSED_ET_OPEN.keys())
DF_HIGH_LEVEL["rules_total_count"] = list([len(v) for v in PARSED_ET_OPEN.values()])
DF_HIGH_LEVEL["rules_disabled_count"] = list([len([item for item in v if not item.enabled]) for v in PARSED_ET_OPEN.values()])
DF_HIGH_LEVEL["rules_enabled_count"] = list([len([item for item in v if item.enabled]) for v in PARSED_ET_OPEN.values()])

Then present the dataframe sorted by the number of active rules per file.

In [None]:
DF_HIGH_LEVEL.sort_values(by=["rules_enabled_count"], ascending=False)

Each column of counters is a vector that can be summed up for total counts.

In [None]:
print("Enabled: {} Disabled: {} Total: {}".format(
    DF_HIGH_LEVEL.rules_enabled_count.sum(),
    DF_HIGH_LEVEL.rules_disabled_count.sum(),
    DF_HIGH_LEVEL.rules_total_count.sum(),
))

## Dig into specific rule files and threats

Okay, now let's try to get information about some rules themselves.

Before getting started, `idstools` parses some information that is not terribly useful (like `action`, `direction`) while leaving other more useful data pieces unparsed. Looking specifically the `header` field for `protocol`, `src_net` and `dest_net`. Following helper function can parse that information.

In [None]:
def extract_header(header: str) -> dict:
    split = header.split()
    return {
        "proto": split[1],
        "src_net": split[2],
        "src_port": split[3],
        "dest_net": split[5],
        "dest_port": split[6]
    }

Then build a list of all rules while adding cleaned up filename and that `header` information to dictionary.

In [None]:
ALL_ET_OPEN_RULES = []

for filename, rules in PARSED_ET_OPEN.items():
    for r in rules:
        r["file"] = os.path.basename(filename)
        r = {**r, **extract_header(r.get("header"))}
        ALL_ET_OPEN_RULES.append(r)

And rather than attempting to inspect 31k element list, we'll turn the whole thing into a dataframe.

In [None]:
DF_ET_OPEN_ALL = pd.DataFrame(ALL_ET_OPEN_RULES)

Filter for only enabled rules. Rules are always commented for a reason!
* false positives;
* bad performance;
* just out of date and irrelevant;

In [None]:
DF_ET_OPEN_ALL = DF_ET_OPEN_ALL.loc[DF_ET_OPEN_ALL.enabled == True]

And get a quick peek of ruleset. Just to see what we can work on. Clearly we need to do more filtering and a proper selection of columns. All those *sticky buffer* and *content modifier* columns are totally useless. That's because they always apply to `content` keyword and have no values themselves. Thus, all those vectors are empty.

In [None]:
DF_ET_OPEN_ALL.head(5)

So, we'll build a more consise dataframe. with only those columns we are about. List is not exhaustive and just my selection. **Decide what is relevant to you!**

In [None]:
CORE_COLS = ["proto", "src_net", "dest_net", "sid", "rev", "msg", "file", "flowbits", "metadata", "references", "flow", "raw"]

In [None]:
DF_ET_OPEN_CONSISE = DF_ET_OPEN_ALL.loc[:,  CORE_COLS]

Notice that our dataframe peek was truncated. This is to avoid exploding your browser, as dataframes can be very big. Following optins can disable that to reveal more information. **But use them with care, make sure you don't call 31k row printout into your browser!**

In [None]:
pd.set_option('display.max_colwidth', None)
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
#pd.set_option('display.min_rows', None)
pd.set_option('display.width', None)

Some rule categories are small and can be shown as-is. Rather than creating separate data structures, we'll go data science way and keep everything in one dataframe. Remember, we are exploring, so we never know where that exploration will lead. Better to keep everything at arms reach and just filter if needed. Rely on intermediete data before reaching your goal.

So, to see into `emerging-worm` category, we can simply filter for that file name. Furthermore, we can sort values to make the information easier to grasp. Sorting by rule directionality is already a good trick to visually group rules.

In [None]:
DF_ET_OPEN_CONSISE \
    .loc[DF_ET_OPEN_CONSISE.file.str.contains("emerging-worm.rules")] \
    .sort_values(by=["src_net", "dest_net"])

**PS! Jupyter is a data science tool, and thus caters to that audience. This can lead to silly things like formating rule header like mathematical formula**.

However, really good stuff is in `malware` and `mobile_malware` categories. And those are big. Too big to explore with full dumps. So, let's limit the scope only to a *recent hotness*.

In [None]:
RULES_SUNBURST = DF_ET_OPEN_CONSISE \
    .loc[DF_ET_OPEN_CONSISE.msg.str.contains("SUNBURST", re.IGNORECASE)] \
    .sort_values(by=["proto", "src_net", "dest_net", "msg"]) \
    .drop(columns=["flowbits", "raw", "metadata", "flow"]) \
    .explode("references")

This is a bit more involved, but in many ways is similar to a database query.
* First, we locate all rules containing `SUNBURST` keyword. Sometimes this information is in `tag` or `metadata`, but dont count on it. And it's not very consistent.
* Then we sort values to make the frame visually easier to explore. Pandas even let's us sort by multiple values. That's why I wanted to parse `proto`, `src_net` and `dest_net` from the rule header! With those fields, we get a much better organized view.
* Then drop some columns (from view) that are just noise:
  * `flowbits` are not really that relevant for current explorations, rule content should be listed separately anyway
  * likewise `raw` rule just makes dataframe as a whole more difficult to assess, but it can always be added back if we need to check the content!
  * `metadata` does not hold much useful information and is a list, which again makes frame messy
  * `flow` is a bit redundant with sorted `src_net` and `dest_net` view. Good info, but we only have limited screen real-estate
 * Finally, `references` holds lists, but we can use `explode()` method to unpack each reference to separate row. **This duplicates other rule row elements!** But not a big deal for this case.

In [None]:
RULES_SUNBURST

Same exploration can be repeated for other relevant threats. For example, I bet many students are interested in `Cobalt Strike` rules.

In [None]:
RULES_COBALT_STRIKE = DF_ET_OPEN_CONSISE \
    .loc[DF_ET_OPEN_CONSISE \
    .msg.str.contains("Cobalt Strike|CobaltStrike", re.IGNORECASE)] \
    .drop(columns=["metadata", "flowbits"]) \
    .explode("references") \
    .sort_values(by=["msg"]) \
    .drop(columns=["raw"])

In [None]:
RULES_COBALT_STRIKE.head()

Here we can see that many rules have multiple references. And, on that note, rules can hold a lot of interesting reading materials! How about we build a reading list.

In [None]:
TEXT = "\n".join(sorted(
    list(
        RULES_COBALT_STRIKE \
            .loc[RULES_COBALT_STRIKE.fillna("NA") \
                                    .references.str.contains("^url")] \
            .references.unique()
    )
))

In [None]:
print(TEXT)

But note that many links might be dead.

In [None]:
RULES_PURPLE_FOX = DF_ET_OPEN_CONSISE \
    .loc[DF_ET_OPEN_CONSISE \
    .msg.str.contains("PurpleFox", re.IGNORECASE)] \
    .drop(columns=["metadata", "flowbits"]) \
    .explode("references") \
    .sort_values(by=["msg"]) \
    .drop(columns=["raw"])

In [None]:
RULES_PURPLE_FOX

In [None]:
RULES_EMOTET = DF_ET_OPEN_CONSISE \
    .loc[DF_ET_OPEN_CONSISE \
    .msg.str.contains("Emotet", re.IGNORECASE)] \
    .drop(columns=["metadata", "flowbits"]) \
    .explode("references") \
    .sort_values(by=["msg"]) \
    .drop(columns=["raw"])

In [None]:
RULES_EMOTET

## Interactive widgets

* https://ipywidgets.readthedocs.io/en/latest/

Not all data exploration must be done with pure code any more. Widgets are a great way to expose any data user is interested in.

In [None]:
! pip install ipywidgets

In [None]:
import re

In [None]:
def show_rules(limit: int, msg: str, columns: tuple, sort: tuple):
    pd.set_option('display.max_rows', limit)
    pd.set_option('display.min_rows', limit)
    return (
        DF_ET_OPEN_ALL[list(columns)]
        .loc[DF_ET_OPEN_ALL.msg.str.contains(msg, flags=re.IGNORECASE)]
        .sort_values(by=[c for c in list(sort) if c in list(columns)])
    )

In [None]:
import ipywidgets as widgets

In [None]:
widgets.interact(
    show_rules, 
    msg="", 
    limit=widgets.IntSlider(min=10, max=100),
    columns=widgets.SelectMultiple(
        options=list(DF_ET_OPEN_ALL.columns.values),
        value=CORE_COLS
    ),
    sort=widgets.SelectMultiple(
        options=list(DF_ET_OPEN_ALL.columns.values),
    )
)