# Extensible Event Format (EVE)

## Intro

* Suricata is not just IDS;
* Lot of protocol parsers;
* Need to parse a protocol in order to implement alerting;
* Logging the parsed data is a no brainer;
* EVE stands for Extensible Event Format;

## Starting to learn

* Full JSON logging of protocol metadata;
* More in [official documentation](https://docs.suricata.io/en/latest/output/eve/eve-json-output.html);
* Good resource to learn is [Malware Traffic Analysis (dot net)](https://www.malware-traffic-analysis.net/)
* Let's investigate a [PCAP containing web scanner samples](https://www.malware-traffic-analysis.net/2024/06/08/index.html)

In [None]:
import pandas as pd
import json

In [None]:
with open("./data/01/eve.json", "r") as handle:
    DF = pd.json_normalize([json.loads(l) for l in handle])

In [None]:
DF

## Where to begin???

* Lot to inspect, where to begin???
    * Firstly, let's see how much data we actually have;
    * Secondly, let's just pick a subset;
    * Thirdly, let's inspect interesting values;
* Move from that after initial familiarization;
* Hunting is based on:
    * Experience & intuition;
    * Familiarity with data;
* First time around is always confusing, so start small;

In [None]:
len(DF)

## Event Type

* To reiterate **protocol log is produced regardless if flow alerted or not**
* `event_type` field refers to what kind of event we're dealing with
* `alert` is emitted ASAP
* `flow` is emitted at the end
    * `alert` has `flow` section but it's partial
* individual protocol events are emitted as they are observed in the flow

In [None]:
list(DF.event_type.unique())

In [None]:
(
    DF
    .groupby("event_type")
    .agg({"event_type": ["count"]})
    .sort_values(by=[("event_type", "count")], ascending=False)
)

### Investigating the first event type

* Notice - more HTTP events than alerts
* Even more flows!
* Many connections with no corresponding alerts
* Not a comprehensive list of fields!
* Some fields more relevant to hunting than others

In [None]:
(
    DF
    .loc[DF.event_type=="http"][
    [c for c in list(DF.columns.values) if c.startswith("http.")]]
    .dropna(how="all", axis=1)
)

## Let's pick some fields to investigate

### HTTP Hostname

* Start with HTTP - main event type for scanning traffic;
* Pick a medium cardinality field which should always be there;
* Know the field - should be a FQDN;
* Therefore, IP address as http hostname usually indicates scripted tooling;
* Notice anomalies!
* Hackers and their tools often do not behave like regular users;
* Sometimes they also mess it up!

In [None]:
(
    DF
    .loc[DF.event_type=="http"][
    [c for c in list(DF.columns.values) if c.startswith("http.")]]
    .dropna(how="all")
    ["http.hostname"]
    .unique()
)

### Hostname - takeaways

* Hunans use domain names, IP addresses are for scripts and malware;
    * Only legit use-case is internal services;
    * Anything inbound / outbound is usually suspicious;
* Missing hostname is is low-hanging fruit
    * HTTP server needs a hostname in order to serve anything;
    * Could be mistake, could be hail mary against wrong service;
* `%!s(missing)` is a format string error pattern;
    * Meaning attacker messed up their script;

### HTTP User-Agent

* Very high cardinality field;
* Inspecting all values unrealistic;
* But clear patterns emerge;
* Simply unique values reveals a lot;

In [None]:
DATA_UA = (
    DF
    .loc[DF.event_type=="http"][
    [c for c in list(DF.columns.values) if c.startswith("http.")]]
    .dropna(how="all")
    ["http.http_user_agent"]
    .dropna()
    .unique()
)
DATA_UA = list(DATA_UA)
DATA_UA = sorted(DATA_UA)
DATA_UA

### UA takeaways

* UA field is under full control of the client;
* Real ones are created by browser;
* Specifies browser versions - thus high cardinality;
* Scanners have default user-agents that attackers do not bother changing;
* Dropping frequent patterns reveals a lot;

In [None]:
[d for d in DATA_UA if not d.startswith("Mozilla/5.0")]

### URL

* Again, patterns emerge;
* So do anomalies;
    * Path traversal attempts;
    * Command injections;
    * Typically, `http.url` from Suricata is normalized without hostname;

In [None]:
sorted(list((
    DF
    .loc[DF.event_type=="http"][
    [c for c in list(DF.columns.values) if c.startswith("http.")]]
    .dropna(how="all")
    ["http.url"]
    .dropna()
    .unique()
)))

## Anomaly example - HTTP Method

* Know your data!
* HTTP Methods are well defined;
* Anything outside of those is an indicator;
* Now investigate distinct values;

In [None]:
sorted(list((
    DF
    .loc[DF.event_type=="http"][
    [c for c in list(DF.columns.values) if c.startswith("http.")]]
    .dropna(how="all")
    ["http.http_method"]
    .dropna()
    .unique()
)))

### What is happening?

* We have to *guestimate*, many things are possible;
* Could be exploit attempt by fuzzing HTTP header fields;
* Could be malformed request where other fields are fuzzed and overflow into method;
* We can perform forensics from those values;

In [None]:
import binascii
V = binascii.hexlify("\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00".encode())
V

## Fileinfo

* `http` event triggers off HTTP *request*;
* Maybe no data was exchanged;
* If the server returned a file, then Suricata presents the response in `fileinfo` event;
* Argubly more significant since client could request anything with no impact;

In [None]:
sorted(list((
    DF
    .loc[DF.event_type == "fileinfo"]
    .dropna(how="all", axis=1)
    ["fileinfo.filename"]
    .unique()
)))

### Fileinfo - takeaways

* Emitted if client downloads a file;
* HTML document returned to a browser *is a file*;
* Could be a 404 page as well...;
* Has additional metadata;
    * file hash;
    * Magic and mime type;
    * Size metadata

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
(
    DF
    .loc[DF.event_type == "fileinfo"]
    [[c for c in list(DF.columns.values) if c.startswith("fileinfo.") or c.startswith("http.")]]
    .dropna(how="all", axis=1)
)

## How to tie events together - Flow ID

* Multiple events emitter per flow;
* Each flow has unique `flow_id`;
* Some protocols emit more than one event;
* Unknown protocols will still emit `flow`;
* Single flow with a lot of events can be an indicator;

In [None]:
DF_FLOW_ID = (
    DF
    .groupby("flow_id")
    .agg({"flow_id": ["count"]})
    .sort_values(by=[("flow_id", "count")], ascending=False)
)
DF_FLOW_ID

## Flow correlation

* Flow correlation is the technique of tying together multiple events with same flow ID;
* Pivot hunting:
    * find interesting value;
    * extract `flow_id`;
    * query all events with that `flow_id`;
    * then rinse and repeat;
* Sometimes we can *kinda* get away without doing that;

In [None]:
(
    DF
    .loc[DF["http.url"] == "mstshash=Administr"]
    .dropna(how="all", axis=1)
)

* Mostly we need to do it ourselves;
* We might miss context without pivoting on `flow_id`;

In [None]:
FLOW_ID = (
    DF
    .loc[DF["http.url"] == "mstshash=Administr"]
    .flow_id
    .unique()
    [0]
)
(
    DF
    .loc[DF.flow_id == FLOW_ID]
    .dropna(how="all", axis=1)
)

## Investigating flows by event count

* Likewise, the sheer event count can be an indicator;
* Consider the background context;

In [None]:
import ipywidgets as widgets
SELECTIONS = []
for row, values in DF_FLOW_ID.iterrows():
    events = values[("flow_id", "count")]
    if events > 2:
        SELECTIONS.append((f"{row}: events: {events}", row))
    
def show(limit: int, fid: int):
    pd.set_option('display.max_rows', limit)
    pd.set_option('display.min_rows', limit)
    pd.set_option('display.max_columns', None)
            
    return (
        DF
        .loc[DF.flow_id == fid]
        .sort_values(by=["tx_id"])
        .dropna(how="all", axis=1)
    )

widgets.interact(show, 
                 limit=widgets.IntSlider(min=5, max=300),
                 fid=widgets.Dropdown(options=SELECTIONS))

## Pivoting with flow_id

### Select field to investigate

In [None]:
COMBO_COL = widgets.Combobox(options=list(DF.columns.values))
COMBO_COL

### Select interesting values to pivot on

In [None]:
VALUES = sorted(list(DF[COMBO_COL.value].fillna("").unique()))
SELECT_MULTIPLE_VALUE = widgets.SelectMultiple(options=VALUES, rows=10)
SELECT_MULTIPLE_VALUE

### Query flows that hold those values

In [None]:
FLOW_ID = (
    DF
    .loc[DF[COMBO_COL.value]
    .isin(list(SELECT_MULTIPLE_VALUE.value))]
    .flow_id
    .unique()
)
DROPDOWN_FLOW_ID = widgets.SelectMultiple(options=FLOW_ID)
DROPDOWN_FLOW_ID

### Generate the report

* Likely too many irrelevant fields;
* Select ones you need;

In [None]:
DF_INSPECT = (
    DF
    .loc[DF.flow_id.isin(list(DROPDOWN_FLOW_ID.value))]
    .sort_values(by=["flow_id", "tx_id"])
    .dropna(how="all", axis=1)
)

In [None]:
COLUMNS = ["timestamp", "flow_id", "src_ip", "dest_ip", "alert.signature"]
COLUMNS = COLUMNS + [c for c in list(DF.columns.values) if c.startswith("http.") or c.startswith("fileinfo.")]

In [None]:
SELECT_MULTIPLE_COLS = widgets.SelectMultiple(
    options=list(DF_INSPECT.columns.values), 
    rows=10,
    value=[c for c in COLUMNS if c in list(DF_INSPECT.columns.values)]
)
SELECT_MULTIPLE_COLS

In [None]:
pd.set_option('display.max_rows', 300)
pd.set_option('display.min_rows', 300)
DF_INSPECT[list(SELECT_MULTIPLE_COLS.value)]

## Aggregating the data

* Investigating every flow manually is a pain;
* Aggregations help;
* Select a *relatively low cardinality field*;
* Look into unique values and counts of relevant fields;

In [None]:
AGGS = {
    "timestamp": ["min", "max", "count"],
    "http.hostname": ["unique", "nunique"],
    "http.url": ["unique", "nunique"],
    "http.http_user_agent": ["unique", "nunique"],
    "alert.signature": ["unique", "nunique"],
    "fileinfo.magic": ["unique", "nunique"],
    "http.http_method": ["unique", "nunique"],
    "src_ip": ["unique", "nunique"],
    "dest_ip": ["unique", "nunique"],
}

In [None]:
pd.set_option('display.max_rows', 50)
pd.set_option('display.min_rows', 50)

In [None]:
DF.groupby("http.hostname").agg(AGGS)

### Use multiple aggregation points

* Look at the data from multiple angles!
* Missing values can throw you off;
* Notice how prior example did not show weird `http.http_method` values;
* Bacause `http.hostname` was missing for those events;

In [None]:
DF.groupby("http.http_method").agg(AGGS)