# Ridiculously Rich SMB Data Provided by Suricata

* Markus Kont
* Stamus Networks
* markus@stamus-networks.com
* github.com/markuskont

## Agenda

* Intro to SMB from Suricata
* Hunting samples
* Few words on analytics

## /whoami

* Started as Server Administrator
* Pivot to cyber security in Masters
* Then tech researcher in NATO CCDCOE
* Contributed to PhD dropout statistics
* Currently the resident hacker / hunter / coder in Stamus Networks
* **NOT** a Windows guy

<img src="me.jpg" width="600">

## What to expect

* No slides! (kind of)
* No rule based detection either!
* Data highlight - analysts interpretation
    * Anonymized samples from exercises
    * Packaged SMB logs from [Malware Traffic Analysis](https://www.malware-traffic-analysis.net/)
    * Some suspension of disbelief needed
* Interactive presentation will be available in [Suricata Analytics](https://github.com/StamusNetworks/suricata-analytics)
* Sorry for scrolling...

## What is SMB

* *Server Message Block*
* Not *Samba*
* Interact with remote systems over network
* File sharing, mapping network drives, printing, etc
* Remote administration via RPC calls

### SMB versions

* SMB1 - deprecated as of 2013
    * Dangerous to use (think EternalBlue)
    * Not installed by default as of server 2016 / W10 1709
* SMB2.0 - Windows Vista
* SMB2.1 - Windows 7
* SMB3.0 - Windows 8
* SMB3.1.1 - Windows 10

### What is Lateral Movement?

* Initial compromise
* Explore the network
* Enumerate resources
* Spread, solidify foothold
* Compromise higher priority systems

## Why SMB

### Consider a normal protocol like HTTP

* Rule engine triggers alerts ASAP
* Protocol event is sent when header is parsed
* Finally a flow event when flow is terminated or times out
* Can be correlated with `flow_id` or `community_id`
* The norm is one protocol event per flow

In [None]:
import json
with open("./http-alert-sample.json", "r") as handle:
    DATA_ALERT_HTTP = [json.loads(l) for l in handle]
from IPython.display import JSON
JSON(DATA_ALERT_HTTP)

In [None]:
import pandas as pd
pd.set_option('display.max_rows', 50)
pd.set_option('display.min_rows', 50)
pd.set_option('display.max_columns', None)
DF = pd.json_normalize(DATA_ALERT_HTTP)
DF

In [None]:
(
    DF
    .groupby("flow_id")
    .agg({
        "timestamp": ["min", "max", "count"],
        "alert.signature": ["unique"],
        "event_type": ["unique"],
        "http.hostname": ["unique"],
    })
)

### Now consider SMB sessions

* SMB is a transaction stream
* Client issues commands, server responds
* Many SMB events for each flow

In [None]:
import gzip
with gzip.open("./smb-mta.json.gz", "r") as handle:
    DF_SMB = pd.json_normalize([json.loads(l) for l in handle])
len(DF_SMB)

In [None]:
pd.set_option('display.max_rows', 15)
pd.set_option('display.min_rows', 15)

In [None]:
DF_SMB_AGG = (
    DF_SMB
    .groupby("flow_id")
    .agg({
        "timestamp": ["min", "max", "count"],
        "path": ["unique"],
    })
    .sort_values(by=[('timestamp', 'count')], ascending=False)
    .explode(('path', 'unique'))
)
DF_SMB_AGG

In [None]:
DF_SMB_AGG.describe()

### What is happening?

* **It's a transaction stream**
* Participants agree on dialect;
* Establish security context;
    * Kerberos hash is the standard;
    * NTLM password authentication is alternative;
    * Can happen multiple times!
* Connect to a *tree* - shared resource;
* Perform operations *within* that *tree*
    * Create or modify remote files;
    * Remote administration via RPC calls;
* Disconnect from tree, *perhaps rinse and repeat*
* Finally log off
    * ...or simply time out...

In [None]:
import ipywidgets as widgets
SELECTIONS = []
for row, values in DF_SMB_AGG.iterrows():
    events = values[("timestamp", "count")]
    SELECTIONS.append((f"{row}: events: {events}", row))
    
def show(limit: int, fid: int):
    pd.set_option('display.max_rows', limit)
    pd.set_option('display.min_rows', limit)
    pd.set_option('display.max_columns', None)
    
    df = DF_SMB
            
    return (
        df
        .loc[df.flow_id == fid]
        .sort_values(by=["tx_id"])
        .dropna(how="all", axis=1)
    )

widgets.interact(show, 
                 limit=widgets.IntSlider(min=5, max=300),
                 fid=widgets.Dropdown(options=SELECTIONS))

## Hunting

* What do we even have to work with?
* Not a full representation
* Fields are fragmented between events
* Full stream tells the whole story
* Unique values are useful pivot points

In [None]:
FIELDS_SMB = [c for c in list(DF_SMB.columns.values) if c.startswith("smb.")]
FIELDS_SMB

### Command

* Present in every event
* Commands can be useful for profiling
    * Newly seen commands can be interesting
    * Seeing SMB1 is not okay
* Status hunting is a low hanging fruit

In [None]:
pd.set_option('display.max_rows', 30)
pd.set_option('display.min_rows', 30)
pd.DataFrame(sorted(list(DF_SMB["smb.command"].unique())))

### Status

* Lot more interesting than command
* Errors and rejections are usually very interesting
* Sometimes status parsing itself fails - even more interesting

In [None]:
pd.set_option('display.max_rows', 30)
pd.set_option('display.min_rows', 30)
pd.DataFrame(list(DF_SMB["smb.status"].unique()))

### But wait, there's more!

<img src="WaitTheresMore.jpg" width="600">

In [None]:
import json
with open("./smb-status-funky.json", "r") as handle:
    FUNKY = sorted(json.loads(handle.read()))
print("\n".join(FUNKY))

### Status 2148074242

* Numeric code means suricata failed parsing the status code;
* Not really a bad thing - makes stuff pop out;
* We do things by the book, hackers don't;
* SMB1 - old and deprecated, already suspicious;
* Likely a CLI hacking tool connecting to IPC;
* Successful responses for SMB1, NT LM 0.12 from Server 2022...

In [None]:
pd.read_csv("sample-1.csv").head(10)

### Status UNKNOWN_00_0000

* Like before, obscure status makes it pop;
* Lot's actually wrong with the picture;
* SMB1 and request OS being windows 2000...
* Actually, this is more likely to be a malicious tool;
* Again, hackers don't do things by the book;

In [None]:
pd.read_csv("sample-2.csv").head(10)

### STATUS_BUFFER_OVERFLOW

* Local Session Manager establishes a desktop session whenever user logs in.
* The FSCTL_PIPE_TRANSCEIVE request is used to send and receive data from an open pipe. Any bytes in the FSCTL input buffer are written as a binary large object (BLOB) to the input buffer of the pipe server. [Microsoft](https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-fscc/6598edc5-374c-4be2-a650-01a8189ca0d1)
* {Buffer Overflow} The data was too large to fit into the specified buffer. [Microsoft](https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55)

In [None]:
pd.read_csv("sample-3.csv")

### Lateral movement to Domain Controller

* Large example, bear with me on this one
* Named pipe connection to Domain Controller
* Remote Prodedure calls to enumerate remote resources
* `smb.dcerpc.endpoint` is a custom field but can also be mapped with our SMB lateral ruleset
* *Endpoint* = DCERPC UUID + Opcode
* `Server Service Remote Protocol` pipe connection likely to enumerate remote services

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.read_csv("sample-5.csv")

In [None]:
pd.set_option('display.max_rows', 30)
pd.set_option('display.min_rows', 30)

### Multiple users in single stream

* A bit more modern SMB version;
* More normal looking command stream;
* Found by simply filtering for more than one users;
* How often does a office worker need to log in to file server?
* In fact, NTLM password authentication is already non-standard...

In [None]:
pd.read_csv("sample-4.csv")

### Pivot hunting

* List unique fields to inspect;
* List unique values per selected field;
* Extract flow stream(s) for selected flow_id;
* Drill down on interesting fields;
* And / or aggregate for more comprehensive view;

In [None]:
COMBO_COL = widgets.Combobox(options=list(DF_SMB.columns.values))
COMBO_COL

In [None]:
VALUES = sorted(list(DF_SMB[COMBO_COL.value].fillna("").unique()))
SELECT_MULTIPLE_VALUE = widgets.SelectMultiple(options=VALUES, rows=10)
SELECT_MULTIPLE_VALUE

In [None]:
FLOW_ID = (
    DF_SMB
    .loc[DF_SMB[COMBO_COL.value]
    .isin(list(SELECT_MULTIPLE_VALUE.value))]
    .flow_id
    .unique()
)
DROPDOWN_FLOW_ID = widgets.SelectMultiple(options=FLOW_ID)
DROPDOWN_FLOW_ID

In [None]:
DF_INSPECT = (
    DF_SMB
    .loc[DF_SMB.flow_id.isin(list(DROPDOWN_FLOW_ID.value))]
    .sort_values(by=["flow_id", "tx_id"])
    .dropna(how="all", axis=1)
)

In [None]:
SELECT_MULTIPLE_COLS = widgets.SelectMultiple(
    options=list(DF_INSPECT.columns.values), 
    rows=10,
    value=["flow_id", "src_ip", "dest_ip", "smb.command", "smb.status"]
)
SELECT_MULTIPLE_COLS

In [None]:
pd.set_option('display.max_rows', 300)
pd.set_option('display.min_rows', 300)
DF_INSPECT[list(SELECT_MULTIPLE_COLS.value)]

In [None]:
DF_INSPECT.groupby("flow_id").agg({k: ["unique", "nunique"] for k in list(SELECT_MULTIPLE_COLS.value)})

## Analytics

* Manual hunting does not scale well;
* It's responsive;
* Baselining happens in analysts head;
* Suricata SMB EVE provides ton of context for profiling;

### K-means

* Entry level data mining algorithm;
* Widely used - because it works;
* Simple by ML standards;
* Partition coordinates closest to each other into K groups;
    * K - number of groups, up for user to decide;
    * First groups are (semi)random;
    * Assign each coorinate to closest group;
    * Recalculate *centroids* for each group;
    * Reassign points and recalculate until no change happens;

### Feature selection

* So far we've worked with raw values;
* But we can transform these values into new data;
* We also don't need everything;
* *Feature selection*

In [None]:
COLUMNS = ["smb.command", "smb.status", "smb.filename", "smb.named_pipe", "smb.ntlmssp.user", "smb.function"]
FEATURES = ["event_count", "commands", "responses", "filenames", "named_pipes", "users", "functions"]

### Data preparation

* Data mining algorithms work on numeric counters;
* We mostly have string values, some categorical fields, very few counters;
* Especially in protocol logs;
* Simple conversion - distinct value counts;

In [None]:
AGGS = {"timestamp": ["min", "max", "count"],}
for col in COLUMNS:
    if col in list(DF_SMB.columns.values):
        AGGS[col] = ["nunique"]

In [None]:
DF_AGG = (
    DF_SMB
    .groupby(["flow_id"])
    .agg(AGGS)
    .sort_values(by=[("timestamp", "min"), ("timestamp", "max")])
)
DF_AGG.columns = ["first_seen", "last_seen"] + FEATURES

### Let's inspect it!

Notice how we've transformed the discrete values.

In [None]:
pd.set_option('display.max_rows', 10)
pd.set_option('display.min_rows', 10)
DF_AGG

### Statistical overview

* 4.4k rows;
* Still too much to inspect;
* And that's a **tiny** dataset;
* How to get a *sense* of our data???
* Simple statistics to the rescue!

In [None]:
DF_AGG.describe()

### Scale it

* Feature scales are different;
* High cardinality fields like `smb.filename` or `event_count` can overshadow others;
* Simple trick is to scale values from 0 to 1;
* It's the exact same data!

In [None]:
from sklearn.preprocessing import MinMaxScaler
SCALER = MinMaxScaler()
DF_SCALED = pd.DataFrame(
    SCALER.fit_transform(DF_AGG[FEATURES].values),
    columns=FEATURES,
    index=DF_AGG.index,
)
DF_SCALED.describe()

### Make it understandable for humans

* numeric data is easiest to understand on graph;
* 7 dimensions is a bit difficult for humans to comprehend;
* *machine says it's bad* does not really fly on a incident report;
* but methods exist to reduce this dimensionality;

In [None]:
from sklearn.decomposition import PCA
PRINCIPAL_COMPONENT_ANALYSIS = PCA(n_components=2)
PRINCIPAL_COMPONENTS = PRINCIPAL_COMPONENT_ANALYSIS.fit_transform(DF_SCALED.values)
PRINCIPAL_DATAFRAME = pd.DataFrame(data=PRINCIPAL_COMPONENTS, columns=['PC1', 'PC2'], index=DF_AGG.index)

In [None]:
import matplotlib.pyplot as plt
plt.scatter(PRINCIPAL_DATAFRAME['PC1'], PRINCIPAL_DATAFRAME['PC2'], s=50, alpha=0.5)
plt.show()

### Separate the datapoints

* make the computer work for us
* generate labels, then attach them to our dataset
* same picture, but with colors!!!
* non-deterministic result

In [None]:
from sklearn.cluster import KMeans
KMEANS = KMeans(n_clusters=6, max_iter=50, init='k-means++')
KMEANS.fit(PRINCIPAL_DATAFRAME)
PRINCIPAL_DATAFRAME["cluster"] = KMEANS.labels_
DF_SCALED["cluster"] = KMEANS.labels_

In [None]:
plt.scatter(PRINCIPAL_DATAFRAME['PC1'], PRINCIPAL_DATAFRAME['PC2'], c=KMEANS.labels_.astype(float), s=50, alpha=0.5)
plt.show()

### Hunting supported by analytics

* now map the clusters to `flow_id`;
* aggregate for quick overview;
* inspect interesting clusters;

In [None]:
CLUSTER_MAPPING = pd.DataFrame()
CLUSTER_MAPPING["flow_id"] = pd.Series(DF_SCALED.index).astype(int)
CLUSTER_MAPPING["cluster"] = KMEANS.labels_
CLUSTER_MAPPING = CLUSTER_MAPPING.to_dict(orient="records")
CLUSTER_MAPPING = {m["flow_id"]: m["cluster"] for m in CLUSTER_MAPPING}

In [None]:
DF_SMB["cluster"] = DF_SMB.flow_id.map(CLUSTER_MAPPING)

In [None]:
for col in COLUMNS:
    if col in list(DF_SMB.columns.values):
        AGGS[col] = ["unique", "nunique"]

In [None]:
pd.set_option('display.html.use_mathjax', False)

In [None]:
(
    DF_SMB
    .groupby(["cluster"])
    .agg(AGGS)
    .sort_values(by=[("timestamp", "min"), ("timestamp", "max")])
)

In [None]:
DF_SMB.loc[DF_SMB.cluster == 5]["smb.filename"].unique()

### Graph mining

In [None]:
import networkx as nx
import hvplot.networkx as hvnx
import holoviews as hv
hv.extension('bokeh')

In [None]:
width = 800
height = 800

In [None]:
k_src = "smb.status"
k_dest = "src_ip"

In [None]:
G = nx.from_pandas_edgelist(DF_SMB, k_src, k_dest)

In [None]:
pos = nx.layout.spring_layout(G)

In [None]:
edges = (
    hvnx
    .draw_networkx_edges(G, pos, width=1, edge_color="#cccccc")
    .opts(width=width, height=height)
)

In [None]:
uniq_src = list(DF_SMB[k_src].unique())
nodes = [i for i, n in enumerate(G.nodes()) if n in uniq_src]
nodes = hvnx.draw_networkx_nodes(G, pos, nodelist=nodes, node_color='#A0CBE2').opts(width=width, height=height)

In [None]:
labels = [i for i, n in enumerate(G.nodes()) if n in uniq_src and G.degree(n) < 3]
labels = hvnx.draw_networkx_labels(G, pos, nodelist=labels)

In [None]:
edges * nodes * labels

## In conclusion

* Don't think only about alerts and atomic indicators;
* Event stream tells a story;
* Data can be transformed;
* SMB and the way Suricata logs it is insane (in a good way);

## Thank you!