<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#msticpy---Event-Clustering" data-toc-modified-id="msticpy---Event-Clustering-1">msticpy - Event Clustering</a></span></li><li><span><a href="#Processes-on-Host---Clustering" data-toc-modified-id="Processes-on-Host---Clustering-2">Processes on Host - Clustering</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Clustered-Processes-(i.e.-processes-that-have-a-cluster-size->-1)" data-toc-modified-id="Clustered-Processes-(i.e.-processes-that-have-a-cluster-size->-1)-2.0.0.1">Clustered Processes (i.e. processes that have a cluster size &gt; 1)<div></div>

# msticpy - Event Clustering

Often, large sets of events contain a lot of very repetitive and unintersting system processes. However, these frequently have values (e.g. commandline or path content) that varies on each execution. This makes it difficult to find outlying events using standard sorting and grouping techniques.
We process the data to extract patterns and use clustering to group these repetitive events into a single row (with an execution count). This makes it easier to find unusual events.

You must have msticpy installed with the "ml" components to run this notebook:
```
%pip install --upgrade msticpy[ml]
```

In [0]:
%pip install --upgrade msticpy[ml]

In [0]:
%pip install seaborn

In [0]:
# Imports
import sys
import warnings
import msticpy
msticpy.init_notebook(globals())
#from msticpy.vis import timeline_duration
#from msticpy.vis import nbdisplay
from msticpy.common.utility import check_py_version
MIN_REQ_PYTHON = (3,6)
check_py_version(MIN_REQ_PYTHON)

from IPython import get_ipython
from IPython.display import display
import ipywidgets as widgets

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import networkx as nx

import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)

#import msticpy
#msticpy.init_notebook(globals())
#from msticpy.vis.timeline import timeline_duration
#from msticpy.vis import nbdisplay

# Some of our dependencies (networkx) still use deprecated Matplotlib
# APIs - we can't do anything about it so suppress them from view
from matplotlib import MatplotlibDeprecationWarning
warnings.simplefilter("ignore", category=MatplotlibDeprecationWarning)


In [0]:
dbutils.fs.mount(
source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})


In [0]:
#dbutils.fs.unmount("/mnt/<mount-name>")

<a id='process_clustering'></a>[Contents](#toc)
# Processes on Host - Clustering
Sometimes you don't have a source process to work with. Other times it's just useful to see what else is going on on the host. This section retrieves all processes on the host within the time bounds
set in the query times widget.

You can display the raw output of this by looking at the *processes_on_host* dataframe. Just copy this into a new cell and hit Ctrl-Enter.

Usually though, the results return a lot of very repetitive and unintersting system processes so we attempt to cluster these to make the view easier to negotiate. 
To do this we process the raw event list output to extract a few features that render strings (such as commandline)into numerical values. The default below uses the following features:
- commandLineTokensFull - this is a count of common delimiters in the commandline 
  (given by this regex r'[\s\-\\/\.,"\'|&:;%$()]'). The aim of this is to capture the commandline structure while ignoring variations on what is essentially the same pattern (e.g. temporary path GUIDs, target IP or host names, etc.)
- pathScore - this sums the ordinal (character) value of each character in the path (so /bin/bash and /bin/bosh would have similar scores).
- isSystemSession - 1 if this is a root/system session, 0 if anything else.

Then we run a clustering algorithm (DBScan in this case) on the process list. The result groups similar (noisy) processes together and leaves unique process patterns as single-member clusters.

#### Clustered Processes (i.e. processes that have a cluster size > 1)

In [0]:
from msticpy.analysis.eventcluster import dbcluster_events, add_process_features




processes_on_host = pd.read_csv(
    "/dbfs/mnt/<mount-name>/Event4688.csv",
    parse_dates=["TimeGenerated"],
    infer_datetime_format=True,
)
feature_procs = add_process_features(input_frame=processes_on_host)


# you might need to play around with the max_cluster_distance parameter.
# decreasing this gives more clusters.
(clus_events, dbcluster, x_data) = dbcluster_events(
    data=feature_procs,
    cluster_columns=["commandlineTokensFull", "pathScore", "isSystemSession"],
    time_column="TimeGenerated",
    max_cluster_distance=0.0001,
)
print("Number of input events:", len(feature_procs))
print("Number of clustered events:", len(clus_events))
clus_events[["ClusterSize", "processName"]][clus_events["ClusterSize"] > 1].plot.bar(
    x="processName", title="Process names with Cluster > 1", figsize=(12, 3)
)



In [0]:
# Looking at the variability of commandlines and process image paths
import seaborn as sns

sns.set(style="darkgrid")

proc_plot = sns.catplot(
    y="processName",
    x="commandlineTokensFull",
    data=feature_procs.sort_values("processName"),
    kind="box",
    height=10,
)
proc_plot.fig.suptitle("Variability of Commandline Tokens", x=1, y=1)

proc_plot = sns.catplot(
    y="processName",
    x="pathLogScore",
    data=feature_procs.sort_values("processName"),
    kind="box",
    height=10,
    hue="isSystemSession",
)
proc_plot.fig.suptitle("Variability of Path", x=1, y=1)



The top graph shows that, for a given process, some have a wide variability in their command line content while the majority have little or none. Looking at a couple of examples - like cmd.exe, powershell.exe, reg.exe, net.exe - we can recognize several common command line tools.

The second graph shows processes by full process path content. We wouldn't normally expect to see variation here - as is the cast with most. There is also quite a lot of variance in the score making it a useful proxy feature for unique path name (this means that proc1.exe and proc2.exe that have the same commandline score won't get collapsed into the same cluster).

Any process with a spread of values here means that we are seeing the same process name (but not necessarily the same file) is being run from different locations.

In [0]:
display(
    clus_events.sort_values("ClusterSize")[
        [
            "TimeGenerated",
            "LastEventTime",
            "NewProcessName",
            "CommandLine",
            "ClusterSize",
            "commandlineTokensFull",
            "pathScore",
            "isSystemSession",
        ]
    ]
)



In [0]:
# Look at clusters for individual process names
def view_cluster(exe_name):
    display(
        clus_events[["ClusterSize", "processName", "CommandLine", "ClusterId"]][
            clus_events["processName"] == exe_name
        ]
    )


view_cluster("powershell.exe")



In [0]:
# Show all clustered processes
from msticpy.analysis.eventcluster import plot_cluster

# Create label with unqualified path
labelled_df = processes_on_host.copy()
labelled_df["label"] = labelled_df.apply(
    lambda x: x.NewProcessName.split("\\")[-1], axis=1
)

%matplotlib inline
#%matplotlib notebook
plt.rcParams["figure.figsize"] = (15, 10)
plot_cluster(
    dbcluster,
    labelled_df,
    x_data,
    plot_label="label",
    plot_features=[0, 1],
    verbose=False,
    cut_off=3,
    xlabel="CmdLine Tokens",
    ylabel="Path Score",
)
