**Commonplace detection in categorical telemetry data**

# Interactive visualization

This notebook puts up an experimental visualization dashboard to enable the study of the behavior of the software processes for which [notebook 01](01%20Data%20Engineering.ipynb) computed a representation
(canned to cloud storage for the impatient).
This representation emphasizes the feature correlations between process instances: two process vectors are similar when they are characterized by many common categorical features; otherwise they are not similar, and their mutual distance is larger.
Thus, groups of similar process instances make up clusters that show up clearly in scatter plots.

We use here the [ThisNotThat](https://thisnotthat.readthedocs.io/en/latest/) data mapping framework,
which leverages the [Panel](https://panel.holoviz.org/) dashboard toolkit,
to display and annotate such scatter plots interactively.
These annotations of clusters augments the initial labeling of process instances using [command lines](01%20Data%20Engineering.ipynb#command-line),
yielding novel sets of categorical labels that document observed software behavior.
These labels are stored in a Pandas data frame,
enabling their further use in the development of ad hoc analytics.
In addition to examining the correlations between processes,
selection of clusters by lassoing performs on-the-fly feature importance analysis experiments, showing how a clump of points is characterized compared to the rest.
Such experiments provide an upper bound on the performance of a classifier one may want to develop to detect the phenomenon expressed through the selected points.

## Preliminaries

In [None]:
import bokeh.io
import bokeh.plotting as bpl
import cloudpickle as cpkl
import fsspec
import gzip
import itertools as it
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import panel as pn
from pathlib import Path
import scipy.sparse as ss
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from tqdm.auto import tqdm
import thisnotthat as tnt
from vectorizers.transformers import CategoricalColumnTransformer, InformationWeightTransformer

In [None]:
bokeh.io.output_notebook()
pn.extension()

## Gather the data map and its metadata

If one has computed [their own vector representation](01%20Data%20Engineering.ipynb) of the data,
local files are used.
Otherwise, we grab a vector representation that I computed and stored in an Azure Blob Container
(accessible without authentication).
If one would use the my own vectors instead of those they computed,
they can rename local file `manifest.json` to, say, `manifest.bak`.
Or delete any/all of the files listed in `files_vectors` below.

In [None]:
files_vectors = ["manifest.json", "features.npz", "map2d.npz", "metadata.csv.gz", "labels.csv.gz", "col2token.pkl.gz"]
if all([Path(f).is_file() for f in files_vectors]):
    print("Using process map and metadata stored LOCALLY.")
    FS = fsspec.filesystem("file")
    ROOT = "."
else:
    print("Using CANNED process map and metadata (from Azure container).")
    FS = fsspec.filesystem("abfs", account_name="scipy2023")
    ROOT = "optc/map"

In [None]:
with FS.open(f"{ROOT}/manifest.json", "rt", encoding="utf-8") as file:
    manifest = json.load(file)
HOST = manifest["host"]
DAYS = manifest["days"]
print(f"Context: host {HOST}; days {', '.join(DAYS)}")

In [None]:
with FS.open(f"{ROOT}/features.npz", "rb") as file_features:
    features = ss.load_npz(file_features)
features

In [None]:
with (
    FS.open(f"{ROOT}/col2token.pkl.gz", "rb") as file_compressed,
    gzip.open(file_compressed, "rb") as file_pkl
):
    col2token = dict(enumerate(cpkl.load(file_pkl)))

assert len(col2token) == features.shape[1]
for i, (k, v) in enumerate(col2token.items()):
    if i >= 25:
        break
    print(k, ":", v)

In [None]:
with FS.open(f"{ROOT}/metadata.csv.gz", "rb") as file_metadata:
    metadata = pd.read_csv(file_metadata, parse_dates=["timestamp"], compression="gzip")
assert metadata.shape[0] == features.shape[0]
metadata["timestamp"] = metadata["timestamp"].apply(pd.Timestamp)
metadata

In [None]:
with FS.open(f"{ROOT}/labels.csv.gz", "rb") as file_labels:
    labels = pd.read_csv(file_labels, compression="gzip")
assert labels.shape[0] == features.shape[0]
labels

In [None]:
with FS.open(f"{ROOT}/map2d.npz", "rb") as file_vectors:
    process_map = np.load(file_vectors)["process_map"]
assert process_map.shape == (features.shape[0], 2)

## Restricting the study

If one cares to look,
one discovers that most processes in the dataset belong to a handful of dominating classes.
Let's scope this experiment to the top-15 process classes with the most associated instances.
The reader is welcome to look at other classes as curiosity strikes them.

First, let's capture that top-15 of categories:

In [None]:
top15_labels = labels.groupby("label", as_index=False).agg({"process_id": "count"}).sort_values("process_id", ascending=False).head(15)
top15_labels

Now, let's restrict the vectors and metadata to instances belonging to these 15 classes.

In [None]:
labels_top15 = labels.loc[labels["label"].isin(set(top15_labels["label"]))].copy()
indices_top15 = labels_top15.index.copy()
labels_top15.reset_index(drop=True, inplace=True)
labels_top15

In [None]:
processes_top15 = process_map[indices_top15, :]
features_top15 = features[indices_top15, :]
metadata_top15 = metadata.loc[indices_top15].copy()
processes_top15.shape, features_top15.shape, metadata_top15.shape

## See you lazer, summarizer

A key tool brought forward by [ThisNotThat](https://thisnotthat.readthedocs.io/en/latest/) are *interactive summarizers*: data frames or plots rendered as some kind of summary of data points selected in an associated scatter plot,
on the fly.
TNT offers a good few summarizers out of the box,
but we need two more here to support the work of understanding cyber telemetry.
The first bespoke summarizer computes the features with the best joint support for the selected points.

In [None]:
class SparseSupportSummarizer:
    """
    Summarizer for a DataSummaryPane.
    This takes a sparse matrix of counts or importances.  Then for any selection of data it computes the
    column marginals of that matrix and finds the columns with the largest marginals.

    It returns a DataFrame with the top max_features features along with their column marginals and support.

    Parameters
    ----------

    matrix: a sparse matrix
        This is the matrix which we will use for computing the marginals
    column_index_dictionary: dict
         A dictionary mapping from column indices to column names
    max_features: int <default: 10>
        The number of features to return
    proportional_support: bool <default: True>
        Should the proportion be normalized (True) or left as a raw count (False)
    """
    def __init__(
        self,
        matrix,
        column_index_dictionary,
        max_features= 10,
        proportional_support = True
    ):
        self.matrix = matrix
        self.column_index_dictionary = column_index_dictionary
        self.max_features = max_features
        self.proportional_support = proportional_support

    def summarize(self, selected):
        data = self.matrix[plot.selected,:]
        column_marginal = np.array(data.sum(axis=0)).squeeze()
        largest_indices = np.argsort(column_marginal)[::-1][:self.max_features]
        features = [self.column_index_dictionary[x] for x in largest_indices]
        kinds, values = zip(*features)
        importance = column_marginal[largest_indices]
        support = np.sort(np.array((data>0).sum(axis=0)).squeeze())[::-1][:self.max_features]
        if self.proportional_support:
            support = support / data.shape[0]
        return pd.DataFrame({'Kind': kinds, 'Value': values, 'Total weight':importance, 'support':support})

The second summarizer encoded here is a copy of one offered by TNT for computing feature importance by training a one-vs-all classifier of the selected data against the rest.
TNT's version does not handle sparse feature matrices yet, so we had to reimplement it.

In [None]:
class SparseFeatureImportanceSummarizer:
    """
    Summarizer for the PlotSummaryPane that constructs a class balanced, L1 penalized,
    logistic regression between the selected points and the remaining data.

    This version takes a sparse feature matrix and column_index_dictionary which maps from the
    indices of the matrix to the set of feature names.

    Then it displays that feature importance in a bar plot.
    The title is colour coded by model accuracy in order to give a rough approximation of
    how much trust you should put in the model.

    All of the standard caveats with using the coefficients of a linear model as a feature
    importance measure should be included here.

    It might be worth reading the sklearn documentation on the
    common pitfalls in the interpretation of coefficients of linear models
    (https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html)

    Parameters
    ----------

    data: sparse_matrix
        A sparse_matrix corresponding to the plot points.
    column_index_dictionary: dict
        A dictionary mapping from column indices to column names
    max_features: int <default: 15>
        The maximum number of features to display the importance for.
    tol_importance_relative: float <default: 0.01>
        The minimum feature coefficient value in order to be considered important.

    """

    def __init__(
        self,
        data,
        column_index_dictionary,
        max_features: int = 15,
        tol_importance_relative: float = 0.01,
    ):

        self.data = data  # Indexed 0 to length.
        self.max_features = max_features
        self.tol_importance_relative = tol_importance_relative
        self._features = column_index_dictionary
        self._classifier = None
        self._classes = None

    def summarize(self, selected, width: int = 600, height: int = 600):
        classes = np.zeros((self.data.shape[0],), dtype="int32")
        classes[selected] = True
        classifier = LogisticRegression(
            penalty="l1",
            solver="liblinear",
            class_weight="balanced",
            tol=1e-3,
            max_iter=20
        ).fit(self.data, classes)
        self._classifier = classifier
        self._classes = classes
        assert classifier.coef_.shape[0] == 1 or classifier.coef_.ndim == 1
        importance = np.squeeze(classifier.coef_)
        index_importance = np.argsort(-np.abs(importance))[: self.max_features]
        importance_abs = np.abs(importance)[index_importance]
        importance_relative = importance_abs / np.max(importance_abs)
        importance_restricted = importance[
            np.where(importance_relative > self.tol_importance_relative)
        ]

        selected_columns_tuples = [self._features[x] for x in index_importance[: len(importance_restricted)] ]
        selected_columns = [f"{kind}: {value}" for kind, value in selected_columns_tuples]

        model_acc = classifier.score(self.data, classes)
        fig = bpl.figure(
            y_range=selected_columns,
            width=width,
            height=height,
        )
        if model_acc > 0.9:
            fig.title = f"Estimated Feature Importance\nTrustworthiness high ({model_acc:.4} mean accuracy)"
            fig.title.text_color = "green"
        elif model_acc > 0.8:
            fig.title = f"Estimated Feature Importance\nTrustworthiness medium ({model_acc:.4} mean accuracy)"
            fig.title.text_color = "yellow"
        elif model_acc > 0.5:
            fig.title = f"Estimated Feature Importance\nTrustworthiness low ({model_acc:.4} mean accuracy)"
            fig.title.text_color = "orange"
        else:
            fig.title = f"Estimated Feature Importance\nTrustworthiness low ({model_acc:.4} mean accuracy)"
            fig.title.text_color = "red"

        fig.hbar(
            y=selected_columns,
            right=importance[index_importance[: len(importance_restricted)]],
            height=0.8,
        )
        plt.xlabel("Coefficient values corrected by the feature's std dev")
        return fig

Expect these two summarizers to be baked into a new version of TNT coming soon to a repository near you.

## Preparing metadata for prime time

Let's supplement the metadata dataframe we have for our top-15 processes with data that aids visualization better.

In [None]:
%%time
metadata_summary_top15 = pd.merge(
    metadata_top15,
    CategoricalColumnTransformer(
        object_column_name='process_id',
        descriptor_column_name=list(metadata_top15.columns[2:]),
        include_column_name=True
    ).fit_transform(metadata_top15.astype('str')).rename("event_summary").reset_index(),
    on="process_id",
    how="left"
).merge(labels_top15, on="process_id", how="left")
metadata_summary_top15["event_summary_string"] = metadata_summary_top15["event_summary"].apply(lambda x: "<br>".join(x))
metadata_summary_top15["freq"] = 1
metadata_summary_top15

Let's also create an alternative feature dictionary: features that correspond to filesystem paths tend to be obnoxiously long strings.
It is useful to clip these to the file name at the end.

In [None]:
col2token_viz = {i: (kind, value.split("\\")[-1]) for i, (kind, value) in col2token.items()}
assert len(col2token_viz) == features.shape[1]
for i, (k, v) in enumerate(col2token_viz.items()):
    if i >= 25:
        break
    print(k, ":", v)

## Hierarchical map annotations

While a naked map might make sense to somebody who has explored the territory it describes,
it tends to be much easier to use when it is *annotated* with the names of locations,
roads and other landmarks.
The same goes for data maps.
TNT provides various tools to automate the production of a hierarchical set of annotations from the features of the dataset:
feature values displayed over clusters at various magnification scales.
It is tricky to describe but joyful to use.
The following code produce such a hierarchy of annotations for the top-15 data.
**Be patient**: producing these annotations takes up to 3 minutes on a M1-generation MacBook Pro;
it may take several minutes up to half an hour on computers with fewer resources.
If one is using the canned data, we skip the calculations altogether and get the object out of the Azure Blob container.

In [None]:
%%time
if FS.isfile(f"{ROOT}/layers.pkl.gz"):
    with (
        FS.open(f"{ROOT}/layers.pkl.gz", "rb") as file_gzip,
        gzip.open(file_gzip, "rb") as file_hier
    ):
        hier_annotations = cpkl.load(file_hier)
else:
    infoweight = InformationWeightTransformer().fit_transform(features_top15).astype(np.float32)
    infoweight_compressed = TruncatedSVD(n_components=1024).fit_transform(infoweight)
    layer_metadata = tnt.SparseMetadataLabelLayers(
        infoweight_compressed,
        processes_top15,
        features_top15,
        {i: value for i, (_, value) in col2token_viz.items()},
        cluster_map_representation=False,
        random_state=42
    )

## Bringing it all together

We finally have all the ingredients to put together our interactive dashboard.

In [None]:
# The scatter plot, hierarchically annotated.
plot = tnt.BokehPlotPane(
    processes_top15,
    labels=labels_top15["label"],
    width=600,
    height=600,
    show_legend=False,
    tools="pan,wheel_zoom,lasso_select,tap,reset"
)
plot.add_cluster_labels(hier_annotations, max_text_size=24)

# This widget enables the edition of the categorical labels associated to the various points (one label per point).
editor = tnt.LabelEditorWidget(plot.labels, selectable_legend=True)
editor.link_to_plot(plot)

# This is one of our most simple search widgets.  Please see our read the docs page for more powerful and flexible search options.
search = tnt.KeywordSearchWidget(
    pd.Series([
        " ".join(col2token[rc[1]][1] for rc in rcs)
        for _, rcs in it.groupby(
            np.stack(np.nonzero(features_top15)).T,
            key=lambda rc: rc[0]
        )
    ])
)
search.link_to_plot(plot)

# A widget to tweak plot properties in order to study the distribution of various features.
control_df = metadata_top15["THREAD,FLOW,PROCESS,FILE,REGISTRY,TASK,MODULE,USER_SESSION,SERVICE,SHELL,HOST".split(',')]
control = tnt.PlotControlWidget(raw_dataframe=control_df)
control.link_to_plot(plot)

# A pane to display detailed information about a single selected process.
info_pane = tnt.InformationPane(
    metadata_summary_top15,
    markdown_template="""\
# {label}

## {process_id}

---

{event_summary_string}
""",
    width=600)
info_pane.link_to_plot(plot)

# Class counts among selected data.
value_summarizer = tnt.summary.dataframe.ValueCountsSummarizer(labels_top15["label"])
value_summary_plot = tnt.summary.dataframe.DataSummaryPane(value_summarizer)
value_summary_plot.link_to_plot(plot)

# Time series summary of the occurrence of selected processes.
time_summarizer = tnt.summary.plot.TimeSeriesSummarizer(
    metadata_summary_top15,
    time_column='timestamp',
    count_column='freq'
)
time_summary_plot = tnt.summary.plot.PlotSummaryPane(time_summarizer)
time_summary_plot.link_to_plot(plot)

# Characterization of selected data by common feature support.
support_summarizer = SparseSupportSummarizer(features_top15, col2token_viz, max_features=16)
support_summary_df = tnt.summary.dataframe.DataSummaryPane(support_summarizer, width=600, sizing_mode=None)
support_summary_df.link_to_plot(plot)

# Feature importance summary for selected data.
feature_summarizer = SparseFeatureImportanceSummarizer(features_top15, col2token_viz, max_features=8)
feature_summary_plot = tnt.summary.plot.PlotSummaryPane(feature_summarizer, width=800, sizing_mode="stretch_both")
feature_summary_plot.link_to_plot(plot)

#Lay out the widgets that you are interested in using via Panel's excellent Row, Column and Tab functions.
pn.Column(
    pn.Row(plot, pn.Column(pn.Row(editor, pn.Column(search, control)))),
    pn.Tabs(
        ("Chronology", pn.Row(time_summary_plot, value_summary_plot)),
        ("Feature importance", pn.Row(feature_summary_plot, support_summary_df)),
        ("Details", info_pane))
)

Fun and/or useful things to do with this dashboard:

1. [ ] Pan and zoom around the scatter plot, watching feature tokens that characterize clusters come in and out of focus.
1. [ ] Select all processes involving a feature with the word `ping`
    - Click on the **Search** button: typing Enter at the keyboard does not always work. 
1. [ ] Select a group of points with the lasso.
1. [ ] Count the number of processes from each class in a selected clump.
1. [ ] Appreciate when a clump of processes occured through the dataset's timeline.
1. [ ] Look up how various features support the similarity between selected processes.
1. [ ] Look up the features of most importance with respect to differentiating some selected points from others.
    - **Remark**: the model training computed on the fly sometimes takes a few seconds to complete, so be patient when this plot fails to update rapidly.
1. [ ] Look up the number of each event types that were generated in a single selected process instance.
1. [ ] Grab the indices of selected process through `plot.selected`.
1. [ ] Appreciate, using marker sizes, the processes having generated by the most events targeting `FLOW` objects; contrast with those having generated the most events targeting `FILE` objects.
1. [ ] Change the process class `tasklist.exe` to `tasklist` to merge the former to the latter.
    - The old color difference remains; it's a known bug.
1. [ ] Select a cluster of points and create a new label for them; name them according to a feature or two that separate them cleanly from others.
1. [ ] Grab the labels generated from mergings and new label creations through `plot.labels`.