**Commonplace detection in categorical telemetry data**

# Data engineering

This notebook exemplifies how to encode telemetry events as a sequence of *bags of categorical labels*, and create a vector representation of this data that is suitable to visualization.

**Warning**: the implementation of this data engineering pipeline is memory- and CPU-intensive.
It strictly requires at least 8 GB of RAM and 2 CPU threads to run to success (the configuration has not even been tested, any confirmation of success with such a configuration would be appreciated);
16 GB with 8 CPU threads is recommended.
Larger resource budgets will be leveraged usefully, up to about 64 CPUs and 256 GB of RAM.

## Preliminaries

In [None]:
import cloudpickle as cpkl
from collections import defaultdict
from dask import delayed
from dask.distributed import LocalCluster, Client, as_completed
import fsspec
import gzip
from hashlib import md5
import json
import numpy as np
import pandas as pd
import scipy.sparse as ss
from sklearn.preprocessing import Normalizer
import struct
import sys
from tqdm.auto import tqdm
import umap
import vectorizers as vz
import zstandard as zstd

This work focuses on the [Operationally Transparent Cyber Data Release](https://github.com/FiveDirections/OpTC-data), abbreviated to the **OpTC dataset**.
The stored form of this dataset minimizes storage cost rather than facilitate processing.
Thus, I publish here a prefiltering of the host-based sensor telemetry
(encoded in GZIP-compressed JSON monoliths according to the [extended CAR](https://github.com/FiveDirections/OpTC-data/blob/master/ecar.md) data model)
into smaller, more numerous chunks of ZSTD-compressed JSON for 3 of the more interesting hosts
(out of the 625 hosts that compose the whole dataset).
This data can be accessed through the following Azure Blob,
whose hosting is a courtesy of the Tutte Institute for Mathematics and Computing.

In [None]:
fs = fsspec.filesystem("abfs", account_name="scipy2023")
fs.ls("optc/")

Let's use the local machine as a ad hoc Dask cluster.
If you can access a better cluster, change the next cell to instantiate the Dask `Client` against it.
Remark that it is critical for the memory limit for each worker to be at least 6 GB.
Only one Dask worker will reach up close this limit,
so in principle this should not freeze the OS even if we allow the set of workers to manage more than the total amount of RAM available.

In [None]:
cluster = LocalCluster(threads_per_worker=1, memory_limit="6GB")
client = Client(cluster)
client

We will work on host 501, processing all days.
One may also choose host 201 or 351 if curious.
The timespan of the following analysis covers the whole data capture span, from September 17 to 25 (2019).
In a compute pinch, one may select a subset of these days.

In [None]:
HOST = 501
HOSTNAME = f"SysClient{HOST:04d}.systemia.com"
DAYS = ["*"]

Here are the *chunk* files: the data engineered so it can be consumed more easily using our Dask cluster.

In [None]:
ROOT_HOSTNAME = f"optc/{HOSTNAME}"
CHUNKS = sorted(sum(
    [fs.glob(f"optc/{HOSTNAME}/{day}/chunk.*.json.zstd") for day in DAYS],
    []
))
len(CHUNKS)

In [None]:
CHUNKS[:10]

The JSON files of host-based telemetry are composed of events whose exact set of fields is determined by the type of **object** changed by the event
(and, to a lesser degree, by the **action** that changes it).
The following filters the field for each object type to retain only those that can be usefully understood as characterized by *categorical* values without any modification
(such as number binning or token lemmatization based on cyber expertise).

This simplest of approaches lends itself to much faster encoding of the categorical values:
indeed, value transformations often involve string operations to such a volume as to incur at least twice the computational cost of the following feature extraction code.
However, the cost of this more expensive work can be tamed and, most importantly, amortized by saving the non-explicit values (or their intermediate representation).
Having experimented with various degrees of sophistication to this task, I believe that what follows gives a pretty good gist of the correlations between events and their aggregates:
more sophisticated tokenization approaches appear to provide value, but the difference is not so remarkable as to warrant obfuscating this presentation.

The following data structure is slightly hard to understand,
as it provides layers of simplification to avoid tedious boilerplate.
In principle, it consists in a hierarchical dictionary whose leaves are sequences of categorical field descriptors.
The full hierarchy follows the convention:

```
object type
    action type
        computation identifier name
            sequence of categorical field descriptors
```

Starting from the bottom, a *categorical field descriptor* consists in the name of the field in the JSON object that describes the event, as well as its _kind_ -- the nature of the object, which makes it more comparable to others.
For instance, two fields named respectively `src_ip` and `dest_ip` would both have as value IP addresses: they are thus of the same kind, and a token of nominal value `10.100.23.34` yield a correlation between two events even if they carry it over these two fields respectively.
However, if a hypothetical process were loaded from a file named `10.100.23.34`, and we did not set its kind as a `path`, it might be confused with an IP address token with the same value, yielding an undue correlation.
Such is thus the purpose of kinds.
Finally, many fields carry token whose kind is eponymous, so no need to have a repetitive tuple -- these are described with just the field name.

We assume that each event happened as part of a single *computation*: a bit of code running for a certain purpose.
As we will describe <a id="event_grouping"></a>[below](#events_grouped_by_processes), we construe such computations as *process instances*.
For most events, we can thus identify each computation by the process that was responsible for the occurrence of the event.
This process is uniquely identified by a UUID generated during data capture, and stored as field `actorID`.
There is, however, one special case: when a process is created (object type `PROCESS`, action `CREATE`),
we don't merely consider this as an action taken by the parent process.
We also make this event the first that occurs in the new process:
this induces useful correlations between processes that spawn the same children,
as well as between processes born from similar parents.
Thus, we represent this as if two events had occurred:
the creation of the child process,
under the responsibility of its parent;
and the start of the new process,
under its own responsibility.
In the latter case, the UUID for the child object is stored as the `objectID` field.

Finally, for most object types, we harvest values from the same subset of fields, regardless of action type.
The exception of process creations is present here again,
because the characterization of the spawning of a child process
and that of the start of a new process in itself are different,
and thus leverage distinct fields.
These are irrelevant to the other actions targeting a `PROCESS` object.
Thus, in most cases, the hierarchy is elided to

```
object type
    sequence of field descriptors (with eponymous kinds omitted)
```

The full hierarchy is only declined with `PROCESS` objects.

In [None]:
categoricals_easy = {
    "FLOW": ["object", "action", ("src_ip", "ip"), ("dest_ip", "ip"), ("src_port", "port"), ("dest_port", "port"), "l4protocol", "direction"],
    "FILE": ["object", "action", ("file_path", "path"), "info_class", ("new_path", "path")],
    "HOST": ["object", "action"],
    "MODULE": ["object", "action", ("module_path", "path")],
    "REGISTRY": ["object", "action", ("key", "registry-key"), ("value", "registry-value"), ("type", "registry-type")],
    "SERVICE": ["object", "action", ("name", "service-name")],
    "SHELL": ["object", "action"],
    "TASK": ["object", "action", "path", ("task_name", "task-name")],
    "THREAD": ["object", "action"],
    "USER_SESSION": ["object", "action", ("user", "user-domain"), ("requesting_domain", "domain"), ("requesting_user", "user"), ("src_ip", "ip"), ("src_port", "port")],
    "PROCESS": {
        "CREATE": {
            "actorID": ["object", "action", ("image_path", "child"), ("image_path", "path")],
            "objectID": [("parent_image_path", "parent"), ("image_path", "process"), ("user", "user-domain")]
        },
        "OPEN": {
            "actorID": ["object", "action"]
        },
        "TERMINATE": {
            "actorID": ["object", "action"]
        }
    }
}

In [None]:
def iter_events(path_chunk):
    """
    Generates the sequence of events (as JSON objects, hence dictionaries) decoded out of a *chunk* --
    a ZSTD-compressed JSON Lines file.
    """
    with fs.open(path_chunk, "rb") as file_raw, zstd.open(file_raw, mode="rt", encoding="utf-8") as file_decompressed:
        for line in file_decompressed:
            try:
                yield json.loads(line)
            except json.JSONDecodeError:
                # Skip ill-formed records.
                pass

In [None]:
def extract_features(path_chunk):
    """
    Filters the events generated out of a *data chunk* so that each is reported as a
    3-tuple composed of timestamp, computation identifier and its characterization as a
    sequence of categorical tokens.
    """
    for event in iter_events(path_chunk):
        obj = event["object"]
        if obj not in categoricals_easy:
            continue
        categoricals = (
            categoricals_easy[obj].get(event["action"], {})
            if isinstance(categoricals_easy[obj], dict)
            else {"actorID": categoricals_easy[obj]}
        )
        for identifier, features in categoricals.items():
            tokens = []
            for feature in features:
                field, kind = (feature, feature) if isinstance(feature, str) else feature
                if value := event.get(field, ""):
                    tokens.append((kind, value))
            yield (pd.Timestamp(event["timestamp"]), event[identifier], tokens)

In [None]:
def tabulate_features(path_chunk):
    """
    Encodes the sequence of events and their extracted features into a Pandas dataframe.
    """
    return pd.DataFrame(
        data=extract_features(path_chunk),
        columns=["timestamp", "process_id", "tokens"]
    ).astype({"process_id": "category"})

The feature engineering plan to enable visualization on the basis of *similarity of computations* is as follows:

1. Encode each event into a vector using *one-hot encoding*.
1. Drop process vectors that are not described by enough events for their representation to mean much.
1. Group events by computation identifier, producing a vector representation for each computation (process) as the **sum** of its events.
1. Clean up the feature space, getting rid of non-informative features such as [spurious values](https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47) and *orphan values* (which describe too few events).
1. Compress the vector representation by [manifold learning](https://scikit-learn.org/stable/modules/manifold.html) down to two dimensions.

## Step 1: computing the event feature matrix

[One-hot encoding](https://en.wikipedia.org/wiki/One-hot) is a process of for computing [multisets](https://en.wikipedia.org/wiki/Multiset) of categorical *tokens* into vectors of value counts.
One associates each multiset to one row of the matrix, and each token of the sum-union of all multisets (the *vocabulary* of the dataset) to a column of the matrix.
Each matrix entry $(i, j)$ then corresponds to the number of occurrences of token $j$ in multiset $i$.
Such numbers of occurrences are often referred to as token *weights*.

At the Tutte Institute, we have put together the [Vectorizers](https://vectorizers.readthedocs.io/en/latest/) library as a repository of our tricks for turning variable-length data objects into fixed-length vectors.
In particular, class `NgramVectorizer`, instantiated to handle a vocabulary of 1-grams (e.g. normal tokens), not only performs one-hot encoding,
but also provides,
through its `+` operator,
a way to merge the encodings of a segmented dataset.
It is thus ideal for the handling of our set chunk files of telemetry.
Once each chunk is encoded, we sum them hierarchically.
(I wanted to do that with a [Dask bag](https://docs.dask.org/en/stable/bag.html),
but some bug frustrates me, so I got on with my own sum implementation.)

This operation is memory-intensive, as *a whole lot* of data is summarized into the sparse `event_matrix`.
Please tolerate Dask's nervous chatter regarding memory usage and garbage collection.

In [None]:
def vectorize_features(path_chunk):
    return vz.NgramVectorizer().fit(tabulate_features(path_chunk)["tokens"])

In [None]:
%%time
summands = [[delayed(vectorize_features)(chunk) for chunk in CHUNKS]]
while len(summands[-1]) > 1:
    to_sum = summands[-1]
    sums = []
    for i in range(0, len(to_sum), 2):
        if i + 1 < len(to_sum):
            sums.append(to_sum[i] + to_sum[i + 1])
        else:
            sums.append(to_sum[i])
    summands.append(sums)

futs = client.compute(sum(summands, []))
for fut in tqdm(as_completed(futs), total=sum(len(ss) for ss in summands)):
    pass

vzr_all = futs[-1].result()
event_matrix = vzr_all._train_matrix
client.cancel(futs)
del futs
event_matrix

## Steps 2 and 3: grouping events into processes, and eliminating ill-represented processes

<a id="events_grouped_by_processes"></a>
We have discussed [earlier](#event_grouping) that event subsequences form coherent *computations*.
While there are many ways to cut subsequences in order to delineate so as to delineate their purpose,
the approach immediately at hand is to focus on *process instances*.
Modern general-purpose operating systems, when asked to run a certain program, box this computation into a *process*:
it serves as a unit to allocate resources and protect memory structure from other processes.
The OpTC dataset does a great job of associating events to the process instances that generated them,
and most processes have good enough unity of purpose for us to latch onto them as the context around which to group events.

One very simple but hacky way to combine the vector representations of events into one for each process is to **sum** (or **average**) these vectors together.
This effectively characterizes each process as the number of times any categorical value occured through the events that it generated.
This representation discards a lot of information: event timing, event order, and even the boundaries between events.
However, it will make the vector distance between two processes where the same events be low, hence highly similar.
It induces undue similarity in case of different event timing, order, or even nature, but we go in knowing that.

An issue that arises here is that the number of events generated by each process varies a lot.
It makes little sense to analyze the regularity of the behavior of processes that generate very few events.
Let's thus first analyze the distribution of total process weights as a proxy to the notion of *proper* process characterization.

In [None]:
def summarize_processes(metadata):
    return metadata.groupby("process_id", as_index=False).agg({"timestamp": "min", **{col: "sum" for col in metadata.columns if col not in {"timestamp", "process_id"}}})

In [None]:
def events_by_process(path_chunk):
    features = tabulate_features(path_chunk)
    metadata = features[["timestamp", "process_id"]].join(
        pd.DataFrame(
            data=iter(features["tokens"].apply(lambda tokens: {value: 1.0 for kind, value in tokens if kind == "object"}))
        ),
        how="inner"
    )
    metadata["event_index"] = pd.Series(metadata.index).apply(lambda x: [x])
    return summarize_processes(metadata)

To stay abreast of what's going on, let's eyeball an example of the summary of a process.

In [None]:
%%time
events_by_process(CHUNKS[0])

The timestamp of the process corresponds to that of the first event found associated to it over the chunk's sequence.
We also count the number of each type of events, denoted by `object` type.
Finally, we keep a list of the event indices associated to each process.
Let's merge these latter index lists over all chunks,
building a dictionary that maps process identifiers to lists of event index not in each chunk respectively,
but rather in the `event_matrix` built above.

In [None]:
%%time
process2ievent = {}
total_events = 0
metadata_processes = pd.DataFrame()
for fut in tqdm(client.map(events_by_process, CHUNKS), total=len(CHUNKS)):
    processes = fut.result()
    metadata_processes = summarize_processes(pd.concat([metadata_processes, processes.drop(columns=["event_index"])], ignore_index=True).fillna(0.0))
    for process_id, indices in processes[["process_id", "event_index"]].itertuples(index=False):
        process2ievent.setdefault(process_id, [])
        for index_row_chunk in indices:
            process2ievent[process_id].append(index_row_chunk + total_events)
    total_events += processes["event_index"].apply(len).sum()

len(process2ievent)

In [None]:
metadata_processes

That's *very many* processes.
To determine which of these don't carry enough features to be reliably characterized compared to others,
let's look at the empirical distribution of weight totals (over logarithmic bins).

In [None]:
features_per_event = np.array(event_matrix.sum(axis=1)).squeeze()
features_per_process = pd.Series({process_id: sum([features_per_event[i] for i in indices]) for process_id, indices in tqdm(process2ievent.items())})
features_per_process

In [None]:
features_per_process.apply(np.log10).hist(bins=range(-1, 6))

It does not make much sense to me to keep processes described by a total number of categorical features less than 10.
Let's drop the process instances that make up the first column.

<a id="pruning"></a>
Since we compute process vectors as sums of event vectors,
each process corresponds to a linear combination of the rows of `event_matrix` associated to it
(which we stored in `process2ievent`).
Let's thus build a *projection matrix* that will be multiplied on the left of `event_matrix` to compute the process vectors.
We will restrict the rows of this projection matrix so as to drop the process instances with insufficient features.

In [None]:
%%time
irows = []
icols = []
process2irow = {}
irow2process = {}
irow_next = 0
for process_id, indices in tqdm(process2ievent.items()):
    if features_per_process.loc[process_id] >= 10:
        irow = irow_next
        irow_next += 1
        irows += [irow] * len(indices)
        icols += indices
        process2irow[process_id] = irow
        irow2process[irow] = process_id

projection = ss.coo_matrix((np.ones((len(irows),), dtype=np.int32), (irows, icols)), shape=(len(process2irow), event_matrix.shape[0])).tocsr()
assert set(np.array(projection.sum(axis=0)).squeeze()) <= {0, 1}
projection

In [None]:
process_matrix = (projection @ event_matrix).astype(np.float32)
process_matrix

Let's prune the process metadata frame to mirror the rows of `process_matrix`.

In [None]:
pruned = sorted(list(irow2process.items()))
metadata_pruned = metadata_processes.set_index("process_id").loc[[process_id for _, process_id in pruned]].copy().reset_index()
assert pd.Series(metadata_pruned.index).equals(pd.Series([i for i, _ in pruned]))
metadata_pruned

# Step 4: feature space clean-up

The vector representation we create means to visualize the *similarity structure* present between process instances.
As such, we have to focus on features that embody this structure.
There are two categories of features that fail to do this, and are thus best discarded; they are uninformative.
On the one hand, *spurious features* are shared at roughly the same weight by too many processes, so they hamper the differentiation between processes.
On the other hand, *orphan features* occur only to a very small group of processes, so they distort the actual similitude between processes.
A look at the distribution of the total weight associated to each feature helps determine how to cull the uninformative.
There is also a special class of the latter:
features that occur to no process at all,
as the [pruning](#pruning) we just did discarded all the process instances that it characterized.

In [None]:
feature_importance = pd.Series(np.array(process_matrix.sum(axis=0)).squeeze())

How many features no longer characterize any process?

In [None]:
sum(feature_importance == 0)

Let's histograph the remaining features over logarithmic bins.

In [None]:
feature_importance.loc[feature_importance > 0].apply(np.log10).hist(bins=[-1,0,1,2,3,4,5])

Most features, by a large factor, are orphans.
We seem not to have any spurious feature, as none is associated to more than 10000 process instances.
Let's take a more detailed look at the first column of the previous histogram.

In [None]:
feature_importance.loc[feature_importance < 10].hist(bins=np.linspace(0, 10, 10) - 0.5)

Again, most of these rarely used features are literal orphans: associated to one or two processes.
Let's cull any feature that's not tied to at least 3 processes.

In [None]:
%%time
col2token = []
token2col = {}
indices_keep = []
for i, count in enumerate(feature_importance):
    if count > 3:
        indices_keep.append(i)
        token = vzr_all.column_index_dictionary_[i]
        index_new = len(col2token)
        col2token.append(token)
        token2col[token] = index_new

culled_matrix = process_matrix[:, indices_keep].copy()
culled_matrix

The same way that pruning off processes may reduce the usage of a feature to nil,
the culling of features may in turn discard the proper representation of processes!
Let's check that each process is characterized by a total feature weight of at least 5.

In [None]:
features_per_process_redux = np.array(culled_matrix.sum(axis=1)).squeeze()
assert np.min(features_per_process_redux) > 5.0

## Step 5: dimension reduction by manifold learning

While t-SNE is a good default choice for compressing vectors to a two-dimension representation for visualization,
I rely instead on [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html), from my friends at the Tutte Institute.
Its implementation is highly efficient and scalable,
and it handles the computation of the $k$ nearest neighbors graph over a very large set of metrics and pseudo-metrics.
These include the [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance),
the metric best supported by probability theory to assess the similarity between multinomial distributions
(which count vectors *are* once $l_1$-normalized).

Now, it's easier in practice to compute manifold learning computations on matrices of unique rows.
We thus first deduplicate the rows of the culled matrix,
keeping an inverse (reduplicating) index so we can still gain a 2D vector representation for all processes under our nose.

In [None]:
%%time

def md5_list(it):
    return struct.unpack("<QQ", md5(memoryview(np.array(it))).digest())

culled_lil = culled_matrix.tolil()
hh = np.zeros(shape=(culled_matrix.shape[0], 4), dtype=np.uint64)
for i, indices_values in enumerate(zip(culled_lil.rows, culled_lil.data)):
    hh[i, :] = sum((md5_list(it) for it in indices_values), ())
_, index_u, inverse_u, counts_u = np.unique(hh, axis=0, return_index=True, return_inverse=True, return_counts=True)
index_u.shape, inverse_u.shape

In [None]:
unique_matrix = culled_matrix[index_u, :]
unique_matrix

Now, perform the $l_1$ normalization to leverage the Hellinger distance between our process representations.

In [None]:
normalized_matrix = Normalizer(norm="l1").fit_transform(unique_matrix)
normalized_matrix

We run a variant of UMAP called [densMAP](https://umap-learn.readthedocs.io/en/latest/densmap_demo.html),
which strikes a trade-off between agglomeration of similar data vectors
and the preservation of density differences between distinct vector neighborhoods in high-dimensional space.
This is critical for the detection of the [large dense clusters](https://drive.google.com/file/d/1ZijF656jj7x8AobIypdyQErn0hXOpIFU/view?usp=sharing)
of 2D vectors that characterize *commonplace behaviors*.

One thing to remark regarding UMAP is that its outcome is not reproducible, even when setting its random seed.
Indeed, it runs stochastic gradient descent over distributed computing tools that preclude full randomization anchoring.
Therefore, expect that your UMAP computations will be different from mine, but still reflect the same local structures between objects.

In [None]:
%%time
process_protomap = umap.UMAP(
    n_components=2,
    metric="hellinger",
    densmap=True,
    dens_lambda=4,
    n_epochs=800,
    verbose=True
).fit_transform(normalized_matrix)
process_protomap

We reduplicate this protomap using the unique inversion index.

In [None]:
process_map = process_protomap[inverse_u, :]
process_map.shape

## One last thing: process labels

While it is possible to look at a map of processes without any other information,
it is much easier to navigate this map if we can rely on certain categorizations of processes, by which we may color them.
A natural, if very fine-grain, categorization of processes comes from their **origin**:
the artifact that started the software that they contain.
The most precise such artifact consists in the **command line** of a process.
If the process was started through a terminal or script,
the command line carries a lot of information regarding the intent of the user or script author,
which is highly useful as far as categories go.
The unfortunate aspect of using command line as categorical labels is that in most settings,
they are both awfully variable
(the cardinal of the vocabulary formed by all command lines making up a dataset can be on the same order of magnitude as the number of processes) and their semantic similarities get lost.
An example of this second issue is that the following two invocations of `ping` perform the exact same purpose:

```sh
ping -w 1000 localhost
ping -w 1000 127.0.0.1
```

So process instances associated to either of these two command lines would be considered to belong to distinct categories,
while they actually make up a single one.
Another example comes from the usage of full paths to programs, as opposed to lexical elision enabled by such mechanisms as the `PATH` environment variable.
Indeed, the following three command lines are semantically the same:

```sh
tasklist
tasklist.exe
C:\Windows\System32\tasklist.exe
```

A full modeling of the categorical space spanned by command lines is outside the scope of this project.
Besides, the cardinal of the set of command lines in the OpTC dataset is peculiarly small,
reflecting the fact that most of the activity captured in this dataset was performed by automated scripts.

Unfortunately, not all processes in the dataset map to a PROCESS-CREATE record to document their associated command line.
While all processes must effectively start by generating such an event,
there is always data loss to host-based sensor systems,
no thanks to icky engineering difficulties.
So when no command line can be found for a process instance,
we fallback on its `image_path`,
the path to the executable file whose execution started the process
(and which is present as a field to every associated event).

The capture of a unique label for each process is performed by crunching through all events, enumerating *label proposals* for each process. If such a proposal is a command line, it will take priority over any other.
Eventually, we are able to keep none but the top-priority proposal for each process instance.

In [None]:
def filter_labels(proposals):
    return proposals.sort_values("importance", ascending=True).drop_duplicates(subset=["process_id"], keep="first", ignore_index=True)

In [None]:
def label_processes(path_chunk):
    data = []
    for event in iter_events(path_chunk):
        if event["object"] == "PROCESS" and event["action"] == "CREATE":
            if command_line := event.get("command_line", ""):
                data.append((event["objectID"], 0, command_line))
            elif image_path := event.get("image_path", ""):
                data.append((event["objectID"], 10, image_path))
            if parent_image_path := event.get("parent_image_path", ""):
                data.append((event["actorID"], 10, parent_image_path))
        else:
            if image_path := event.get("image_path", ""):
                data.append((event["actorID"], 10, image_path))

    return filter_labels(pd.DataFrame(data=data, columns=["process_id", "importance", "label"]))

In [None]:
labels_known = pd.DataFrame()
for fut in tqdm(client.map(label_processes, CHUNKS), total=len(CHUNKS)):
    labels_known = filter_labels(pd.concat([labels_known, fut.result()], ignore_index=True))
labels_known

After all this, if any process is still without label, we give up and label it as `(unknown)`.

In [None]:
labels = pd.Series(irow2process, name="process_id").to_frame().merge(labels_known[["process_id", "label"]], on="process_id", how="left").fillna("(unknown)")
labels

In [None]:
labels.loc[labels["label"] == "(unknown)"].count()

## Save for later

[Notebook 02](02%20Interactive%20visualization.ipynb) in this repository will cover the interactive visualization
(was it a spoiler at this point?) of all these vectors data we generated.
We write it all to disk so this notebook can read it back.

In [None]:
%%time
with open("manifest.json", "wt", encoding="utf-8") as file:
    json.dump(
        {
            "host": str(HOST),
            "days": [str(d) for d in DAYS]
        },
        file
    )
ss.save_npz("features.npz", culled_matrix, compressed=True)
np.savez_compressed("map2d.npz", process_map=process_map)
metadata_pruned.to_csv("metadata.csv.gz", index=False, compression="gzip")
labels.to_csv("labels.csv.gz", index=False, compression="gzip")
with gzip.open("col2token.pkl.gz", "wb") as file:
    cpkl.dump(col2token, file)

If one would rather use the data I canned into the Azure bucket instead of this one they just generated,
they can delete or rename these three files:

1. `manifest.json`
1. `vectors.npz`
1. `labels.csv.gz`
1. `col2token.pkl.gz`

Even altering just one will get the visualization notebook to fall back onto canned data.