# Embedding command lines as distributions on point clouds of token vectors

The data map produced in [notebook 1](Command%20lines%20-%20Bags%20of%20words.ipynb) is nice, but flawed.
Bags of words tend to yield _fragmented_ vector sets,
where similar objects spread across a lot of space, over a large number of vectors.
This makes their joint analysis trickier than it should be.
It also assumes that tokens are independant and order-free, two properties that command line tokens definitely don't satisfy.
Finally, it doesn't learn any of the latent _synonomity semantics_ of the data.
For instance, when using them to embed text, bags of words see words such as _red_, _carmine_ and _scarlet_ as strictly distinct, and inducing dissimilarity between the correpsonding sentences.

We can analyze the token vocabulary by considering how they co-occur with other tokens,
and where.
Co-occurrence helps with finding synonymous tokens,
and bridging similarity gaps that fragment vectors with bags of words.
Let's apply this insight to the analysis of command lines.

---

**Goals of this notebook**:

1. Crunch a vector representation of tokens that takes into account their cooccurrences as a similarity notion.
2. Use these vectors as a _point cloud_ geometric vocabulary on which to express command lines, and derive a vector space embedding of command lines from that -- a technique named **Wasserstein embedding**.
3. Appraise the visible differences between a bag-of-words data map and a Wasserstein data map.

---

**Attention**: you must first run [notebook 1](Command%20lines%20-%20Bags%20of%20words.ipynb) to get ready to run this one in turn.

In [1]:
import datamapplot as dmp
from fast_hdbscan import HDBSCAN
import json
import numpy as np
import os
import pandas as pd
from pathlib import Path
import re
import scipy.sparse
from tqdm.auto import trange
import umap
import vectorizers as vz
import vectorizers.transformers as vzt

Gather our tokenized command lines.

In [2]:
cmdlines_tokenized = pd.read_parquet("cmdlines_tokenized.parquet")["cmdline"]
cmdlines_tokenized

pid_hash
45EBAFA4120EBD83795F91B021C2C987                  ["%systemroot%\system32\csrss.exe"]
117409566B3CCA5974A355559048FA20    ["%systemroot%\system32\csrss.exe", objectdire...
71A014969332AF0C1E0FBFC883BEB8FF    ["%systemroot%\system32\musnotificationux.exe"...
83FB4E68D3AC8DA9B36A19FF158F9410    ["%systemroot%\system32\musnotificationux.exe"...
A5F0F3DB7F64C157C48A7CCBFDCCB9DD    ["%systemroot%\system32\musnotificationux.exe"...
                                                          ...                        
2B69DF1A23610FC8D0E5F50A03883E05    ["usr\bin\bash.exe", --norc, -c, "export path=...
E8574562C96D3A223B0A52389B4EFD3F    ["usr\bin\mintty.exe", --nodaemon, -o, appid=g...
4925BFC824A2B984C2AF2C4EBB7F705F                                 ["vboxheadless.exe"]
73323197ECBC776B25F7554996E55495    ["vswhere.exe", -property, catalog_productsema...
A54992061BAC1C526BEE5923F4D5E506                                       ["wlrmdr.exe"]
Name: cmdline, Length: 30994, dtype: object

Deriving a vector representation of each token based on their cooccurrence in command lines can be done by embedding them using `vectorizers.TokenCooccurrenceVectorizer`.

In [3]:
%%time
vz_cooc = vz.TokenCooccurrenceVectorizer(n_threads=os.cpu_count(), n_iter=3)\
    .fit(cmdlines_tokenized.tolist())
cooc_vec = vz_cooc.reduce_dimension(512)
cooc_vec.shape



CPU times: user 2min 24s, sys: 12.9 s, total: 2min 37s
Wall time: 35.1 s


(20834, 512)

Let's take a minute to take a look at the similarity structure induced between tokens by cooccurrence.

In [4]:
%%time
cooc_dmap = umap.UMAP(metric="cosine", verbose=True).fit_transform(cooc_vec)

UMAP(angular_rp_forest=True, metric='cosine', verbose=True)
Tue Jul  9 11:21:32 2024 Construct fuzzy simplicial set
Tue Jul  9 11:21:32 2024 Finding Nearest Neighbors
Tue Jul  9 11:21:32 2024 Building RP forest with 12 trees
Tue Jul  9 11:21:33 2024 NN descent for 14 iterations
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
	 6  /  14
	Stopping threshold met -- exiting after 6 iterations
Tue Jul  9 11:21:38 2024 Finished Nearest Neighbor Search
Tue Jul  9 11:21:39 2024 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Tue Jul  9 11:21:48 2024 Finished embedding
CPU times: user 55.9 s, sys: 3.82 s, total: 59.7 s
Wall time: 16.6 s


We can retrieve the mapping between vectors and tokens through variable `vz_cooc.token_index_dictionary_`.

In [5]:
dmp.create_interactive_plot(
    cooc_dmap,
    hover_text=[token for _, token in sorted(vz_cooc.token_index_dictionary_.items())],
    title="Tokens",
    sub_title="from cooccurrence over command lines",
    darkmode=True
)

Clearly, we're not wasting our time:
there are many tokens that are truly co-occurring strictly
(at the exclusion of co-occurring with others).
Leveraging this structure should yield a better data map.

So the same way that we can represent a command line as a set of weights on a categorical vocabulary,
we can represent it as a set of weights on a set of vectors, aka a _point cloud_.
In this case, we can understand the representation as being a _distribution_ on these vectors.
Fortunately, the structure we computed previously to represent the bag of words can be reused _as is_ from notebook 1.

In [6]:
cmdlines_iwt = scipy.sparse.load_npz("cmdlines_iwt.npz")
cmdlines_iwt

<30994x20834 sparse matrix of type '<class 'numpy.float64'>'
	with 236529 stored elements in Compressed Sparse Row format>

The notion of similarity that is useful between distributions on point clouds is called the [Wasserstein distance](https://en.wikipedia.org/wiki/Wasserstein_metric).
Computing this distance between many distributions and deriving a vector space embedding for each distribution is called **Wasserstein embedding**,
and it involves the resolution of an optimal transport problem.
We have code for doing that in `vectorizers.WassersteinVectorizer`.
The following cell can take from 20 seconds to many minutes, depending on your computer.
Time to refill your mug?

In [7]:
%%time
cmdlines_wass = vz.WassersteinVectorizer().fit_transform(cmdlines_iwt, vectors=cooc_vec)
cmdlines_wass.shape

CPU times: user 1min 32s, sys: 6.7 s, total: 1min 38s
Wall time: 20.9 s


(30994, 128)

Now crunch this down into a data map.
The Wasserstein-embedded vectors are best appraised for similarity through cosine distance.
This takes time once again.
Fling some angry birds at green pigs.

In [8]:
%%time
wass_dmap = umap.UMAP(metric="cosine", init="pca", verbose=True).fit_transform(cmdlines_wass)
wass_dmap.shape

UMAP(angular_rp_forest=True, init='pca', metric='cosine', verbose=True)
Tue Jul  9 11:22:09 2024 Construct fuzzy simplicial set
Tue Jul  9 11:22:09 2024 Finding Nearest Neighbors
Tue Jul  9 11:22:09 2024 Building RP forest with 14 trees
Tue Jul  9 11:22:09 2024 NN descent for 15 iterations
	 1  /  15
	 2  /  15
	 3  /  15
	 4  /  15
	 5  /  15
	Stopping threshold met -- exiting after 5 iterations
Tue Jul  9 11:22:10 2024 Finished Nearest Neighbor Search
Tue Jul  9 11:22:10 2024 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Tue Jul  9 11:22:19 2024 Finished embedding
CPU times: user 39.4 s, sys: 1.77 s, total: 41.2 s
Wall time: 9.39 s


(30994, 2)

Let's fetch the same annotations we used for the [data map of notebook 1](Command%20lines%20-%20Bags%20of%20words.ipynb#datamap) and take a look.

In [9]:
annotations = pd.read_parquet("cmdlines_labels_hovertext.parquet")
label_color_map = json.loads(Path("color_map.json").read_text(encoding="utf-8"))

In [10]:
plot_interactive = dmp.create_interactive_plot(
    wass_dmap,
    annotations["labels"],
    hover_text=annotations["hover_text"],
    darkmode=True,
    label_color_map=label_color_map,
    title="Process instances",
    sub_title="as distributions on a point cloud of token cooccurrence vectors"
)
plot_interactive

This data map addresses many issues from that of notebook 1.
The immediate observation in support is that
most of the `msedge.exe` processes no longer form a world-spanning structure, but are rather lined up in two nearby clusters.
These clusters also accrete `chrome.exe` process instances,
which share many parameters with `msedge.exe` instances
(since both are browsers based on the same rendering engine).
`mergehelper.exe` process instances have also been cinched well close to each other.

A difficulty that hampers further comparison of the data maps is that UMAP computations have a strict random component.
Indeed, it relies on a multi-thread stochastic gradient descent iteration to solve a low-space projection problem,
which cannot be made deterministic by setting some random seed.

We will revisit the problem of comparing data map making methods in [notebook 5](5%20Comparing%20data%20maps.ipynb).
For this purpose, let's save the high-dimension vectors for this purpose.

In [11]:
np.savez_compressed("cmdlines_wasserstein.npz", cmdlines=cmdlines_wass)