# Embedding process instances as bags of code images

Observing patterns of similarity among complex telemetry datasets involves peering at it through _multiple lenses_. 
The work of embedding command lines in [notebooks 1](1%20Command%20lines%20-%20-Bags%20of%20words.ipynb) [and 2](2%20Command%20lines%20-%20Wasserstein%20embedding.ipynb) yielded one such lense.
Here is another one.

Modern software is far from standalone:
most of its functionality is actually leveraged from operating system services and other routines lifted from _shared code libraries_.
On Windows, any file containing code getting loaded by a process (its own executable, as well as the libraries it relies on) is called a _code image_, or _image file_.
For a program to load a new code image raises an event on auditing systems that get captured by the host-based sensor,
and thus transfered to telemetry.

When trying to fathom the behaviour of a program,
an analyst will look at the code images it loads:
like Checkhov's guns,
a program has got to leverage certain APIs if it cares to bloat its memory space with all this code.
In many cases, this provides a rough idea of the purpose of the program.
Therefore, we can surmise that programs that depend on the similar sets of code images might serve similar purpoes.

---

**Goal of this notebook**:
look at a vector space embedding of process instances as sets of code images.

---

In [1]:
import datamapplot as dmp
from fast_hdbscan import HDBSCAN
import glasbey
import numpy as np
import pandas as pd
import scipy.sparse
import shlex
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from tqdm.auto import trange
import umap
import vectorizers as vz
import vectorizers.transformers as vzt

In [notebook 0](0%20Gather%20and%20engineer%20dataset.ipynb#imageloads),
we filtered the image load events to the set of filtered processes.
We will identify each code image by the path to the file that contains it on disk.

In [2]:
image_loads = pd.read_parquet("image_loads.parquet")
image_loads

Unnamed: 0,pid_hash,filename,timestamp
0,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\wsqmcons.exe,2023-11-05 00:00:00.314910-07:00
1,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\ntdll.dll,2023-11-05 00:00:00.314924-07:00
2,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\kernel32.dll,2023-11-05 00:00:00.315711-07:00
3,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\kernelbase.dll,2023-11-05 00:00:00.315864-07:00
4,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\msvcrt.dll,2023-11-05 00:00:00.317407-07:00
...,...,...,...
1606258,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\npmproxy.dll,2023-11-20 15:55:23.410633-08:00
1606259,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\wintypes.dll,2023-11-20 15:55:23.413362-08:00
1606260,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\taskschd.dll,2023-11-20 15:55:23.418445-08:00
1606261,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\sspicli.dll,2023-11-20 15:55:23.418901-08:00


We must now express each process instance as a sequence of code image paths.
When we have pivoted tabular information,
we can use `vectorizers.transformers.CategoricalColumnTransformer` for this purpose.

In [3]:
%%time
images_x_process = vzt.CategoricalColumnTransformer("pid_hash", "filename").fit_transform(image_loads)
images_x_process

CPU times: user 917 ms, sys: 15.9 ms, total: 933 ms
Wall time: 935 ms


pid_hash
0001A7D9420F65F0E8B07E78E1D26B45    [c:\program files (x86)\microsoft\edgeupdate\m...
000442B02967B85605AA0EAEC09B3C65    [c:\windows\system32\svchost.exe, c:\windows\s...
0004B89274C488D14B82AC4B82FCE8C1    [c:\windows\system32\umpdc.dll, c:\windows\sys...
000631F56921E5408C209847F5C08982    [c:\windows\system32\umpdc.dll, c:\windows\sys...
000633B71BF1319C229720F1407B06AB    [c:\windows\system32\windows.networking.connec...
                                                          ...                        
FFF37EA8B2562643513B43C1AFE6631B    [c:\windows\system32\mousocoreworker.exe, c:\w...
FFF4C0540420379CED9EF8FBBF629BBD    [c:\windows\system32\taskhostw.exe, c:\windows...
FFFBEC447AE450B63B77772055BCC046    [c:\program files (x86)\microsoft\edgeupdate\m...
FFFE89C9D856A2CA183895879FD210E0    [c:\windows\system32\svchost.exe, c:\windows\s...
FFFF1E20E3041AA26D33FB1C563C1AEF    [c:\users\user6\desktop\extend_bincfg\.venv\sc...
Length: 40982, dtype: object

Interestingly, we seem to have many _less_ process instances represented in this manner than through command lines.
This is because the Windows auditing systems work on a best effort policy:
under load, such systems will not report all the events being generated,
for fear of using too much CPU time and storage space for auditing rather than for user computations.
Thus, we only get exposed to a subset of image load events.
We can expect our representation to be noisy because of that:
many of our process instances will be characterized only by a subset of their actual loaded images.
To wit: we have literally _zero_ image loads for nearly 100,000 process instances!

Now we have each process instance described as a sequence of tokens,
we build the counts matrix, and reweight it on the basis of information contributions.

In [4]:
%%time
vz_tc = vz.NgramVectorizer().fit(images_x_process.tolist())
processes_tc = vz_tc._train_matrix.tocsr()
processes_tc

CPU times: user 5 s, sys: 105 ms, total: 5.1 s
Wall time: 5.13 s


<40982x8298 sparse matrix of type '<class 'numpy.float32'>'
	with 1606263 stored elements in Compressed Sparse Row format>

In [5]:
%%time
processes_iwt = vzt.InformationWeightTransformer().fit_transform(processes_tc)
processes_iwt

CPU times: user 758 ms, sys: 14.4 ms, total: 772 ms
Wall time: 774 ms


<40982x8298 sparse matrix of type '<class 'numpy.float64'>'
	with 1606263 stored elements in Compressed Sparse Row format>

In the present case, we could use the `vectorizers.TokenCooccurrenceVectorizer` to crunch the cooccurrence representation for code images.
However, it would be much too much work, as the relative position of code images in the sequence is irrelevant
(unlike tokens in a command line).
Instead, we can simply get the orderless cooccurrence matrix by folding the count matrix unto itself.

In [6]:
%%time
sum_rows_iwt = np.array(np.sum(processes_iwt, axis=0)).squeeze()
images_cooc = scipy.sparse.diags(sum_rows_iwt) @ processes_iwt.T @ processes_iwt
images_cooc

CPU times: user 224 ms, sys: 8.05 ms, total: 232 ms
Wall time: 231 ms


<8298x8298 sparse matrix of type '<class 'numpy.float64'>'
	with 2985318 stored elements in Compressed Sparse Row format>

8,000 dimensions is quite large, yielding heavy computations for Wasserstein embedding.
We get by perfectly fine with smaller vectors, with preservation of the _global_ similarity structure.

In [7]:
%%time
images_svd = TruncatedSVD(n_components=512).fit_transform(images_cooc)
images_svd.shape

CPU times: user 33.7 s, sys: 8.03 s, total: 41.7 s
Wall time: 9.33 s


(8298, 512)

Wasserstein embedding takes its usual few minutes.
Do crush some candy.

In [8]:
%%time
processes_wass = vz.WassersteinVectorizer().fit_transform(processes_iwt, vectors=images_svd)

CPU times: user 12min 35s, sys: 26.1 s, total: 13min 1s
Wall time: 1min 57s


Now produce the 2D data map.

In [9]:
%%time
processes_dmap = umap.UMAP(metric="cosine", unique=True, verbose=True).fit_transform(processes_wass)

UMAP(angular_rp_forest=True, metric='cosine', unique=True, verbose=True)
Unique=True -> Number of data points reduced from  40982  to  4635
Most common duplicate is 5  with a count of  2827
Tue Jul  9 15:27:10 2024 Construct fuzzy simplicial set
Tue Jul  9 15:27:10 2024 Finding Nearest Neighbors
Tue Jul  9 15:27:10 2024 Building RP forest with 8 trees
Tue Jul  9 15:27:11 2024 NN descent for 12 iterations
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	Stopping threshold met -- exiting after 4 iterations
Tue Jul  9 15:27:16 2024 Finished Nearest Neighbor Search
Tue Jul  9 15:27:17 2024 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Tue Jul  9 15:27:20 2024 Finished embedding
CPU times: user 17.8 s, sys: 1.3 s, total: 19.1 s
Wall time: 10.3 s


Let's identify each point by the command line (hover text).

In [10]:
metadata = images_x_process.rename("images").to_frame().merge(
    pd.read_parquet("process_filtered.parquet"),
    left_index=True,
    right_on="pid_hash",
    how="left"
).set_index("pid_hash")
assert (metadata.index == images_x_process.index).all()
hover_text = metadata["name"] + metadata["cmdline"].map(lambda x: '"'.join(x.split('"')[2:]))
hover_text

pid_hash
0001A7D9420F65F0E8B07E78E1D26B45    microsoftedgeupdate.exe /ua /installsource sch...
000442B02967B85605AA0EAEC09B3C65             svchost.exe -k netsvcs -p -s netsetupsvc
0004B89274C488D14B82AC4B82FCE8C1             svchost.exe -k netsvcs -p -s netsetupsvc
000631F56921E5408C209847F5C08982             svchost.exe -k netsvcs -p -s netsetupsvc
000633B71BF1319C229720F1407B06AB                       mousocoreworker.exe -embedding
                                                          ...                        
FFF37EA8B2562643513B43C1AFE6631B                       mousocoreworker.exe -embedding
FFF4C0540420379CED9EF8FBBF629BBD                                        taskhostw.exe
FFFBEC447AE450B63B77772055BCC046    microsoftedgeupdate.exe /ua /installsource sch...
FFFE89C9D856A2CA183895879FD210E0             svchost.exe -k netsvcs -p -s netsetupsvc
FFFF1E20E3041AA26D33FB1C563C1AEF    python.exe "c:\users\user6\desktop\extend_binc...
Length: 40982, dtype: object

How to annotate the map?
We are characterizing each process instance by the code images it loads.
So a cluster of points is likely composed of process instances that load very similar sets of code images.
For each cluster, let's identify which 3 code images are most _distinguishing_ for the cluster with respect to all the rest.
The way to do that consists in computing the joint distribution of code image usage:
it will stand as our _null distribution_.
Then for the cluster, we calculate the distribution of its **local** code image usage;
we then identify which 3 code images for the cluster are used more frequently than the null distribution accounts for.

In [11]:
clusters = HDBSCAN(min_cluster_size=10).fit_predict(processes_dmap)
num_clusters = np.unique(clusters).shape[0] - 1
num_clusters

475

In [12]:
def normalized_distribution(M):
    S = M.mean(axis=0)
    return np.array(S).squeeze() / np.sum(S)


null_distrib = normalized_distribution(processes_iwt)
gt_lt = {True: ">", False: "<"}

K = 3
topics = {-1: "Unlabelled"}
for cluster in trange(num_clusters):
    distrib_cluster = normalized_distribution(processes_iwt[clusters == cluster])
    diff = np.abs(distrib_cluster - null_distrib)
    i_most_important = np.argsort(-diff)[:K]
    topic = "\n".join(gt_lt[distrib_cluster[i] > null_distrib[i]] + vz_tc.column_index_dictionary_[i].split('\\')[-1] for i in i_most_important)
    topics[cluster] = topic + "\u202D" * cluster if topic else "Unlabelled"

  0%|          | 0/475 [00:00<?, ?it/s]

The last line of the previous cell is a dirty trick to make force each cluster to have its very own label,
even if the label string collides with another cluster's.
We use the writing direction change Unicode code point,
a glyph-less code point,
to make the string stricly unique without changing its appearance in the plot.

In [14]:
dmp.create_interactive_plot(
    processes_dmap,
    [topics[c] for c in clusters],
    hover_text=list(hover_text),
    darkmode=False,
    enable_search=True,
    title="Process instances",
    sub_title="as distributions over a cloud of code image cooccurrence vectors",
)

This data map shows a very different type of variability than does command line embeddings.
Whereas the latter showed a large spread of embedding points for browsers,
under this lense,
they now pile up in a tiny cluster
(with respect to the 2D area),
thanks to browser processes always loading the same code images.
In contrast, general purpose programs such as `python.exe` and `mscorsvw.exe` show up all over the plot:
the APIs they source depend on the exact computation they are meant to support.