# Comparing data maps

In [this talk](https://docs.google.com/presentation/d/1wIPd4KIlEngB43lmyM-W2GoChWyHBf-cDlk4sdsYgD4/edit#slide=id.g2e9451582b7_0_918),
I displayed a side-by-side view of the data maps of command line embeddings as a [bag of tokens](1%20Command%20lines%20-%20Bags%20of%20words.ipynb) and as [distributions over a cloud of cooccurrence vectors](2%20Command%20lines%20-%20Wasserstein%20embedding.ipynb).
This is a surprisingly tricky thing to produce,
because UMAP produces an intrinsically random projection
(because it relies on a multi-thread stochastic gradient descent).
There are ways to minimize the random differences induced by having to crunch both vector sets through distinct UMAP computations.

---

**Goal of this notebook**: demonstrate how to produce a pair of visually comparable data maps.

---

In [1]:
import datamapplot as dmp
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import scipy.sparse
import umap
import vectorizers as vz
import vectorizers.transformers as vzt

Get the bag-of-words data map.
It is the model that the other data map will be aligned to.
Get also the index that discards the [invalid points](1%20Command%20lines%20-%20Bags%20of%20words.ipynb#invalid) from the bag of words: we can only compare actual valid embeddings.
Thus, our side-by-side plot will involve a restriction of the Wasserstein embedding data map to this set of valid points.

In [2]:
with np.load("cmdlines_bagofwords.npz") as store_bagofwords:
    index_valid = store_bagofwords["index_valid"]
    bagofwords_dmap_valid = store_bagofwords["datamap"][index_valid]

Fetch the Wasserstein embedding.

In [3]:
with np.load("cmdlines_wasserstein.npz") as store_wasserstein:
    cmdlines_wasserstein = store_wasserstein["cmdlines"]

We now crunch the UMAP compression of the Wasserstein embedding, but with a twist:
we initialize the embedding with the data map we want to compare with,
in the hope that the only differences that get derived from this starting point stem from differences between embedding methods.

In [4]:
%%time
wasserstein_dmap = umap.UMAP(metric="cosine", init=bagofwords_dmap_valid, verbose=True)\
    .fit_transform(cmdlines_wasserstein[index_valid])

UMAP(angular_rp_forest=True, init=array([[17.662422  , -8.094102  ],
       [17.662432  , -8.094071  ],
       [17.662428  , -8.094079  ],
       ...,
       [-4.3010616 , 13.551044  ],
       [ 3.6941457 , 0.49649546],
       [ 3.0759356 , 0.6824077 ]], dtype=float32), metric='cosine', verbose=True)
Wed Jul 10 07:56:01 2024 Construct fuzzy simplicial set
Wed Jul 10 07:56:01 2024 Finding Nearest Neighbors
Wed Jul 10 07:56:01 2024 Building RP forest with 14 trees
Wed Jul 10 07:56:03 2024 NN descent for 15 iterations
	 1  /  15
	 2  /  15
	 3  /  15
	 4  /  15
	 5  /  15
	Stopping threshold met -- exiting after 5 iterations
Wed Jul 10 07:56:07 2024 Finished Nearest Neighbor Search
Wed Jul 10 07:56:08 2024 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Wed Jul 10 07:56:15 2024 Finished embedding
CPU times: user 34 s, sys: 415 ms, total: 34.4 s
Wall time: 13.3 s


Finally, we calculate a [Procrustes alignment](https://en.wikipedia.org/wiki/Procrustes_analysis) between the two maps.

In [5]:
bagofwords_dmap_a, wasserstein_dmap_a = vz.utils.procrustes_align(bagofwords_dmap_valid, wasserstein_dmap, scale_to="first")

Get the plotting metadata generated in notebook 1,
restrict it to valid points.

In [6]:
metadata_cmdlines_valid = pd.read_parquet("cmdlines_labels_hovertext.parquet").iloc[index_valid]
metadata_cmdlines_valid

Unnamed: 0,labels,hover_text
2,Unlabelled,"""%systemroot%\system32\musnotificationux.exe"" ..."
3,Unlabelled,"""%systemroot%\system32\musnotificationux.exe"" ..."
4,Unlabelled,"""%systemroot%\system32\musnotificationux.exe"" ..."
5,Unlabelled,"""%systemroot%\system32\musnotificationux.exe"" ..."
6,Unlabelled,"""%systemroot%\system32\musnotificationux.exe"" ..."
...,...,...
30986,Unlabelled,"""systemroot\system32\smss.exe"" 00000170 0000008c"
30987,Unlabelled,"""systemroot\system32\smss.exe"" 00000174 0000008c"
30988,Unlabelled,"""systemroot\system32\smss.exe"" 00000180 0000008c"
30989,bash.exe,"(80x) ""usr\bin\bash.exe"" --norc -c ""export pat..."


Get also the color map generated in notebook 1.

In [7]:
label_color_map = json.loads(Path("color_map.json").read_text(encoding="utf-8"))
label_color_map

{'svchost.exe': '#0c71ff',
 'conhost.exe': '#ca2800',
 'taskhostw.exe': '#ff28ba',
 'mscorsvw.exe': '#000096',
 'microsoftedgeupdate.exe': '#86e300',
 'mousocoreworker.exe': '#1c5951',
 'msedge.exe': '#20d2ff',
 'sppsvc.exe': '#20ae86',
 'mergehelper.exe': '#590000',
 'git.exe': '#65008e',
 'backgroundtaskhost.exe': '#b6005d',
 'bash.exe': '#ffaa96',
 'ngen.exe': '#ba10c2',
 'cmd.exe': '#510039',
 'wmiprvse.exe': '#00650c',
 'compattelrunner.exe': '#0096a6',
 'wermgr.exe': '#20aa00',
 'ls.exe': '#ffaeeb',
 'googleupdate.exe': '#ff316d',
 'reg.exe': '#0431ff',
 'rundll32.exe': '#31e7ce',
 'ngentask.exe': '#eb65ff',
 'wsqmcons.exe': '#ff6d2d',
 'powershell.exe': '#8a2071',
 'python.exe': '#24ffa6',
 'Unlabelled': '#dddddd'}

Now, we produce a static figure of both data maps respectively,
using corresponding parameters.
To not spoil the surprise, we delay the showing of the data map figures produced by `datamapplot.create_plot`,
and save them to PNGs that can be shown afterwards.

In [8]:
%%time
fig, _ = dmp.create_plot(
    bagofwords_dmap_a,
    labels=metadata_cmdlines_valid["labels"],
    title="Process instances",
    sub_title="as bags of information-reweighted parsed command line tokens",
    font_family="Roboto",
    figsize=(8, 8),
    label_font_size=9.,
    use_medoids=True,
    darkmode=False,
    label_color_map=label_color_map,
)
fig.savefig("bagofwords.png", bbox_inches="tight")
plt.close(fig)

CPU times: user 28.9 s, sys: 1.3 s, total: 30.2 s
Wall time: 29 s


In [9]:
%%time
fig, _ = dmp.create_plot(
    wasserstein_dmap_a,
    labels=metadata_cmdlines_valid["labels"],
    title="Process instances",
    sub_title="as distributions over a cloud of command line token cooccurrence vectors",
    font_family="Roboto",
    figsize=(8, 8),
    label_font_size=9.,
    use_medoids=True,
    darkmode=False,
    label_color_map=label_color_map,
)
fig.savefig("wasserstein.png", bbox_inches="tight")
plt.close(fig)

CPU times: user 23.1 s, sys: 1.28 s, total: 24.4 s
Wall time: 24.9 s


Now behold! This side-by-side view helps a lot with appreciating the improvements brought on by Wasserstein embedding over the bag-of-words approach.
In particular, observe how each cluster label is _much_ closer to the same-color cluster of points:
this is because the labeled points accrete in a smaller subspace with the Wasserstein embedding,
instead of the tendancy of the bag of words of having related things dispersed all over.

In [10]:
%%html
<style>
    .container {
        display: flex;
    }
    .figure {
        flex: 1;
    }
    .figure > img {
        width: 100%;
    }
</style>
<div class="container">
    <div class="figure"><img src="bagofwords.png"></div>
    <div class="figure"><img src="wasserstein.png"></div>
</div>