# Gather and engineer the ACME 3 dataset

The ACME3 dataset is composed of [host-based](https://en.wikipedia.org/wiki/Host-based_intrusion_detection_system) telemetry gathered from a laboratory experiment.
It was captured over the month of November 2023 by a research team at the Lawrence Livermore National Laboratory.
The data was collected on a small general-purpose Windows network built on AWS for this purpose,
and equipped with a [Microsoft Domain Controller](https://en.wikipedia.org/wiki/Domain_controller_(Windows)) to make it look and behave closer to an enterprise network.
Collaborators of the data collection team were invited to come and do what they could of their work on this network,
so as to generate natural user activity
(as opposed to simulated user activity, which is often used for such experiments).
The team used the open-source [Wintap](https://github.com/LLNL/Wintap) to generate and collect the telemetry.
The researchers also deployed bespoke [Putty SSH clients](https://www.putty.org/) modified to behave like [backdoored](https://en.wikipedia.org/wiki/Backdoor_(computing)) software,
naming it cannily as `PuttyX.exe`,
to capture its behaviour as _malicious activity_.

In this series of notebooks, we will embed subsets of this telemetry into vector spaces and draw _data maps_ of these embeddings.
The goal is to build multiple perspectives towards gaining an understanding of the interplay between the processes whose behaviour was captured.
We will not focus on the detection of PuttyX activity.

---

**Goals of this notebook**

1. Download the main ACME3 summary dataset.
2. Engineer the data subset in support to these experiments.

---

In [1]:
import duckdb
import numpy as np
import os
import pandas as pd
import re
import requests as rq
import tarfile
from tqdm.auto import tqdm

The ACME3 dataset has been released as open data, under [this license](https://www.llnl.gov/disclaimer).
One can gather the whole raw data,
the the data curators have put together a _summary dataset_ that assembles all the information we will focus on today.
The whole data archive weighs in not unreasonably at 14 GB.

In [2]:
MB = 1 << 20
with rq.get("https://gdo168.llnl.gov/data/ACME-2023/stdview-20231105-20231120.tar", stream=True) as r:
    assert r.ok and re.match(r"^[0-9]+$", r.headers.get("Content-Length", "n/a"))
    size_acme3 = int(r.headers["Content-Length"])
    if os.path.isfile("acme3.tar") and os.path.getsize("acme3.tar") == size_acme3:
        print("ACME3 dataset in place")
    else:
        with open("acme3.tar", "wb") as file, tqdm(desc="Download", total=size_acme3, unit_scale=True, unit="") as progress:
            for chunk in r.iter_content(chunk_size=4 * MB):
                file.write(chunk)
                progress.update(len(chunk))

ACME3 dataset in place


Host-based telemetry is actually a collection of many telemetry streams,
detailing respectively how the processes running on the hosts of the network interact with the various IT resources available.
We will focus on two such streams:

1. The main stream of process instances, of which we mean to analyze _command lines_.
2. The stream of _code image load events_, by which processes establish their capabilities by sourcing various subsets of system APIs.

An important aspect of host-based data analysis is that the host-based agent is also a user-mode process.
As such, it interacts with all the processes of a system to achieve its purpose,
and will store the telemetry it generates either in local files or to a central database accessed over the network.
It thus makes sense to try and exclude the telemetry events generated by the agent itself.
For the two telemetry streams we are interested in,
this entails tracking the _child processes_ of the agent processes,
so as to discard their related telemetry events.

In [3]:
db = duckdb.connect(":memory:")
with tarfile.open("acme3.tar", mode="r") as archive:
    for table in tqdm(["process", "process_image_load", "process_path"]):
        path = f"stdview-20231105-20231120/{table}.parquet"
        archive.extract(path, set_attrs=False, filter=lambda x, _: x)
        db.execute(f"create or replace view {table} as select * from read_parquet('{path}')")

  0%|          | 0/3 [00:00<?, ?it/s]

We use [DuckDB](https://duckdb.org/) for easy peasy data engineering.

In [4]:
%load_ext magic_duckdb
%dql -co db

I am not going to describe the full schema and semantics of the necessary data tables,
but here are a few notes in service of our purposes:

1. The process instances are summarized in a table named `process`.
    1. The instances are respectively uniquely identified by the `pid_hash` field.
    2. The original process command lines are not stored: only the command line tails are, under the `args` field. We can reconstitute a normalized approximation by concatenating the `process_path` and `args` fields, separated with a space.
1. The `process_path` table enumerates the all the paths of the process trees of each host respectively, starting from the node corresponding to each process instances over and up to the root of its tree.
    1. We want to discard events related to the activity of two telemetry-generating agents: Wintap and Amazon's own **SSM** agent.
    2. We can look up any Wintap-related process in string representations of the process paths (as it would carry the `wintap` substring). Same for SSM, identified with the `amazon-ssm` substring.
    3. The paths marking unwanted telemetry tie back to `pid_hash` identifiers.
1. The data capture laboratory was stood up a few weeks before the data collection period started, so as to be set up for the experiment. We discard all the information collected outside of the nominal data capture period.
1. Code image loading events are collected in table `process_image_load`. We restrict to the events not generated by the telemetry agents' own activity.

<a id="sql"></a>

In [5]:
%%dql -o process_filtered
select
    p.pid_hash,
    p.parent_pid_hash,
    p.process_started as timestamp,
    p.hostname,
    p.process_name as name,
    trim('"' || p.process_path || '" ' || coalesce(p.args, '')) as cmdline,
    pp.ptree
from process as p
inner join process_path as pp using (pid_hash)
where     pp.ptree not like '%wintap%'
      and pp.ptree not like '%amazon-ssm%'
      and p.process_started >= '2023-11-05'::timestamp
      and p.process_started = p.first_seen
order by p.process_started, p.hostname

Unnamed: 0,pid_hash,parent_pid_hash,timestamp,hostname,name,cmdline,ptree
0,3F8784056EB55BB295DCAB8C9344254A,BA7EC59E059F8E7B9F75D3BD0D3E85FE,2023-11-05 00:00:00.314092-07:00,ACME-HH-YIM,wsqmcons.exe,"""c:\windows\system32\wsqmcons.exe""",=wsqmcons.exe->svchost.exe->services.exe->wini...
1,B64CB859912B447E83D8C8E619ABCF93,50F07B711DB04C89FF3922CD15C2647B,2023-11-05 00:00:00.377725-07:00,ACME-HH-UWI,wsqmcons.exe,"""c:\windows\system32\wsqmcons.exe""",=wsqmcons.exe->svchost.exe->services.exe->wini...
2,921D6E35B58C5CA571A55D96C16A5D6A,981581F0C573975A7D576645C5B731F0,2023-11-05 00:00:00.436069-07:00,ACME-HH-HGC,wsqmcons.exe,"""c:\windows\system32\wsqmcons.exe""",=wsqmcons.exe->svchost.exe->services.exe->wini...
3,22E2B0D56B1F636B09CBDC86CCDBBD97,483B551ED4FD3BF742CBBD9B343E8117,2023-11-05 00:00:00.586099-07:00,ACME-WS-AZU,wsqmcons.exe,"""c:\windows\system32\wsqmcons.exe""",=wsqmcons.exe->svchost.exe->services.exe->wini...
4,F1F1B598005B90EB45FBF77E74DDFCB8,803022DFF3341FF443B9D49E08D82558,2023-11-05 00:00:00.731733-07:00,ACME-HH-AKA,wsqmcons.exe,"""c:\windows\system32\wsqmcons.exe""",=wsqmcons.exe->svchost.exe->services.exe->wini...
...,...,...,...,...,...,...,...
136701,11D2905584ADEBDB004263B08D3706A1,5B1453B37F079EC6AA910E820C5B6BF1,2023-11-20 15:54:14.616846-08:00,ACME-HH-IKA,usoclient.exe,"""c:\windows\system32\usoclient.exe"" startscan",=usoclient.exe->svchost.exe->services.exe->win...
136702,D8DC51881EDB5C6F7D6239E3DE1A5D93,E283B625D417A1C9584B2D6C9D0C364F,2023-11-20 15:54:14.638486-08:00,ACME-HH-IKA,mousocoreworker.exe,"""c:\windows\system32\mousocoreworker.exe"" -emb...",=mousocoreworker.exe->svchost.exe->services.ex...
136703,521490D4B45E5BDC4BD2CA7081BC06A7,6A37E544F4D9E91016071993B3DC4103,2023-11-20 15:55:23.333011-08:00,ACME-HH-AZH,usoclient.exe,"""c:\windows\system32\usoclient.exe"" startscan",=usoclient.exe->svchost.exe->services.exe->win...
136704,D2211D916C3414FDFBD2663BB9051C84,2C2DB3363D7539099C7869C6BFEEED6B,2023-11-20 15:55:23.349693-08:00,ACME-HH-AZH,mousocoreworker.exe,"""c:\windows\system32\mousocoreworker.exe"" -emb...",=mousocoreworker.exe->svchost.exe->services.ex...


<a id="imageloads"></a>

In [6]:
%%dql -o image_loads
select
    pid_hash,
    filename,
    first_seen as timestamp
from process_image_load
inner join process_filtered using (pid_hash)
order by timestamp, pid_hash

Unnamed: 0,pid_hash,filename,timestamp
0,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\wsqmcons.exe,2023-11-05 00:00:00.314910-07:00
1,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\ntdll.dll,2023-11-05 00:00:00.314924-07:00
2,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\kernel32.dll,2023-11-05 00:00:00.315711-07:00
3,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\kernelbase.dll,2023-11-05 00:00:00.315864-07:00
4,3F8784056EB55BB295DCAB8C9344254A,c:\windows\system32\msvcrt.dll,2023-11-05 00:00:00.317407-07:00
...,...,...,...
1606258,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\npmproxy.dll,2023-11-20 15:55:23.410633-08:00
1606259,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\wintypes.dll,2023-11-20 15:55:23.413362-08:00
1606260,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\taskschd.dll,2023-11-20 15:55:23.418445-08:00
1606261,D2211D916C3414FDFBD2663BB9051C84,c:\windows\system32\sspicli.dll,2023-11-20 15:55:23.418901-08:00


We save the tables we just gathered as local Parquet files.

In [7]:
process_filtered.to_parquet('process_filtered.parquet', compression="zstd")
image_loads.to_parquet('image_loads.parquet', compression="zstd")

Q: how much storage space for the data we require?

In [8]:
pd.DataFrame(
    [
        (name, os.path.getsize(name))
        for name in os.listdir(".")
        if name in {"image_loads.parquet", "process_filtered.parquet"}
    ],
    columns=["name", "size"]
)

Unnamed: 0,name,size
0,image_loads.parquet,6169628
1,process_filtered.parquet,5315641


A: not much at all.

Unless you care about exploring the dataset at larger length,
you can now remove the Tar file we gathered and the Parquet files we extracted from it.
To do so, you may change the next cell to a code cell and run it.