# Process metadata and lineage anomalies

The ACME4 dataset assigns the responsibility of each captured event to a _process instance_,
uniquely identified with a **PID hash** (`pid_hash`).
Thus, a pervasive analysis of all process instances would presume to include all of these identifiers.
However, some cursory analysis exposes that some of these PID hashes are _duds_:
there seems to be missing fundamental metadata regarding such processes.
This notebook aims at labeling these process instances in order to exclude them from analytic consideration.

In [1]:
%load_ext autoreload
%load_ext dotenv
#%load_ext quak
%load_ext sql

In [2]:
%autoreload 1
%aimport acme4_explore

In [3]:
%dotenv

In [4]:
import acme4_explore
import logging as lg
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
import re
from tqdm.auto import tqdm, trange

In [5]:
lg.basicConfig(**acme4_explore.logging_config())
LOG = lg.getLogger("notebook")

In [6]:
db = acme4_explore.connect_db()
%sql db --alias duckdb
%config SqlMagic.displaycon=False
%config SqlMagic.autopandas=True

_Duds_ are PID hashes for which no parent PID hash is recorded.

In [7]:
%%sql duds <<
select pid_hash, process_name, args
from process
where parent_pid_hash is null

In [8]:
duds

Unnamed: 0,pid_hash,process_name,args
0,1F2CD6387409AC52197E229DF5E01480,,
1,5E5F77D9A0BB72CA82F910F94CE9083A,,
2,E86668F7CF2E2895FA9D785777F93FD6,,
3,72BEE33B015916E6F555DB4B4908070B,,
4,5717F94252DDD6E0D5D4984FEB6B9171,,
...,...,...,...
12121,F237FB4EFBFCC26E5BABA610EE4AAD49,,
12122,504B38D65A18EA48476EFAEFB83EEED7,,
12123,1694025B7EF0C319334CDDE0A8D71ABD,wmic.exe,
12124,4E35E4E205286EF99147B17F627E2989,conhost.exe,


We can observe that, while some of these dud processes have a name, none have any non-trivial command line.

In [9]:
duds["process_name"].value_counts(dropna=False)

process_name
None                7022
conhost.exe         3808
wmic.exe             902
svchost.exe           55
taskhostw.exe         42
                    ... 
inno_updater.exe       1
atbroker.exe           1
disksnapshot.exe       1
wlrmdr.exe             1
dwm.exe                1
Name: count, Length: 71, dtype: int64

In [10]:
duds["args"].value_counts(dropna=False)

args
None    12126
Name: count, dtype: int64

Are these duds informed from the `process_path` table,
which stores _lineage_ (the path from a process node to the root of a host's process tree)
data?

In [11]:
%%sql
select count(process_path.pid_hash) as num
from process_path
inner join duds using (pid_hash)

Unnamed: 0,num
0,12070


Answer: no.
Most of them do have a record in `process_path`, however.
What does the parent list of a process without a parent PID hash look like?

In [12]:
%%sql duds_lineage <<
select duds.pid_hash, ptree_list
from duds
inner join process_path using (pid_hash)

In [13]:
duds_lineage

Unnamed: 0,pid_hash,ptree_list
0,ABAC7CCEABC7E0663758005EDD2CBDA9,[ABAC7CCEABC7E0663758005EDD2CBDA9]
1,ADB470E1F049A92DC034DF38D8BD2465,[ADB470E1F049A92DC034DF38D8BD2465]
2,BA7687D6B369D679639564DDF4F4CBCE,[BA7687D6B369D679639564DDF4F4CBCE]
3,BAF2A75920C33AE1217BE8CBAF1AF1B3,[BAF2A75920C33AE1217BE8CBAF1AF1B3]
4,D1FB4A8D17E2460CDF30E0FF560EB7DF,[D1FB4A8D17E2460CDF30E0FF560EB7DF]
...,...,...
12065,84A67F281CE5AB4B6902DAC1BE1712AD,[84A67F281CE5AB4B6902DAC1BE1712AD]
12066,861E656C2A983DCF97C67E14968CDEC3,[861E656C2A983DCF97C67E14968CDEC3]
12067,8806F08D16A4013CD910B8040C322AE3,[8806F08D16A4013CD910B8040C322AE3]
12068,8934B041C08F6646F855C3674A2225AD,[8934B041C08F6646F855C3674A2225AD]


As the `ptree_list` column of `process_path` stores the lineage in process-to-root order,
where a process has no parent,
it stores only the process' own PID hash.

In [14]:
duds_lineage["ptree_list"].map(len).value_counts()

ptree_list
1    12070
Name: count, dtype: int64

I would presume that few other processes would have a lineage of length 1;
the only ones I can think of are those at the root of the process tree,
and the only such root that comes to mind is the host's initial user mode process.
Let's check.

In [15]:
%%sql no_lineage <<
select pid_hash, process_name, length_lineage
from (
    select pid_hash, process_name, length(ptree_list) as length_lineage
    from process_path
)
where length_lineage < 2

In [16]:
no_lineage

Unnamed: 0,pid_hash,process_name,length_lineage
0,00076962A860658B5C74574915AAD94C,conhost.exe,1
1,000C6DD334EAC0C94D4AB0147CB66823,conhost.exe,1
2,001999BCC533CC1B0FE072CA9BB692B9,conhost.exe,1
3,001FFD625FDE57A058ABD9EA10E208C5,conhost.exe,1
4,00398AF501769695839D8C0FDA9FFB36,conhost.exe,1
...,...,...,...
1123521,FFBFEB2EDDB89FDFADCF99BEB9F8D574,conhost.exe,1
1123522,FFC2AFA0D7A7C9AC523D1CF26B404401,conhost.exe,1
1123523,FFC2D3BA7F974A4293429C3291FAAA1C,conhost.exe,1
1123524,FFE06FDB5FEE7A569F9D787A4621B0D3,conhost.exe,1


This is surprising: most processes have a lineage composed only of themselves!

In [17]:
no_lineage.length_lineage.value_counts()

length_lineage
1    1123526
Name: count, dtype: int64

What portion of the whole set of processes is that?!

In [18]:
num_total, = db.sql("select count(pid_hash) from process").fetchone()
no_lineage.shape[0] / num_total

0.6335416895745755

63% of the whole dataset has no parent process information!
Surely this is not all init processes.

In [19]:
no_lineage["process_name"].value_counts(dropna=False)

process_name
conhost.exe                        1114979
None                                  7022
wmic.exe                               910
svchost.exe                             57
ntoskrnl.exe                            53
                                    ...   
msiexec.exe                              1
securityhealthsystray.exe                1
wsmprovhost.exe                          1
apphostregistrationverifier.exe          1
am_delta.exe                             1
Name: count, Length: 90, dtype: int64

More than 90% of these no-lineage processes are `conhost.exe` processes &mdash;
console hosts,
processes that provide basic terminal services to command-line software.
The initial process, `ntoskrnl.exe`,
has counts that make sense given the number of hosts and the short duration of the data acquisitions.
The rest is sensor or data aggregation issues.
The high prevalence of parentless `conhost.exe` processes might also have to do with sensor or hypervisor behaviour,
injecting processes into the hosts' userspace without going through a notional user-mode service or common kernel system call.
Further investigation from the Wintap development team is certainly warranted.

In any case,
we have identified our duds,
and we also have flagged a large-number of no-lineage processes.
Let's save non-duds and properly-lineaged processes to Parquet files for downstream analysis.

In [20]:
%%sql
copy (
    select pid_hash, process_name, args, filename, hostname, exit_code
    from process
    anti join duds using (pid_hash)
)
to '{{acme4_explore.dir_work()}}/process_nondud.parquet'
(format parquet, compression 'zstd')

Unnamed: 0,Count
0,1761279


<a id="proper-lineage"></a>The processes with a proper lineage:

In [21]:
%%sql
copy (
    select P.pid_hash, P.process_name, args, filename, P.hostname, exit_code, ptree, ptree_list, ptree_list_tuples
    from process as P
    inner join process_path using (pid_hash)
    where length(ptree_list) > 1
)
to '{{acme4_explore.dir_work()}}/process_with_lineage.parquet'
(format parquet, compression 'zstd')

Unnamed: 0,Count
0,649421
