# Simple Process Tree Builder

This notebook opens a Wintap process parquet file (or files) using duckdb to build a networkx process tree graph.

Workflow is:

1. Map process parquet into a duckdb view
2. Create an iterator for all rows in the set using the wg.add_all() function.
    Note: the current example function is dead simple, but could be greatly enhanced with complex filtering, joining, etc.
3. Pass the iterator to the build graph function which adds nodes with properties and parent->child relationships
4. Display a few simple metics about the graph

## Setup
Download data from https://gdo-wintap.llnl.gov

A good starting set is: https://gdo168.llnl.gov/data/ACME-2023/stdview-20231109-20231111/process_summary.parquet

If you'd like more data, look into the longer date ranges.

Modify the path in the "create view" statement to point to where you have downloaded the process_summary.parquet file. 

In [2]:
# Import packages used in notebooks
import duckdb
import networkx as nx
import wintapgraph as wg
%load_ext magic_duckdb

In [3]:
# Initialize an in-memory db. Save reference in a variable and then set magic-duckdb environment. Result is ability to use the same DB instance from python code and %dql/%%dql magics.
con = duckdb.connect()
%dql -co con
# Only uses a process table
%dql create view process as from '~/data/wintapv6/ACME-Redo/stdview-20231109-20231111/process_summary.parquet'
# Display a simple summary of the process_uber_summary table
%dql summarize process


Unnamed: 0,column_name,column_type,min,max,approx_unique,avg,std,q25,q50,q75,count,null_percentage
0,pid_hash,VARCHAR,00000AD4E601CD6811A249DE5566D701,FFFFE16FB71AB7F667A3B1FDAF417F00,998853,,,,,,1019637,0.0
1,os_family,VARCHAR,windows,windows,1,,,,,,1019637,0.0
2,agent_id,VARCHAR,,,0,,,,,,1019637,100.0
3,num_agent_id,BIGINT,0,0,1,0.0,0.0,0,0,0,1019637,0.0
4,hostname,VARCHAR,ACME-DC1,ACME-WS-UVF,26,,,,,,1019637,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
102,dll_first_seen,TIMESTAMP WITH TIME ZONE,2023-11-08 16:54:43.772235-07,2023-11-11 16:56:54.604024-07,314066,,,,,,1019637,69.9
103,dll_last_seen,TIMESTAMP WITH TIME ZONE,2023-11-08 16:55:12.90114-07,2023-11-11 16:59:08.241085-07,307771,,,,,,1019637,69.9
104,os,VARCHAR,Windows Server 2022 Datacenter,Windows Server 2022 Datacenter,1,,,,,,1019637,0.0
105,os_version,VARCHAR,Microsoft Windows NT 6.2.9200.0,Microsoft Windows NT 6.2.9200.0,1,,,,,,1019637,0.0


In [4]:
processes = wg.add_all(con)
netg = wg.build_process_tree_graph(con, processes)
netg

Adding 1019637 process nodes


<networkx.classes.graph.Graph at 0x114fe54d0>

In [5]:
[len(c) for c in sorted(nx.connected_components(netg), key=len, reverse=True)]

[107035,
 101736,
 95046,
 35069,
 32875,
 32850,
 32653,
 32260,
 32137,
 31337,
 31163,
 30768,
 30756,
 30734,
 30609,
 30412,
 30400,
 30149,
 29231,
 29073,
 29002,
 28941,
 26122,
 22291,
 21898,
 21225,
 8516,
 4217,
 351,
 142,
 135,
 120,
 92,
 85,
 69,
 21,
 13,
 9,
 7,
 6,
 5,
 5,
 4,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,

In [6]:
nx.number_connected_components(netg)

19961

In [7]:
%dql select count(*) from process

Unnamed: 0,count_star()
0,1019637
