# Helpers for using parquet via DuckDB
* Initialize an in-memory database with views for all event types at an aggregation level.
    * Note that views are basically pointers to the parquet files and use no memory.
* Generate an event timeline chart (using SQL+Pandas)
* Bonus: highlight events of interest (puttyx/notepad++/etc)

In [None]:
# Define imports, functions
# This dataset_chooser() uses a .env file in the top level of this project. It needs to define DATAPATH as the top level of where your data sets are.
# You can optionally define a DEFAULT_PATH pointing to a specific dataset. This provides the convenience of not having to select the dataset when restarting the notebook.
# See .env-default for an example.
# If there is no .env or the paths are invalid, dataset_chooser() defaults to users home directory.

# To enable logging output to jupyter, uncomment the following 3 lines:
#import logging
#logger = logging.getLogger()
#logger.setLevel(logging.DEBUG)
%run notebookutil.py

w_datasets=dataset_chooser()
display(w_datasets)

In [None]:
# Initialize an in-memory db. Save reference in a variable and then set magic-duckdb environment. Result is ability to use the same DB instance from python code and %dql/%%dql magics.
con=ru.init_db(w_datasets.selected) # ,agg_level='rolling')
%dql -co con
# Create views for every top-level type found in the current dataset.
svd.init_db(con)
# Display the list of tables/views
%dql show tables

## Summarize event data and display in chart to help understand event distribution over time

In [None]:
# Tabular summary
display(svd.table_summary(con,w_datasets.selected))

In [None]:
# Events over time. 
# Dynamically adjust the bucket size based on the dataset duration for the best resolution/performance.
%dql create or replace macro tb(wts) as time_bucket(interval '1 day', to_timestamp_micros(win32_to_epoch(wts)))
eventdf=svd.fetch_summary_data(con)
svd.display_event_chart(eventdf)

# Complex example: 

In [None]:
%%dql SELECT conn_id,
    lag(total_size, 2, '-1') OVER win AS "total_size-0",
    lag(total_size, 1, '-1') OVER win AS "total_size-1",
    total_size AS "total_size-2",
    lead(total_size, 1, '-1') OVER win AS "total_size-3",
    lead(total_size, 2, '-1') OVER win AS "total_size-4"
FROM process_conn_incr
WINDOW win as (partition by conn_id order by conn_id, first_seen)
ORDER BY conn_id

In [None]:
%dql select conn_id, first_seen, total_size from process_conn_incr where conn_id='FFFFCC7D22BD0496D4404602FBD90636' order by first_seen