# Hatchet Query Language


This notebook explores [Hatchet](https://github.com/LLNL/hatchet) queries from it's [**Object-based Dialect**](https://hatchet.readthedocs.io/en/latest/query_lang.html) within **Category 1: Quantifier Capabilities**. Below, Fig. 1 provides an overview of all the categories within the Hatcher Query Language.   

|Category ID|Category Description|
|:---------:|:-------------------|
|1          |Quantifier Capabilities|
|2          |String Equivalence and Regex Matching|
|3          |String Containment (contains, starts with, ends with)|
|4          |Basic Numeric Comparison (==, >, >=, etc.)|
|5          |Comparison with Special Values (NaN, Inf, None)|

**Figure 1**: A table of the Hatchet Query Language capabilities, distinguished into categories and their corresponding cateogry ID.

The notebook covers different query types that are capable of matching various number of nodes from [GraphFrame objects](https://hatchet.readthedocs.io/en/latest/user_guide.html). Hatchet offers multiple interfaces to define queries with different trade-offs to verbosity and expressiveness. The entire capability table with information about all the query use cases, categories and capabilities can be found [here](https://docs.google.com/spreadsheets/d/1fKNlHmDJdDbnE4jyMcaFqdnw6ZSaexgm33rOcVAj0do/edit#gid=0).


A hatchet query always finds all **matching paths** for a provided GraphFrame. For this reason, Hatchet queries also takes the shape of a sequence of **predicates** to match, with the added convienience of also allowing to define an accompinying **quantifier** for each predicate. This concept is reflected in Fig. 2, where we interpret queries in terms of filtering or selecting parts of the GraphFrame.

![Graph frames and queries](../common/images/hatchet_query_graphframe.png)

**Figure 2**: A diagram to provide an overview of queries and an example of how queries filter GraphFrames.

***

## Category 1: Quantifier Capabilities

A valid hatchet query requires a **quantifier**. The accepted values for query node quantifiers in the **Object-based Dialect** are:

1. `"."`: Match a single node
2. `"*"`: Match 0 or more nodes
3. `"+"`: Match 1 or more nodes
4. `Integer`: Match an exact number of nodes

### Loading profile data as Hatchet GraphFrame

Hatchet queries are only defined on Hatchet GraphFrames. 
Obtaining a hatchet GraphFrame is straight forward:

1. Import hatchet
2. Use the appropriete reader for the profile/trace at hand

This notebook loads a [Caliper](https://github.com/LLNL/Caliper) profile in JSON format and Caliper is a performance profiling library.

The dataset contains [LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics)](https://asc.llnl.gov/codes/proxy-apps/lulesh) performance report data generated by Caliper. LULESH is a highly simplified application designed to solve the Sedov Blast problem, which is a standard hydrodynamics test problem. It performs a hydronomics stencil calculation using both MPI and OpenMP to achieve parallelism. 

This is an interesting profile because it covers a relatively large number of nodes and also spends considerable time in MPI communication routines.

In [None]:
import hatchet as ht
gf = ht.GraphFrame.from_caliper("../../data/lulesh-16nodes/lulesh-annotation-profile-512cores.json")

### Displaying a Hatchet GraphFrame
A compact overview of a hatchet GraphFrame can be obtained using the `gf.tree()` function. We use this throughout the notebook to display the differences between an original GraphFrame and the resulting GraphFrame after applying a query.

In [None]:
print(gf.tree())

### Displaying a DataFrame
An additional detail perspective can be obtained by viewing the underlying data using a **DataFrame**. Hatchet uses pandas **DataFrames** to store the data on each node of the hierarchy and keeps the graph relationships between the nodes in a different data structure that is kept consistent with the **DataFrame**. The **DataFrame** holds all the numerical and categorical data associated with each node. 

In [None]:
gf.dataframe

### Dropping index levels

As a precursor to defining queries, we drop the index level of the GraphFrame using the `drop_index_levels()` Hatchet function. Hatchet hierarchical indexing can be of two types, depending on whether there is a single metric per node or multiple set of metrics per node.  

If a node contains a single metric, the DataFrame will use an `Index` object containing the node column. If a node has additional level of information, Hatchet creates a `MultiIndex` to store the information pertaining to multiple set of metrics per node. `MultiIndex` stores the node column as the "top" level of the index, followed by additional information on the levels below. 

Based on the types of indexing (`Index or MultiIndex`), retrieving data from a DataFrame corresponding to a particular node either retrieves a single or multiple rows. This difference can cause issues when applying query node predicates.
Therefore, it is necessary to get rid of all index levels besides the node column through an aggregation operation on the GraphFrame. Then, a query node predicate can be applied to the GraphFrame. 

In [None]:
gf.drop_index_levels()

### Query type 1: Match a single node

In the simpliest case, one can use this query type to match all single nodes that belong to function calls of a particular library (e.g., MPI). 

The following query matches all single nodes where the predicate that the `name` metric `starts with "MPI_Barrier"` is satisfied:

In [None]:
# single node with names starting with MPI_Barrier
query_1 = [
    (1, {"name": "MPI_Barrier.*"}),
]

The above query, when used with an accepted value for the quantifier, ".", to match all single nodes with a certain predicate, is equivalent to the follwing query:

In [None]:
# single node with names starting with MPI_Barrier
query_1 = [
    (".", {"name": "MPI_Barrier.*"}),
]

The above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame. The `filter()` function takes a user-supplied function or query object and applies that to all rows in the DataFrame. The resulting Series or DataFrame is used to filter the DataFrame to only return rows that are true.

In [None]:
gf_filt = gf.filter(query_1)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

Then, we use  `drop_index_levels()`  to get rid of all index levels besides the node column in preparation of defining more queries below.

In [None]:
gf.drop_index_levels()

### Query type 2: Match zero or more nodes

In many cases, one may not know how many nodes to match. For this reason hatchet provides the `"*"` and `"+"` literals as a quantifier  `match zero or more nodes` and `match one or more nodes`, respectively.

This query type filters a GraphFrame with the object syntax to find all query paths with zero or more nodes that meet a query predicate. This notebook contains two examples for this query use case. The purpose of the second example is to illustrate the difference between the query use cases that `match zero or more nodes` and `match one or more nodes`.

Note: For matching zero or more nodes, we use the string literal `"*"`. 

**Example 1:**

For the first example, the following query matches all zero or more nodes where the predicate that the `name` metric `starts with "Calc"` is satisfied.

In [None]:
# zero or more nodes with name starting with Calc
query_2_1 = [
    ("*", {"name": "Calc.*"}),
]

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_2_1)

Here, instead of matching only single nodes, entire call stacks can be matched. The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

Then, we use  `drop_index_levels()`  to get rid of all index levels besides the node column in preparation of defining more queries below.

In [None]:
gf.drop_index_levels()

**Example 2:**

In some cases, it is necessary to constrain which nodes to match. In other cases, it may be unknown which functions are called before a particular routine.

For this second example, the first quantifier (`"."`) constrains the filter to single node with the predicate that the metric `name`, `starts with lulesh`. The second quantifier (`"."`) all nodes matching any node, before only `matching zero or more nodes` that satisfy the predicate that the metric `name`, `starts with Calc`.  

In [None]:
# zero or more nodes with several query nodes
query_2_2 = [
    (".", {"name": "lulesh.*"}),
    ("."),
    ("*", {"name": "Calc.*"})
]

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_2_2)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The above graph tree demonstrates that when the query to `match zero or more nodes` is executed with the constraints mentioned above, the node with `name "TimeIncrement"` is included, as it satisfies the third quantifier and predicate in the example. This specific node is ommited when the third quantifier in this example is changed to match one or more nodes instead. 

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

Then, we use  `drop_index_levels()`  to get rid of all index levels besides the node column in preparation of defining more queries below.

In [None]:
gf.drop_index_levels()

### Query type 3: Match one or more nodes

This query type filters a GraphFrame with the object syntax to find all query paths with one or more nodes that meet a query predicate. 

The notebook contains two examples for this query type. The purpose of the second example is to illustrate the difference between the query types that `match zero or more nodes` and `match one or more nodes`.


Note: For matching one or more nodes, we use the string literal `"+"`.

**Example 1:**

For the first example, the metric used for the query is `name` and the predicate is that the `name` metric `starts with "CalcMonotonic"`.

In [None]:
# one or more nodes with name starting with CalcMonotonic
query_3_1 = [
    ("+", {"name": "CalcMonotonic.*"}),
]

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_3_1)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

Then, we use  `drop_index_levels()`  to get rid of all index levels besides the node column in preparation of defining more queries below.

In [None]:
gf.drop_index_levels()

**Example 2:**

For the second example, we repeat the second example in the previous section but replace the final query node to 'match one or more nodes'. The first quantifier (`"."`) constrains the filter to single node with the predicate that the metric `name`, `starts with lulesh`. The second quantifier (`"."`) all nodes matching any node, before only `matching one or more nodes` that satisfy the predicate that the metric `name`, `starts with Calc`.  

In [None]:
# one or more nodes with several query nodes
query_3_2 = [
    (".", {"name": "lulesh.*"}),
    ("."),
    ("+", {"name": "Calc.*"})
]

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_3_2)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

Execution of the second examples for query type 2 (match zero or more nodes) and query type 3 (match one or more nodes) demonstrate the difference between the two query types when used in combination with other quantifiers and predicates. 

The predicate that the metric `name`, `starts with Calc` for this dataset demonstrates that when we `match one or more nodes`, the filter ommits the node with the `name "TimeIncrement"`, as it does not satisfy the third quantifier and predicate in this example. 

This specific node is included when the third quantifier in this example is changed to match zero or more nodes instead. The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

Then, we use  `drop_index_levels()`  to get rid of all index levels besides the node column in preparation for the final query use case with **Category 1: Quantifier Capabilities**.

In [None]:
gf.drop_index_levels()

### Query type 4: Match exact number of nodes


This query type filters a GraphFrame with the object syntax to find all query paths with an exact number of nodes, provided as an integer, that meets a query predicate.


Note: For matching an exact number of nodes, we use an `"integer value"` to define the number of nodes to match.

The metric used for the query is `name` and the predicate is that the `name` metric `starts with "Calc"`. We have previously applied a query use case to match zero or more nodes that start with the name Calc. However, one can use the following query to concisely match only those nodes that contain `exactly three nodes` that `start with name Calc`. The resulting GraphFrame should be relatively smaller, considering the original GraphFrame and the previous example.

In [None]:
# exactly three nodes with names starting with Calc
query_4 = [
    (3, {"name": "Calc.*"}),
]

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_4)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe