# Hatchet Query Language

This notebook explores [Hatchet](https://github.com/LLNL/hatchet) queries from its [**Call Path Query Language**](https://hatchet.readthedocs.io/en/latest/query_lang.html), specifically **Category 2: String Equivalence and Regex Matching**. The notebook covers different query types that are capable of finding string metrics that satify a pattern match from [GraphFrame](https://hatchet.readthedocs.io/en/latest/user_guide.html) objects.

In [1]:
# display documentation for object-based dialect
from IPython.display import Markdown, display
display(Markdown("../common/documentation/base-query-language-02.md"))

The **Query Language** finds all paths in a call graph that match properties described by the query applied to profiling data. It enables Hatchet’s Jupyter notebook-based interactive visualization to provide users with a simple and intuitive way to massively reduce the profiling data interactively. The **Query Language** has two dialects (Object-based Dialect and String-based Dialect), that simplify its use under diverse circumstances. Hatchet also supports five different categories for the **Query Language**, as shown in Fig. 1.  

|Category ID|Category Description|
|:---------:|:-------------------|
|1          |Quantifier Capabilities|
|2          |String Equivalence and Regex Matching|
|3          |String Containment (contains, starts with, ends with)|
|4          |Basic Numeric Comparison (==, >, >=, etc.)|
|5          |Comparison with Special Values (NaN, Inf, None)|

**Figure 1**: A table of the Hatchet Query Language capabilities, distinguished into categories and their corresponding cateogry ID.

Hatchet offers multiple interfaces to define queries with different trade-offs to verbosity and expressiveness. An entire catalog of queries, use cases, categories and capabilities can be found [here](https://docs.google.com/spreadsheets/d/1fKNlHmDJdDbnE4jyMcaFqdnw6ZSaexgm33rOcVAj0do/edit#gid=0).

Hatchet query language consumes a GraphFrame and a sequence of queries. Each query can comprise a **predicate** and a **quantifier**. Hatchet query language finds all **matching paths** from a provided GraphFrame. For example, in Fig.2, for the query (any with A or B), the output would comprise of 2 paths, [1, 2, 4] and [1, 3, 4].

![Graph frames and queries](../common/images/hatchet_query_graphframe.png)

**Figure 2**: A diagram to provide an overview of queries and an example of how queries filter GraphFrames.

***

## Category 2: String Equivalence and Regex Matching

Category 2 expands on query conditions by exploring string equivalence and regex matching. The Object-based Dialect of the Hatchet Query Language allows us to:

1. Check for string metric equivalence in query node predicates
2. Check for regex match on string metric in query node predicates

In [2]:
# display dataset information 
display(Markdown("../common/documentation/dataset-information-with-regex.md"))

### Loading profile data as Hatchet GraphFrame

Hatchet queries are only defined on Hatchet GraphFrames. 
Obtaining a hatchet GraphFrame is straight forward:

1. Import hatchet
2. Import [python regular expression operations](https://docs.python.org/3/library/re.html) 
3. Use the appropriete reader for the profile/trace at hand

We first load a [Caliper](https://github.com/LLNL/Caliper) profile in JSON format and Caliper is a performance profiling library.

This example profile is profiled from [LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics)](https://asc.llnl.gov/codes/proxy-apps/lulesh), a performance report data generated by Caliper. LULESH is a highly simplified application designed to solve the Sedov Blast problem, which is a standard hydrodynamics test problem. It performs a hydrodynamics stencil calculation using both MPI and OpenMP to achieve parallelism. 

This is an interesting profile because it covers a relatively large number of nodes (45 nodes) and also spends considerable time in MPI communication routines.

In [None]:
import hatchet as ht
import re
gf = ht.GraphFrame.from_caliper("../../data/lulesh-16nodes/lulesh-annotation-profile-512cores.json")

In [3]:
# display GraphFrame information 
display(Markdown("../common/documentation/graph-tree-information.md"))

### Displaying a Hatchet GraphFrame
A compact overview of a hatchet GraphFrame can be obtained using the `gf.tree()` function. We use this throughout the notebook to display the differences between an original GraphFrame and the resulting GraphFrame after applying a query.

In [None]:
print(gf.tree())

In [5]:
# display DataFrame information 
display(Markdown("../common/documentation/dataframe-information.md"))

### Displaying a DataFrame
An additional detail perspective can be obtained by viewing the underlying data using a **DataFrame**. Hatchet **DataFrame** holds all the numerical and categorical data associated with each node. 

In [None]:
gf.dataframe

In [7]:
# why use drop index levels
display(Markdown("../common/documentation/drop-index-information.md"))

### Dropping index levels

As a precursor to defining queries, we drop the index level of the GraphFrame using the `drop_index_levels()` Hatchet function. Hatchet hierarchical indexing can be of two types, depending on whether there is a single metric per node or multiple set of metrics per node.  

If a node contains a single metric, the DataFrame will use an `Index` object containing the node column. If a node has an additional level of information, Hatchet creates a `MultiIndex` to store the information pertaining to multiple sets of metrics per node. `MultiIndex` stores the node column as the "top" level of the index, followed by additional information on the levels below. 

Based on the types of indexing (`Index or MultiIndex`), retrieving data from a DataFrame corresponding to a particular node either retrieves a single or multiple rows. This difference can cause issues when applying query node predicates.
Therefore, it is necessary to get rid of all index levels besides the node column through an aggregation operation on the GraphFrame. Then, a query node predicate can be applied to the GraphFrame. 

In [8]:
gf.drop_index_levels()

In [9]:
# display query type 1 documentation
display(Markdown("../common/documentation/string-equivalence-regex-matching-01.md"))

### Query type 1: Check for string metric equivalence in query node conditions


This query type covers cases where comparing string equivalence to filter a graph tree is helpful. Object-based Dialect is used to find query nodes that are equivalent to a provided string metric. The following query matches all single nodes where the metric used for the query predicate is `name` and the equivalent string to find is `MPI_Finalize`. 

Note: The query condition to find string metric equivalence can be written as `{"metric_name": "string_to_check"}`.

In [10]:
# find all nodes where the name metric is equivalent to MPI_Finalize
query_1 = ht.QueryMatcher().match (".", lambda row: re.match("MPI_Finalize", row["name"]))

The above query is passed to Hatchet’s `filter()` function to filter the input graph frame. The `filter()` function takes a user-supplied function or query object and applies that to all rows in the DataFrame. The resulting Series or DataFrame is used to filter the DataFrame to only return rows that are true.

In [11]:
gf_filt = gf.filter(query_1)

The resulting graph frame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [21]:
# display query type 2 documentation
display(Markdown("../common/documentation/string-equivalence-regex-matching-02-01.md"))

### Query type 2: Check for regex match on string metric in query node predicates

This query type covers cases where matching string using regex can provide an understanding of function calls with a certain string metric pattern. Object-based Dialect is used to find query nodes that matches a provided regex expression. 

The notebook contains two examples for this query type. The purpose of the second example is to illustrate a relatively complex example, when compared to the first example, of regex that is used to match a pattern on the `name` metric.

**Example 1:**

The following query matches all single nodes where the metric `name` matches the regex expression `MPI_.*`. The expression translates to matching nodes with the `name` metric starting with `MPI_`. 

In [22]:
# find all single nodes where the name metric starts with MPI_
query_2_1 = ht.QueryMatcher().match (".", lambda row: re.match("MPI_.*", row["name"]))

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input graph frame.

In [23]:
gf_filt = gf.filter(query_2_1)

The resulting graph frame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [13]:
# display query type 2 documentation
display(Markdown("../common/documentation/string-equivalence-regex-matching-02-02.md"))

**Example 2:**

In some cases, one is aware that the functions to search for `starts with` and `ends with` a certain string value. This example provides a scenario where the user knows to search functions that `end with Elems`, then further restrict the query to only include functions that `start with Apply or Calc`.

The following query matches all single nodes where the metric `name` matches the regex expression `(Apply|Calc).*Elems$`. The expression translates to matching nodes with the `name` that starts with either `Apply or Calc` and ends with `Elems`.

In [26]:
# find all single nodes where the name metric starts with either Apply or Calc and ends with Elems
query_2_2 = ht.QueryMatcher().match (".", lambda row: re.match("(Apply|Calc).*Elems$", row["name"]))

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input graph frame.

In [27]:
gf_filt = gf.filter(query_2_2)

The resulting graph frame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe