# Hatchet Query Language

This notebook explores [Hatchet](https://github.com/LLNL/hatchet) queries from its [**Call Path Query Language**](https://hatchet.readthedocs.io/en/latest/query_lang.html), specifically **Category 3: String Containment Predicates**. The notebook covers different query types that check the nodes of [GraphFrame](https://hatchet.readthedocs.io/en/latest/user_guide.html) objects for string metrics that contain a certain substring.

In [1]:
# display documentation for Hatchet GraphFrame
from IPython.display import Markdown, display
display(Markdown("../common/documentation/hatchet-query-language.md"))

Hatchet supports eight different categories for the query language, as shown in Fig. 1.  

|Category ID|Category Description|
|:---------:|:-------------------|
|1          |Quantifier Capabilities|
|2          |String Equivalence and Regex Matching Predicates|
|3          |String Containment Predicates (contains, starts with, ends with)|
|4          |Basic Numeric Comparison Predicates (==, >, >=, etc.)|
|5          |Special Value Identification Predicates (NaN, Inf, None)|
|6          |Predicate Combination through Conjunction (AND)|
|7          |Predicate Combination through Disjunction and Complement (OR, NOT)|
|8          |Predicate Combination through Other Operations (e.g., XOR)|

**Figure 1**: A table of the Hatchet Query Language capabilities, distinguished into categories and their corresponding category ID.

Hatchet offers multiple interfaces to define queries with different trade-offs to verbosity and expressiveness. An entire catalog of queries, use cases, categories and capabilities can be found [here](https://docs.google.com/spreadsheets/d/1fKNlHmDJdDbnE4jyMcaFqdnw6ZSaexgm33rOcVAj0do/edit#gid=0).

Hatchet query language consumes a GraphFrame and a sequence of queries. Each query can comprise a **predicate** and a **quantifier**. Hatchet query language finds all **matching paths** from a provided GraphFrame. For example, in Fig.2, for the query (any with A or B), the output would comprise of 2 paths, [1, 2, 4] and [1, 3, 4].

![Graph frames and queries](../common/images/hatchet_query_graphframe.PNG)

**Figure 2**: A diagram to provide an overview of queries and an example of how queries filter GraphFrames.

***




In [2]:
# display documentation for object-based dialect
display(Markdown("../common/documentation/base-query-language-03.md"))

The **Query Language** finds all paths in a call graph that match properties described by the query applied to profiling data. It enables Hatchet’s Jupyter notebook-based interactive visualization to provide users with a simple and intuitive way to massively reduce the profiling data interactively. The **Query Language** has two dialects (Object-based Dialect and String-based Dialect), that simplify its use under diverse circumstances. 

## Category 3: String Containment Predicates (contains, starts with, ends with)

Category 3 expands on a user friendly alternative to using regex in query node predicates to check for string metrics that match a certain subtring. This alternative method is exclusive to Hatchet's base Query Language and it's String-based dialect, providing a simpler approach to defining query node predicates and removing dependency on regex knowledge.

The Hatchet base Query Language allows us to `check if string metric`:

1. Starts with substring in query node predicates
2. Ends with substring in query node predicates
3. Contains substring in query node predicates


In [3]:
# display dataset information 
display(Markdown("../common/documentation/dataset-information-with-regex.md"))

### Loading profile data as Hatchet GraphFrame

Hatchet queries are only defined on Hatchet GraphFrames. 
Obtaining a hatchet GraphFrame is straight forward:

1. Import hatchet
2. Import [python regular expression operations](https://docs.python.org/3/library/re.html) 
3. Use the appropriate reader for the profile/trace at hand

We first load a [Caliper](https://github.com/LLNL/Caliper) profile in JSON format, where Caliper is a performance profiling library developed by the Lawrence Livermore National Lab (LLNL).

This example profile is profiled from [LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics)](https://asc.llnl.gov/codes/proxy-apps/lulesh), a performance report data generated by Caliper. LULESH is a highly simplified application designed to solve the Sedov Blast problem, which is a standard hydrodynamics test problem. It performs a hydrodynamics stencil calculation using both MPI and OpenMP to achieve parallelism. 

This is an interesting profile because it covers a relatively large number of nodes (45 nodes) and spends considerable time in MPI communication routines.


In [None]:
import hatchet as ht
import re
gf = ht.GraphFrame.from_caliper("../../data/lulesh-16nodes/lulesh-annotation-profile-512cores.json")

In [5]:
# display GraphFrame information 
display(Markdown("../common/documentation/graph-tree-information.md"))

### Displaying a Hatchet GraphFrame
A compact overview of a hatchet GraphFrame can be obtained using the `gf.tree()` function. We use this throughout the notebook to display the differences between an original GraphFrame and the resulting GraphFrame after applying a query.

In [None]:
print(gf.tree())

In [7]:
# display DataFrame information 
display(Markdown("../common/documentation/dataframe-information.md"))

### Displaying a DataFrame
An additional detail perspective can be obtained by viewing the underlying data using a **DataFrame**. A Hatchet **DataFrame** holds all the numerical and categorical data associated with each node. 

In [None]:
gf.dataframe

In [9]:
# why use drop index levels
display(Markdown("../common/documentation/drop-index-information.md"))

### Dropping index levels

As a precursor to defining queries, we drop the index level of the GraphFrame using the `drop_index_levels()` Hatchet function. Hatchet hierarchical indexing can be of two types, depending on whether there is a single metric per node or multiple set of metrics per node.  

If a node contains a single metric, the DataFrame will use an `Index` object containing the node column. If a node has an additional level of information, Hatchet creates a `MultiIndex` to store the information pertaining to multiple sets of metrics per node. `MultiIndex` stores the node column as the "top" level of the index, followed by additional information on the levels below. 

Based on the types of indexing (`Index or MultiIndex`), retrieving data from a DataFrame corresponding to a particular node either retrieves a single or multiple rows. This difference can cause issues when applying query node predicates.
Therefore, it is necessary to get rid of all index levels besides the node column through an aggregation operation on the GraphFrame. Then, a query node predicate can be applied to the GraphFrame. 

In [None]:
gf.drop_index_levels()

In [11]:
# display query type 1 documentation
display(Markdown("../common/documentation/string-containment-01-01.md"))

### Query type 1: Check if string metric starts with substring in query node predicates


This type of query provides an understanding of function calls that `start with` a certain substring provided by the user. The String-based Dialect and the base Query Language facilitate this query type with a `'STARTS WITH'` keyword and `startswith` function, respectively.  

For comparison, the following query checks for all single nodes with the name metric that `starts with Lagrange` using regex:

In [None]:
query_1 = ht.QueryMatcher().match (".", lambda row: re.match("Lagrange.*", row["name"]))

In [13]:
# display query type 1 documentation
display(Markdown("../common/documentation/string-containment-01-02.md"))

The query defined below also checks for all single nodes with the name metric that `starts with Lagrange`:

In [None]:
query_1 = ht.QueryMatcher().match(".", lambda row: row["name"].startswith("Lagrange")) 

The above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame. The `filter()` function takes a user-supplied function or query object and applies that to all rows in the DataFrame. The resulting Series or DataFrame is used to filter the DataFrame to only return rows that are true.

In [None]:
gf_filt = gf.filter(query_1)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

In [18]:
# display query type 1 documentation
display(Markdown("../common/documentation/string-containment-02-01.md"))

### Query type 2: Check if string metric ends with substring in query node predicates


This type of query provides an understanding of function calls that `end with` a certain substring provided by the user. The String-based Dialect and the base Query Language facilitate this query type with a `'ENDS WITH'` keyword and `endswith` function, respectively. This notebook contains two examples for this query use case. The purpose of the second example is to illustrate a relatively complex example of a query that is comparable to the notebook example on string regex matching.

**Example 1:**

For the first example, the following query checks for all single nodes with the name metric that `ends with Elems`:

In [None]:
query_2_1 = ht.QueryMatcher().match(".", lambda row: row["name"].endswith("Elems")) 

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_2_1)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [23]:
# display query type 2 documentation
display(Markdown("../common/documentation/string-containment-02-02.md"))

**Example 2:**

In some cases, one is aware that the functions to search for `starts with` and `ends with` a certain string value. This example provides a scenario where the user knows to search functions that `end with Elems`, then further restrict the query to only include functions that `start with Apply or Calc`.

The following query matches all single nodes where the name metric `starts with Apply or Calc` and `ends with Elems`.

In [None]:
query_2_2 = ht.QueryMatcher().match(".", lambda row: (row["name"].endswith("Elems")) and (row["name"].startswith("Apply") or row["name"].startswith("Calc"))) 

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input graph frame.

In [None]:
gf_filt = gf.filter(query_2_2)

The resulting graph frame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [28]:
# display query type 1 documentation
display(Markdown("../common/documentation/string-containment-03.md"))

### Query type 3: Check if string metric contains substring in query node predicates


This type of query provides an understanding of function calls that `contain` a certain substring provided by the user. The String-based Dialect and the base Query Language facilitate this query type with the `'CONTAINS'` and `'in'` keywords, respectively.  

The following query checks for all single nodes with the name metric that `contains Volume`:

In [None]:
query_3 = ht.QueryMatcher().match(".", lambda row: "Volume" in row["name"]) 

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_3)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe