# Hatchet Query Language

This notebook explores [Hatchet](https://github.com/LLNL/hatchet) queries from its [**Object-based Dialect**](https://hatchet.readthedocs.io/en/latest/query_lang.html), specifically **Category 4: Basic Numeric Comparison Predicates**. The notebook covers different query types that are capable of comparing numeric metrics from [GraphFrame](https://hatchet.readthedocs.io/en/latest/user_guide.html) objects with given predicates in the query node.

In [23]:
# display documentation for Hatchet GraphFrame
from IPython.display import Markdown, display
display(Markdown("../common/documentation/hatchet-query-language.md"))

Hatchet supports eight different categories for the query language, as shown in Fig. 1.  

|Category ID|Category Description|
|:---------:|:-------------------|
|1          |Quantifier Capabilities|
|2          |String Equivalence and Regex Matching Predicates|
|3          |String Containment Predicates (contains, starts with, ends with)|
|4          |Basic Numeric Comparison Predicates (==, >, >=, etc.)|
|5          |Special Value Identification Predicates (NaN, Inf, None)|
|6          |Predicate Combination through Conjunction (AND)|
|7          |Predicate Combination through Disjunction and Complement (OR, NOT)|
|8          |Predicate Combination through Other Operations (e.g., XOR)|

**Figure 1**: A table of the Hatchet Query Language capabilities, distinguished into categories and their corresponding category ID.

Hatchet offers multiple interfaces to define queries with different trade-offs to verbosity and expressiveness. An entire catalog of queries, use cases, categories and capabilities can be found [here](https://docs.google.com/spreadsheets/d/1fKNlHmDJdDbnE4jyMcaFqdnw6ZSaexgm33rOcVAj0do/edit#gid=0).

Hatchet query language consumes a GraphFrame and a sequence of queries. Each query can comprise a **predicate** and a **quantifier**. Hatchet query language finds all **matching paths** from a provided GraphFrame. For example, in Fig.2, for the query (any with A or B), the output would comprise of 2 paths, [1, 2, 4] and [1, 3, 4].

![Graph frames and queries](../common/images/hatchet_query_graphframe.PNG)

**Figure 2**: A diagram to provide an overview of queries and an example of how queries filter GraphFrames.

***




In [24]:
# display documentation for object-based dialect
display(Markdown("../common/documentation/object-based-dialect-04.md"))

The **Object-based Dialect** is a formal language that is built around Python’s built-in objects. Queries are composed using Python’s list, tuple, and dict built-in data structures within **Object-based Dialect** of Hatchet query language. 

## Category 4: Basic Numeric Comparison Predicates (==, >, >=, <, <=)

Category 4 expands on query conditions by exploring predicates that compare numerical metrics of performance data. An example of a numerical metric is "time". The Object-based Dialect of the Hatchet Query Language allows us to:

1. Evaluate numeric metric comparison (e.g., greater than) in query node predicates
2. Check for numeric metric equivalence in query node predicates


In [25]:
# display dataset information 
display(Markdown("../common/documentation/dataset-information.md"))

### Loading profile data as Hatchet GraphFrame

Hatchet queries are only defined on Hatchet GraphFrames. 
Obtaining a hatchet GraphFrame is straight forward:

1. Import hatchet
2. Use the appropriate reader for the profile/trace at hand

We first load a [Caliper](https://github.com/LLNL/Caliper) profile in JSON format, where Caliper is a performance profiling library developed by the Lawrence Livermore National Lab (LLNL).

This example profile is profiled from [LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics)](https://asc.llnl.gov/codes/proxy-apps/lulesh), a performance report data generated by Caliper. LULESH is a highly simplified application designed to solve the Sedov Blast problem, which is a standard hydrodynamics test problem. It performs a hydrodynamics stencil calculation using both MPI and OpenMP to achieve parallelism. 

This is an interesting profile because it covers a relatively large number of nodes (45 nodes) and spends considerable time in MPI communication routines.


In [None]:
import hatchet as ht
gf = ht.GraphFrame.from_caliper("../../data/lulesh-16nodes/lulesh-annotation-profile-512cores.json")

In [27]:
# display GraphFrame information 
display(Markdown("../common/documentation/graph-tree-information.md"))

### Displaying a Hatchet GraphFrame
A compact overview of a hatchet GraphFrame can be obtained using the `gf.tree()` function. We use this throughout the notebook to display the differences between an original GraphFrame and the resulting GraphFrame after applying a query.

In [None]:
print(gf.tree())

In [28]:
# display DataFrame information 
display(Markdown("../common/documentation/dataframe-information.md"))

### Displaying a DataFrame
An additional detail perspective can be obtained by viewing the underlying data using a **DataFrame**. A Hatchet **DataFrame** holds all the numerical and categorical data associated with each node. 

In [None]:
gf.dataframe

In [29]:
# why use drop index levels
display(Markdown("../common/documentation/drop-index-information.md"))

### Dropping index levels

As a precursor to defining queries, we drop the index level of the GraphFrame using the `drop_index_levels()` Hatchet function. Hatchet hierarchical indexing can be of two types, depending on whether there is a single metric per node or multiple set of metrics per node.  

If a node contains a single metric, the DataFrame will use an `Index` object containing the node column. If a node has an additional level of information, Hatchet creates a `MultiIndex` to store the information pertaining to multiple sets of metrics per node. `MultiIndex` stores the node column as the "top" level of the index, followed by additional information on the levels below. 

Based on the types of indexing (`Index or MultiIndex`), retrieving data from a DataFrame corresponding to a particular node either retrieves a single or multiple rows. This difference can cause issues when applying query node predicates.
Therefore, it is necessary to get rid of all index levels besides the node column through an aggregation operation on the GraphFrame. Then, a query node predicate can be applied to the GraphFrame. 

In [None]:
gf.drop_index_levels()

In [30]:
# display query type 1 documentation
display(Markdown("../common/documentation/basic-numeric-comparison-01.md"))

### Query type 1: Evaluate numeric metric comparison (e.g., greater than) in query node predicates


In the simplest case, a user can use this query type to find GraphFrame nodes where the time spent on the nodes is comparable to a numerical value using inequalities (<, >, <=, >=). A practical use case of such type of query is to constrain which nodes to match before conducting a numeric metric comparison. Below, we take a look at an example with a similar use case.

The following query first filters all single nodes with the name metric `lulesh.cycle`. Then, the second query node includes all nodes below the lulesh.cycle node in the GraphFrame. The final query node matches single nodes in the GraphFrame where the numeric comparison is satisfied when the `time` metric `is greater than 100000`:

In [None]:
query_1 = [
    (".", {"name": "lulesh.cycle"}),
    ("*"),
    (".", {"time": "> 100000"})
]

The above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame. The `filter()` function takes a user-supplied function or query object and applies that to all rows in the DataFrame. The resulting Series or DataFrame is used to filter the DataFrame to only return rows that are true.

In [None]:
gf_filt = gf.filter(query_1)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

In [31]:
# display query type 1 documentation
display(Markdown("../common/documentation/basic-numeric-comparison-02-01.md"))

### Query type 2: Check for numeric metric equivalence in query node predicates

This query type covers cases where the user provides a numerical value that has an equivalence relationship with a numeric metric of the nodes. Since, the query node predicates find numeric metrics that match a numerical value exactly, it is required to set this numeric metric of all the rows of the DataFrame to their floor value. This allows for a successful application of the query even when the numeric metric has a higher precision number than the numeric value provided in the query node predicate.

For the example below, we check for time metric equivalence. Hence we:

1. Import the numpy library for python
2. Use the numpy library to set the time metric of all rows of the DataFrame to their floor values




In [None]:
import numpy as np
gf.dataframe["time"] = gf.dataframe["time"].apply(np.floor)

In [32]:
# display query type 1 documentation
display(Markdown("../common/documentation/basic-numeric-comparison-02-02.md"))

The following query matches zero or more nodes where the numeric comparison is satisfied when the `time` metric `is equivalent to 4686`:


In [None]:
query_2 = [
    ("*", {"time": "== 4686"})
]

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_2)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe