# Hatchet Query Language

This notebook explores [Hatchet](https://github.com/LLNL/hatchet) queries from its [**String-based Dialect**](https://hatchet.readthedocs.io/en/latest/query_lang.html), specifically **Category 5: Special Value Identification Predicates**. The notebook covers different query types that are capable of identifying nodes from [GraphFrame](https://hatchet.readthedocs.io/en/latest/user_guide.html) objects containing special values (NaN, Inf, None) using query node predicates.

In [1]:
# display documentation for Hatchet GraphFrame
from IPython.display import Markdown, display
display(Markdown("../common/documentation/hatchet-query-language.md"))

Hatchet supports eight different categories for the query language, as shown in Fig. 1.  

|Category ID|Category Description|
|:---------:|:-------------------|
|1          |Quantifier Capabilities|
|2          |String Equivalence and Regex Matching Predicates|
|3          |String Containment Predicates (contains, starts with, ends with)|
|4          |Basic Numeric Comparison Predicates (==, >, >=, etc.)|
|5          |Special Value Identification Predicates (NaN, Inf, None)|
|6          |Predicate Combination through Conjunction (AND)|
|7          |Predicate Combination through Disjunction and Complement (OR, NOT)|
|8          |Predicate Combination through Other Operations (e.g., XOR)|

**Figure 1**: A table of the Hatchet Query Language capabilities, distinguished into categories and their corresponding category ID.

Hatchet offers multiple interfaces to define queries with different trade-offs to verbosity and expressiveness. An entire catalog of queries, use cases, categories and capabilities can be found [here](https://docs.google.com/spreadsheets/d/1fKNlHmDJdDbnE4jyMcaFqdnw6ZSaexgm33rOcVAj0do/edit#gid=0).

Hatchet query language consumes a GraphFrame and a sequence of queries. Each query can comprise a **predicate** and a **quantifier**. Hatchet query language finds all **matching paths** from a provided GraphFrame. For example, in Fig.2, for the query (any with A or B), the output would comprise of 2 paths, [1, 2, 4] and [1, 3, 4].

![Graph frames and queries](../common/images/hatchet_query_graphframe.PNG)

**Figure 2**: A diagram to provide an overview of queries and an example of how queries filter GraphFrames.

***




In [2]:
# display documentation for object-based dialect
display(Markdown("../common/documentation/string-based-dialect-05.md"))

The **String-based Dialect** is a formal language that can be used to create queries using a syntax derived from [Cypher](https://dl.acm.org/doi/10.1145/3183713.3190657). Queries generated using the **String-based Dialect** contain two main syntactic pieces: a *MATCH* statement and a *WHERE* statement. The *MATCH* statement starts with the *MATCH* keyword and defines the quantifiers and variable names used to refer to query nodes in the predicates. The *WHERE* statement starts with the *WHERE* keyword and defines one or more predicates. 

## Category 5: Special Value Identification Predicates (NaN, Inf, None)

Category 5 expands on using query node predicates to indentify and filter GraphFrame nodes with special values such as NaN, Inf, or None.

The String-based Dialect of the Hatchet Query Language allows us to check if:

1. Check if numeric metric is NaN in query node predicates
2. Check if numeric metric is not NaN in query node predicates
3. Check if numeric metric is infinity in query node predicates
4. Check if numeric metric is not infinity in query node predicates
5. Check if metric is None (i.e., Python keyword None) in query node predicates
6. Check if metric is not None (i.e., Python keyword None) in query node predicates


In [3]:
# display dataset information 
display(Markdown("../common/documentation/dataset-information.md"))

### Loading profile data as Hatchet GraphFrame

Hatchet queries are only defined on Hatchet GraphFrames. 
Obtaining a hatchet GraphFrame is straight forward:

1. Import hatchet
2. Use the appropriate reader for the profile/trace at hand

We first load a [Caliper](https://github.com/LLNL/Caliper) profile in JSON format, where Caliper is a performance profiling library developed by the Lawrence Livermore National Lab (LLNL).

This example profile is profiled from [LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics)](https://asc.llnl.gov/codes/proxy-apps/lulesh), a performance report data generated by Caliper. LULESH is a highly simplified application designed to solve the Sedov Blast problem, which is a standard hydrodynamics test problem. It performs a hydrodynamics stencil calculation using both MPI and OpenMP to achieve parallelism. 

This is an interesting profile because it covers a relatively large number of nodes (45 nodes) and spends considerable time in MPI communication routines.


In this notebook, we also import an additional profile dataset as we perform some GraphFrame operations for some of the use cases. The two profiles are similar but, one does not contain any MPI function call nodes.

In [None]:
import hatchet as ht
gf = ht.GraphFrame.from_caliper("../../data/lulesh-16nodes/lulesh-annotation-profile-512cores.json")
gf1 = ht.GraphFrame.from_caliper("../../data/lulesh-16nodes/lulesh-annotation-profile-512cores-nompi.json")

In [5]:
# display GraphFrame information 
display(Markdown("../common/documentation/graph-tree-information.md"))

### Displaying a Hatchet GraphFrame
A compact overview of a hatchet GraphFrame can be obtained using the `gf.tree()` function. We use this throughout the notebook to display the differences between an original GraphFrame and the resulting GraphFrame after applying a query.

In [None]:
print(gf.tree())

In [7]:
# display DataFrame information 
display(Markdown("../common/documentation/dataframe-information.md"))

### Displaying a DataFrame
An additional detail perspective can be obtained by viewing the underlying data using a **DataFrame**. A Hatchet **DataFrame** holds all the numerical and categorical data associated with each node. 

In [None]:
gf.dataframe

In [9]:
# why use drop index levels
display(Markdown("../common/documentation/drop-index-information.md"))

### Dropping index levels

As a precursor to defining queries, we drop the index level of the GraphFrame using the `drop_index_levels()` Hatchet function. Hatchet hierarchical indexing can be of two types, depending on whether there is a single metric per node or multiple set of metrics per node.  

If a node contains a single metric, the DataFrame will use an `Index` object containing the node column. If a node has an additional level of information, Hatchet creates a `MultiIndex` to store the information pertaining to multiple sets of metrics per node. `MultiIndex` stores the node column as the "top" level of the index, followed by additional information on the levels below. 

Based on the types of indexing (`Index or MultiIndex`), retrieving data from a DataFrame corresponding to a particular node either retrieves a single or multiple rows. This difference can cause issues when applying query node predicates.
Therefore, it is necessary to get rid of all index levels besides the node column through an aggregation operation on the GraphFrame. Then, a query node predicate can be applied to the GraphFrame. 

In [None]:
gf.drop_index_levels()
gf1.drop_index_levels()

In [49]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-01-01.md"))

### Query type 1: Check if numeric metric is NaN in query node predicates

This query type allows a user to indentify GraphFrame nodes with a NaN value. When a user subtracts or divides two GraphFrames that don't contain the same nodes, the nodes that are not present in both GraphFrames are changed to hold **NaN** values. An example of this can be seen on Fig. 1 below.

![Production of NaN nodes](../common/images/nan_production.PNG)

**Figure 1**: Visual example of GraphFrame operations that produce nodes with NaN value.


Such query type covers cases where an operation on GraphFrames produce nodes with NaN as their value. The ability to filter the resulting GraphFrame allow users to focus on nodes that aren't shared between the GraphFrames. 

Here, we use the second profile dataset stored as "gf1" for a `subtraction of GraphFrames`. Given that the only difference between the node names of the two profiles is that one GraphFrame does not contain any MPI nodes, the resulting GraphFrame must contain MPI nodes with nan value for their time metric.






In [None]:
print(gf1.tree())

In [None]:
# subtraction of two GraphFrames
gf1 -= gf

In [None]:
# print resulting GraphFrame to view NaN nodes
print(gf1.tree())

In [15]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-01-02.md"))

The following query identifies all GraphFrame nodes where the time metric `is NaN`, using the query node predicate. 



In [None]:
query_1 = """
MATCH ("*", p)
WHERE p."time" IS NAN
"""

The above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame. The `filter()` function takes a user-supplied function or query object and applies that to all rows in the DataFrame. The resulting Series or DataFrame is used to filter the DataFrame to only return rows that are true.

In [None]:
gf_filt = gf1.filter(query_1)

The resulting GraphFrame now only lists the  nodes that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the DataFrame:

In [None]:
gf_filt.dataframe

In [20]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-02.md"))

### Query type 2: Check if numeric metric is not NaN in query node predicates

This query type allows a user to indentify GraphFrame nodes that `do not contain a NaN value`. Such queries allow users to focus on nodes that are shared between the GraphFrames after an operation is carried out between two GraphFrames. 



In [None]:
query_2 = """
MATCH ("*", p)
WHERE p."time" IS NOT NAN
"""

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf1.filter(query_2)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [25]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-03-01.md"))

### Query type 3: Check if numeric metric is infinity in query node predicates

Similar to query type 1, where certain operations on GraphFrames create nodes with NaN values, some operations can produce GraphFrame nodes with INF values. This query type covers cases where an operation on GraphFrames produce nodes with INF. 

From the graph tree and DataFrame of the dataset for this notebook, we determine that none of the nodes contain an INF. Therefore, we manually add INF to specific rows of the DataFrame below to successfully demonstrate this query type.

1. Import the Python numpy library
2. Set the time metric of all MPI functions to INF




In [None]:
import numpy as np
gf.dataframe.loc[gf.dataframe.name == "MPI_Barrier", 'time'] = np.inf
gf.dataframe.loc[gf.dataframe.name == "MPI_Finalize", 'time'] = np.inf
gf.dataframe.loc[gf.dataframe.name == "MPI_Irecv", 'time'] = np.inf
gf.dataframe.loc[gf.dataframe.name == "MPI_Isend", 'time'] = np.inf
gf.dataframe.loc[gf.dataframe.name == "MPI_Reduce", 'time'] = np.inf
gf.dataframe.loc[gf.dataframe.name == "MPI_Wait", 'time'] = np.inf
gf.dataframe.loc[gf.dataframe.name == "MPI_Waitall", 'time'] = np.inf
gf.dataframe.loc[gf.dataframe.name == "MPI_Allreduce", 'time'] = np.inf

In [27]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-01-02.md"))

The following query identifies all GraphFrame nodes where the time metric `is NaN`, using the query node predicate. 



In [None]:
query_3 = """
MATCH ("*", p)
WHERE p."time" IS INF
"""

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_3)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [32]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-04.md"))

### Query type 4: Check if numeric metric is not infinity in query node predicates

This query type allows a user to indentify GraphFrame nodes that `do not contain an INF value`. 



In [None]:
query_4 = """
MATCH ("*", p)
WHERE p."time" IS NOT INF
"""

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_4)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [37]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-05-01.md"))

### Query type 5: Check if metric is None (i.e., Python keyword None) in query node predicates

Similar to query type 1 and 2, where certain operations on GraphFrames create nodes with NaN and INF values, respectively, some operations can produce GraphFrame nodes with None values. This query type covers cases where an operation on GraphFrames produce nodes with None. 

From the graph tree and DataFrame of the dataset for this notebook, we determine that none of the nodes contain an INF. Therefore, we manually add INF to specific rows of the DataFrame below. Before we carry on with this step, note that Python only allows object types to hold a None value. Any other variable type automatically sets a None to NaN. To align our demonstration with this feature of Python, we apply our query only on the name metric while also not interferring with any reference to the name metric carried out by the Hatcher Query Language. 

1. Create a column in the DataFrame called "name_copy" that is identical to the name column  
2. Set the name_copy metric of all MPI functions to None



In [None]:
gf.dataframe['name_copy'] = gf.dataframe['name']

gf.dataframe.loc[gf.dataframe.name == "MPI_Barrier", 'name_copy'] = None
gf.dataframe.loc[gf.dataframe.name == "MPI_Finalize", 'name_copy'] = None
gf.dataframe.loc[gf.dataframe.name == "MPI_Irecv", 'name_copy'] = None
gf.dataframe.loc[gf.dataframe.name == "MPI_Isend", 'name_copy'] = None
gf.dataframe.loc[gf.dataframe.name == "MPI_Reduce", 'name_copy'] = None
gf.dataframe.loc[gf.dataframe.name == "MPI_Wait", 'name_copy'] = None
gf.dataframe.loc[gf.dataframe.name == "MPI_Waitall", 'name_copy'] = None
gf.dataframe.loc[gf.dataframe.name == "MPI_Allreduce", 'name_copy'] = None

In [39]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-05-02.md"))

The following query identifies all GraphFrame nodes where the time metric `is None`, using the query node predicate. 



In [None]:
query_5 = """
MATCH ("*", p)
WHERE p."name_copy" IS NONE
"""

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_5)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe

In [44]:
# display query type 1 documentation
display(Markdown("../common/documentation/special-value-identification-06.md"))

### Query type 6: Check if metric is not None (i.e., Python keyword None) in query node predicates

This query type allows a user to indentify GraphFrame nodes that `do not contain a None value`. 



In [None]:
query_6 = """
MATCH ("*", p)
WHERE p."name_copy" IS NOT NONE
"""

Just as before, the above query is passed to Hatchet’s `filter()` function to filter the input GraphFrame.

In [None]:
gf_filt = gf.filter(query_6)

The resulting graph frame now only lists the  node/s that matched the query:

In [None]:
print(gf_filt.tree())

The query match is also reflected in the dataframe:

In [None]:
gf_filt.dataframe