# Query Language: Thicket Tutorial

Thicket is a python-based toolkit for Exploratory Data Analysis (EDA) of parallel performance data that enables performance optimization and understanding of applications’ performance on supercomputers. It bridges the performance tool gap between being able to consider only a single instance of a simulation run (e.g., single platform, single measurement tool, or single scale) and finding actionable insights in multi-dimensional, multi-scale, multi-architecture, and multi-tool performance datasets.

**NOTE: An interactive version of this notebook is available in the Binder environment.**

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/llnl/thicket-tutorial/develop)

***

## 1. Import Necessary Packages

To explore the structure and various capabilities of thicket components, we begin by importing necessary packages. These include python extensions and thicket's statistical functions.

In [None]:
import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from IPython.display import HTML
import hatchet as ht

import thicket as tt

display(HTML("<style>.container { width:80% !important; }</style>"))

## 2. Read in Performance Profiles

For this notebook, we select profiles generated on Lawrence Livermore National Lab (LLNL) machine, lassen. We create a thicket object generated with the same block size of 128. 

In [None]:
problem_sizes = ["1048576", 
                 "2097152", 
                 "4194304", 
                 "8388608",
                ]
lassen1 = [f"../data/lassen/clang10.0.1_nvcc10.2.89_{x}/Base_CUDA-block_128.cali" for x in problem_sizes]
lassen2 = [f"../data/lassen/clang10.0.1_nvcc10.2.89_1048576/Base_CUDA-block_256.cali"]

# generate thicket(s)
th_lassen = tt.Thicket.from_caliperreader(lassen1)

## 3. More Information on a Function
***
You can use the help() method within Python to see the information for a given object. You can do this by typing help(object). 
This will allow you to see the arguments for the function, and what will be returned. An example is below.

In [None]:
help(tt.median)

## 4. Append Statistical Calculation(s)
***

We can calculate statistical aggregations per-node in the performance data and append the values to the aggregated statistics table. In the example below, we calculate the per-node median time across 4 profiles and append the median to the statistics table. The new column is called `Total time_median`. 

Why is this important for this notebook?

When the nodes in the performance data table change, the aggregated statistics table will change depending on the metric. Therefore, the aggregated statistics table is cleared after a query has been applied. In the examples further down, we use an appended column (specifically the median of total time) as the metric to print the call trees.

In [None]:
metrics = ["Total time"]
tt.median(th_lassen, columns=metrics)
th_lassen.statsframe.dataframe

## 5. Thicket Query Language 

**Use the Query Language**

Thicket's query language provides users the capability to select or `query` specific nodes based on the call tree component in thicket. The nodes in the performance data and statistics table are updated as well to reflect which nodes are remaining in the call tree.

In [None]:
print("Initial call tree: ")
print(th_lassen.statsframe.tree("Total time_median"))

### Example Query 1: Find a Subgraph with a Specific Root

This example shows how to find a subtree starting with a specific root. More specifically, the query in this example finds a subtree rooted at the node with the name "Stream" followed by all nodes down to the leaf nodes.

NOTE: A DeprecationWarning is generated when using “old-style” queries (i.e., queries with QueryMatcher) if you have the newest version of Hatchet installed.

In [None]:
query_ex1 = (
    ht.QueryMatcher()
    .match (
        ".", 
        lambda row: row["name"].apply(
        lambda x: re.match(
            "Stream", x
        )
        is not None).all()
    )
    .rel("*")
)

# applying the first query on the lassen thicket
th_ex1 = th_lassen.query(query_ex1)
tt.median(th_ex1, columns=["Total time"])
print(th_ex1.statsframe.tree("Total time_median"))

### Example Query 2: Find All Paths Ending with a Specific Node

This example shows how to find all paths of a GraphFrame ending with a specific node. More specifically, the queries in this example can be used to find paths ending with a node named "Stream".

In [None]:
query_ex2 = (
    ht.QueryMatcher()
    .match ("*")
    .rel(".",
         lambda row: row["name"].apply(
        lambda x: re.match(
            "Stream", x
        )
        is not None).all()
        )
)

# applying the second query on the lassen thicket
th_ex2 = th_lassen.query(query_ex2)
tt.median(th_ex2, columns=["Total time"])
print(th_ex2.statsframe.tree("Total time_median"))

### Example Query 3: Find All Paths with Specific Starting and Ending Nodes

This example shows how to find all call paths starting with and ending with specific nodes. More specifically, the query in this example finds paths starting with a node named "Stream" and ending with a node named "Stream_MUL".

In [None]:
query_ex3 = (
    ht.QueryMatcher()
    .match (".",
            lambda row: row["name"].apply(
                lambda x: re.match(
                    "Stream", x
                )
                is not None).all()
           )
    .rel("*")
    .rel(".",
         lambda row: row["name"].apply(
        lambda x: re.match(
            "Stream_MUL", x
        )
        is not None).all()
        )
)



# applying the third query on the lassen thicket
th_ex3 = th_lassen.query(query_ex3)
tt.median(th_ex3, columns=["Total time"])
print(th_ex3.statsframe.tree("Total time_median"))

### Example Query 4: Find All Nodes for a Particular Software Library

This example shows how to find all call paths representing a specific software library. This example is simply a variant of finding a subtree with a given root (i.e., from :ref:`this section <subgraph_root_ex>`). The example query below can be adapted to find the nodes for a subset of the MPI library, for example. In our example, we look for subtrees rooted at PolyBench_2MM, Basic_DAXPY, and Apps_ENERGY.

In [None]:
api_entrypoints = [
            "Polybench_2MM",
            "Basic_DAXPY",
            "Apps_ENERGY",
         ]

query_ex4 = (
    ht.QueryMatcher()
    .match (".",
            lambda row: row["name"].apply(
                lambda x: x in api_entrypoints).all()
           )
    .rel("*")
)



# applying the fourth query on the lassen thicket
th_ex4 = th_lassen.query(query_ex4)
tt.median(th_ex4, columns=["Total time"])
print(th_ex4.statsframe.tree("Total time_median"))

### Example Query 5: Find All Paths through a Specific Node

This example shows how to find all call paths that pass through a specific node. More specifically, the query below finds all paths that pass through a node named "Stream".

In [None]:
query_ex5 = (
    ht.QueryMatcher()
    .match ("*")
    .rel(".",
         lambda row: row["name"].apply(
        lambda x: re.match(
            "Stream", x
        )
        is not None).all()
        )
    .rel("*")
)



# applying the fifth query on the lassen thicket
th_ex5 = th_lassen.query(query_ex5)
tt.median(th_ex5, columns=["Total time"])
print(th_ex5.statsframe.tree("Total time_median"))