# Welcome to the Qualification Tool for the RAPIDS Accelerator for Apache Spark
To run the tool, you need to enter a log path that represents the DBFS location for your Spark CPU event logs.  Then you can select "Run all" to execute the notebook.  After the notebook completes, you will see various output tables show up below.

## Summary Output
The report represents the entire app execution, including unsupported operators and non-SQL operations.  By default, the applications and queries are sorted in descending order by the following fields:
- Recommendation;
- Estimated GPU Speed-up;
- Estimated GPU Time Saved; and
- End Time.

## Stages Output
For each stage used in SQL operations, the Qualification tool generates the following information:
1. App ID
1. Stage ID
1. Average Speedup Factor: the average estimated speed-up of all the operators in the given stage.
1. Stage Task Duration: amount of time spent in tasks of SQL Dataframe operations for the given stage.
1. Unsupported Task Duration: sum of task durations for the unsupported operators. For more details, see Supported Operators.
1. Stage Estimated: True or False indicates if we had to estimate the stage duration.

## Execs Output
The Qualification tool generates a report of the “Exec” in the “SparkPlan” or “Executor Nodes” along with the estimated acceleration on the GPU. Please refer to the Supported Operators guide for more details on limitations on UDFs and unsupported operators.
1. App ID
1. SQL ID
1. Exec Name: example Filter, HashAggregate
1. Expression Name
1. Task Speedup Factor: it is simply the average acceleration of the operators based on the original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU.
1. Exec Duration: wall-Clock time measured since the operator starts till it is completed.
1. SQL Node Id
1. Exec Is Supported: whether the Exec is supported by RAPIDS or not. Please refer to the Supported Operators section.
1. Exec Stages: an array of stage IDs
1. Exec Children
1. Exec Children Node Ids
1. Exec Should Remove: whether the Op is removed from the migrated plan.

In [0]:
import json
import requests
import base64
import shlex
import subprocess
import pandas as pd

TOOL_JAR_URL = 'https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.10.0/rapids-4-spark-tools_2.12-22.10.0.jar'
TOOL_JAR_LOCAL_PATH = '/tmp/rapids-4-spark-tools.jar'

# Qualification tool output directory.
OUTPUT_DIR = '/tmp/'

response = requests.get(TOOL_JAR_URL)
open(TOOL_JAR_LOCAL_PATH, "wb").write(response.content)

In [0]:
dbutils.widgets.text("log_path", "")
eventlog_string=dbutils.widgets.get("log_path")

q_command_string="java -Xmx10g -cp /tmp/rapids-4-spark-tools.jar:/databricks/jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain -o {} ".format(OUTPUT_DIR) + eventlog_string
args = shlex.split(q_command_string)
cmd_out = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)


if cmd_out.returncode != 0:
  dbutils.notebook.exit("Qualification Tool failed with stderr:" + cmd_out.stderr)

## Summary Output

In [0]:
summary_output=pd.read_csv(OUTPUT_DIR + "rapids_4_spark_qualification_output/rapids_4_spark_qualification_output.csv")
display(summary_output)

## Stages Output

In [0]:
stages_output=pd.read_csv(OUTPUT_DIR + "rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_stages.csv")
display(stages_output)

## Execs Output

In [0]:
execs_output=pd.read_csv(OUTPUT_DIR + "rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_execs.csv")
display(execs_output)