# Welcome to the Qualification User Tool for the RAPIDS Accelerator for Apache Spark
To run the user tool, you need to enter a log path that represents the DBFS location for your Spark CPU event logs.  Then you can select "Run all" to execute the notebook.  After the notebook completes, you will see various output tables show up below.  More options for running the qualification user tool can be found here: https://docs.nvidia.com/spark-rapids/user-guide/latest/spark-qualification-tool.html#running-the-qualification-tool-standalone-for-csp-environments-on-spark-event-logs.

## Summary Output
The report represents the entire app execution, including unsupported operators and non-SQL operations.  By default, the applications and queries are sorted in descending order by the following fields:
- Recommendation;
- Estimated GPU Speed-up;
- Estimated GPU Time Saved; and
- End Time.

## Stages Output
For each stage used in SQL operations, the Qualification tool generates the following information:
1. App ID
1. Stage ID
1. Average Speedup Factor: the average estimated speed-up of all the operators in the given stage.
1. Stage Task Duration: amount of time spent in tasks of SQL Dataframe operations for the given stage.
1. Unsupported Task Duration: sum of task durations for the unsupported operators. For more details, see Supported Operators.
1. Stage Estimated: True or False indicates if we had to estimate the stage duration.

## Execs Output
The Qualification tool generates a report of the “Exec” in the “SparkPlan” or “Executor Nodes” along with the estimated acceleration on the GPU. Please refer to the Supported Operators guide for more details on limitations on UDFs and unsupported operators.
1. App ID
1. SQL ID
1. Exec Name: example Filter, HashAggregate
1. Expression Name
1. Task Speedup Factor: it is simply the average acceleration of the operators based on the original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU.
1. Exec Duration: wall-Clock time measured since the operator starts till it is completed.
1. SQL Node Id
1. Exec Is Supported: whether the Exec is supported by RAPIDS or not. Please refer to the Supported Operators section.
1. Exec Stages: an array of stage IDs
1. Exec Children
1. Exec Children Node Ids
1. Exec Should Remove: whether the Op is removed from the migrated plan.

In [0]:
%sh
VENV="databricks_venv"
echo "Setting up the virtual environment '$VENV'."
(apt update && \
apt install -y python3-venv && \
python3 -m venv $VENV && \
source $VENV/bin/activate && \
echo "Installing Spark Rapids User Tools"
pip install spark-rapids-user-tools)  > /dev/null 2>&1 || \
{ echo "Error: Failed to install Spark Rapids User Tools"; exit 1; }
echo "Spark Rapids User Tools installed successfully."


Setting up the virtual environment 'databricks_venv'.
Spark Rapids User Tools installed successfully.


In [0]:
import os
import pandas as pd
dbutils.widgets.text("log_path", "")
eventlog_string=dbutils.widgets.get("log_path")

dbutils.widgets.text("output_path", "")
outputpath_string=dbutils.widgets.get("output_path")

dbutils.widgets.dropdown("csp", "aws", ["aws", "azure"])
csp_string=dbutils.widgets.get("csp")

os.environ["EVENTLOG_PATH"] = eventlog_string
os.environ["OUTPUT_PATH"] = outputpath_string
os.environ["PLATFORM"] = f"databricks-{csp_string}"

In [0]:
%sh
source databricks_venv/bin/activate
spark_rapids_user_tools $PLATFORM qualification --eventlogs $EVENTLOG_PATH --local_folder $OUTPUT_PATH --verbose &> $OUTPUT_PATH/qual_debug.log

In [0]:
log_path = os.path.join(outputpath_string, "qual_debug.log")

try:
    with open(log_path, 'r') as file:
        output_folder = next((line.split(":", 1)[1].strip() for line in file if line.startswith("Qualification tool output: ")), None)
        if output_folder is None:
            raise ValueError(f"Cannot find output folder. See logs: {log_path}")
except FileNotFoundError:
    print(f"File not found: {log_path}")

## Summary Output

In [0]:
summary_output=pd.read_csv(os.path.join(output_folder, "rapids_4_spark_qualification_output.csv"))
display(summary_output)

App Name,App ID,Recommendation,Estimated GPU Speedup,Estimated GPU Duration,Estimated GPU Time Saved,SQL DF Duration,SQL Dataframe Task Duration,App Duration,GPU Opportunity,Executor CPU Time Percent,SQL Ids with Failures,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Nested Complex Types,Potential Problems,Longest SQL Duration,NONSQL Task Duration Plus Overhead,Unsupported Task Duration,Supported SQL DF Task Duration,Task Speedup Factor,App Duration Estimated,Unsupported Execs,Unsupported Expressions,Estimated Job Frequency (monthly)
TPC-DS Like Bench q1,app-20220209224147-0004,Recommended,1.4,22476.58,9130.41,13417,7598550,31607,13417,37.09,,,,,,,13412,3012015,0,7598550,3.13,False,Execute CreateViewCommand,,30


## Stages Output

In [0]:
stages_output=pd.read_csv(os.path.join(output_folder, "rapids_4_spark_qualification_output_stages.csv"))
display(stages_output)

App ID,Stage ID,Average Speedup Factor,Stage Task Duration,Unsupported Task Duration,Stage Estimated,Number of transitions from or to GPU
app-20220209224147-0004,37,4.3,595309,0,False,0
app-20220209224147-0004,32,2.67,969,0,False,0
app-20220209224147-0004,36,3.02,1631076,0,False,0
app-20220209224147-0004,39,4.88,379029,0,False,0
app-20220209224147-0004,33,2.69,1141311,0,False,0
app-20220209224147-0004,40,2.45,218,0,False,0
app-20220209224147-0004,31,2.67,914,0,False,0
app-20220209224147-0004,34,2.69,2625437,0,False,0
app-20220209224147-0004,35,2.93,1166672,0,False,0
app-20220209224147-0004,38,2.99,57615,0,False,0


## Execs Output

In [0]:
execs_output=pd.read_csv(os.path.join(output_folder, "rapids_4_spark_qualification_output_execs.csv"))
display(execs_output)

App ID,SQL ID,Exec Name,Expression Name,Task Speedup Factor,Exec Duration,SQL Node Id,Exec Is Supported,Exec Stages,Exec Children,Exec Children Node Ids,Exec Should Remove
app-20220209224147-0004,8,Execute CreateViewCommand,,1.0,0,0,False,,,,False
app-20220209224147-0004,24,WholeStageCodegen (7),WholeStageCodegen (7),3.23,858100,46,True,35,HashAggregate:HashAggregate,47:48,False
app-20220209224147-0004,24,TakeOrderedAndProject,,2.45,0,1,True,39:40,,,False
app-20220209224147-0004,24,WholeStageCodegen (9),WholeStageCodegen (9),3.1,510,59,True,31,Project:Filter:ColumnarToRow,60:61:62,False
app-20220209224147-0004,24,Execute InsertIntoHadoopFsRelationCommand parquet,,2.45,0,0,True,,,,False
app-20220209224147-0004,24,SortMergeJoin,,20.57,0,12,True,37,,,False
app-20220209224147-0004,5,Execute CreateViewCommand,,1.0,0,0,False,,,,False
app-20220209224147-0004,24,Filter,,3.75,0,38,True,32,,,False
app-20220209224147-0004,24,Scan parquet,,2.45,19429,70,True,38,,,False
app-20220209224147-0004,24,Exchange,,2.78,2423,45,True,35:37,,,False
