# Welcome to the Qualification Tool for the RAPIDS Accelerator for Apache Spark
To run the tool, you need to enter a log path that represents the DBFS location for your Spark CPU event logs.  Then you can select "Run all" to execute the notebook.  After the notebook completes, you will see various output tables show up below.  More options for running the qualification tool can be found here: https://nvidia.github.io/spark-rapids/docs/spark-qualification-tool.html#qualification-tool-options.

## Summary Output
The report represents the entire app execution, including unsupported operators and non-SQL operations.  By default, the applications and queries are sorted in descending order by the following fields:
- Recommendation;
- Estimated GPU Speed-up;
- Estimated GPU Time Saved; and
- End Time.

## Stages Output
For each stage used in SQL operations, the Qualification tool generates the following information:
1. App ID
1. Stage ID
1. Average Speedup Factor: the average estimated speed-up of all the operators in the given stage.
1. Stage Task Duration: amount of time spent in tasks of SQL Dataframe operations for the given stage.
1. Unsupported Task Duration: sum of task durations for the unsupported operators. For more details, see Supported Operators.
1. Stage Estimated: True or False indicates if we had to estimate the stage duration.

## Execs Output
The Qualification tool generates a report of the “Exec” in the “SparkPlan” or “Executor Nodes” along with the estimated acceleration on the GPU. Please refer to the Supported Operators guide for more details on limitations on UDFs and unsupported operators.
1. App ID
1. SQL ID
1. Exec Name: example Filter, HashAggregate
1. Expression Name
1. Task Speedup Factor: it is simply the average acceleration of the operators based on the original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU.
1. Exec Duration: wall-Clock time measured since the operator starts till it is completed.
1. SQL Node Id
1. Exec Is Supported: whether the Exec is supported by RAPIDS or not. Please refer to the Supported Operators section.
1. Exec Stages: an array of stage IDs
1. Exec Children
1. Exec Children Node Ids
1. Exec Should Remove: whether the Op is removed from the migrated plan.

In [0]:
import json
import requests
import base64
import shlex
import subprocess
import pandas as pd

# Download Spark RAPIDS tool jar
TOOL_JAR_URL = 'https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/23.04.0/rapids-4-spark-tools_2.12-23.04.0.jar'
TOOL_JAR_LOCAL_PATH = '/tmp/rapids-4-spark-tools.jar'
response = requests.get(TOOL_JAR_URL)
open(TOOL_JAR_LOCAL_PATH, "wb").write(response.content)

# Download S3 jars
HADOOP_AWS_URL = 'https://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar'
HADOOP_AWS_LOCAL_PATH = '/tmp/hadoop-aws-2.7.4.jar'
response = requests.get(HADOOP_AWS_URL)
open(HADOOP_AWS_LOCAL_PATH, "wb").write(response.content)

AWS_JAVA_URL = 'https://repo.maven.apache.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar'
AWS_JAVA_LOCAL_PATH = '/tmp/aws-java-sdk-1.7.4.jar'
response = requests.get(AWS_JAVA_URL)
open(AWS_JAVA_LOCAL_PATH, "wb").write(response.content)



In [0]:
dbutils.widgets.text("log_path", "")
eventlog_string=dbutils.widgets.get("log_path")

dbutils.widgets.text("output_path", "")
outputpath_string=dbutils.widgets.get("output_path")

In [0]:
!java -Xmx10g -cp /tmp/rapids-4-spark-tools.jar:/tmp/hadoop-aws-2.7.4.jar:/tmp/aws-java-sdk-1.7.4.jar:/databricks/jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain -o $outputpath_string $eventlog_string &> /tmp/qual_debug.log

## Summary Output

In [0]:
summary_output=pd.read_csv(outputpath_string + "/rapids_4_spark_qualification_output/rapids_4_spark_qualification_output.csv")
display(summary_output)

App Name,App ID,Recommendation,Estimated GPU Speedup,Estimated GPU Duration,Estimated GPU Time Saved,SQL DF Duration,SQL Dataframe Task Duration,App Duration,GPU Opportunity,Executor CPU Time Percent,SQL Ids with Failures,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Nested Complex Types,Potential Problems,Longest SQL Duration,NONSQL Task Duration Plus Overhead,Unsupported Task Duration,Supported SQL DF Task Duration,Task Speedup Factor,App Duration Estimated,Unsupported Execs,Unsupported Expressions
TPC-DS Like Bench q14b,app-20220209225233-0018,Strongly Recommended,2.7,48971.53,83400.46,112708,27066964,132372,109793,65.66,,,,,,,112701,2939275,700037,26366927,4.16,False,HashAggregate;Subquery;Execute CreateViewCommand;Filter;ReusedExchange;ColumnarToRow,decimal;DecimalType
TPC-DS Like Bench q14a,app-20220209225013-0017,Strongly Recommended,2.66,51728.96,86381.03,117178,28746000,138110,112636,64.69,,,,,,,117172,2978987,1114076,27631924,4.29,False,HashAggregate;Subquery;Execute CreateViewCommand;Filter;ReusedExchange;ColumnarToRow,decimal;DecimalType
TPC-DS Like Bench q4,app-20220209224316-0007,Strongly Recommended,2.66,41856.03,69720.96,90167,85324700,111577,90167,65.09,,,,,,,90163,3086516,0,85324700,4.41,False,Execute CreateViewCommand;ReusedExchange;ColumnarToRow,
TPC-DS Like Bench q24b,app-20220209230348-0030,Recommended,2.34,32581.07,43720.92,56748,27574691,76302,56694,60.89,,,,,,,56741,2950634,25998,27548693,4.37,False,AdaptiveSparkPlan;Subquery;Execute CreateViewCommand;Filter;ColumnarToRow,decimal
TPC-DS Like Bench q67,app-20220209232913-0074,Recommended,2.27,144720.72,184416.27,308185,15437192,329137,265300,73.64,,,,,,,308174,3028469,2148108,13289084,3.28,False,Execute CreateViewCommand;ReusedExchange;HashAggregate;ColumnarToRow,DecimalType
TPC-DS Like Bench q24a,app-20220209230232-0029,Recommended,2.25,33029.22,41483.77,53469,27911960,74513,53202,60.14,,,,,,,53455,3305140,139193,27772767,4.54,False,AdaptiveSparkPlan;Subquery;Execute CreateViewCommand;Filter;ColumnarToRow,decimal
TPC-DS Like Bench q93,app-20220209235239-0100,Recommended,2.18,31203.42,36894.57,50166,29213056,68098,50166,64.73,,,,,,,50161,2973035,0,29213056,3.78,False,Execute CreateViewCommand;AdaptiveSparkPlan;ColumnarToRow,
TPC-DS Like Bench q23a,app-20220209225917-0027,Recommended,2.17,43196.97,50617.02,74068,45913641,93814,66840,64.81,,,,,,,74064,2975228,4480262,41433379,4.12,False,HashAggregate;Subquery;Execute CreateViewCommand;Filter;ReusedExchange;ColumnarToRow,decimal;DecimalType
TPC-DS Like Bench q23b,app-20220209230053-0028,Recommended,2.08,46588.14,50761.85,77422,59799857,97350,66526,66.43,,,,,,,77417,3011951,8415627,51384230,4.22,False,HashAggregate;Subquery;Execute CreateViewCommand;Filter;ReusedExchange;ColumnarToRow,DecimalType;decimal
TPC-DS Like Bench q72,app-20220209233659-0079,Recommended,1.97,30920.88,30054.11,41703,25434438,60975,41703,74.77,,,,,,,41694,2971887,0,25434438,3.58,False,Execute CreateViewCommand;ReusedExchange;ColumnarToRow,


## Stages Output

In [0]:
stages_output=pd.read_csv(outputpath_string + "/rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_stages.csv")
display(stages_output)

App ID,Stage ID,Average Speedup Factor,Stage Task Duration,Unsupported Task Duration,Stage Estimated
app-20220209224509-0008,31,2.95,1184,0,False
app-20220209224509-0008,33,2.9,1086,0,False
app-20220209224509-0008,35,3.37,2321523,0,False
app-20220209224509-0008,41,3.9,172648,0,False
app-20220209224509-0008,39,5.72,4586824,0,False
app-20220209224509-0008,38,3.36,694337,0,False
app-20220209224509-0008,40,4.26,656835,0,False
app-20220209224509-0008,37,3.36,524190,0,False
app-20220209224509-0008,32,2.9,641,0,False
app-20220209224509-0008,36,2.73,7065,0,False


## Execs Output

In [0]:
execs_output=pd.read_csv(outputpath_string + "/rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_execs.csv")
display(execs_output)

App ID,SQL ID,Exec Name,Expression Name,Task Speedup Factor,Exec Duration,SQL Node Id,Exec Is Supported,Exec Stages,Exec Children,Exec Children Node Ids,Exec Should Remove
app-20220209224509-0008,24,ColumnarToRow,,1.0,0,30,False,31,,,True
app-20220209224509-0008,24,ColumnarToRow,,1.0,0,89,False,39,,,True
app-20220209224509-0008,24,WholeStageCodegen (15),WholeStageCodegen (15),8.0,4234050,95,True,39,Sort,96,False
app-20220209224509-0008,24,Exchange,,4.2,584,97,True,36:39,,,False
app-20220209224509-0008,0,Execute CreateViewCommand,,1.0,0,0,False,,,,False
app-20220209224509-0008,24,WholeStageCodegen (8),WholeStageCodegen (8),2.9,23059,65,True,37,Project:Filter:ColumnarToRow,66:67:68,False
app-20220209224509-0008,24,Union,,3.0,0,85,True,,,,False
app-20220209224509-0008,24,WholeStageCodegen (22),WholeStageCodegen (22),4.5,18839,76,True,40,HashAggregate,77,False
app-20220209224509-0008,24,Project,,3.0,0,60,True,,,,False
app-20220209224509-0008,24,Exchange,,4.2,80029,104,True,35:39,,,False
