# Welcome to the Spark RAPIDS NDS Demo Lab
This notebook will guide you through understanding how to run Spark CPU [NDS queries](https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds).  


## Start the Spark history server

To begin, launch the Spark history server to allow us to view the Spark UI for applications that are completed.  That will help you gain a better understanding of how Spark plans queries.

In [1]:
%%bash
/opt/spark/sbin/start-history-server.sh

starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/spark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-2813d72834c0.out


## Access the Spark history server

In [2]:
%%js
var hostname = window.location.hostname;
var href = "'http://"+hostname+":18080'";
element.innerHTML = '<a style="color:blue;" target="_blank" href='+href+'>Open Spark History Server</a>';

<IPython.core.display.Javascript object>

## View NDS query
You can view the SQL query syntax that will be executed as a Spark application only using CPU resources.

In [3]:
query="4"

In [4]:
%%script env query="$query" bash
cat ./query_files/query$query.sql

-- start query 1 in stream 0 using template query4.tpl
with year_total as (
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_preferred_cust_flag
       ,c_birth_country customer_birth_country
       ,c_login customer_login
       ,c_email_address customer_email_address
       ,d_year dyear
       ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2) year_total
       ,'s' sale_type
 from customer
     ,store_sales
     ,date_dim
 where c_customer_sk = ss_customer_sk
   and ss_sold_date_sk = d_date_sk
 group by c_customer_id
         ,c_first_name
         ,c_last_name
         ,c_preferred_cust_flag
         ,c_birth_country
         ,c_login
         ,c_email_address
         ,d_year
 union all
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_last_name
       ,c_preferred_cust_flag customer_p

## Run NDS query (CPU)
The SQL query will be executed as a Spark job on your instance.  It will read data from S3 at a scale factor of 100GB.  Once the query completes, it will output the entire duration for the run, including the table setup time as well as the individual query execution time.

In [5]:
%%script env query="$query" bash
cd ./spark-rapids-benchmarks/nds
./spark-submit-template power_run_cpu.template \
    nds_power.py \
    s3a://dli-public-datasets-us-west-2/production/x-ds-04-v1/parquet_sf100 \
    ../../query_files/query$query.sql \
    cpu-time-$query.csv 

+ source power_run_cpu.template
++ source base.template
+++ export SPARK_HOME=/opt/spark
+++ SPARK_HOME=/opt/spark
+++ export SPARK_MASTER=yarn
+++ SPARK_MASTER=yarn
+++ export DRIVER_MEMORY=10G
+++ DRIVER_MEMORY=10G
+++ export EXECUTOR_CORES=12
+++ EXECUTOR_CORES=12
+++ export NUM_EXECUTORS=8
+++ NUM_EXECUTORS=8
+++ export EXECUTOR_MEMORY=16G
+++ EXECUTOR_MEMORY=16G
+++ export NDS_LISTENER_JAR=./jvm_listener/target/nds-benchmark-listener-1.0-SNAPSHOT.jar
+++ NDS_LISTENER_JAR=./jvm_listener/target/nds-benchmark-listener-1.0-SNAPSHOT.jar
+++ export SPARK_RAPIDS_PLUGIN_JAR=rapids-4-spark_2.12-22.06.0.jar
+++ SPARK_RAPIDS_PLUGIN_JAR=rapids-4-spark_2.12-22.06.0.jar
++++ echo /opt/spark/python/lib/py4j-0.10.9.5-src.zip
+++ export PYTHONPATH=/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.5-src.zip
+++ PYTHONPATH=/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.5-src.zip
++ export SHUFFLE_PARTITIONS=200
++ SHUFFLE_PARTITIONS=200
++ SPARK_CONF=("--conf" "spark.driver.memory=${DRIVER_ME

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9b59ea31-20f8-4678-8f92-b1ecde6e273f;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.2.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.180 in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar ...
	[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.2.2!hadoop-aws.jar (20ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.180/aws-java-sdk-bundle-1.12.180.jar ...
	[SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.12.180!aws-java-sdk-bundle.jar (1621ms)
:: resolution report :: resolve 1334ms :: artifacts dl 1643ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.180 from central in [default]
	org.apache.had

Time taken: 5321 millis for table customer_address
Time taken: 1106 millis for table customer_demographics


24/09/20 05:40:37 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Time taken: 1052 millis for table date_dim
Time taken: 1000 millis for table warehouse
Time taken: 987 millis for table ship_mode
Time taken: 1062 millis for table time_dim
Time taken: 997 millis for table reason
Time taken: 1005 millis for table income_band
Time taken: 1274 millis for table item
Time taken: 1010 millis for table store
Time taken: 987 millis for table call_center
Time taken: 1115 millis for table customer
Time taken: 997 millis for table web_site
Time taken: 5199 millis for table inventory
Time taken: 32239 millis for table catalog_returns
Time taken: 33582 millis for table web_returns
Time taken: 28333 millis for table web_sales
Time taken: 29274 millis for table catalog_sales
Time taken: 28399 millis for table store_sales
TaskFailureListener is registered.
Time taken: [347214] millis for query4
['application_id', 'query', 'time/milliseconds']
('local-1726810823666', 'CreateTempView customer_address', 5321)
('local-1726810823666', 'CreateTempView customer_demographics