refine cupsatial demo to make it more clear for customers (#187)

* update cuda pub key to avoid GPG error Signed-off-by: liyuan <yuali@nvidia.com> * address review comments Signed-off-by: liyuan <yuali@nvidia.com> * address review comments Signed-off-by: liyuan <yuali@nvidia.com> * fit img size Signed-off-by: liyuan <yuali@nvidia.com> * fit img size Signed-off-by: liyuan <yuali@nvidia.com> * fit some types Signed-off-by: liyuan <yuali@nvidia.com> * try resize img Signed-off-by: liyuan <yuali@nvidia.com> * try resize img Signed-off-by: liyuan <yuali@nvidia.com> * try resize img Signed-off-by: liyuan <yuali@nvidia.com> * try resize img Signed-off-by: liyuan <yuali@nvidia.com> * try resize img Signed-off-by: liyuan <yuali@nvidia.com> * fix dockerbuild error Signed-off-by: liyuan <yuali@nvidia.com> * update cuspatial version in build-in-local step Signed-off-by: liyuan <yuali@nvidia.com> * add cpu notebook Signed-off-by: liyuan <yuali@nvidia.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Liangcai Li <firestarmanllc@gmail.com> * Update the path to make it consistent Signed-off-by: liyuan <yuali@nvidia.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * remove Run CPU Demo step4 Signed-off-by: liyuan <yuali@nvidia.com> * break the long sql to multi lines Signed-off-by: liyuan <yuali@nvidia.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * Update examples/UDF-Examples/Spark-cuSpatial/README.md Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com> * try to workarround broken link error Signed-off-by: liyuan <yuali@nvidia.com> * try to workarround broken link error Signed-off-by: liyuan <yuali@nvidia.com> * try to workarround broken link error Signed-off-by: liyuan <yuali@nvidia.com> * try to workarround broken link error Signed-off-by: liyuan <yuali@nvidia.com> * try to workarround broken link error Signed-off-by: liyuan <yuali@nvidia.com> * try to workarround broken link error Signed-off-by: liyuan <yuali@nvidia.com> * revert markdown links checker conf Signed-off-by: liyuan <yuali@nvidia.com> Co-authored-by: Liangcai Li <firestarmanllc@gmail.com> Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>
NVIDIA · Jul 19, 2022 · 8c4809c · 8c4809c
1 parent 5346c03
commit 8c4809c
Show file tree

Hide file tree

Showing 10 changed files with 793 additions and 42 deletions.
diff --git a/docs/img/guides/cuspatial/Nycd-Community-Districts.png b/docs/img/guides/cuspatial/Nycd-Community-Districts.png
diff --git a/docs/img/guides/cuspatial/Nyct2000.png b/docs/img/guides/cuspatial/Nyct2000.png
diff --git a/docs/img/guides/cuspatial/install-jar.png b/docs/img/guides/cuspatial/install-jar.png
diff --git a/docs/img/guides/cuspatial/sample-polygon.png b/docs/img/guides/cuspatial/sample-polygon.png
diff --git a/docs/img/guides/cuspatial/taxi-zones.png b/docs/img/guides/cuspatial/taxi-zones.png
diff --git a/examples/UDF-Examples/Spark-cuSpatial/Dockerfile b/examples/UDF-Examples/Spark-cuSpatial/Dockerfile
@@ -39,11 +39,11 @@ RUN conda --version
 RUN conda install -c conda-forge openjdk=8 maven=3.8.1 -y
 
 # install cuDF dependency.
-RUN conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y
+RUN conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y
 
 RUN wget --quiet \
     https://github.com/Kitware/CMake/releases/download/v3.21.3/cmake-3.21.3-linux-x86_64.tar.gz \
     && tar -xzf cmake-3.21.3-linux-x86_64.tar.gz \
     && rm -rf cmake-3.21.3-linux-x86_64.tar.gz
 
-ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
+ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
diff --git a/examples/UDF-Examples/Spark-cuSpatial/Dockerfile.awsdb b/examples/UDF-Examples/Spark-cuSpatial/Dockerfile.awsdb
@@ -18,6 +18,9 @@ FROM nvidia/cuda:11.2.2-devel-ubuntu18.04
 
 ENV DEBIAN_FRONTEND=noninteractive
 
+# update cuda pub key to avoid GPG error
+RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
+
 # See https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile
 RUN apt-get update && \
     apt-get install --yes --no-install-recommends \

diff --git a/examples/UDF-Examples/Spark-cuSpatial/README.md b/examples/UDF-Examples/Spark-cuSpatial/README.md
@@ -5,93 +5,117 @@ It implements a [RapidsUDF](https://nvidia.github.io/spark-rapids/docs/additiona
 interface to call the cuSpatial functions through JNI. It can be run on a distributed Spark cluster with scalability.
 
 ## Performance
-We got the end-2-end time as below table when running with 2009 NYC Taxi trip pickup location,
-which includes 168,898,952 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd).
-The data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) 
-and [NYC Open data](https://www1.nyc.gov/site/planning/data-maps/open-data.page#district_political).
-| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd (71 Complex Polygons)|
+We got the end-2-end hot run times as below table when running with 2009 NYC Taxi trip pickup location,
+which includes 170,896,055 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd Community-Districts).
+The point data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
+The polygon data can be downloaded from [taxi_zone dataset](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc),
+[nyct2000 dataset](https://data.cityofnewyork.us/City-Government/2000-Census-Tracts/ysjj-vb9j) and 
+[nycd Community-Districts dataset](https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4)
+
+| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd Community-Districts (71 Complex Polygons)|
 | ----------- | :---------: | :---------: | :---------: |
-| 4-core CPU | 1122.9 seconds | 5525.4 seconds| 6642.7 seconds |
-| 1 GPU(Titan V) on local | 4.5 seconds | 5.7 seconds | 6.6 seconds|
-| 2 GPU(T4) on Databricks | 9.1 seconds | 10.0 seconds | 12.1 seconds |
+| 4-core CPU | 3.9 minutes | 4.0 minutes| 4.1 minutes |
+| 1 GPU(T4) on Databricks | 25 seconds | 27 seconds | 28 seconds|
+| 2 GPU(T4) on Databricks | 15 seconds | 14 seconds | 17 seconds |
+| 4 GPU(T4) on Databricks | 11 seconds | 11 seconds | 12 seconds |
+
+Note: Please update the `x,y` column names to `Start_Lon,Start_Lat` in
+the [notebook](./notebooks/cuspatial_sample_db.ipynb) if you test with the download points.
+
+taxi-zones map:
+
+<img src="../../../docs/img/guides/cuspatial/taxi-zones.png" width="600">
+
+nyct2000 map:
+
+<img src="../../../docs/img/guides/cuspatial/Nyct2000.png" width="600">
+
+nyct-community-districts map:
+
+<img src="../../../docs/img/guides/cuspatial/Nycd-Community-Districts.png" width="600">
 
 ## Build
-You can build the jar file [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile)
-or you can build it [in local](#build-in-local) machine after some prerequisites.
+Firstly build the UDF JAR from source code before running this demo.
+You can build the JAR [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile), 
+or [in local machine](#build-in-local-machine) after prerequisites.
 
 ### Build in Docker
 1. Build the docker image [Dockerfile](Dockerfile), then run the container.
      ```Bash
      docker build -f Dockerfile . -t build-spark-cuspatial
      docker run -it build-spark-cuspatial bash
      ```
-2. Get the code, then run "mvn package".
+2. Get the code, then run `mvn package`.
      ```Bash
      git clone https://github.com/NVIDIA/spark-rapids-examples.git
      cd spark-rapids-examples/examples/UDF-Examples/Spark-cuSpatial/
      mvn package
      ```
-3. You'll get the jar named like "spark-cuspatial-<version>.jar" in the target folder.
+3. You'll get the jar named `spark-cuspatial-<version>.jar` in the target folder.
+
+Note: The docker env is just for building the jar, not for running the application.
 
-### Build in Local:
-1. essential build tools:
+### Build in local machine:
+1. Essential build tools:
     - [cmake(>=3.20)](https://cmake.org/download/),
     - [ninja(>=1.10)](https://github.com/ninja-build/ninja/releases),
     - [gcc(>=9.3)](https://gcc.gnu.org/releases.html)
 2. [CUDA Toolkit(>=11.0)](https://developer.nvidia.com/cuda-toolkit)
 3. conda: use [miniconda](https://docs.conda.io/en/latest/miniconda.html) to maintain header files and cmake dependecies
 4. [cuspatial](https://github.com/rapidsai/cuspatial): install libcuspatial
     ```Bash
-    # get libcuspatial from conda
-    conda install -c rapidsai -c nvidia -c conda-forge  -c defaults libcuspatial=22.04
+    # Install libcuspatial from conda
+    conda install -c rapidsai -c nvidia -c conda-forge  -c defaults libcuspatial=22.06
     # or below command for the nightly (aka SNAPSHOT) version.
-    conda install -c rapidsai-nightly -c nvidia -c conda-forge  -c defaults libcuspatial=22.06
+    conda install -c rapidsai-nightly -c nvidia -c conda-forge  -c defaults libcuspatial=22.08
     ```
-5. Get the code, then run "mvn package".
+5. Build the JAR using `mvn package`.
      ```Bash
      git clone https://github.com/NVIDIA/spark-rapids-examples.git
      cd spark-rapids-examples/examples/Spark-cuSpatial/
      mvn package
      ```
-6. You'll get "spark-cuspatial-<version>.jar" in the target folder.      
-
+6. `spark-cuspatial-<version>.jar` will be generated in the target folder.
 
 ## Run
-### Run on-premises clusters: standalone
+### GPU Demo on Spark Standalone on-premises cluster
 1. Install necessary libraries. Besides `cudf` and `cuspatial`, the `gdal` library that is compatible with the installed `cuspatial` may also be needed.
-    Install it by running the command below.
     ```
     conda install -c conda-forge libgdal=3.3.1
     ```
 2. Set up [a standalone cluster](/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md) of Spark. Make sure the conda/lib is included in LD_LIBRARY_PATH, so that spark executors can load libcuspatial.so.
 
-3. Download spark-rapids jars
-   * [spark-rapids v22.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar) or above
-4. Prepare the dataset & jars. Copy the sample dataset from [cuspatial_data](../../../datasets/cuspatial_data.tar.gz) to "/data/cuspatial_data".
-    Copy spark-rapids & spark-cuspatial-22.08.0-SNAPSHOT.jar to "/data/cuspatial_data/jars".
-    You can use your own path, but remember to update the paths in "gpu-run.sh" accordingly.
-5. Run "gpu-run.sh"
+3. Download Spark RAPIDS JAR
+   * [Spark RAPIDS JAR v22.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar) or above
+4. Prepare sample dataset and JARs. Copy the [sample dataset](../../../datasets/cuspatial_data.tar.gz) to `/data/cuspatial_data/`.
+    Copy Spark RAPIDS JAR and `spark-cuspatial-<version>.jar` to `/data/cuspatial_data/jars/`.
+    If you build the `spark-cuspatial-<version>.jar` in docker, please copy the jar from docker to local:
+    ```
+    docker cp YOUR_DOCKER_CONTAINER:/PATH/TO/spark-cuspatial-<version>.jar ./YOUR_LOCAL_PATH
+    ```
+    Note: update the paths in `gpu-run.sh` accordingly.
+5. Run `gpu-run.sh`
     ```Bash
     ./gpu-run.sh
     ```
-### Run on AWS Databricks
-1. Build the customized docker image [Dockerfile.awsdb](Dockerfile.awsdb) and push to dockerhub so that it can be accessible by AWS Databricks.
+### GPU Demo on AWS Databricks
+1. Build a customized docker image using [Dockerfile.awsdb](Dockerfile.awsdb) and push to a Docker registry such as [Docker Hub](https://hub.docker.com/) which can be accessible by AWS Databricks.
      ```Bash
      # replace your dockerhub repo, your tag or any other repo AWS DB can access
      docker build -f Dockerfile.awsdb . -t <your-dockerhub-repo>:<your-tag>
      docker push <your-dockerhub-repo>:<your-tag>
      ```
 
 2. Follow the [Spark-rapids get-started document](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html#start-a-databricks-cluster) to create a GPU cluster on AWS Databricks.
- Something different from the document.
+ Below are some different steps since a custom docker image is used with Databricks:
     * Databricks Runtime Version
-  You should choose a Standard version of the Runtime version like "Runtime: 9.1 LTS(Scala 2.12, Spark 3.1.2)" and
-  choose GPU instance type like "g4dn.xlarge". Note that ML runtime does not support customized docker container.
-  If you choose a ML version, it says "Support for Databricks container services requires runtime version 5.3+" 
-  and the "Confirm" button is disabled.
+  Choose a non-ML Databricks Runtime such as `Runtime: 9.1 LTS(Scala 2.12, Spark 3.1.2)` and
+  choose GPU AWS instance type such as `g4dn.xlarge`. Note that ML runtime does not support customized docker container with below messages:
+`Support for Databricks container services requires runtime version 5.3+` 
+  and the `Confirm` button is disabled.
     * Use your own Docker container
-  Input "Docker Image URL" as "your-dockerhub-repo:your-tag"
-    * For the other configurations, you can follow the get-started document.
+  Input `Docker Image URL` as `your-dockerhub-repo:your-tag`
+    * Follow the Databricks get-started document for other steps.
 
 3. Copy the sample [cuspatial_data.tar.gz](../../../datasets/cuspatial_data.tar.gz) or your data to DBFS by using Databricks CLI.
     ```Bash
@@ -103,5 +127,38 @@ or you can build it [in local](#build-in-local) machine after some prerequisites
         points
         polygons
     ```
-4. Import the Library "spark-cuspatial-22.08.0-SNAPSHOT.jar" to the Databricks, then install it to your cluster.
-5. Import [cuspatial_sample.ipynb](notebooks/cuspatial_sample_db.ipynb) to your workspace in Databricks. Attach to your cluster, then run it.
+   The sample points and polygons are randomly generated.
+
+   Sample polygons: 
+
+   <img src="../../../docs/img/guides/cuspatial/sample-polygon.png" width="600">
+
+4. Upload `spark-cuspatial-<version>.jar` on dbfs and then install it in Databricks cluster.
+
+   <img src="../../../docs/img/guides/cuspatial/install-jar.png" width="600">    
+
+5. Import [cuspatial_sample.ipynb](notebooks/cuspatial_sample_db.ipynb) to Databricks workspace, attach it to Databricks cluster and run it.
+
+### CPU Demo on AWS Databricks
+1. Create a Databricks cluster. For example, Databricks Runtime 10.3.
+
+2. Install the Sedona jars and Sedona Python libs on Databricks using web UI. 
+   The Sedona version should be 1.1.1-incubating or higher.
+   * Install below jars from Maven Coordinates in Libraries tab:
+    ```Bash
+    org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.0-incubating
+    org.datasyslab:geotools-wrapper:1.1.0-25.2
+    ```
+   * To enable python support, install below python lib from PyPI in Libraries tab 
+    ```Bash
+    apache-sedona
+    ```
+3. From your cluster configuration (Cluster -> Edit -> Configuration -> Advanced options -> Spark) activate the 
+   Sedona functions and the kryo serializer by adding below to the Spark Config
+    ```Bash
+    spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
+    spark.serializer org.apache.spark.serializer.KryoSerializer
+    spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
+    ```
+
+4. Upload the sample data files to DBFS, start the cluster, attach the [notebook](notebooks/spacial-cpu-apache-sedona_db.ipynb) to the cluster, and run it.