Skip to content

Commit

Permalink
refine cupsatial demo to make it more clear for customers (#187)
Browse files Browse the repository at this point in the history
* update cuda pub key to avoid GPG error

Signed-off-by: liyuan <yuali@nvidia.com>

* address review comments

Signed-off-by: liyuan <yuali@nvidia.com>

* address review comments

Signed-off-by: liyuan <yuali@nvidia.com>

* fit img size

Signed-off-by: liyuan <yuali@nvidia.com>

* fit img size

Signed-off-by: liyuan <yuali@nvidia.com>

* fit some types

Signed-off-by: liyuan <yuali@nvidia.com>

* try resize img

Signed-off-by: liyuan <yuali@nvidia.com>

* try resize img

Signed-off-by: liyuan <yuali@nvidia.com>

* try resize img

Signed-off-by: liyuan <yuali@nvidia.com>

* try resize img

Signed-off-by: liyuan <yuali@nvidia.com>

* try resize img

Signed-off-by: liyuan <yuali@nvidia.com>

* fix dockerbuild error

Signed-off-by: liyuan <yuali@nvidia.com>

* update cuspatial version in build-in-local step

Signed-off-by: liyuan <yuali@nvidia.com>

* add cpu notebook

Signed-off-by: liyuan <yuali@nvidia.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Liangcai Li <firestarmanllc@gmail.com>

* Update the path to make it consistent

Signed-off-by: liyuan <yuali@nvidia.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* remove Run CPU Demo step4

Signed-off-by: liyuan <yuali@nvidia.com>

* break the long sql to multi lines

Signed-off-by: liyuan <yuali@nvidia.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* Update examples/UDF-Examples/Spark-cuSpatial/README.md

Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>

* try to workarround broken link error

Signed-off-by: liyuan <yuali@nvidia.com>

* try to workarround broken link error

Signed-off-by: liyuan <yuali@nvidia.com>

* try to workarround broken link error

Signed-off-by: liyuan <yuali@nvidia.com>

* try to workarround broken link error

Signed-off-by: liyuan <yuali@nvidia.com>

* try to workarround broken link error

Signed-off-by: liyuan <yuali@nvidia.com>

* try to workarround broken link error

Signed-off-by: liyuan <yuali@nvidia.com>

* revert markdown links checker conf

Signed-off-by: liyuan <yuali@nvidia.com>

Co-authored-by: Liangcai Li <firestarmanllc@gmail.com>
Co-authored-by: Hao Zhu <9665750+viadea@users.noreply.github.com>
  • Loading branch information
3 people committed Jul 19, 2022
1 parent 5346c03 commit 8c4809c
Show file tree
Hide file tree
Showing 10 changed files with 793 additions and 42 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/Nyct2000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/install-jar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/sample-polygon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/guides/cuspatial/taxi-zones.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions examples/UDF-Examples/Spark-cuSpatial/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@ RUN conda --version
RUN conda install -c conda-forge openjdk=8 maven=3.8.1 -y

# install cuDF dependency.
RUN conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y
RUN conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.06 python=3.8 -y

RUN wget --quiet \
https://github.com/Kitware/CMake/releases/download/v3.21.3/cmake-3.21.3-linux-x86_64.tar.gz \
&& tar -xzf cmake-3.21.3-linux-x86_64.tar.gz \
&& rm -rf cmake-3.21.3-linux-x86_64.tar.gz

ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
ENV PATH="/cmake-3.21.3-linux-x86_64/bin:${PATH}"
3 changes: 3 additions & 0 deletions examples/UDF-Examples/Spark-cuSpatial/Dockerfile.awsdb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ FROM nvidia/cuda:11.2.2-devel-ubuntu18.04

ENV DEBIAN_FRONTEND=noninteractive

# update cuda pub key to avoid GPG error
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# See https://github.com/databricks/containers/blob/master/ubuntu/minimal/Dockerfile
RUN apt-get update && \
apt-get install --yes --no-install-recommends \
Expand Down
135 changes: 96 additions & 39 deletions examples/UDF-Examples/Spark-cuSpatial/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,93 +5,117 @@ It implements a [RapidsUDF](https://nvidia.github.io/spark-rapids/docs/additiona
interface to call the cuSpatial functions through JNI. It can be run on a distributed Spark cluster with scalability.

## Performance
We got the end-2-end time as below table when running with 2009 NYC Taxi trip pickup location,
which includes 168,898,952 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd).
The data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
and [NYC Open data](https://www1.nyc.gov/site/planning/data-maps/open-data.page#district_political).
| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd (71 Complex Polygons)|
We got the end-2-end hot run times as below table when running with 2009 NYC Taxi trip pickup location,
which includes 170,896,055 points, and 3 sets of polygons(taxi_zone, nyct2000, nycd Community-Districts).
The point data can be downloaded from [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
The polygon data can be downloaded from [taxi_zone dataset](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc),
[nyct2000 dataset](https://data.cityofnewyork.us/City-Government/2000-Census-Tracts/ysjj-vb9j) and
[nycd Community-Districts dataset](https://data.cityofnewyork.us/City-Government/Community-Districts/yfnk-k7r4)

| Environment | Taxi_zones (263 Polygons) | Nyct2000 (2216 Polygons) | Nycd Community-Districts (71 Complex Polygons)|
| ----------- | :---------: | :---------: | :---------: |
| 4-core CPU | 1122.9 seconds | 5525.4 seconds| 6642.7 seconds |
| 1 GPU(Titan V) on local | 4.5 seconds | 5.7 seconds | 6.6 seconds|
| 2 GPU(T4) on Databricks | 9.1 seconds | 10.0 seconds | 12.1 seconds |
| 4-core CPU | 3.9 minutes | 4.0 minutes| 4.1 minutes |
| 1 GPU(T4) on Databricks | 25 seconds | 27 seconds | 28 seconds|
| 2 GPU(T4) on Databricks | 15 seconds | 14 seconds | 17 seconds |
| 4 GPU(T4) on Databricks | 11 seconds | 11 seconds | 12 seconds |

Note: Please update the `x,y` column names to `Start_Lon,Start_Lat` in
the [notebook](./notebooks/cuspatial_sample_db.ipynb) if you test with the download points.

taxi-zones map:

<img src="../../../docs/img/guides/cuspatial/taxi-zones.png" width="600">

nyct2000 map:

<img src="../../../docs/img/guides/cuspatial/Nyct2000.png" width="600">

nyct-community-districts map:

<img src="../../../docs/img/guides/cuspatial/Nycd-Community-Districts.png" width="600">

## Build
You can build the jar file [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile)
or you can build it [in local](#build-in-local) machine after some prerequisites.
Firstly build the UDF JAR from source code before running this demo.
You can build the JAR [in Docker](#build-in-docker) with the provided [Dockerfile](Dockerfile),
or [in local machine](#build-in-local-machine) after prerequisites.

### Build in Docker
1. Build the docker image [Dockerfile](Dockerfile), then run the container.
```Bash
docker build -f Dockerfile . -t build-spark-cuspatial
docker run -it build-spark-cuspatial bash
```
2. Get the code, then run "mvn package".
2. Get the code, then run `mvn package`.
```Bash
git clone https://github.com/NVIDIA/spark-rapids-examples.git
cd spark-rapids-examples/examples/UDF-Examples/Spark-cuSpatial/
mvn package
```
3. You'll get the jar named like "spark-cuspatial-<version>.jar" in the target folder.
3. You'll get the jar named `spark-cuspatial-<version>.jar` in the target folder.

Note: The docker env is just for building the jar, not for running the application.

### Build in Local:
1. essential build tools:
### Build in local machine:
1. Essential build tools:
- [cmake(>=3.20)](https://cmake.org/download/),
- [ninja(>=1.10)](https://github.com/ninja-build/ninja/releases),
- [gcc(>=9.3)](https://gcc.gnu.org/releases.html)
2. [CUDA Toolkit(>=11.0)](https://developer.nvidia.com/cuda-toolkit)
3. conda: use [miniconda](https://docs.conda.io/en/latest/miniconda.html) to maintain header files and cmake dependecies
4. [cuspatial](https://github.com/rapidsai/cuspatial): install libcuspatial
```Bash
# get libcuspatial from conda
conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.04
# Install libcuspatial from conda
conda install -c rapidsai -c nvidia -c conda-forge -c defaults libcuspatial=22.06
# or below command for the nightly (aka SNAPSHOT) version.
conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.06
conda install -c rapidsai-nightly -c nvidia -c conda-forge -c defaults libcuspatial=22.08
```
5. Get the code, then run "mvn package".
5. Build the JAR using `mvn package`.
```Bash
git clone https://github.com/NVIDIA/spark-rapids-examples.git
cd spark-rapids-examples/examples/Spark-cuSpatial/
mvn package
```
6. You'll get "spark-cuspatial-<version>.jar" in the target folder.

6. `spark-cuspatial-<version>.jar` will be generated in the target folder.

## Run
### Run on-premises clusters: standalone
### GPU Demo on Spark Standalone on-premises cluster
1. Install necessary libraries. Besides `cudf` and `cuspatial`, the `gdal` library that is compatible with the installed `cuspatial` may also be needed.
Install it by running the command below.
```
conda install -c conda-forge libgdal=3.3.1
```
2. Set up [a standalone cluster](/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md) of Spark. Make sure the conda/lib is included in LD_LIBRARY_PATH, so that spark executors can load libcuspatial.so.

3. Download spark-rapids jars
* [spark-rapids v22.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar) or above
4. Prepare the dataset & jars. Copy the sample dataset from [cuspatial_data](../../../datasets/cuspatial_data.tar.gz) to "/data/cuspatial_data".
Copy spark-rapids & spark-cuspatial-22.08.0-SNAPSHOT.jar to "/data/cuspatial_data/jars".
You can use your own path, but remember to update the paths in "gpu-run.sh" accordingly.
5. Run "gpu-run.sh"
3. Download Spark RAPIDS JAR
* [Spark RAPIDS JAR v22.06.0](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar) or above
4. Prepare sample dataset and JARs. Copy the [sample dataset](../../../datasets/cuspatial_data.tar.gz) to `/data/cuspatial_data/`.
Copy Spark RAPIDS JAR and `spark-cuspatial-<version>.jar` to `/data/cuspatial_data/jars/`.
If you build the `spark-cuspatial-<version>.jar` in docker, please copy the jar from docker to local:
```
docker cp YOUR_DOCKER_CONTAINER:/PATH/TO/spark-cuspatial-<version>.jar ./YOUR_LOCAL_PATH
```
Note: update the paths in `gpu-run.sh` accordingly.
5. Run `gpu-run.sh`
```Bash
./gpu-run.sh
```
### Run on AWS Databricks
1. Build the customized docker image [Dockerfile.awsdb](Dockerfile.awsdb) and push to dockerhub so that it can be accessible by AWS Databricks.
### GPU Demo on AWS Databricks
1. Build a customized docker image using [Dockerfile.awsdb](Dockerfile.awsdb) and push to a Docker registry such as [Docker Hub](https://hub.docker.com/) which can be accessible by AWS Databricks.
```Bash
# replace your dockerhub repo, your tag or any other repo AWS DB can access
docker build -f Dockerfile.awsdb . -t <your-dockerhub-repo>:<your-tag>
docker push <your-dockerhub-repo>:<your-tag>
```

2. Follow the [Spark-rapids get-started document](https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-databricks.html#start-a-databricks-cluster) to create a GPU cluster on AWS Databricks.
Something different from the document.
Below are some different steps since a custom docker image is used with Databricks:
* Databricks Runtime Version
You should choose a Standard version of the Runtime version like "Runtime: 9.1 LTS(Scala 2.12, Spark 3.1.2)" and
choose GPU instance type like "g4dn.xlarge". Note that ML runtime does not support customized docker container.
If you choose a ML version, it says "Support for Databricks container services requires runtime version 5.3+"
and the "Confirm" button is disabled.
Choose a non-ML Databricks Runtime such as `Runtime: 9.1 LTS(Scala 2.12, Spark 3.1.2)` and
choose GPU AWS instance type such as `g4dn.xlarge`. Note that ML runtime does not support customized docker container with below messages:
`Support for Databricks container services requires runtime version 5.3+`
and the `Confirm` button is disabled.
* Use your own Docker container
Input "Docker Image URL" as "your-dockerhub-repo:your-tag"
* For the other configurations, you can follow the get-started document.
Input `Docker Image URL` as `your-dockerhub-repo:your-tag`
* Follow the Databricks get-started document for other steps.

3. Copy the sample [cuspatial_data.tar.gz](../../../datasets/cuspatial_data.tar.gz) or your data to DBFS by using Databricks CLI.
```Bash
Expand All @@ -103,5 +127,38 @@ or you can build it [in local](#build-in-local) machine after some prerequisites
points
polygons
```
4. Import the Library "spark-cuspatial-22.08.0-SNAPSHOT.jar" to the Databricks, then install it to your cluster.
5. Import [cuspatial_sample.ipynb](notebooks/cuspatial_sample_db.ipynb) to your workspace in Databricks. Attach to your cluster, then run it.
The sample points and polygons are randomly generated.

Sample polygons:

<img src="../../../docs/img/guides/cuspatial/sample-polygon.png" width="600">

4. Upload `spark-cuspatial-<version>.jar` on dbfs and then install it in Databricks cluster.

<img src="../../../docs/img/guides/cuspatial/install-jar.png" width="600">

5. Import [cuspatial_sample.ipynb](notebooks/cuspatial_sample_db.ipynb) to Databricks workspace, attach it to Databricks cluster and run it.

### CPU Demo on AWS Databricks
1. Create a Databricks cluster. For example, Databricks Runtime 10.3.

2. Install the Sedona jars and Sedona Python libs on Databricks using web UI.
The Sedona version should be 1.1.1-incubating or higher.
* Install below jars from Maven Coordinates in Libraries tab:
```Bash
org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.0-incubating
org.datasyslab:geotools-wrapper:1.1.0-25.2
```
* To enable python support, install below python lib from PyPI in Libraries tab
```Bash
apache-sedona
```
3. From your cluster configuration (Cluster -> Edit -> Configuration -> Advanced options -> Spark) activate the
Sedona functions and the kryo serializer by adding below to the Spark Config
```Bash
spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
```

4. Upload the sample data files to DBFS, start the cluster, attach the [notebook](notebooks/spacial-cpu-apache-sedona_db.ipynb) to the cluster, and run it.
Loading

0 comments on commit 8c4809c

Please sign in to comment.