initial commit: modified scala etls to accept Fannie Mae data (#191)

* initial commit: modified scala etls to accept Fannie Mae data Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * updated pyspark etls to consume raw mortgage data Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * updated pyspark application docs Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * updated scala spark application docs Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * updated MortgageETL.ipynb notebook Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fixed accuracy issue in scala and python ETLs Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * tested MortgageETL Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * added scala notebook etl Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * updated MortgageETL+XGBoost.ipynb Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fix bugs in docs Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * update docs to reflect fannie mae data Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * updated ipynb files with future links Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * link updated Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * remove maxPartitionBytes Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * added code to save train and test datasets Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * removed incompatibleOps Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * resolved spark/rapids configs Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * added instructions to download dataset Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * modified readme files to reflect config changes Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fixed a bug in utility Mortgage.scala Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fixed scala application bus Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fixed python spark application bugs Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * added cpu etl section in readMe Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fixed scala notebooks Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fixed python notebooks Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * added step to run on CPU in scala notebook etl Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * fixed cv scala notebook bug Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * improve documentation Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com> * read data from disk before random split Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
NVIDIA · Jul 25, 2022 · ac355c0 · ac355c0
1 parent 8c4809c
commit ac355c0
Show file tree

Hide file tree

Showing 24 changed files with 2,228 additions and 1,531 deletions.
diff --git a/datasets/mortgage-small.tar.gz b/datasets/mortgage-small.tar.gz
diff --git a/docs/get-started/xgboost-examples/building-sample-apps/python.md b/docs/get-started/xgboost-examples/building-sample-apps/python.md
@@ -17,7 +17,8 @@ Two files are required by PySpark:
 
 + *samples.zip*
 
-  the package including all example code
+  the package including all example code. 
+  Executing the above build commands generates the samples.zip file in 'spark-rapids-examples/examples/XGBoost-Examples' folder
 
 + *main.py*
 

diff --git a/docs/get-started/xgboost-examples/dataset/mortgage.md b/docs/get-started/xgboost-examples/dataset/mortgage.md
@@ -0,0 +1,22 @@
+# How to download the Mortgage dataset
+
+
+
+## Steps to download the data
+
+1. Go to the [Fannie Mae](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) website
+2. Click on [Single-Family Loan Performance Data](https://datadynamics.fanniemae.com/data-dynamics/?&_ga=2.181456292.2043790680.1657122341-289272350.1655822609#/reportMenu;category=HP)
+    * Register as a new user if you are using the website for the first time
+    * Use the credentials to login
+3. Select [HP](https://datadynamics.fanniemae.com/data-dynamics/#/reportMenu;category=HP)
+4. Click on  **Download Data** and choose *Single-Family Loan Performance Data*
+5. You will find a tabular list of 'Acquisition and Performance' files sorted based on year and quarter. Click on the file to download `Eg: 2017Q1.zip`
+6. Unzip the downlad file to extract the csv file `Eg: 2017Q1.csv`
+7. Copy only the csv files to a new folder for the ETL to read
+
+## Notes
+1. Refer to the [Loan Performance Data Tutorial](https://capitalmarkets.fanniemae.com/media/9066/display) for more details. 
+2. Note that *Single-Family Loan Performance Data* has 2 componenets. However, the Mortgage ETL requires only the first one (primary dataset)
+    * Primary Dataset:  Acquisition and Performance Files
+    * HARP Dataset
+3. Use the [Resources](https://datadynamics.fanniemae.com/data-dynamics/#/resources/HP) section to know more about the dataset
diff --git a/docs/get-started/xgboost-examples/notebook/python-notebook.md b/docs/get-started/xgboost-examples/notebook/python-notebook.md
@@ -30,6 +30,8 @@ and the home directory for Apache Spark respectively.
     --conf spark.plugins=com.nvidia.spark.SQLPlugin \
     --conf spark.rapids.memory.gpu.pooling.enabled=false \
     --conf spark.executor.resource.gpu.amount=1 \
+    --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
+    --conf spark.rapids.sql.hasNans=false \
     --conf spark.task.resource.gpu.amount=1 \
     --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
     --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh

diff --git a/docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md b/docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md
@@ -60,9 +60,10 @@ on cluster filesystems like HDFS, or in [object stores like S3 and GCS](https://
 Note that using [application dependencies](https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management) from 
 the submission client’s local file system is currently not yet supported.
 
-Note: the `mortgage_eval_merged.csv` and `mortgage_train_merged.csv` are not Mortgage raw data,
-they are the data produced by Mortgage ETL job. If user wants to use a larger size Mortgage data, please refer to [Launch ETL job](#etl).
-Taxi ETL job is the same. But Agaricus does not have ETL process, it is combined with XGBoost as there is just a filter operation.
+#### Note: 
+1. Mortgage and Taxi jobs have ETLs to generate the processed data. 
+2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing. 
+3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.
 
 Save Kubernetes Template Resources
 ----------------------------------
@@ -89,35 +90,36 @@ to execute using a GPU which is already in use -- causing undefined behavior and
 
 <span id="etl">Launch Mortgage or Taxi ETL Part</span>
 ---------------------------
+Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets. 
 
 Run spark-submit
 
 ``` bash
 ${SPARK_HOME}/bin/spark-submit \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
-   --conf spark.rapids.memory.gpu.pooling.enabled=false \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.task.resource.gpu.amount=1 \
+   --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
+   --conf spark.rapids.sql.csv.read.double.enabled=true \
    --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
+   --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
+   --conf spark.rapids.sql.hasNans=false \
    --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
    --jars ${RAPIDS_JAR}                                           \
    --master <k8s://ip:port or k8s://URL>                                                                  \
    --deploy-mode ${SPARK_DEPLOY_MODE}                                             \
    --num-executors ${SPARK_NUM_EXECUTORS}                                         \
    --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
    --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
-   --class ${EXAMPLE_CLASS}                                                       \
    --class com.nvidia.spark.examples.mortgage.ETLMain  \
    $SAMPLE_JAR \
    -format=csv \
-   -dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-train/" \
-   -dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-train/" \
-   -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/train/"
-
-# if generating eval data, change the data path to eval as well as the corresponding perf-eval and acq-eval data
-# -dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-eval"
-# -dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-eval"
-# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/eval/"
+   -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
+   -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/"
+
+# if generating eval data, change the data path to eval
+# -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
+# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
 # if running Taxi ETL benchmark, change the class and data path params to
 # -class com.nvidia.spark.examples.taxi.ETLMain  
 # -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
@@ -163,9 +165,9 @@ export SPARK_DRIVER_MEMORY=4g
 export SPARK_EXECUTOR_MEMORY=8g
 
 # example class to use
-export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.GPUMain
-# or change to com.nvidia.spark.examples.taxi.GPUMain to run Taxi Xgboost benchmark
-# or change to com.nvidia.spark.examples.agaricus.GPUMain to run Agaricus Xgboost benchmark
+export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
+# or change to com.nvidia.spark.examples.taxi.Main to run Taxi Xgboost benchmark
+# or change to com.nvidia.spark.examples.agaricus.Main to run Agaricus Xgboost benchmark
 
 # tree construction algorithm
 export TREE_METHOD=gpu_hist
@@ -192,9 +194,9 @@ ${SPARK_HOME}/bin/spark-submit
   --conf spark.kubernetes.executor.podTemplateFile=${TEMPLATE_PATH}                     \
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark                  \
   ${SAMPLE_JAR}                                                                        \
-  -dataPath=train::${DATA_PATH}/mortgage/csv/train/mortgage_train_merged.csv              \
-  -dataPath=trans::${DATA_PATH}/mortgage/csv/test/mortgage_eval_merged.csv                 \
-  -format=csv                                                                           \
+  -dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/                   \
+  -dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/                    \
+  -format=parquet                                                                \
   -numWorkers=${SPARK_NUM_EXECUTORS}                                                    \
   -treeMethod=${TREE_METHOD}                                                            \
   -numRound=100                                                                         \

diff --git a/docs/get-started/xgboost-examples/on-prem-cluster/standalone-python.md b/docs/get-started/xgboost-examples/on-prem-cluster/standalone-python.md
@@ -53,6 +53,13 @@ Get Application Files, Jar and Dataset
 
 Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md)
 
+
+#### Note: 
+1. Mortgage and Taxi jobs have ETLs to generate the processed data.
+2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing. 
+3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.
+
+
 Launch a Standalone Spark Cluster
 ---------------------------------
 
@@ -83,9 +90,8 @@ Launch a Standalone Spark Cluster
 
 Launch Mortgage or Taxi ETL Part
 ---------------------------
-
-Run spark-submit
-
+Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.
+### ETL on GPU
 ``` bash
 ${SPARK_HOME}/bin/spark-submit \
     --master spark://$HOSTNAME:7077 \
@@ -95,18 +101,39 @@ ${SPARK_HOME}/bin/spark-submit \
     --conf spark.plugins=com.nvidia.spark.SQLPlugin \
     --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
     --conf spark.rapids.sql.csv.read.double.enabled=true \
+    --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
+    --conf spark.rapids.sql.hasNans=false \
     --py-files ${SAMPLE_ZIP} \
     main.py \
     --mainClass='com.nvidia.spark.examples.mortgage.etl_main' \
     --format=csv \
-    --dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-train/" \
-    --dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-train/" \
-    --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/train/"
-
-# if generating eval data, change the data path to eval as well as the corresponding perf-eval and acq-eval data
-# --dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-eval"
-# --dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-eval"
-# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/eval/"
+    --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
+    --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/"
+
+# if generating eval data, change the data path to eval
+# --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
+# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
+# if running Taxi ETL benchmark, change the class and data path params to
+# -class com.nvidia.spark.examples.taxi.ETLMain  
+# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
+# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
+```
+### ETL on CPU
+```bash
+${SPARK_HOME}/bin/spark-submit \
+    --master spark://$HOSTNAME:7077 \
+    --executor-memory 32G \
+    --conf spark.executor.instances=1 \
+    --py-files ${SAMPLE_ZIP} \
+    main.py \
+    --mainClass='com.nvidia.spark.examples.mortgage.etl_main' \
+    --format=csv \
+    --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
+    --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/"
+
+# if generating eval data, change the data path to eval
+# --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
+# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
 # if running Taxi ETL benchmark, change the class and data path params to
 # -class com.nvidia.spark.examples.taxi.ETLMain  
 # -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
@@ -166,8 +193,8 @@ ${SPARK_HOME}/bin/spark-submit
  --py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP}                   \
  ${MAIN_PY}                                                     \
  --mainClass=${EXAMPLE_CLASS}                                                   \
- --dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/out/train/      \
- --dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/out/eval/      \
+ --dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/      \
+ --dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/      \
  --format=parquet                                 \
  --numWorkers=${SPARK_NUM_EXECUTORS}                                            \
  --treeMethod=${TREE_METHOD}                                                    \
@@ -240,8 +267,8 @@ ${SPARK_HOME}/bin/spark-submit
  --py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP}                       \
  ${SPARK_PYTHON_ENTRYPOINT}                                                     \
  --mainClass=${EXAMPLE_CLASS}                                                   \
- --dataPath=train::${DATA_PATH}/mortgage/out/train/      \
- --dataPath=trans::${DATA_PATH}/mortgage/out/eval/         \
+ --dataPath=train::${DATA_PATH}/mortgage/output/train/      \
+ --dataPath=trans::${DATA_PATH}/mortgage/output/eval/         \
  --format=parquet                                                               \
  --numWorkers=${SPARK_NUM_EXECUTORS}                                            \
  --treeMethod=${TREE_METHOD}                                                    \