Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit: modified scala etls to accept Fannie Mae data #191

Merged
merged 31 commits into from
Jul 25, 2022
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
e4bb22d
initial commit: modified scala etls to accept Fannie Mae data
SurajAralihalli Jun 24, 2022
e69e157
updated pyspark etls to consume raw mortgage data
SurajAralihalli Jun 27, 2022
5d6f5b3
updated pyspark application docs
SurajAralihalli Jun 27, 2022
d0cb491
updated scala spark application docs
SurajAralihalli Jun 27, 2022
84633d2
updated MortgageETL.ipynb notebook
SurajAralihalli Jun 27, 2022
836e601
Merge remote-tracking branch 'upstream/branch-22.08' into fannieMaeET…
SurajAralihalli Jun 27, 2022
9e74d79
fixed accuracy issue in scala and python ETLs
SurajAralihalli Jul 1, 2022
4bee25e
tested MortgageETL
SurajAralihalli Jul 1, 2022
6f5e882
added scala notebook etl
SurajAralihalli Jul 1, 2022
911c7aa
updated MortgageETL+XGBoost.ipynb
SurajAralihalli Jul 5, 2022
67d68cd
fix bugs in docs
SurajAralihalli Jul 5, 2022
674d318
Merge remote-tracking branch 'upstream/branch-22.08' into fannieMaeET…
SurajAralihalli Jul 5, 2022
1cbbc30
update docs to reflect fannie mae data
SurajAralihalli Jul 6, 2022
ac9c08b
updated ipynb files with future links
SurajAralihalli Jul 6, 2022
0458ccc
link updated
SurajAralihalli Jul 7, 2022
07e661d
remove maxPartitionBytes
SurajAralihalli Jul 12, 2022
3d8f44e
added code to save train and test datasets
SurajAralihalli Jul 14, 2022
6810a5f
removed incompatibleOps
SurajAralihalli Jul 14, 2022
2f115b0
resolved spark/rapids configs
SurajAralihalli Jul 14, 2022
97c1c46
added instructions to download dataset
SurajAralihalli Jul 14, 2022
4d3d30b
modified readme files to reflect config changes
SurajAralihalli Jul 14, 2022
a6f3fdb
fixed a bug in utility Mortgage.scala
SurajAralihalli Jul 18, 2022
4dfff4b
fixed scala application bus
SurajAralihalli Jul 20, 2022
834c71c
fixed python spark application bugs
SurajAralihalli Jul 20, 2022
aed50c8
added cpu etl section in readMe
SurajAralihalli Jul 20, 2022
af6e966
fixed scala notebooks
SurajAralihalli Jul 20, 2022
1c04ffb
fixed python notebooks
SurajAralihalli Jul 20, 2022
3d4c487
added step to run on CPU in scala notebook etl
SurajAralihalli Jul 20, 2022
18629e4
fixed cv scala notebook bug
SurajAralihalli Jul 20, 2022
173ad5d
improve documentation
SurajAralihalli Jul 20, 2022
788dc20
read data from disk before random split
SurajAralihalli Jul 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed datasets/mortgage-small.tar.gz
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ Two files are required by PySpark:

+ *samples.zip*

the package including all example code
the package including all example code.
Executing the above build commands generates the samples.zip file in 'spark-rapids-examples/examples/XGBoost-Examples' folder

+ *main.py*

Expand Down
22 changes: 22 additions & 0 deletions docs/get-started/xgboost-examples/dataset/mortgage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# How to download the Mortgage dataset



## Steps to download the data

1. Go to the [Fannie Mae](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) website
2. Click on [Single-Family Loan Performance Data](https://datadynamics.fanniemae.com/data-dynamics/?&_ga=2.181456292.2043790680.1657122341-289272350.1655822609#/reportMenu;category=HP)
* Register as a new user if you are using the website for the first time
* Use the credentials to login
3. Select [HP](https://datadynamics.fanniemae.com/data-dynamics/#/reportMenu;category=HP)
4. Click on **Download Data** and choose *Single-Family Loan Performance Data*
5. You will find a tabular list of 'Acquisition and Performance' files sorted based on year and quarter. Click on the file to download `Eg: 2017Q1.zip`
6. Unzip the downlad file to extract the csv file `Eg: 2017Q1.csv`
7. Copy only the csv files to a new folder for the ETL to read

## Notes
1. Refer to the [Loan Performance Data Tutorial](https://capitalmarkets.fanniemae.com/media/9066/display) for more details.
2. Note that *Single-Family Loan Performance Data* has 2 componenets. However, the Mortgage ETL requires only the first one (primary dataset)
* Primary Dataset: Acquisition and Performance Files
* HARP Dataset
3. Use the [Resources](https://datadynamics.fanniemae.com/data-dynamics/#/resources/HP) section to know more about the dataset
2 changes: 2 additions & 0 deletions docs/get-started/xgboost-examples/notebook/python-notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ and the home directory for Apache Spark respectively.
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.rapids.sql.hasNans=false \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,10 @@ on cluster filesystems like HDFS, or in [object stores like S3 and GCS](https://
Note that using [application dependencies](https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management) from
the submission client’s local file system is currently not yet supported.

Note: the `mortgage_eval_merged.csv` and `mortgage_train_merged.csv` are not Mortgage raw data,
they are the data produced by Mortgage ETL job. If user wants to use a larger size Mortgage data, please refer to [Launch ETL job](#etl).
Taxi ETL job is the same. But Agaricus does not have ETL process, it is combined with XGBoost as there is just a filter operation.
#### Note:
1. Mortgage and Taxi jobs have ETLs to generate the processed data.
2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing.
3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.

Save Kubernetes Template Resources
----------------------------------
Expand All @@ -89,35 +90,36 @@ to execute using a GPU which is already in use -- causing undefined behavior and

<span id="etl">Launch Mortgage or Taxi ETL Part</span>
---------------------------
Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.

Run spark-submit

``` bash
${SPARK_HOME}/bin/spark-submit \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
--conf spark.rapids.sql.csv.read.double.enabled=true \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.rapids.sql.hasNans=false \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
--jars ${RAPIDS_JAR} \
--master <k8s://ip:port or k8s://URL> \
--deploy-mode ${SPARK_DEPLOY_MODE} \
--num-executors ${SPARK_NUM_EXECUTORS} \
--driver-memory ${SPARK_DRIVER_MEMORY} \
--executor-memory ${SPARK_EXECUTOR_MEMORY} \
--class ${EXAMPLE_CLASS} \
--class com.nvidia.spark.examples.mortgage.ETLMain \
$SAMPLE_JAR \
-format=csv \
-dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-train/" \
-dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-train/" \
-dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/train/"

# if generating eval data, change the data path to eval as well as the corresponding perf-eval and acq-eval data
# -dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-eval"
# -dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-eval"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/eval/"
-dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
-dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/"

# if generating eval data, change the data path to eval
# -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
Expand Down Expand Up @@ -163,9 +165,9 @@ export SPARK_DRIVER_MEMORY=4g
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.GPUMain
# or change to com.nvidia.spark.examples.taxi.GPUMain to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.GPUMain to run Agaricus Xgboost benchmark
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
# or change to com.nvidia.spark.examples.taxi.Main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.Main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=gpu_hist
Expand All @@ -192,9 +194,9 @@ ${SPARK_HOME}/bin/spark-submit
--conf spark.kubernetes.executor.podTemplateFile=${TEMPLATE_PATH} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
${SAMPLE_JAR} \
-dataPath=train::${DATA_PATH}/mortgage/csv/train/mortgage_train_merged.csv \
-dataPath=trans::${DATA_PATH}/mortgage/csv/test/mortgage_eval_merged.csv \
-format=csv \
-dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/ \
-dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/ \
-format=parquet \
-numWorkers=${SPARK_NUM_EXECUTORS} \
-treeMethod=${TREE_METHOD} \
-numRound=100 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,13 @@ Get Application Files, Jar and Dataset

Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md)


#### Note:
1. Mortgage and Taxi jobs have ETLs to generate the processed data.
2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing.
3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.


Launch a Standalone Spark Cluster
---------------------------------

Expand Down Expand Up @@ -83,9 +90,8 @@ Launch a Standalone Spark Cluster

Launch Mortgage or Taxi ETL Part
---------------------------

Run spark-submit

Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.
### ETL on GPU
``` bash
${SPARK_HOME}/bin/spark-submit \
--master spark://$HOSTNAME:7077 \
Expand All @@ -95,18 +101,39 @@ ${SPARK_HOME}/bin/spark-submit \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
--conf spark.rapids.sql.csv.read.double.enabled=true \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.rapids.sql.hasNans=false \
--py-files ${SAMPLE_ZIP} \
main.py \
--mainClass='com.nvidia.spark.examples.mortgage.etl_main' \
--format=csv \
--dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-train/" \
--dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-train/" \
--dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/train/"

# if generating eval data, change the data path to eval as well as the corresponding perf-eval and acq-eval data
# --dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-eval"
# --dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-eval"
# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/eval/"
--dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
--dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/"

# if generating eval data, change the data path to eval
# --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```
### ETL on CPU
```bash
${SPARK_HOME}/bin/spark-submit \
--master spark://$HOSTNAME:7077 \
--executor-memory 32G \
--conf spark.executor.instances=1 \
--py-files ${SAMPLE_ZIP} \
main.py \
--mainClass='com.nvidia.spark.examples.mortgage.etl_main' \
--format=csv \
--dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
--dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/"

# if generating eval data, change the data path to eval
# --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
Expand Down Expand Up @@ -166,8 +193,8 @@ ${SPARK_HOME}/bin/spark-submit
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
${MAIN_PY} \
--mainClass=${EXAMPLE_CLASS} \
--dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/out/train/ \
--dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/out/eval/ \
--dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/ \
--dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/ \
--format=parquet \
--numWorkers=${SPARK_NUM_EXECUTORS} \
--treeMethod=${TREE_METHOD} \
Expand Down Expand Up @@ -240,8 +267,8 @@ ${SPARK_HOME}/bin/spark-submit
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
${SPARK_PYTHON_ENTRYPOINT} \
--mainClass=${EXAMPLE_CLASS} \
--dataPath=train::${DATA_PATH}/mortgage/out/train/ \
--dataPath=trans::${DATA_PATH}/mortgage/out/eval/ \
--dataPath=train::${DATA_PATH}/mortgage/output/train/ \
--dataPath=trans::${DATA_PATH}/mortgage/output/eval/ \
--format=parquet \
--numWorkers=${SPARK_NUM_EXECUTORS} \
--treeMethod=${TREE_METHOD} \
Expand Down
Loading