Skip to content

Avoid nullable pandas dtypes in CPUParquetEngine#60

Merged
karlhigley merged 2 commits intoNVIDIA-Merlin:mainfrom
rjzamora:parquet-cast-nullable
Apr 4, 2022
Merged

Avoid nullable pandas dtypes in CPUParquetEngine#60
karlhigley merged 2 commits intoNVIDIA-Merlin:mainfrom
rjzamora:parquet-cast-nullable

Conversation

@rjzamora
Copy link
Copy Markdown
Contributor

@rjzamora rjzamora commented Apr 4, 2022

This is (hopefully) a temporary fix for a lack of support for nullable dtypes in NVTabular. This change ensures that reading from parquet data (with cpu=True) will not result in nullable pandas types.

cc @albert17

@nvidia-merlin-bot
Copy link
Copy Markdown

Click to view CI Results
GitHub pull request #60 of commit 5f326867755930c84d37b8dcf8ccfd463d9c759c, no merge conflicts.
Running as SYSTEM
Setting status of 5f326867755930c84d37b8dcf8ccfd463d9c759c to PENDING with url https://10.20.13.93:8080/job/merlin_core/11/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/60/*:refs/remotes/origin/pr/60/* # timeout=10
 > git rev-parse 5f326867755930c84d37b8dcf8ccfd463d9c759c^{commit} # timeout=10
Checking out Revision 5f326867755930c84d37b8dcf8ccfd463d9c759c (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5f326867755930c84d37b8dcf8ccfd463d9c759c # timeout=10
Commit message: "convert all pandas dtypes to non-nullable in CPUParquetEngine"
 > git rev-list --no-walk abc37714f84ddca49a34b883569e514fb25f8bc2 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins810222065464690935.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (61.3.1)
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 337 items / 1 skipped

tests/unit/core/test_dispatch.py .. [ 0%]
tests/unit/dag/test_base_operator.py .... [ 1%]
tests/unit/dag/test_column_selector.py .......................... [ 9%]
tests/unit/dag/test_tags.py ...... [ 11%]
tests/unit/dag/ops/test_selection.py ... [ 12%]
tests/unit/io/test_io.py ............................................... [ 26%]
................................................................ [ 45%]
tests/unit/schema/test_column_schemas.py ............................... [ 54%]
........................................................................ [ 75%]
........................................................................ [ 97%]
[ 97%]
tests/unit/schema/test_schema_io.py .. [ 97%]
tests/unit/utils/test_utils.py ........ [100%]

=============================== warnings summary ===============================
tests/unit/dag/test_base_operator.py: 4 warnings
tests/unit/io/test_io.py: 72 warnings
/usr/lib/python3.8/site-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/io/test_io.py::test_validate_and_regenerate_dataset
/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:549: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.
paths = [p.path for p in pa_dataset.pieces]

tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 34063 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 44019 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 39131 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 44011 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 35161 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 34515 instead
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 337 passed, 1 skipped, 83 warnings in 52.74s =================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_core] $ /bin/bash /tmp/jenkins6866539576500878086.sh

@rjzamora
Copy link
Copy Markdown
Contributor Author

rjzamora commented Apr 4, 2022

Note that I am getting this pre-commit failure (probably related to a lack of test coverage for the change), but will need to come back later to address it:

(rapids_21.12) rzamora@dgx14:~/workspace/cudf-22.02/core$ git commit -m "convert all pandas dtypes to non-nullable in CPUParquetEngine"
[WARNING] The 'exclude' field in hook 'isort' is a regex, not a glob -- matching '/*' probably isn't what you want here
isort....................................................................Passed
black....................................................................Passed
flake8...................................................................Passed
pylint...................................................................Passed
interrogate..............................................................Failed
- hook id: interrogate
- exit code: 1

= Coverage for /home/nfs/rzamora/workspace/cudf-22.02/core/merlin/io/ =
- Summary -
| Name       | Total | Miss | Cover | Cover% |
|------------|-------|------|-------|--------|
| parquet.py |    23 |   17 |     6 |    26% |
|------------|-------|------|-------|--------|
| TOTAL      |    23 |   17 |     6 |  26.1% |
- RESULT: FAILED (minimum: 60.0%, actual: 26.1%) -


codespell................................................................Passed
bandit...................................................................Passed

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2022

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-60

@karlhigley karlhigley added this to the Merlin 22.04 milestone Apr 4, 2022
@karlhigley karlhigley added the bug Something isn't working label Apr 4, 2022
@nvidia-merlin-bot
Copy link
Copy Markdown

Click to view CI Results
GitHub pull request #60 of commit 1eb0dc91dc1e0930acf4fc74d0139951cc21f4bd, no merge conflicts.
Running as SYSTEM
Setting status of 1eb0dc91dc1e0930acf4fc74d0139951cc21f4bd to PENDING with url https://10.20.13.93:8080/job/merlin_core/12/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/60/*:refs/remotes/origin/pr/60/* # timeout=10
 > git rev-parse 1eb0dc91dc1e0930acf4fc74d0139951cc21f4bd^{commit} # timeout=10
Checking out Revision 1eb0dc91dc1e0930acf4fc74d0139951cc21f4bd (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 1eb0dc91dc1e0930acf4fc74d0139951cc21f4bd # timeout=10
Commit message: "Merge branch 'main' into parquet-cast-nullable"
 > git rev-list --no-walk 5f326867755930c84d37b8dcf8ccfd463d9c759c # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins3679070402294277217.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (61.3.1)
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 337 items / 1 skipped

tests/unit/core/test_dispatch.py .. [ 0%]
tests/unit/dag/test_base_operator.py .... [ 1%]
tests/unit/dag/test_column_selector.py .......................... [ 9%]
tests/unit/dag/test_tags.py ...... [ 11%]
tests/unit/dag/ops/test_selection.py ... [ 12%]
tests/unit/io/test_io.py ............................................... [ 26%]
................................................................ [ 45%]
tests/unit/schema/test_column_schemas.py ............................... [ 54%]
........................................................................ [ 75%]
........................................................................ [ 97%]
[ 97%]
tests/unit/schema/test_schema_io.py .. [ 97%]
tests/unit/utils/test_utils.py ........ [100%]

=============================== warnings summary ===============================
tests/unit/dag/test_base_operator.py: 4 warnings
tests/unit/io/test_io.py: 72 warnings
/usr/lib/python3.8/site-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/io/test_io.py::test_validate_and_regenerate_dataset
/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:549: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.
paths = [p.path for p in pa_dataset.pieces]

tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 36763 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 38759 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 39657 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 46181 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 44947 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]
/var/jenkins_home/.local/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 45453 instead
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 337 passed, 1 skipped, 83 warnings in 52.33s =================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_core] $ /bin/bash /tmp/jenkins400913421926742573.sh

@karlhigley karlhigley merged commit 7b1e1ac into NVIDIA-Merlin:main Apr 4, 2022
@rjzamora rjzamora deleted the parquet-cast-nullable branch April 5, 2022 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants