Remove use of kind property in ColumnSchema#65
Conversation
This was intended to normalize Pandas nullable types to their closest corresponding Numpy dtypes, but turns out to break with other numpy dtypes that also define the `kind` property, so must be reverted.
Click to view CI ResultsGitHub pull request #65 of commit a6a5f3f06a20db8ac79c77616740e4232fad8ef5, no merge conflicts.
Running as SYSTEM
Setting status of a6a5f3f06a20db8ac79c77616740e4232fad8ef5 to PENDING with url https://10.20.13.93:8080/job/merlin_core/17/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/65/*:refs/remotes/origin/pr/65/* # timeout=10
> git rev-parse a6a5f3f06a20db8ac79c77616740e4232fad8ef5^{commit} # timeout=10
Checking out Revision a6a5f3f06a20db8ac79c77616740e4232fad8ef5 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f a6a5f3f06a20db8ac79c77616740e4232fad8ef5 # timeout=10
Commit message: "Remove use of `kind` property in `ColumnSchema`"
> git rev-list --no-walk 693e6776671d703478fe3f25b76732836b7936a8 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins1559533346675254086.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (62.0.0)
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 337 items / 1 skipped
|
| elif hasattr(self.dtype, "kind"): | ||
| dtype = np.dtype(self.dtype.kind) |
There was a problem hiding this comment.
I think we still need to handle string dtypes (the parquet change only handles int/float nullable dtypes):
elif isinstance(self.dtype, pd.StringDtype):
dtype = np.dtype("O")There was a problem hiding this comment.
To clarify - We could add a check for pandas string types in the parquet reader, but I believe they are created in other places anyway...
There was a problem hiding this comment.
Wait, I thought your change avoided using them?
There was a problem hiding this comment.
We avoid creating nullable integer and float dtypes in parquet now, but we weren't creating StringDtype there anyway (I think - I need to check that).
There was a problem hiding this comment.
Okay - I think you are right. I missed that some categorical columns are producing StringDtype - I'll try another tweak to the parquet reader and see if we can just drop the block as you are suggesting here.
There was a problem hiding this comment.
Okay - It does seem that your original change here works fine if #66 is also merged :)
|
LGTM, but it looks like you will need someone with higher merlin privileges to approve as well :) |
Click to view CI ResultsGitHub pull request #65 of commit 2e7b4c48a1ae2367c3983e605eda7b4d8868a3f9, no merge conflicts.
Running as SYSTEM
Setting status of 2e7b4c48a1ae2367c3983e605eda7b4d8868a3f9 to PENDING with url https://10.20.13.93:8080/job/merlin_core/22/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/65/*:refs/remotes/origin/pr/65/* # timeout=10
> git rev-parse 2e7b4c48a1ae2367c3983e605eda7b4d8868a3f9^{commit} # timeout=10
Checking out Revision 2e7b4c48a1ae2367c3983e605eda7b4d8868a3f9 (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f 2e7b4c48a1ae2367c3983e605eda7b4d8868a3f9 # timeout=10
Commit message: "Merge branch 'main' into fix/schema-dtypes"
> git rev-list --no-walk f08c57e6dc4972e54b6d77e0e90ab4b1566ac089 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins8711348676361416056.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (62.0.0)
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 337 items / 1 skipped
|
Click to view CI ResultsGitHub pull request #65 of commit bb55e2348eb0285e65ec3f59c53c5ab0927f548f, no merge conflicts.
Running as SYSTEM
Setting status of bb55e2348eb0285e65ec3f59c53c5ab0927f548f to PENDING with url https://10.20.13.93:8080/job/merlin_core/23/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
> git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
> git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/65/*:refs/remotes/origin/pr/65/* # timeout=10
> git rev-parse bb55e2348eb0285e65ec3f59c53c5ab0927f548f^{commit} # timeout=10
Checking out Revision bb55e2348eb0285e65ec3f59c53c5ab0927f548f (detached)
> git config core.sparsecheckout # timeout=10
> git checkout -f bb55e2348eb0285e65ec3f59c53c5ab0927f548f # timeout=10
Commit message: "Merge branch 'main' into fix/schema-dtypes"
> git rev-list --no-walk 2e7b4c48a1ae2367c3983e605eda7b4d8868a3f9 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins3826142596931303025.sh
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (62.0.0)
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 337 items / 1 skipped
|
Documentation preview |
This was intended to normalize Pandas nullable types to their closest corresponding Numpy dtypes, but turns out to break with other numpy dtypes that also define the
kindproperty, so must be reverted.