Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reading data from GCS creates issue #1155

Closed
bschifferer opened this issue Oct 1, 2021 · 7 comments · Fixed by #1213
Closed

[BUG] Reading data from GCS creates issue #1155

bschifferer opened this issue Oct 1, 2021 · 7 comments · Fixed by #1213
Assignees
Labels
bug Something isn't working P0

Comments

@bschifferer
Copy link
Contributor

bschifferer commented Oct 1, 2021

Describe the bug
Reading parquet file from Google Cloud Storage does not work.

Steps/Code to reproduce bug

dataset = nvt.Dataset("gs://bucket/file.parquet")
dataset.to_ddf().head()

Error:

cuDF failure at: ../src/table/table.cpp:42: Column size mismatch:

If the data is copied to the local disk, the code will work.
cuDF / dask_cudf can read from GCS.
This is with the latest NVTabular

@bschifferer bschifferer added the bug Something isn't working label Oct 1, 2021
@benfred benfred added this to To Do in v0.7.1 (21.10) via automation Oct 4, 2021
@rjzamora
Copy link
Collaborator

rjzamora commented Oct 4, 2021

@bschifferer - I am having trouble reproducing this. However, I did run into another error related to the local buffer needing to be wrapped in a BytesIO object in order to interact with pyarrow correctly. Perhaps the errors are related?

When you say "cuDF / dask_cudf can read from GCS", are you referring to the latest version of RAPIDS, or 21.08?

@benfred benfred added the P0 label Oct 6, 2021
@benfred benfred moved this from To Do to In progress in v0.7.1 (21.10) Oct 6, 2021
@pchandar
Copy link

pchandar commented Oct 8, 2021

@rjzamora

As @bschifferer said, this fails

dataset = nvt.Dataset("gs://bucket/file.parquet")
dataset.to_ddf().head()

but this seems to work fine (feel a lot slower)

ddf = dask_cudf.read_parquet("/path-to-data/*.parquet")
nvt.Dataset(ddf)

@rjzamora
Copy link
Collaborator

rjzamora commented Oct 8, 2021

So this still doesn't work with the latest development branch (main)? Is the problematic file in a public bucket? If not, can you share a toy DataFrame example that cannot be read back like this after being written to gcs?

@pchandar
Copy link

pchandar commented Oct 8, 2021

I haven't tried with the latest main branch. I'll check if #1158 fixed the issue and provide an update.

@rjzamora
Copy link
Collaborator

rjzamora commented Oct 8, 2021

I haven't tried with the latest main branch. I'll check if #1158 fixed the issue and provide an update.

Thank @pchandar ! Note that I am not very confident that #1158 was related, but since I cannot reproduce with an arbitrary parquet file myself, this issue is a bit difficult to debug.

@pchandar
Copy link

pchandar commented Oct 14, 2021

@rjzamora sorry it took a while. It was a bit tricky to reproduce this on a test dataset. But if you copy the transformed parquet from this (Cell 18) example to a GCS bucket and then

ds = nvt.Dataset("gs://bucket/movielens.parquet")
ds.head()

will give the following error

RuntimeError: cuDF failure at: ../src/table/table.cpp:42: Column size mismatch: 76 != 20000076

A couple of observations: (1) this seem to happen only when the list columns exists; and (2) for sufficiently large datasets (when I tried slicing the problematic dataset it seemed to work fine). Hope this help reproduce the error at your end.
Thanks

@rjzamora
Copy link
Collaborator

rjzamora commented Oct 26, 2021

I'm sorry for the delay here. This is indeed a bug in the optimized data-transfer logic for read_parquet from remote storage. It turns out that that the list column name is modified from "genres" to "genres.list.element" in the parquet metadata, and so we fail to transfer the data for that column. In the near future, all this logic will live directly in fsspec (and will be removed from NVTabular), but I will submit a temporary fix asap for NVT.

v0.7.1 (21.10) automation moved this from In progress to Done Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants