New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Reading data from GCS creates issue #1155
Comments
@bschifferer - I am having trouble reproducing this. However, I did run into another error related to the local buffer needing to be wrapped in a BytesIO object in order to interact with pyarrow correctly. Perhaps the errors are related? When you say "cuDF / dask_cudf can read from GCS", are you referring to the latest version of RAPIDS, or 21.08? |
As @bschifferer said, this fails
but this seems to work fine (feel a lot slower)
|
So this still doesn't work with the latest development branch ( |
I haven't tried with the latest |
@rjzamora sorry it took a while. It was a bit tricky to reproduce this on a test dataset. But if you copy the transformed parquet from this (Cell 18) example to a GCS bucket and then
will give the following error
A couple of observations: (1) this seem to happen only when the list columns exists; and (2) for sufficiently large datasets (when I tried slicing the problematic dataset it seemed to work fine). Hope this help reproduce the error at your end. |
I'm sorry for the delay here. This is indeed a bug in the optimized data-transfer logic for read_parquet from remote storage. It turns out that that the list column name is modified from "genres" to "genres.list.element" in the parquet metadata, and so we fail to transfer the data for that column. In the near future, all this logic will live directly in fsspec (and will be removed from NVTabular), but I will submit a temporary fix asap for NVT. |
Describe the bug
Reading parquet file from Google Cloud Storage does not work.
Steps/Code to reproduce bug
Error:
If the data is copied to the local disk, the code will work.
cuDF / dask_cudf can read from GCS.
This is with the latest NVTabular
The text was updated successfully, but these errors were encountered: