Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet AssertionErrors on long running jobs #407

Open
NeroCorleone opened this issue Feb 10, 2021 · 1 comment
Open

Parquet AssertionErrors on long running jobs #407

NeroCorleone opened this issue Feb 10, 2021 · 1 comment

Comments

@NeroCorleone
Copy link
Contributor

NeroCorleone commented Feb 10, 2021

Problem description

We are seeing different kinds of errors when creating a ktk dataset and it is unclear where these errors come from. Initially, those were AssertionErrors from somewhere in the parquet stack. More recently, we have seen:
Exception: OSError('IOError: ZSTD decompression failed: Corrupted block detected',) on a dask worker node.

Example code (ideally copy-pastable)

Unfortunately not so easy: essentially we are triggering a long running (> 3h) ktk job with kartothek.io.dask.dataframe.update_dataset_from_ddf. During this long running job we sometimes (?) see the following stacktrace:

  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_generic.py", line 120, in restore_dataframe
    date_as_object=date_as_object,
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_parquet.py", line 128, in restore_dataframe
    parquet_file, columns_to_io, predicates_for_pushdown
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_parquet.py", line 237, in _read_row_groups_into_tables
    row_group = parquet_file.read_row_group(row, columns=columns)
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 271, in read_row_group
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1079, in pyarrow._parquet.ParquetReader.read_row_group
    return self.read_row_groups([i], column_indices, use_threads)
  File "pyarrow/_parquet.pyx", line 1098, in pyarrow._parquet.ParquetReader.read_row_groups
    check_status(self.reader.get()

  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
    raise IOError(message)

Used versions

# pip freeze
attrs==20.2.0
azure-common==1.1.25
azure-storage-blob==2.1.0
azure-storage-common==2.1.0
blinker==1.4
bokeh==2.2.1
cffi==1.14.3
chardet==3.0.4
click==7.1.2
cloudpickle==1.5.0
contextvars==2.4
cytoolz==0.10.1
dask==2.30.0
dataclasses==0.7
decorator==4.4.2
distributed==2.30.1+by.1
Flask==1.1.2
fsspec==0.8.4
gunicorn==20.0.4
HeapDict==1.0.1
idna==2.10
immutables==0.14
itsdangerous==1.1.0
Jinja2==2.11.2
kartothek==3.17.0
locket==0.2.0
lz4==3.1.0
MarkupSafe==1.1.1
milksnake==0.1.5
msgpack==1.0.0
numpy==1.19.1
packaging==20.4
pandas==1.1.4
partd==1.1.0
Pillow==7.2.0
pip==19.2.3
prometheus-client==0.8.0
prompt-toolkit==3.0.5
psutil==5.7.3
pyarrow==1.0.1
pycparser==2.20
pydantic==1.7.2
pygelf==0.3.4
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
requests==2.24.0
retail-interface==0.21.0
sentry-sdk==0.16.2
setuptools==41.2.0
simplejson==3.17.2
simplekv==0.14.1
six==1.15.0
sortedcontainers==2.2.2
storefact==0.10.0
structlog==20.1.0
tblib==1.7.0
terminaltables==3.1.0
toolz==0.10.0
tornado==6.1
typing-extensions==3.7.4.3
uritools==3.0.0
urllib3==1.25.10
urlquote==1.1.4
voluptuous==0.11.7
wcwidth==0.2.5
Werkzeug==1.0.1
wheel==0.33.6
zict==2.0.0
zstandard==0.14.0

Debugging the issue hints towards some improper fetch in our io buffer but the root cause is unknown. The issue might be triggered by a non-threadsafe reader in pyarrow, a bug in our azure storage backend or the buffer itself, see also #402

@fjetter
Copy link
Collaborator

fjetter commented Feb 17, 2021

Very likely caused by Azure/azure-sdk-for-python#16723

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants