Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential regression in version 23.4.* while reading Parquet files leads to ParquetInvalidOrCorruptedFileException. The issue is not present in version 23.3.*. #49547

Closed
shawel opened this issue May 5, 2023 · 7 comments
Assignees
Labels
potential bug To be reviewed by developers and confirmed/rejected.

Comments

@shawel
Copy link

shawel commented May 5, 2023

I get the following error reading some parquet file. this file works in databricks and versions >= 23.3

Code: 1001. DB::Exception: Received from localhost:9000. DB::Exception: parquet::ParquetInvalidOrCorruptedFileException: Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. (STD_EXCEPTION)

SELECT count()
FROM s3('https://s3.us-east-1.amazonaws.com/.../.snappy.parquet', '***, '***, 'Parquet', 'timestamp Nullable(DateTime), dist Nullable(Float64),  event_name Nullable(String), gh Nullable(String), ipAddress Nullable(String), latitude Float64, longitude Float64, source_id Nullable(String)')
@shawel shawel added the potential bug To be reviewed by developers and confirmed/rejected. label May 5, 2023
@qoega
Copy link
Member

qoega commented May 5, 2023

Can you provide example of such file? It will make it much easier to fix

@shawel
Copy link
Author

shawel commented May 5, 2023

@Quid37 ok i will get back to you on this. dm possibly?

@al13n321
Copy link
Member

al13n321 commented May 5, 2023

Likely #49525 . Try adding SETTINGS remote_filesystem_read_method='read' to the query and see if it helps.

If that doesn't work, please let me know, and I'll investigate (and will ask you for the file or some metadata from it, etc).

@shawel
Copy link
Author

shawel commented May 7, 2023

@al13n321 SETTINGS remote_filesystem_read_method='read' does fix the issue

@warleysa
Copy link

This is also happening when updated to 23.4.* as well for all of our implementations with S3. I have added SETTINGS to the queries, but this seems to be a larger issue

@anton-zelenskiy
Copy link

remote_filesystem_read_method='read' setting also fixes the issue "Orig exception: Code: 32. DB::Exception: Attempt to read after eof: While executing ParquetBlockInputFormat: While executing S3. (ATTEMPT_TO_READ_AFTER_EOF) (version 23.4.2.11 (official build))"

@den-crane
Copy link
Contributor

outdated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential bug To be reviewed by developers and confirmed/rejected.
Projects
None yet
Development

No branches or pull requests

6 participants