Skip to content

Unable to read Parquet file written by databricks #4515

@jaychia

Description

@jaychia

Describe the bug

Attached an example Parquet file which was written through databricks.

It seems other Rust-based engines also fail to read it:

Polars, pqrs

However, I was able to successfully read it via duckdb and pyarrow.

File:
part-00000-68793c08-5b0e-480e-bd6b-0a7568d49906.c000.snappy.parquet.zip

To Reproduce

df = daft.read_parquet("file.parquet")

This fails with File out of specification: Invalid thrift: bad data

import pyarrow.parquet as papq

f = papq.ParquetFile("myfile.parquet")
f.metadata

This prints

<pyarrow._parquet.FileMetaData object at 0x145c60720>
  created_by: parquet-mr compatible Photon version 0.2 (build 16.4)
  num_columns: 10
  num_rows: 76
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 4432

Expected behavior

No response

Component(s)

Parquet

Additional context

No response

Metadata

Metadata

Labels

bugSomething isn't workingp1Important to tackle soon, but preemptable by p0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions