Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats computation not working #269

Open
SrTangente opened this issue Feb 13, 2024 · 4 comments
Open

Stats computation not working #269

SrTangente opened this issue Feb 13, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@SrTangente
Copy link

Stat computation like this snippet:

val statsDF = DeltaLog.forTable(spark, path).update().allFiles.map(_.stats)

is not working for qbeast-spark 1.0.0. Maybe an alternative can be found to extract this information

@SrTangente SrTangente added 1.0.0 bug Something isn't working labels Feb 13, 2024
@SrTangente SrTangente self-assigned this Feb 13, 2024
@osopardo1
Copy link
Member

osopardo1 commented Feb 14, 2024

I cannot reproduce the error that you are experiencing.

I've tried:

  • Version 1.0.0-SNAPSHOT working with Spark 3.4.1 and Delta 2.4.0. Fine
  • Version 1.0.0-6a780ea1-SNAPSHOT working with Spark 3.5.0 and Delta 3.0.0. Fine too

Please check if you are using the right Delta Versions when reading the DeltaLog. You should use the same version to read and write.

@Jiaweihu08
Copy link
Member

Closed for inactivity.

@SrTangente
Copy link
Author

SrTangente commented Feb 20, 2024

The stats are extracted for only 32 random columns and only one of the cubes, not all of them

@osopardo1
Copy link
Member

osopardo1 commented Feb 22, 2024

  1. The number of columns used to compute the stats can be set with a table property from Delta: delta.dataSkippingNumIndexedCols. Since it's a table property, you should create the table before hand with the custom property and write your dataframe with saveAsTable or insertInto to make the change effective.
    This is not a feature from Qbeast.
  2. The stats are always collected per file, not per cube. :( We rely on Delta implementation of stats collection, and that structure is not designed to fit multiple min and max for each column in a single file.

@osopardo1 osopardo1 removed the 1.0.0 label Mar 27, 2024
@osopardo1 osopardo1 changed the title Stats computation not working for 1.0.0 Stats computation not working Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants