New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark Action to Analyze table #10288
base: main
Are you sure you want to change the base?
Conversation
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
/** Computes the statistic of the given columns and stores it as Puffin files. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnalyzeTableSparkAction
is a generic name, I see that in future we want to compute the partition stats too. Which may not be written as puffin files.
Either we can change the change the naming to computeNDVSketches
or make it generic such that any kind of stats can be computed from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more on this, I think we should just call it computeNDVSketches
and not mix it with partition stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to follow the model of RDMS and Engines like Trino using ANALYZE TABLE <tblName>
to collect all table level stats.
With a procedure per stats model, the user have to invoke procedure/action for every stats and
also with any new stats addition, the user need to ensure to update his code to call the new procedure/action.
not mix it with partition stats.
I think we could have partition stats as a separate action since it per partition, whereas this procedure can collect top level table stats.
spark(), table, columnsToBeAnalyzed.toArray(new String[0])); | ||
table | ||
.updateStatistics() | ||
.setStatistics(table.currentSnapshot().snapshotId(), statisticsFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if table's current snapshot has modified concurrently by another client between like 117 to line 120?
|
||
public static Iterator<Tuple2<String, ThetaSketchJavaSerializable>> computeNDVSketches( | ||
SparkSession spark, String tableName, String... columns) { | ||
String sql = String.format("select %s from %s", String.join(",", columns), tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also think about incremental update and update sketches from previous checkpoint. Querying whole table maybe not efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, incremental need to be wired into the ends of write paths.
This procedure could exist in parallel, which could get stats of the whole table on demand.
assumeTrue(catalogName.equals("spark_catalog")); | ||
sql( | ||
"CREATE TABLE %s (id int, data string) USING iceberg TBLPROPERTIES" | ||
+ "('format-version'='2')", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default format version itself v2 now. So, specifying it again is redundant.
String path = operations.metadataFileLocation(String.format("%s.stats", UUID.randomUUID())); | ||
OutputFile outputFile = fileIO.newOutputFile(path); | ||
try (PuffinWriter writer = | ||
Puffin.write(outputFile).createdBy("Spark DistinctCountProcedure").build()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this name instead of "analyze table procedure".
there was an old PR on the same: #6582 |
I don't have time to work on this, so karuppayya will take over. Thanks a lot @karuppayya for continuing the work. |
This change adds a Spark action to Analyze tables.
As part of analysis, the action generates Apache data - sketch for NDV stats and writes it as puffins.