-
Notifications
You must be signed in to change notification settings - Fork 2.9k
feat(dataflow): Dataflow to Apache Iceberg with dynamic destinations #9645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(dataflow): Dataflow to Apache Iceberg with dynamic destinations #9645
Conversation
|
Here is the summary of changes. You are about to add 1 region tag.
This comment is generated by snippet-bot.
|
|
@code-review-assist review |
ghost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request adds a snippet demonstrating dynamic destinations with Apache Iceberg in Dataflow and refactors integration tests. The changes are well-structured and address a relevant documentation bug. Here are a few suggestions based on the Google Java Style Guide:
- Use diamond operators for type inference.
- Declare
MapandImmutableMapusing their interfaces. - Consider using
List.of()for creating immutable lists when possible.
|
@code-review-assist walkthrough |
Dataflow to Apache Iceberg Samples WalkthroughThis repository contains Java code samples demonstrating how to interact with Apache Iceberg tables using Apache Beam and Google Cloud Dataflow. The primary goal is to provide clear, concise examples for developers to integrate Dataflow pipelines with Iceberg for data ingestion and processing. Repository StructureThe repository is structured as follows:
Code WalkthroughLet's trace the execution flow of the
Integration TestsThe integration tests in
Pull Request ImpactThis pull request introduces a significant enhancement by adding support for dynamic destinations in Apache Iceberg. The updated Potential ImprovementsWhile the PR is well-structured, adding more comprehensive error handling and logging could further improve its robustness. Consider adding checks for null or empty values in the input data and handling potential exceptions during table creation or data writing. More detailed logging would aid in debugging and troubleshooting. |
|
@iennae @kweinmeister Should I address anything from the code review bot? |
dataflow/snippets/pom.xml
Outdated
| <maven.compiler.target>11</maven.compiler.target> | ||
| <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> | ||
| <apache_beam.version>2.58.0</apache_beam.version> | ||
| <apache_beam.version>2.60.0</apache_beam.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little bit old version. cc @ahmedabu98
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 but noting that Dataflow will be upgrading Iceberg to the latest version anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bumped the version
chamikaramj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks for adding this example.
LGTM
dataflow/snippets/pom.xml
Outdated
| <maven.compiler.target>11</maven.compiler.target> | ||
| <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> | ||
| <apache_beam.version>2.58.0</apache_beam.version> | ||
| <apache_beam.version>2.60.0</apache_beam.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 but noting that Dataflow will be upgrading Iceberg to the latest version anyways.
telpirion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @VeronicaWasson ! This sample looks really good. I have just a handful of comments for you. Please take a look at your earliest convenience.
Tip: I recommend formatting this sample similar to other Java samples. Here's a recently merged sample that follows all style guidance.
| void setCatalogName(String value); | ||
| } | ||
|
|
||
| public static PipelineResult.State main(String[] args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: return void instead of a result. Move the call to pipeline.run().waitUntilFinish() into the createPipeline() method.
See:
https://googlecloudplatform.github.io/samples-style-guide/#result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
(fwiw, I've gotten some conflicting guidance on this in previous Dataflow code snippets)
| .apply(JsonToRow.withSchema(SCHEMA)) | ||
| .apply(Managed.write(Managed.ICEBERG).withConfig(config)); | ||
|
|
||
| return pipeline; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: process the response from the pipeline in some manner.
See:
https://googlecloudplatform.github.io/samples-style-guide/#pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'm no longer returning the result, there's nothing to check here. (If the pipeline fails, the IT will fail.)
| "{\"id\":2, \"name\":\"Charles\", \"airport\": \"ORD\" }" | ||
| ); | ||
|
|
||
| // [END dataflow_apache_iceberg_dynamic_destinations] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: don't omit the interface definition here from the rest of the sample. If the code requires that the developer extend an interface, then we should show it in the sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| // Parse the pipeline options passed into the application. Example: | ||
| // --runner=DirectRunner --warehouseLocation=$LOCATION --catalogName=$CATALOG \ | ||
| // For more information, see https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options | ||
| var options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: declare variables with types, rather than var. As illustrative/didactic code, we want to communicate to the reader what sort of type they need to work with (especially for strongly-typed languages).
See:
https://github.com/GoogleCloudPlatform/java-docs-samples/blob/main/SAMPLE_FORMAT.md#java-11-features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| .build(); | ||
|
|
||
| // The data to write to table, formatted as JSON strings. | ||
| static final List<String> TABLE_ROWS = Arrays.asList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would use List.of() instead here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| void setCatalogName(String value); | ||
| } | ||
|
|
||
| public static PipelineResult.State main(String[] args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: remove the arg parsing from the sample.
See:
https://googlecloudplatform.github.io/samples-style-guide/#no-cli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is idiomatic for Dataflow / Apache Beam. To run the pipeline, the user passes in command line arguments that get parsed via the PipelineOptions class.
e.g.:
https://beam.apache.org/documentation/programming-guide/#pipeline-options-cli
| } | ||
|
|
||
| // [START dataflow_apache_iceberg_dynamic_destinations] | ||
| public static Pipeline createPipeline(Options options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would provide a comment for this method that describes what the code sample does. It's hard for me to understand, just from a casual glance, what effect this code has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out, I added some more comments.
| } | ||
|
|
||
| @Before | ||
| public void setUp() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment: consider keeping the pipe for stdout to bout. I think that the sample should attempt to process the result by printing messages to the console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sample writes records to an Iceberg catalog, so the IT tests whether the records were added successfully.
Previously I was doing this in a roundabout way (printing the records to stdout first) but this version seems more direct.
| ImmutableMap.of(CatalogProperties.WAREHOUSE_LOCATION, warehouseLocation), | ||
| hadoopConf); | ||
| createIcebergTable(catalog, TABLE_IDENTIFIER); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove extraneous line if unneeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| final Table tableSYD = createIcebergTable("flights-SYD"); | ||
|
|
||
| // Run the Dataflow pipeline. | ||
| PipelineResult.State state = ApacheIcebergDynamicDestinations.main( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comments about processing results in the sample (and not returning types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Description
Add snippet for Iceberg dynamic destinations
Add a snippet that shows the use of [https://github.com/[Task]: Add utilities to easily implement portable dynamic destinations apache/beam#32365](dynamic destinations) when writing to Apache Iceberg from Dataflow, This is a new feature in Beam 2.60.
Refactor integration tests to remove the assumption of exactly 1 destination table.
Relevant doc bug: b/371047621
Checklist
pom.xmlparent set to latestshared-configurationmvn clean verifyrequiredmvn -P lint checkstyle:checkrequiredmvn -P lint clean compile pmd:cpd-check spotbugs:checkadvisory only