Proof of concept of a typed BigQuery reader#41
Proof of concept of a typed BigQuery reader#41alexvanboxel wants to merge 1 commit intoGoogleCloudPlatform:masterfrom alexvanboxel:feature/bq
Conversation
|
Additional remark why I implemented this as a core feature:
|
|
Hi Alex! Thanks for the pull request! Sorry we didn't get back to you sooner. This is a feature we'd like to have as well. I need to look at the code and have a few conversations with people here before I can comment on the approach, but I'm hoping I can respond with more details and next-steps by the end of the week. |
|
No problem. Note that it's a proof of concept and welcome any feedback. I'll be happy to incorporate that into my next iteration. |
|
Hi Alex! As I said earlier we definitely want to support a typed BigQuery API. We have a few suggestions for the general direction and some comments on the details, but we’d like to help you move forward with this and get it checked in. Our biggest concern is that this is hard-coding a single way of converting a There would be a default The specific parser will be serialized as part of the Bound instance, and will specify how to do the type conversion. Some additional suggestions:
There will likely be more comments on the actual implementation, but hopefully the above can get you moving forward. |
|
Thanks, I will certainly try to incorporate the feedback. And I needed the feedback to continue to work on the implementation. A question though. Every new feedback, do you want me to squash the commit into one. Or keep the original commit and work onto of it? |
|
Hi Alex. As far as the squashing or not goes, its partly whats easiest for you. From a review perspective, it may make sense for you to incorporate the feedback and squash all that into a single commit that we can review, and then for each round of review to be a separate commit. This would allow us to review the incremental changes between each round. |
|
Hi Alex, Any new thoughts or updates on the revised implementation? Thanks |
|
Not yet, as I'm writing this in my free time I'm currently have no time. I'm focusing at preparing for the Devoxx conference (second week of November), I'm part of the steering. After that I'll pick this up again. I tried rebasing my branch though and the merge conflicts where so high it's probably better to start from scratch with the comments. |
|
Great, no problem. Take your time! |
|
Note: Because the SDK has been deviated so far and there where a lot of changes requested in the review I started a new branch. I'm currently far along with implementing the feature, but I stumbled along what I think is kind of a blocker. I did implement as separate input parser, the types are inferred by the default registry (all as suggested by @bjchambers). But as the time came to do a test run on the DataFlow service I got the following error (and indeed I added a new Property: input_parser to provide a hint to the service of the parser that is used. In the BigQueryReaderFactory the input_parser property is again picked up to de-serialize the parser. But I assume that it's not allowed to create new properties... am I right? |
|
Indeed, the Dataflow Service performs a strict check on the properties passed in. It is not allowed to create new properties on the fly. Normally, this is not an issue because such properties can be passed through as a part of the serialized object. In this specific case, however, since BigQueryIO is currently implemented as a "native source", it doesn't have a corresponding serialized object to piggy back on. A way to solve this problem is to generate two steps under the hood. That said, adding new translation fields is not out of the question when there's a specific need. It is just a process that takes a while to complete. |
|
I was afraid of that. I will implement the workaround with the ParDo step so I can continue and test it on the service, but keep the implementation with the Parser pushed do right down to the BigQueryReader (for reviewing). Thanks. |
This is a partial revert of commits f5e3b8e and 18c82ad. When running a batch Dataflow job on Cloud Dataflow service, the data are produced by running a BigQuery export job and then reading all the files in parallel. When run in the DirectPipelineRunner, BigQuery's JSON API is used directly. These data come back in different formats. To compensate, we use BigQueryTableRowIterator to normalize the behavior in DirectPipelineRunner to the behavior seen when running on the service. (We cannot change this decision without a major breaking change.) This patch fixes some discrepancies in the way that BigQueryTableRowIterator is implemented. Specifically, *) In commit 18c82ad (response to issue #20) we updated the format of timestamps to be printed as strings. However, we did not correctly match the behavior of BigQuery export. Here is a sample set of times from the export job vs the JSON API. 2016-01-06 06:38:00 UTC 1.45206228E9 2016-01-06 06:38:11 UTC 1.452062291E9 2016-01-06 06:38:11.1 UTC 1.4520622911E9 2016-01-06 06:38:11.12 UTC 1.45206229112E9 2016-01-06 06:38:11.123 UTC 1.452062291123E9 * 2016-01-06 06:38:11.1234 UTC 1.4520622911234E9 2016-01-06 06:38:11.12345 UTC 1.45206229112345E9 2016-01-06 06:38:11.123456 UTC 1.452062291123456E9 Before, only the * test would have passed. *) In commit f5e3b8e we updated TableRow iterator to preserve the usual TableRow field `f` corresponding to getF(), which returns a list of fields in Schema order. This was my mistaken attempt to better support users who have prior experience with BigQuery's API and expect to use getF()/getV(). However, there were two issues: 1. this change did not affect the behavior in the DataflowPipelineRunner. 2. this was actually a breaking backwards-incompatible change, because common downstream DoFns may iterate over the keys of the TableRow, and it added the field "f". So we should not propagate the change to DataflowPipelineRunner, but instead we should revert the change to BigQueryTableRowIterator. (Note this is also a slightly-backwards-incompatible change, but it's reverting to old behavior and users are more likely to be depending on DataflowPipelineRunner rather than DirectPipelineRunner.) Fix both these issues and add tests. This is still ugly for now. The long-term fix here is to support a parser that lets users skip TableRow altogether and goes straight to POJOs of their choosing (See #41). That would also eliminate our performance and typing issues using TableRow as an inner type in pipelines (See e.g. http://stackoverflow.com/questions/33622227/dataflow-mixing-integer-long-types). ----Release Notes---- [] ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=111746236
This is a partial revert of commits f5e3b8e and 18c82ad. When running a batch Dataflow job on Cloud Dataflow service, the data are produced by running a BigQuery export job and then reading all the files in parallel. When run in the DirectPipelineRunner, BigQuery's JSON API is used directly. These data come back in different formats. To compensate, we use BigQueryTableRowIterator to normalize the behavior in DirectPipelineRunner to the behavior seen when running on the service. (We cannot change this decision without a major breaking change.) This patch fixes some discrepancies in the way that BigQueryTableRowIterator is implemented. Specifically, *) In commit 18c82ad (response to issue GoogleCloudPlatform#20) we updated the format of timestamps to be printed as strings. However, we did not correctly match the behavior of BigQuery export. Here is a sample set of times from the export job vs the JSON API. 2016-01-06 06:38:00 UTC 1.45206228E9 2016-01-06 06:38:11 UTC 1.452062291E9 2016-01-06 06:38:11.1 UTC 1.4520622911E9 2016-01-06 06:38:11.12 UTC 1.45206229112E9 2016-01-06 06:38:11.123 UTC 1.452062291123E9 * 2016-01-06 06:38:11.1234 UTC 1.4520622911234E9 2016-01-06 06:38:11.12345 UTC 1.45206229112345E9 2016-01-06 06:38:11.123456 UTC 1.452062291123456E9 Before, only the * test would have passed. *) In commit f5e3b8e we updated TableRow iterator to preserve the usual TableRow field `f` corresponding to getF(), which returns a list of fields in Schema order. This was my mistaken attempt to better support users who have prior experience with BigQuery's API and expect to use getF()/getV(). However, there were two issues: 1. this change did not affect the behavior in the DataflowPipelineRunner. 2. this was actually a breaking backwards-incompatible change, because common downstream DoFns may iterate over the keys of the TableRow, and it added the field "f". So we should not propagate the change to DataflowPipelineRunner, but instead we should revert the change to BigQueryTableRowIterator. (Note this is also a slightly-backwards-incompatible change, but it's reverting to old behavior and users are more likely to be depending on DataflowPipelineRunner rather than DirectPipelineRunner.) Fix both these issues and add tests. This is still ugly for now. The long-term fix here is to support a parser that lets users skip TableRow altogether and goes straight to POJOs of their choosing (See GoogleCloudPlatform#41). That would also eliminate our performance and typing issues using TableRow as an inner type in pipelines (See e.g. http://stackoverflow.com/questions/33622227/dataflow-mixing-integer-long-types). ----Release Notes---- [] ------------- Created by MOE: https://github.com/google/moe MOE_MIGRATED_REVID=111746236
|
Closing as obsolete. A cool idea that I hope will make it into Dataflow Java SDK 2.0 based on Apache Beam |
The is not a Pull Request for merge, but a Request For Comment on a feature I'm developing. I like to know if this is a feature that would be considered to be merged in. It needs some more development but if I know it won't be accepted anyway I will abandon this feature as a "core" merge.
What is the feature: An object mapper for BigQuery. My experience with Dataflow tells me to stop using loosely typed object as soon as possible, so it's best to go types at the border: The reader. So having a build in mapper helps data flow users. (this will come in handy more for ETL type of flows). Example usage:
Note: This is a Proof Of Concept and need development. What needs development: