Skip to content

BigQuery: Allow loading data from a list of URLs#2000

Merged
quartzmo merged 2 commits into
googleapis:masterfrom
keylimetoolbox:bigquery_load_uri_list
Mar 15, 2018
Merged

BigQuery: Allow loading data from a list of URLs#2000
quartzmo merged 2 commits into
googleapis:masterfrom
keylimetoolbox:bigquery_load_uri_list

Conversation

@jeremywadsack
Copy link
Copy Markdown
Contributor

In the BigQuery API documentation, a load job can be given a list or source URIs. This PR updates the client to allow you to send an array of Storage objects, or URLs (and String or URI), when loading data into a table or dataset.

The initial commit adds some table-reference acceptance tests that were missing, but make it complete with the table acceptance tests.

I noticed that there are very few load/load_job acceptance tests for the dataset. Would you like me to also fill that in to cover all the various cases we have for table and table-reference?

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Mar 8, 2018
@blowmage
Copy link
Copy Markdown
Contributor

blowmage commented Mar 8, 2018

Can you mix files and URLs?

@quartzmo quartzmo requested review from blowmage and quartzmo March 8, 2018 23:35
@quartzmo quartzmo self-assigned this Mar 8, 2018
@quartzmo quartzmo added the api: bigquery Issues related to the BigQuery API. label Mar 8, 2018
@jeremywadsack
Copy link
Copy Markdown
Contributor Author

@blowmage Local files? Not as far as I could tell. To upload local files you send a media-upload request to the /upload URI. This can be a multipart upload (which I believe is what the library does), but the documentation says that you cannot upload multiple files:

Wildcards and comma separated lists are not supported when you load files from a local data source. Files must be loaded individually.

Additionally:

The body of the request is formatted as a multipart/related content type [RFC2387] and contains exactly two parts. The parts are identified by a boundary string, and the final boundary string is followed by two hyphens.

Each part of the multipart request needs an additional Content-Type header:

Metadata part: Must come first, and Content-Type must match one of the accepted metadata formats.
Media part: Must come second, and Content-Type must match one the method's accepted media MIME types.

That said, I haven't tried to see what the API does if you use a media-upload request and set configuration.load.sourceUris[].


On the CI failure, I could use some guidance. I don't know how to address a doc test failure, "This code example is not yet mocked."

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@blowmage
Copy link
Copy Markdown
Contributor

blowmage commented Mar 9, 2018

@jeremywadsack The doctest failure is because you added a new code example labeled "Pass a list of google-cloud-storage files:", but didn't configure the documentation mocks for it in support/doctest_helper.rb. You can probably copy an existing configuration for Google::Cloud::Bigquery::Table#load_job in that file. Be sure to run bundle exec rake ci in the directory to see what other failures you may have before updating the PR.

@jeremywadsack
Copy link
Copy Markdown
Contributor Author

jeremywadsack commented Mar 9, 2018

@blowmage ba48571 adds doctest mocks (thanks for the pointer). That also reminded me I should document the example for the other three methods.

I ran rake ci locally and it passed before and after these changes, so that commit is a bit of a shot-in-the-dark.

@jeremywadsack
Copy link
Copy Markdown
Contributor Author

@blowmage @quartzmo Let me know if there's anything else you want me to update. If this is good, I can squash/rebase.

@quartzmo
Copy link
Copy Markdown
Member

I pulled this branch and ran rake ci:a (includes acceptance tests), passes for me locally.

Copy link
Copy Markdown
Member

@quartzmo quartzmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jeremywadsack jeremywadsack force-pushed the bigquery_load_uri_list branch from bd03b16 to 5e95c59 Compare March 15, 2018 17:47
@jeremywadsack
Copy link
Copy Markdown
Contributor Author

Sqashed and rebased off master. I left 58c5960 as a separate commit because it was covering specs that were missing, unrelated to adding this support.

@quartzmo
Copy link
Copy Markdown
Member

I'll merge when green (just a formality...)

@blowmage
Copy link
Copy Markdown
Contributor

Thank you for the rebase/squash!

@quartzmo quartzmo merged commit f3308d7 into googleapis:master Mar 15, 2018
@quartzmo
Copy link
Copy Markdown
Member

Thank you @jeremywadsack, really appreciate your contributions!

@jeremywadsack
Copy link
Copy Markdown
Contributor Author

I really appreciate both of your responsiveness to issues and pull requests. I realize this is a really huge repository covering most of the GCP products. It's nice to have things moved through so quickly to releases.

@jeremywadsack jeremywadsack deleted the bigquery_load_uri_list branch March 15, 2018 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the BigQuery API. cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants