Create a random sample set for the latest of each BigQuery dataset #121

rviscomi · 2019-01-14T06:28:39Z

The BQ tables are getting unwieldy and expensive to query. We use scheduled queries to generate the tables in the latest dataset. Similarly, we should generate a subset of these tables randomly limited to some number of rows to ensure that the query will be inexpensive. For the requests dataset we should group by page so that all pages in the sample have all of their respective requests.

calculate the average row size in bytes for each dataset
pick a sample size in rows corresponding to about 1 GB per dataset (this can change)
schedule a query for each dataset (requests, pages, etc) and each client (desktop, mobile) to materialize sample tables

The text was updated successfully, but these errors were encountered:

rviscomi added the enhancement label Jan 14, 2019

rviscomi self-assigned this Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a random sample set for the latest of each BigQuery dataset #121

Create a random sample set for the latest of each BigQuery dataset #121

rviscomi commented Jan 14, 2019

Create a random sample set for the latest of each BigQuery dataset #121

Create a random sample set for the latest of each BigQuery dataset #121

Comments

rviscomi commented Jan 14, 2019