Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a random sample set for the latest of each BigQuery dataset #121

Open
3 tasks
rviscomi opened this issue Jan 14, 2019 · 0 comments
Open
3 tasks

Create a random sample set for the latest of each BigQuery dataset #121

rviscomi opened this issue Jan 14, 2019 · 0 comments
Assignees

Comments

@rviscomi
Copy link
Member

The BQ tables are getting unwieldy and expensive to query. We use scheduled queries to generate the tables in the latest dataset. Similarly, we should generate a subset of these tables randomly limited to some number of rows to ensure that the query will be inexpensive. For the requests dataset we should group by page so that all pages in the sample have all of their respective requests.

  • calculate the average row size in bytes for each dataset
  • pick a sample size in rows corresponding to about 1 GB per dataset (this can change)
  • schedule a query for each dataset (requests, pages, etc) and each client (desktop, mobile) to materialize sample tables
@rviscomi rviscomi self-assigned this Jan 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant