-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create and maintain a 10k-row subset table #27
Comments
How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"? |
LIMIT 10000 can still cause a full scan of all rows with some queries, thus
cost substantially more. Having a smaller table reduces that cost, allowing
for developing/iterating on the query.
It’s not clear to me if you can use a view with LIMIT 10000 to achieve this
and have the billing only report for the 10K rows.
…On Wed, Mar 21, 2018 at 5:36 AM Ilya Grigorik ***@***.***> wrote:
How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABayJ-v5wTBQvu6HP1CQS8Q7W27hnwumks5tgebegaJpZM4SzBi3>
.
|
I tested out a VIEW with a a |
As for how to define first, I've using the top N based on the Alexa rank . For example my pages_mobile sample query reduced the 342MB table to 765KB:
Then create other tables based on the This request sample table uses the following query, which processes 43.3GB and results in a 132MB table that I can use to test JOINs with the pages table later on :
And this sample table creates a Lighthouse version of it, reducing the 115GB table down to 280MB.
Creating the sample tables for the HAR and Lighthouse datasets will help reduce processing overhead on developing queries more than the pages/requests tables. But we should still include them so that we can test JOINs across these tables. |
WDYT of just having a random sample? This is future-proof if we're ever moving away from Alexa ranks and possibly more representative of the full dataset than just the head URLs. |
I think this is a great idea. I have had a lot of hesitation digging into the Lighthouse data - knowing each query would eat 13% of my free 1TB. Letting me refine my queries on a 280 MB ( or even a couple GB) table would lower the barrier to entry. |
I created a few sample tables here. These use the queries I mentioned above for the alexa top 1000. A random sample makes sense too - but we would need to make sure that the URLs are consistent across sample tables so that JOINs can be tested properly I didn't create them for every table yet, but this is a start. I'm happy to share this data set more broadly, but I think it's better off living under the HTTP Archive. Let me know if there is anything I can do to help with this! |
@paulcalvano In the short term any chance you can recreate these with the new structure tables? Longer term if we want people to query the raw data then I think maintaining these in the public set is a must. As it stands it's just too expensive to query the main tables, and it's only going to get more expensive as the HTTP Archive grows. For example this seemingly innocuous query processes 708GB!
(Using SQL as the query language makes BigQuery accessible but it also 'tricked' me to think about processing in a relational DB way which of course isn't how it works - for a while was confused as to why the where clause didn't limit the bytes processed) |
It seems that things that want to query the response bodies with a regexp would benefit a lot from this as regexps are frequently a little tricky and you'd like to be able to test and make mistakes with a much smaller dataset than even the oldest/smallest ones I can find. |
1k and 10k tables created for the web almanac - |
Suggested in the HTTP Archive Slack channel:
Just need to figure out which tables to subset, how to organize them, and how to keep them updated with the latest release.
The text was updated successfully, but these errors were encountered: