Create and maintain a 10k-row subset table #27

rviscomi · 2018-03-21T05:18:28Z

Suggested in the HTTP Archive Slack channel:

Was wondering if it makes sense to add a "sample" dataset that contains data for the first ~1000 pages. This way you can easily test out a query on httparchive.latest.response_bodies_desktop using something smaller like httparchive.sample.response_bodies_desktop. I manually create sample datasets for the same reason when working with the larger tables.

having an official 10K subset would make this process cheaper for non-Google folks, and would make it feasible to create an occasional query without hitting the free plan limits

Just need to figure out which tables to subset, how to organize them, and how to keep them updated with the latest release.

The text was updated successfully, but these errors were encountered:

igrigorik · 2018-03-21T05:36:27Z

How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"?

sleevi · 2018-03-21T08:06:36Z

LIMIT 10000 can still cause a full scan of all rows with some queries, thus cost substantially more. Having a smaller table reduces that cost, allowing for developing/iterating on the query. It’s not clear to me if you can use a view with LIMIT 10000 to achieve this and have the billing only report for the 10K rows.

…

On Wed, Mar 21, 2018 at 5:36 AM Ilya Grigorik ***@***.***> wrote: How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABayJ-v5wTBQvu6HP1CQS8Q7W27hnwumks5tgebegaJpZM4SzBi3> .

paulcalvano · 2018-03-21T11:13:04Z

I tested out a VIEW with a a LIMIT 1000 clause in it. It still caused a full scan of all rows in the table that the VIEW was based on.

paulcalvano · 2018-03-21T12:03:45Z

As for how to define first, I've using the top N based on the Alexa rank . WHERE rank > 0 LIMIT 1000

For example my pages_mobile sample query reduced the 342MB table to 765KB:

SELECT *
FROM `httparchive.runs.2018_02_15_pages_mobile` 
WHERE rank > 0
ORDER BY rank ASC
LIMIT 1000

Then create other tables based on the pageid column in this dataset, or the url for the HAR datasets or the NET.HOSTNAME(url) match for CrUX datasets.

This request sample table uses the following query, which processes 43.3GB and results in a 132MB table that I can use to test JOINs with the pages table later on :

SELECT * 
FROM `httparchive.runs.2018_02_15_requests_mobile`  
WHERE pageid IN (
  SELECT pageid 
  FROM `httparchive_sampleset.pages_mobile`
)

And this sample table creates a Lighthouse version of it, reducing the 115GB table down to 280MB.

SELECT * 
FROM `httparchive.lighthouse.2018_02_15_mobile`   
WHERE url IN (
  SELECT url 
  FROM `httparchive_sampleset.pages_mobile`
)
AND report IS NOT NULL

Creating the sample tables for the HAR and Lighthouse datasets will help reduce processing overhead on developing queries more than the pages/requests tables. But we should still include them so that we can test JOINs across these tables.

rviscomi · 2018-03-21T20:20:37Z

WDYT of just having a random sample? This is future-proof if we're ever moving away from Alexa ranks and possibly more representative of the full dataset than just the head URLs.

dougsillars · 2018-04-03T23:11:50Z

I think this is a great idea. I have had a lot of hesitation digging into the Lighthouse data - knowing each query would eat 13% of my free 1TB. Letting me refine my queries on a 280 MB ( or even a couple GB) table would lower the barrier to entry.

paulcalvano · 2018-04-05T14:04:16Z

I created a few sample tables here. These use the queries I mentioned above for the alexa top 1000.

A random sample makes sense too - but we would need to make sure that the URLs are consistent across sample tables so that JOINs can be tested properly

I didn't create them for every table yet, but this is a start. I'm happy to share this data set more broadly, but I think it's better off living under the HTTP Archive. Let me know if there is anything I can do to help with this!

andydavies · 2018-11-14T11:07:30Z

@paulcalvano In the short term any chance you can recreate these with the new structure tables?

Longer term if we want people to query the raw data then I think maintaining these in the public set is a must.

As it stands it's just too expensive to query the main tables, and it's only going to get more expensive as the HTTP Archive grows.

For example this seemingly innocuous query processes 708GB!

select page, extractLinkHeaders(JSON_EXTRACT(payload, '$.response.headers'))
from `httparchive.requests.2018_10_15_mobile`
where url = "https://starpowertalent.com/";

(Using SQL as the query language makes BigQuery accessible but it also 'tricked' me to think about processing in a relational DB way which of course isn't how it works - for a while was confused as to why the where clause didn't limit the bytes processed)

bkardell · 2019-03-31T19:08:17Z

It seems that things that want to query the response bodies with a regexp would benefit a lot from this as regexps are frequently a little tricky and you'd like to be able to test and make mistakes with a much smaller dataset than even the oldest/smallest ones I can find.

paulcalvano · 2019-06-17T17:10:59Z

1k and 10k tables created for the web almanac -

HTTPArchive/almanac.httparchive.org#34

rviscomi added the enhancement label Mar 21, 2018

max-ostapenko mentioned this issue Sep 11, 2024

Sample data tables HTTPArchive/dataform#7

Merged

max-ostapenko closed this as completed in HTTPArchive/dataform#7 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create and maintain a 10k-row subset table #27

Create and maintain a 10k-row subset table #27

rviscomi commented Mar 21, 2018

igrigorik commented Mar 21, 2018

sleevi commented Mar 21, 2018 via email

paulcalvano commented Mar 21, 2018

paulcalvano commented Mar 21, 2018

rviscomi commented Mar 21, 2018

dougsillars commented Apr 3, 2018

paulcalvano commented Apr 5, 2018

andydavies commented Nov 14, 2018 •

edited

Loading

bkardell commented Mar 31, 2019

paulcalvano commented Jun 17, 2019

Create and maintain a 10k-row subset table #27

Create and maintain a 10k-row subset table #27

Comments

rviscomi commented Mar 21, 2018

igrigorik commented Mar 21, 2018

sleevi commented Mar 21, 2018 via email

paulcalvano commented Mar 21, 2018

paulcalvano commented Mar 21, 2018

rviscomi commented Mar 21, 2018

dougsillars commented Apr 3, 2018

paulcalvano commented Apr 5, 2018

andydavies commented Nov 14, 2018 • edited Loading

bkardell commented Mar 31, 2019

paulcalvano commented Jun 17, 2019

andydavies commented Nov 14, 2018 •

edited

Loading