Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement summary tables from raw HAR data #23

rviscomi opened this issue Mar 8, 2018 · 2 comments


None yet
2 participants
Copy link

commented Mar 8, 2018

The legacy website uses intermediate crawl data from MySQL tables to generate CSVs containing summary data about pages and requests. As part of the beta migration, we would like to deprecate this preprocessing step and depend directly on the raw HAR data.

In BigQuery, this data is represented in the runs dataset, which has recently been split into summary_pages and summary_requests. These datasets will continue to exist, but will be generated in a BigQuery post-processing step instead, using the HAR tables as input.

A secondary goal of this process is to modernize the summary data. For example, the videoRequests field may not count modern video formats like WebM.

  • Write new queries to replicate summary data
  • Hook queries into post-processing pipeline
  • Unplug CSV -> BigQuery pipeline
  • Remove pre-processing logic

This comment has been minimized.

Copy link
Member Author

commented Mar 28, 2019

requests.2019_03_01_desktop would be a good place to start because we lost a lot of data in the summary_requests.2019_03_01_desktop table.

Help wanted: someone to write an SQL query that converts the HAR-based requests data into summary_requests data. Since the dataset is very large, I created this table of 100 sample requests to practice on. Should match the schema of the summary_requests tables. HA-specific metadata like requestid and pageid may be null as they're not included in the HAR.

Also filed this post on the forum asking for help:


This comment has been minimized.

Copy link

commented Mar 28, 2019

I might take a look at this. Note, the requests table is required for some of the statistical analysis for each run but this is essentially a one-off task that can be done differently, never been a fan of the request tables myself. CSVs can be useful so HAR to CSV would be useful. I already have the code for this in Python in my fork of httparchive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.