Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response bodies tables too big to load into BigQuery in one go (15 TB) #225

Closed
rviscomi opened this issue Oct 2, 2020 · 13 comments
Closed
Labels

Comments

@rviscomi
Copy link
Member

rviscomi commented Oct 2, 2020

The September 2020 crawl completed successfully except for the response_bodies table for mobile. Inspecting the Dataflow logs shows this as the error:

Error while reading table: 2020_09_01_mobile_5b8f67b4_219e_41c5_9877_94ee0290cc64_source, error message: Total JSON data size exceeds max allowed size. Total size is at least: 16556745870240. Max allowed size is: 16492674416640.

16492674416640 bytes is 15 TB, the maximum size per BigQuery load job. The corresponding September 2020 desktop table weighs in at 10.82 TB and the August 2020 mobile table is 14.47 TB, so it's plausible that the mobile table finally exceeded 15 TB this month. The underlying CrUX dataset is continuing to grow, so this is another one of the stresses on the data pipeline capacity.

Table Rows Bytes (TB)
2020_08_01_desktop 215,621,667 10.99
2020_08_01_mobile 270,249,686 14.47
2020_09_01_desktop 216,083,365 10.82
2020_09_01_mobile 291,589,220 15.06 ❌

Here's a look at how the response body sizes were distributed in 2020_08_01_mobile:

MB requests cumulative weight (GB) cumulative requests
90 1 0.09 1
87 1 0.17 2
86 1 0.26 3
84 1 0.34 4
83 1 0.42 5
78 1 0.50 6
75 1 0.57 7
74 1 0.64 8
68 2 0.77 10
62 1 0.83 11
61 2 0.95 13
59 1 1.01 14
56 1 1.07 15
55 2 1.17 17
54 1 1.23 18
50 2 1.32 20
48 3 1.46 23
47 3 1.60 26
45 1 1.65 27
44 5 1.86 32
43 1 1.90 33
41 2 1.98 35
40 2 2.06 37
38 2 2.14 39
37 4 2.28 43
36 5 2.46 48
35 6 2.66 54
34 1 2.69 55
32 2 2.76 57
31 2 2.82 59
30 3 2.91 62
29 3 2.99 65
28 6 3.15 71
27 2 3.21 73
26 7 3.38 80
25 4 3.48 84
24 6 3.62 90
23 25 4.18 115
22 15 4.51 130
21 32 5.16 162
20 136 7.82 298
19 35 8.47 333
18 44 9.24 377
17 50 10.07 427
16 62 11.04 489
15 79 12.20 568
14 140 14.11 708
13 166 16.22 874
12 287 19.58 1,161
11 276 22.55 1,437
10 536 27.78 1,973
9 687 33.82 2,660
8 1,387 44.66 4,047
7 1,876 57.48 5,923
6 2,537 72.35 8,460
5 6,896 106.02 15,356
4 9,898 144.68 25,254
3 30,386 233.70 55,640
2 103,328 435.52 158,968
1 1,438,555 1840.35 1,597,523
0 268,652,163 1840.35 270,249,686

Some hand-wavey math later, I think what this is telling me is that if we reduce the row limit back down to 2 MB from 100, we can save up to 606 GB, assuming each row also has an average of 100 bytes of bookkeeping (page, url, truncated, requestid). This should be enough headroom to offset the dataset growth and get us under the limit.

I'll rerun the Dataflow job with a limit of 2 MB and report back if it works.

@rviscomi rviscomi added the bug label Oct 2, 2020
@rviscomi
Copy link
Member Author

rviscomi commented Oct 2, 2020

The Dataflow job was started and should take 8.5 hours to complete. I also want to note that this will result in ~300k response bodies (0.11%) being truncated — not omitted, I think that was a mistake in my math above.

One workaround to the 15 TB limit may be to write to two tables and merge them in a post-load job. But all this work is making me wonder if it's worth the effort. I actively discourage everyone from querying the response_bodies dataset due to its size. At 15 TB and $5/TB it's $75 just to run one query on the mobile table or $125 for both desktop and mobile. Nobody wants to spend that kind of money.

With the recent support for $WPT_BODIES in WebPageTest custom metrics, we can analyze the response bodies for any request and emit stats in the form of custom metrics in the pages dataset. For the most part, this was our analytical approach in the Web Almanac for metrics that were dependent on response bodies.

So for researchers interested in querying over all response bodies on BigQuery, we could encourage them to write a new custom metric, which also has the benefit of enriching the cheaper-to-query pages dataset for everyone. Any given pages table is still under the 1 TB free monthly quota. The downside is that this isn't backwards compatible and they wouldn't be able to easily do this for older crawls. We would still have the response bodies in the HAR files on GCS, so if absolutely necessary we could find a way to load a crawl's worth into BQ, maybe using the sharded table approach above.

@paulcalvano @pmeenan WDYT?

@rviscomi
Copy link
Member Author

rviscomi commented Oct 3, 2020

The job with the 2 MB per row limit failed still. It truncated 217,352 response bodies, but the table still exceeded the 15 TB limit.

I'm rerunning it now with a 1.5 MB per row limit.

@rviscomi
Copy link
Member Author

rviscomi commented Oct 4, 2020

1.5 MB still failing. Experimenting with 2 MB again but omitting the new requestId field.

@rviscomi
Copy link
Member Author

rviscomi commented Oct 7, 2020

Still failing. Trying a different approach. There seem to be many response bodies for images:

SELECT
  format,
  COUNT(0) AS freq
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  date = '2020-08-01' AND
  client = 'mobile' AND
  type = 'image' AND
  body IS NOT NULL
GROUP BY
  format
ORDER BY
  freq DESC
format freq
svg 7505138
gif 957900
ico 489250
png 101415
jpg 93750
webp 938
  56

Let's try omitting any images that are not SVG with the 2 MB truncation.

@housseindjirdeh
Copy link

housseindjirdeh commented Mar 17, 2021

It also looks like crawls for response_bodies have stopped after October 2020 for both mobile and desktop, and I assume they're not completing successfully for this same reason?

@philipwalton
Copy link

Also (FYI), a number of people internally at Google have sudden started to notice this (many related to perf questions) because they've been unable to query data.

I understand the argument that most people probably don't want to be paying for queries to the response bodies table, but for folks who aren't paying for it, there still may be value in being able to perform ad-hoc queries.

@paulcalvano
Copy link
Contributor

Last year the response bodies processing was commented out since the dataset size exceeded 15TB and was causing issues with the DataFlow pipeline. You can see where this is commented out here - https://github.com/HTTPArchive/bigquery/blob/master/dataflow/python/bigquery_import.py#L331

The HAR files still exist in GCS, so once this is resolved we could backfill the data.

@rviscomi
Copy link
Member Author

rviscomi commented Mar 31, 2021

+1 to @paulcalvano

I'll add that if the metrics are useful for everyone, we should consider baking them into the pages dataset either via custom WPT metrics or calculating them in the Dataflow pipeline. Both approaches have access to the raw response bodies upstream from BigQuery. If the metrics are less universally useful, we could either try one of the previous approaches on a one-off basis or I can suggest how to write your own Dataflow pipeline to ingest the response bodies from GCS and generate the metrics as part of your own GCP project.

Sorry for the inconvenience @housseindjirdeh @philipwalton!

@rviscomi rviscomi changed the title response_bodies.2020_09_01_mobile missing from BigQuery Response bodies tables too big to load into BigQuery in one go (15 TB) Apr 21, 2021
@rviscomi
Copy link
Member Author

rviscomi commented Apr 21, 2021

I'd like to repurpose this issue from triage to designing a solution.

One issue with the current pipeline is that we wait until the crawl is done before generating the BQ tables. It may be better to insert test data one row at a time as soon as it's ready. This avoids huge 15+ TB load jobs. It also makes partial data available sooner, but on the other hand I can see how it'd be confusing to appear incomplete. This would require a significant change to the data pipeline.

Another option is to shard the data before loading. There are a couple of ways we could do this: naively bisecting the tables so each half is roughly 8 TB each and combining them in a post-processing step, or using GCS to stage a serialized copy of the table using sharding and BQ can reassemble the shards in a single load job.

I'm most optimistic about the GCS sharding option, but if a total GCP rewrite of the test pipeline is in our future, maybe streaming inserts are the way to go. @pmeenan @paulcalvano @tunetheweb any thoughts/suggestions on the path forward?

@pmeenan
Copy link
Member

pmeenan commented Apr 21, 2021

Is there a table size limit as well or is it just the load job? If you stream individual files as they arrive then you need to remember which files you have uploaded. 2 possible alternatives:

1 - Keep the processing at the end of the crawl but break it up (shard) and process one day at a time. The test ID's are all prefixed by the day when they were run so that should give ~15-20 shards, all well under the limit.

2 - Process each "chunk" of tests as they are uploaded. WPT for HA (because of IA legacy reasons) groups the tests into groups of 5-10k (YYMMDD_GROUP_xxxxx). They are all uploaded as a group, though that was something I was hoping to move away from. The upload process can create a .done file for each group and the load job would just need to keep track of the groups, not individual tests.

1 feels the cleanest with the least amount of additional tracking.

Could we have an in-process set of data tables that we load data into and then rename them to the crawl table name when all of the load jobs are done?

@jeffposnick
Copy link

If you do go with an approach that streams chunks/shards/etc. of data into a table, then one request is to ensure that there's some cheap-to-query flag that gets set once you're reasonably confident that the data for a given month is complete.

I currently have some automated BigQuery logic that reads from this table, and I'd like to ensure that I could delay executing any expensive queries until a month's results are finalized.

@rviscomi rviscomi pinned this issue Apr 22, 2021
@rviscomi
Copy link
Member Author

rviscomi commented May 14, 2021

We were able to remediate @jeffposnick's Workbox use case by creating a pwa.js custom metric in HTTPArchive/legacy.httparchive.org#198. @housseindjirdeh could you tell me more about your use case to query the response_bodies dataset? I'm hoping we can find a similar workaround. @philipwalton do you know about any other use cases?

@rviscomi
Copy link
Member Author

Fixed by HTTPArchive/bigquery#123

@rviscomi rviscomi unpinned this issue Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants