Response bodies tables too big to load into BigQuery in one go (15 TB) #225

rviscomi · 2020-10-02T20:31:54Z

The September 2020 crawl completed successfully except for the response_bodies table for mobile. Inspecting the Dataflow logs shows this as the error:

Error while reading table: 2020_09_01_mobile_5b8f67b4_219e_41c5_9877_94ee0290cc64_source, error message: Total JSON data size exceeds max allowed size. Total size is at least: 16556745870240. Max allowed size is: 16492674416640.

16492674416640 bytes is 15 TB, the maximum size per BigQuery load job. The corresponding September 2020 desktop table weighs in at 10.82 TB and the August 2020 mobile table is 14.47 TB, so it's plausible that the mobile table finally exceeded 15 TB this month. The underlying CrUX dataset is continuing to grow, so this is another one of the stresses on the data pipeline capacity.

Table	Rows	Bytes (TB)
2020_08_01_desktop	215,621,667	10.99
2020_08_01_mobile	270,249,686	14.47
2020_09_01_desktop	216,083,365	10.82
2020_09_01_mobile	291,589,220	15.06 ❌

Here's a look at how the response body sizes were distributed in 2020_08_01_mobile:

MB	requests	cumulative weight (GB)	cumulative requests
90	1	0.09	1
87	1	0.17	2
86	1	0.26	3
84	1	0.34	4
83	1	0.42	5
78	1	0.50	6
75	1	0.57	7
74	1	0.64	8
68	2	0.77	10
62	1	0.83	11
61	2	0.95	13
59	1	1.01	14
56	1	1.07	15
55	2	1.17	17
54	1	1.23	18
50	2	1.32	20
48	3	1.46	23
47	3	1.60	26
45	1	1.65	27
44	5	1.86	32
43	1	1.90	33
41	2	1.98	35
40	2	2.06	37
38	2	2.14	39
37	4	2.28	43
36	5	2.46	48
35	6	2.66	54
34	1	2.69	55
32	2	2.76	57
31	2	2.82	59
30	3	2.91	62
29	3	2.99	65
28	6	3.15	71
27	2	3.21	73
26	7	3.38	80
25	4	3.48	84
24	6	3.62	90
23	25	4.18	115
22	15	4.51	130
21	32	5.16	162
20	136	7.82	298
19	35	8.47	333
18	44	9.24	377
17	50	10.07	427
16	62	11.04	489
15	79	12.20	568
14	140	14.11	708
13	166	16.22	874
12	287	19.58	1,161
11	276	22.55	1,437
10	536	27.78	1,973
9	687	33.82	2,660
8	1,387	44.66	4,047
7	1,876	57.48	5,923
6	2,537	72.35	8,460
5	6,896	106.02	15,356
4	9,898	144.68	25,254
3	30,386	233.70	55,640
2	103,328	435.52	158,968
1	1,438,555	1840.35	1,597,523
0	268,652,163	1840.35	270,249,686

Some hand-wavey math later, I think what this is telling me is that if we reduce the row limit back down to 2 MB from 100, we can save up to 606 GB, assuming each row also has an average of 100 bytes of bookkeeping (page, url, truncated, requestid). This should be enough headroom to offset the dataset growth and get us under the limit.

I'll rerun the Dataflow job with a limit of 2 MB and report back if it works.

The text was updated successfully, but these errors were encountered:

rviscomi · 2020-10-02T20:52:08Z

The Dataflow job was started and should take 8.5 hours to complete. I also want to note that this will result in ~300k response bodies (0.11%) being truncated — not omitted, I think that was a mistake in my math above.

One workaround to the 15 TB limit may be to write to two tables and merge them in a post-load job. But all this work is making me wonder if it's worth the effort. I actively discourage everyone from querying the response_bodies dataset due to its size. At 15 TB and $5/TB it's $75 just to run one query on the mobile table or $125 for both desktop and mobile. Nobody wants to spend that kind of money.

With the recent support for $WPT_BODIES in WebPageTest custom metrics, we can analyze the response bodies for any request and emit stats in the form of custom metrics in the pages dataset. For the most part, this was our analytical approach in the Web Almanac for metrics that were dependent on response bodies.

So for researchers interested in querying over all response bodies on BigQuery, we could encourage them to write a new custom metric, which also has the benefit of enriching the cheaper-to-query pages dataset for everyone. Any given pages table is still under the 1 TB free monthly quota. The downside is that this isn't backwards compatible and they wouldn't be able to easily do this for older crawls. We would still have the response bodies in the HAR files on GCS, so if absolutely necessary we could find a way to load a crawl's worth into BQ, maybe using the sharded table approach above.

@paulcalvano @pmeenan WDYT?

rviscomi · 2020-10-03T22:27:09Z

The job with the 2 MB per row limit failed still. It truncated 217,352 response bodies, but the table still exceeded the 15 TB limit.

I'm rerunning it now with a 1.5 MB per row limit.

rviscomi · 2020-10-04T17:04:23Z

1.5 MB still failing. Experimenting with 2 MB again but omitting the new requestId field.

rviscomi · 2020-10-07T17:31:09Z

Still failing. Trying a different approach. There seem to be many response bodies for images:

SELECT
  format,
  COUNT(0) AS freq
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  date = '2020-08-01' AND
  client = 'mobile' AND
  type = 'image' AND
  body IS NOT NULL
GROUP BY
  format
ORDER BY
  freq DESC

format	freq
svg	7505138
gif	957900
ico	489250
png	101415
jpg	93750
webp	938
	56

Let's try omitting any images that are not SVG with the 2 MB truncation.

housseindjirdeh · 2021-03-17T14:33:04Z

It also looks like crawls for response_bodies have stopped after October 2020 for both mobile and desktop, and I assume they're not completing successfully for this same reason?

philipwalton · 2021-03-17T16:47:08Z

Also (FYI), a number of people internally at Google have sudden started to notice this (many related to perf questions) because they've been unable to query data.

I understand the argument that most people probably don't want to be paying for queries to the response bodies table, but for folks who aren't paying for it, there still may be value in being able to perform ad-hoc queries.

paulcalvano · 2021-03-17T20:11:00Z

Last year the response bodies processing was commented out since the dataset size exceeded 15TB and was causing issues with the DataFlow pipeline. You can see where this is commented out here - https://github.com/HTTPArchive/bigquery/blob/master/dataflow/python/bigquery_import.py#L331

The HAR files still exist in GCS, so once this is resolved we could backfill the data.

rviscomi · 2021-03-31T00:59:29Z

+1 to @paulcalvano

I'll add that if the metrics are useful for everyone, we should consider baking them into the pages dataset either via custom WPT metrics or calculating them in the Dataflow pipeline. Both approaches have access to the raw response bodies upstream from BigQuery. If the metrics are less universally useful, we could either try one of the previous approaches on a one-off basis or I can suggest how to write your own Dataflow pipeline to ingest the response bodies from GCS and generate the metrics as part of your own GCP project.

Sorry for the inconvenience @housseindjirdeh @philipwalton!

rviscomi · 2021-04-21T04:00:12Z

I'd like to repurpose this issue from triage to designing a solution.

One issue with the current pipeline is that we wait until the crawl is done before generating the BQ tables. It may be better to insert test data one row at a time as soon as it's ready. This avoids huge 15+ TB load jobs. It also makes partial data available sooner, but on the other hand I can see how it'd be confusing to appear incomplete. This would require a significant change to the data pipeline.

Another option is to shard the data before loading. There are a couple of ways we could do this: naively bisecting the tables so each half is roughly 8 TB each and combining them in a post-processing step, or using GCS to stage a serialized copy of the table using sharding and BQ can reassemble the shards in a single load job.

I'm most optimistic about the GCS sharding option, but if a total GCP rewrite of the test pipeline is in our future, maybe streaming inserts are the way to go. @pmeenan @paulcalvano @tunetheweb any thoughts/suggestions on the path forward?

pmeenan · 2021-04-21T14:43:21Z

Is there a table size limit as well or is it just the load job? If you stream individual files as they arrive then you need to remember which files you have uploaded. 2 possible alternatives:

1 - Keep the processing at the end of the crawl but break it up (shard) and process one day at a time. The test ID's are all prefixed by the day when they were run so that should give ~15-20 shards, all well under the limit.

2 - Process each "chunk" of tests as they are uploaded. WPT for HA (because of IA legacy reasons) groups the tests into groups of 5-10k (YYMMDD_GROUP_xxxxx). They are all uploaded as a group, though that was something I was hoping to move away from. The upload process can create a .done file for each group and the load job would just need to keep track of the groups, not individual tests.

1 feels the cleanest with the least amount of additional tracking.

Could we have an in-process set of data tables that we load data into and then rename them to the crawl table name when all of the load jobs are done?

jeffposnick · 2021-04-21T16:33:53Z

If you do go with an approach that streams chunks/shards/etc. of data into a table, then one request is to ensure that there's some cheap-to-query flag that gets set once you're reasonably confident that the data for a given month is complete.

I currently have some automated BigQuery logic that reads from this table, and I'd like to ensure that I could delay executing any expensive queries until a month's results are finalized.

rviscomi · 2021-05-14T03:23:31Z

We were able to remediate @jeffposnick's Workbox use case by creating a pwa.js custom metric in HTTPArchive/legacy.httparchive.org#198. @housseindjirdeh could you tell me more about your use case to query the response_bodies dataset? I'm hoping we can find a similar workaround. @philipwalton do you know about any other use cases?

rviscomi · 2021-08-18T20:03:13Z

Fixed by HTTPArchive/bigquery#123

rviscomi added the bug label Oct 2, 2020

rviscomi mentioned this issue Oct 6, 2020

Remove Fugu reports that require a user gesture #224

Merged

rviscomi changed the title ~~response_bodies.2020_09_01_mobile missing from BigQuery~~ Response bodies tables too big to load into BigQuery in one go (15 TB) Apr 21, 2021

rviscomi pinned this issue Apr 22, 2021

rviscomi closed this as completed Aug 18, 2021

rviscomi unpinned this issue Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response bodies tables too big to load into BigQuery in one go (15 TB) #225

Response bodies tables too big to load into BigQuery in one go (15 TB) #225

rviscomi commented Oct 2, 2020

rviscomi commented Oct 2, 2020

rviscomi commented Oct 3, 2020

rviscomi commented Oct 4, 2020

rviscomi commented Oct 7, 2020

housseindjirdeh commented Mar 17, 2021 •

edited

philipwalton commented Mar 17, 2021

paulcalvano commented Mar 17, 2021

rviscomi commented Mar 31, 2021 •

edited

rviscomi commented Apr 21, 2021 •

edited by tunetheweb

pmeenan commented Apr 21, 2021

jeffposnick commented Apr 21, 2021

rviscomi commented May 14, 2021 •

edited

rviscomi commented Aug 18, 2021

Response bodies tables too big to load into BigQuery in one go (15 TB) #225

Response bodies tables too big to load into BigQuery in one go (15 TB) #225

Comments

rviscomi commented Oct 2, 2020

rviscomi commented Oct 2, 2020

rviscomi commented Oct 3, 2020

rviscomi commented Oct 4, 2020

rviscomi commented Oct 7, 2020

housseindjirdeh commented Mar 17, 2021 • edited

philipwalton commented Mar 17, 2021

paulcalvano commented Mar 17, 2021

rviscomi commented Mar 31, 2021 • edited

rviscomi commented Apr 21, 2021 • edited by tunetheweb

pmeenan commented Apr 21, 2021

jeffposnick commented Apr 21, 2021

rviscomi commented May 14, 2021 • edited

rviscomi commented Aug 18, 2021

housseindjirdeh commented Mar 17, 2021 •

edited

rviscomi commented Mar 31, 2021 •

edited

rviscomi commented Apr 21, 2021 •

edited by tunetheweb

rviscomi commented May 14, 2021 •

edited