Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Desktop Summary Requests incomplete #135

Open
dougsillars opened this issue Mar 26, 2019 · 8 comments

Comments

Projects
None yet
4 participants
@dougsillars
Copy link

commented Mar 26, 2019

February 2019 Summary Requests table is 272M lines, and 240GB

March 2019 Summary Requests table is 5M lines, 5GB

It appears a large amount of data is missing in the March data. The raw data files at https://legacy.httparchive.org/downloads.php are also different in size.

@rviscomi rviscomi added the bug label Mar 26, 2019

@rviscomi

This comment has been minimized.

Copy link
Member

commented Mar 26, 2019

Confirmed that the downloads are serving a file way too small

image

Next step is to try to rerun the mysqldump

https://github.com/HTTPArchive/legacy.httparchive.org/blob/9ef583089600d05093c4992a0c92e77f00c26ae8/bulktest/update.php#L214

@rviscomi rviscomi self-assigned this Mar 26, 2019

@rviscomi

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

The local mysql tables seem to have been cleared out with the exception of the requests table:

mysql> select count(0) from requests;
+----------+
| count(0) |
+----------+
|  5119678 |
+----------+
1 row in set (0.00 sec)

mysql> select count(0) from requestsdev;
+----------+
| count(0) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> select count(0) from requestsmobile;
+----------+
| count(0) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> select count(0) from requestsmobiledev;
+----------+
| count(0) |
+----------+
|        0 |
+----------+
1 row in set (0.01 sec)

The requests that are in that table are only from tests on March 1:

mysql> select min(startedDateTime), max(startedDateTime) from requests;
+----------------------+----------------------+
| min(startedDateTime) | max(startedDateTime) |
+----------------------+----------------------+
|           1551418211 |           1551432347 |
+----------------------+----------------------+
1 row in set (0.00 sec)

So this is why the mysqldump of the requests table is only yielding 647 MB of data.

Not sure what happened to the requests table to cut it short and why only desktop was affected. Also not sure if we have any other backups available. The good news is that we do have the HAR files for all of these requests so it's not a total loss of data, but we would still need to convert the HAR data to the schema in the CSV-based summary tables. This is doable but would require some time. This task is also something that's been on our todo list as part of the mysql deprecation. See #23

I'm still mildly concerned that this is a problem that might happen again, so it's best to keep an eye on the April crawl, especially around the 15th of the month when @pmeenan noticed a suspicious drop in disk space.

@pmeenan

This comment has been minimized.

Copy link

commented Mar 28, 2019

FWIW, the requests tables get dropped after the mysqldump completes so it's not unusual for them to be empty after the crawl but it looks like something triggered it mid-crawl for the desktop data :(

@rviscomi

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

Yeah it seems something nuked the table before we could do our backups. That said, I'm curious how we ended up with a partial requests table if it's supposed to be dropped after each mysqldump.

Here's how it should work. There's a cron job to run batch_process.php every 30 minutes. batch_process will kick off the mysqldump when the crawl is complete:

https://github.com/HTTPArchive/legacy.httparchive.org/blob/7a5710dc83dd4ca7bb204573fd3fa58c5ea2c1f0/bulktest/batch_process.php#L41-L82

https://github.com/HTTPArchive/legacy.httparchive.org/blob/7a5710dc83dd4ca7bb204573fd3fa58c5ea2c1f0/bulktest/copy.php#L96-L100

https://github.com/HTTPArchive/legacy.httparchive.org/blob/6d1a872a3270360a14eb018871544a0c9c8adf28/crawls.inc#L285-L332

@rviscomi rviscomi assigned paulcalvano and unassigned rviscomi Apr 27, 2019

@rviscomi

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

Reassigning to Paul, he's got a conversion sheet going to recreate the summary requests data.

@rviscomi

This comment has been minimized.

Copy link
Member

commented May 4, 2019

Paul and I made lots of progress on this. Here's a table with the summary_requests schema generated from the HARs: https://bigquery.cloud.google.com/table/httparchive:scratchspace.requests_2019_04_01_desktop?tab=preview

Would appreciate another set of eyes to make sure the results look good.

@rviscomi

This comment has been minimized.

Copy link
Member

commented May 4, 2019

@rviscomi

This comment has been minimized.

Copy link
Member

commented May 10, 2019

Noticed today that the summary_pages tables are off as well. Metrics like the total font size are calculated based on the underlying requests, so in the absence of those the summary page data becomes 0.

We'll need to write a query that aggregates requests for each page and computes the summary stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.