Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triage all proposed metrics (396 of 396 done) #33

Closed
rviscomi opened this issue Jun 4, 2019 · 24 comments
Closed

Triage all proposed metrics (396 of 396 done) #33

rviscomi opened this issue Jun 4, 2019 · 24 comments
Assignees
Labels
analysis Querying the dataset

Comments

@rviscomi
Copy link
Member

rviscomi commented Jun 4, 2019

Assigned: @HTTPArchive/data-analysts team

Due date: No later than July 1

Any metrics that require augmenting the test infrastructure (eg custom metrics) must be ready to go when the July crawl starts. This ensures that when the crawl completes at the end of July, we can query the dataset and pass it off to authors for interpretation in August.

As of now there are 350+ metrics spread over 20 chapters.

Part Chapter Able To Query Not Feasible Grand Total
I 01. JavaScript 24 1 25
I 02. CSS 39 7 46
I 03. Markup 4 1 5
I 04. Media 20 5 25
I 05. Third Parties 13   13
I 06. Fonts 40 7 47
II 07. Performance 24   24
II 08. Security 36 5 41
II 09. Accessibility 32 6 38
II 10. SEO 15   15
II 11. PWA 6   6
II 12. Mobile web 19 2 21
III 13. Ecommerce 10 3 13
III 14. CMS 11 1 12
IV 15. Compression 3 1 4
IV 16. Caching 14 1 15
IV 17. CDN 13 3 16
IV 18. Page Weight 3   3
IV 19. Resource Hints 10   10
IV 20. HTTP/2 14 3 17
  Grand Total 350 46 396

I've copied all of the metrics for each chapter to this sheet (named "Metrics Triage"). To edit the sheet please give me your email address to add to the editors list. What we need to do is go through the list of metrics for each chapter and assign a status from one of the following:

  • To Be Reviewed
  • Need More Info
  • Not Feasible
  • Able To Query
  • Custom Metric Required
  • Custom Metric Written
  • Query Written

The lifecycle is:

  • All metrics start as TBR
    • Move to NMI if the metric is vaguely worded or otherwise unclear what is being asked for. Get in touch with the chapter author(s) and straighten out what the expected data should look like.
    • Move to NF if the metric cannot be queried using the HTTP Archive dataset or other publicly available datasets on BigQuery (eg CrUX). This is the "done" state for metrics which cannot progress any further.
    • Move to ATQ if the metric is able to be queried from the dataset based on the latest schema
      • Move to QW if the metric has a corresponding query written. This is the ideal "done" state for all metrics.
    • Move to CMR if the metric can only be queried with the addition of a custom metric
      • Move to CMW if the metric has had a corresponding custom metric written. Metrics in this state must also have a corresponding query written and moved to QW when complete.

Custom metrics should only be added as a last resort and must adhere to strict performance requirements. We test on millions of pages so any complex/slow scripts would impede the crawl. Because we anticipate needing many custom metrics, we'll implement everything as individual functions within a single custom metric whose output is a JSON-encoded object with each result as its own sub-property. More on this when we get there.

Add your name in the Analyst column to take responsibility for moving it through the metric lifecycle.

Once we're ready to begin writing queries, we will create a thread on https://discuss.httparchive.org for each chapter, listing all queryable metrics. Hopefully we can crowdsource some of the querying by tapping into the power users on the forum.

@tjmonsi
Copy link
Contributor

tjmonsi commented Jun 4, 2019 via email

@rviscomi rviscomi added the analysis Querying the dataset label Jun 4, 2019
@rviscomi rviscomi added this to TODO in Web Almanac 2019 via automation Jun 5, 2019
@rviscomi rviscomi moved this from TODO to In Progress in Web Almanac 2019 Jun 5, 2019
@rviscomi
Copy link
Member Author

rviscomi commented Jun 6, 2019

@HTTPArchive/data-analysts reminder to please go through the Metrics Triage sheet when you have the time.

There was a lot of info in the first post so here's a condensed version:

  1. Request edit access to the sheet. I don't have everyone's email address otherwise I'd give access now.
  2. Go through the Metrics Triage tab and add your GitHub name to the Analyst column for any metrics you'll be responsible for.
  3. Triage metrics marked To Be Reviewed and change their status depending on their feasibility.

The next step will be to start writing queries and custom metrics using the HTTP Archive forum to discuss solutions.

@rviscomi rviscomi added the ASAP This issue is blocking progress label Jun 11, 2019
@rviscomi rviscomi changed the title Triage all proposed metrics Triage all proposed metrics (42 of 281 done) Jun 11, 2019
@ymschaap
Copy link
Contributor

ymschaap commented Jun 14, 2019

I understand we can create custom metrics for the next crawl. Which is really cool. I'm just unsure what this enables. For example for the SEO Chapter we would want to count the number of h1, h2, h3 elements and their string length. How would I go and create a custom metric? Do you have an example of a custom metric (e.g. piece of code)? Are there docs? Who tests and writes the code?

Once I understand the custom metrics capabilities, I could fill out the Metrics Triage sheet.

@rviscomi
Copy link
Member Author

rviscomi commented Jun 14, 2019

Good question! Custom metrics are JS snippets you can execute on each page. They are run by our legacy crawl system and the code for existing metrics is here: https://github.com/HTTPArchive/legacy.httparchive.org/tree/master/custom_metrics

For example, see the doctype custom metric. To test it, you can run it directly on webpagetest.org under the "Custom" tab:

image

Note that all WPT custom metrics must have [metricName] at the start of the script. This is excluded in the HTTP Archive code and generated automatically based on the file name.

You'll see the output in the WPT results:

image

For complex metrics like almanac.js you will need to inspect the JSON results directly to see the output. The test ID for the results is in the URL. Simply append ?f=json to see the JSON results. For example: http://webpagetest.org/result/190624_6W_f5211bdf38d897fb4cb5a4f0872eb1f6/?f=json

Then you can find the custom metric by going to data.median.firstView.almanac:

image

Let me know if you have any other questions!

@rviscomi rviscomi changed the title Triage all proposed metrics (42 of 281 done) Triage all proposed metrics (64 of 318 done) Jun 17, 2019
@patrickhulce
Copy link
Contributor

Sorry if I missed this somewhere, but do we need to do something extra to get the right permissions to query the sample datasets created in #34 and/or have our test queries not billed to us individually? :)

@rviscomi
Copy link
Member Author

I've updated the permissions of the sample_data dataset so anyone can query it.

The goal for that dataset is to allow @HTTPArchive/data-analysts to explore the schema and validate their queries. The table sizes should be small enough so any queries fit comfortably within the free monthly quota. When we run the analysis against the full dataset, I hope to have BQ credits for everyone to cover any expenses.

@rviscomi
Copy link
Member Author

rviscomi commented Jun 19, 2019

@HTTPArchive/data-analysts we're behind on triaging all of the metrics so I think we need to take a different approach. There are 350 metrics and 12 analysts, so that's an average of 30 metrics per analyst. If we divide and conquer that way, we should be able to meet the July 1 deadline. I'll go through the triage sheet and assign each analyst to approximately 30 metrics each grouped by chapter. I'll update this issue with a table of the assignments.


I've updated the sheet with Analyst assignments and updated the summary table with each analyst's total metric status.

@khempenius and @patrickhulce since you're both authors and expressed interest only in taking on analyst roles for your respective chapters, I didn't add you to any new chapters. @fhoffa I coaxed you into this so I didn't give you too many metrics to work on. Let me know if any of you are willing to take on more metrics, it'd be a big help.

@beouss you expressed an interest in joining the team but never accepted your invitation. If you're still interested I'll assign you some metrics.

@rviscomi rviscomi changed the title Triage all proposed metrics (64 of 318 done) Triage all proposed metrics (88 of 350 done) Jun 19, 2019
@rviscomi rviscomi changed the title Triage all proposed metrics (88 of 350 done) Triage all proposed metrics (85 of 372 done) Jun 19, 2019
@rviscomi rviscomi changed the title Triage all proposed metrics (388 of 396 done) Triage all proposed metrics (390 of 396 done) Jun 30, 2019
@rviscomi rviscomi changed the title Triage all proposed metrics (390 of 396 done) Triage all proposed metrics (396 of 396 done) Jun 30, 2019
@rviscomi
Copy link
Member Author

Today's the day! I've marked all 5 remaining Need More Info metrics as Not Feasible. We're finally done with the triage! Thanks again to the entire @HTTPArchive/data-analysts team for your hard work going through these ~400 metrics.

I'll be syncing the custom metrics with the HTTP Archive server today so they're included in tomorrow's July crawl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
No open projects
Development

No branches or pull requests