Skip to content

Tech report: category-level aggregations #928

@rviscomi

Description

@rviscomi

Use case: when comparing technologies within the same category, it can be useful to know how they all compare to some kind of category-level aggregation over all pages within the category.

Mockup:
image

The blue line represents an aggregation of all pages within the CMS category, so a user can see how it compares to specific technologies within that category. It could also be possible to compare entire categories.

The technical implementation could look something like this:

  • update the technologies table schema to include a field indicating whether the row pertains to a technology or a category aggregation
    • all dimensions supported: rank, client, geo
    • backfill all historical data
  • provide a param in the API endpoints to distinguish between the two, only returning data for the selected aggregation type (default: technology)
  • add categories to the UI, similar to the special "ALL" technology

In terms of the schema changes, we currently have the following fields:

  • date (2024-08-01)
  • geo (ALL)
  • rank (ALL)
  • category (CMS)
  • app (WordPress)
  • client (desktop)
  • [stats] where each field is aggregated over the set of pages that use WordPress for the given dimensions

The updated schema would look something like this for the CMS-level aggregation:

  • date (2024-08-01)
  • type (category)
  • geo (ALL)
  • rank (ALL)
  • category (CMS)
  • app (All CMSs)
  • client (desktop)
  • [stats] where each field is aggregated over the set of pages that one or more CMS for the given dimensions

Calculating category-level data based on technology-level aggregations won't work because percentiles cannot accurately be aggregated together. At best we'd be able to do a weighted average of the medians, but this would also not solve the issue of deduplicating origins that appear multiple times in a category because they use multiple technologies. For example, jQuery UI is always used with jQuery within the JS libraries category, but those websites would be counted twice. So the implementation would need to process the raw origin-level data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Tech ReportHTTP Archive Technology Report

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions