Country-level origin counts should always increase by rank #13

rviscomi · 2022-03-28T21:18:47Z

@tunetheweb and @shappir pointed out that the number of origins for a given technology and country are not necessarily increasing by rank.

SELECT
  *
FROM
  `httparchive.core_web_vitals.technologies`
WHERE
  date = '2022-02-01' AND
  app = 'React' AND
  geo = 'United States of America' AND
  client = 'mobile'
ORDER BY
  rank

rank	origins
1,000	247
10,000	1,434
100,000	10,981
1,000,000	82,988
10,000,000	45,316

According to the results, there are 83k React websites in the US among the top 1M. However, in the top 10M segment, there are only 45k websites. This doesn't make sense because every website in the top 1M should also be in the top 10M.

The way this table is generated should be counting every website in the more popular ranks among the lesser popular ranks:

cwv-tech-report/sql/monthly.sql

Lines 86 to 92 in 3e64d0b

    
           FROM 
        
             geo_summary, 
        
             UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS _rank 
        
           WHERE 
        
             date = (SELECT * FROM RELEASE_DATE) AND 
        
             device IN ('desktop', 'phone') AND 
        
             rank <= _rank

Something is clearly not working.

tunetheweb · 2022-03-28T21:25:28Z

Are you sure this is the SQL used as I can't get this bit to run:

cwv-tech-report/sql/monthly.sql

Lines 50 to 63 in 3e64d0b

    
           geo_summary AS ( 
        
             SELECT 
        
               CAST(REGEXP_REPLACE(CAST(yyyymm AS STRING), r'(\d{4})(\d{2})', r'\1-\2-01') AS DATE) AS date, 
        
               * EXCEPT (country_code), 
        
               `chrome-ux-report`.experimental.GET_COUNTRY(country_code) AS geo 
        
             FROM 
        
               `chrome-ux-report.materialized.country_summary` 
        
           UNION ALL 
        
             SELECT 
        
               * EXCEPT (yyyymmdd, p75_fid_origin, p75_cls_origin, p75_lcp_origin), 
        
               'ALL' AS geo 
        
             FROM 
        
               `chrome-ux-report.materialized.device_summary` 
        
           ), crux AS (

The date is missing from the second UNION clause.

rviscomi · 2022-03-28T21:27:53Z

Yes that query was used. It no longer works because the responsiveness metric has since been added to the materialized CrUX tables. So to get it working you need to add p75_responsiveness_origin to line 59. I was planning to commit that change when I have a fix for this issue or when I rerun it with next month's data.

tunetheweb · 2022-03-28T21:44:34Z

This is interesting:

SELECT device, rank, COUNT(0) AS count
FROM `chrome-ux-report.materialized.country_summary`
WHERE yyyymm = 202202 AND
  country_code = 'us'
GROUP BY device, rank
ORDER BY device, rank

device	rank	count
desktop	1000	958
desktop	10000	8748
desktop	100000	87534
desktop	1000000	835616
desktop	10000000	501673
phone	1000	895
phone	10000	8341
phone	100000	84990
phone	1000000	827078
phone	10000000	493310
tablet	1000	612
tablet	10000	5462
tablet	100000	43822
tablet	1000000	54252
tablet	10000000	1795

Last number looks wrong in all of them

pmeenan · 2022-03-28T22:03:55Z

Unless the materialized tables are already accounting for it, the rank on the raw data is exclusive of other rank groups (each page belongs to a single rank) so the numbers are exclusive of the other ranks.

pmeenan · 2022-03-28T22:04:56Z

Sorry, at least they are in the HA data. I assume the same is true for the CrUX data where each origin belongs to a single rank.

tunetheweb · 2022-03-28T22:19:44Z

Ah yes that's true. Should have used this SQL and that works:

SELECT device, _rank, COUNT(0) AS count, COUNT(DISTINCT origin) as origins
FROM `chrome-ux-report.materialized.country_summary`,
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS _rank
WHERE yyyymm = 202202 AND
  country_code = 'us' AND
  rank <= _rank
GROUP BY device, _rank
ORDER BY device, _rank

Weird that the 10m bucket (which we know is not complete and doesn't include 1m as you say so is the 1.00001m - 8m bucket approx.) has less US sites that 1m bucket. But not impossible I guess.

rviscomi · 2022-03-28T22:59:56Z

Ok I'm closer to the answer.

WITH tech AS (
SELECT DISTINCT
  url
FROM
  `httparchive.technologies.2022_02_01_mobile`
WHERE
  app = 'React'
), crux AS (
SELECT
  CONCAT(origin, '/') AS url,
  rank
FROM
  `chrome-ux-report.materialized.country_summary`
WHERE
  yyyymm = 202202 AND
  country_code = 'us' AND
  device = 'phone'
)

SELECT
  rank,
  COUNT(0) AS origins
FROM
  tech
JOIN
  crux
USING
  (url)
GROUP BY
  rank
ORDER BY
  origins

rank	origins
1,000	247
10,000	1,434
100,000	10,981
10,000,000	45,316
1,000,000	82,988

Similar to @tunetheweb's query in #13 (comment), this query intentionally does not nest smaller ranks under larger ones (ie 1k is not a subset of 10k). These results align perfectly with the React stats in #13 (comment).

So the bug is that the ranks are not inclusive of smaller ranks. The expected behavior is:

WITH tech AS (
SELECT DISTINCT
  url
FROM
  `httparchive.technologies.2022_02_01_mobile`
WHERE
  app = 'React'
), crux AS (
SELECT
  CONCAT(origin, '/') AS url,
  rank
FROM
  `chrome-ux-report.materialized.country_summary`
WHERE
  yyyymm = 202202 AND
  country_code = 'us' AND
  device = 'phone'
)

SELECT
  _rank,
  COUNT(0) AS origins
FROM
  tech
JOIN
  crux
USING
  (url),
  UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS _rank
WHERE
  rank <= _rank
GROUP BY
  _rank
ORDER BY
  origins

_rank	origins
1,000	247
10,000	1,681
100,000	12,662
1,000,000	95,650
10,000,000	140,966
100,000,000	140,966

I think there's a bug in my query that selects/groups by rank (the exclusive value) instead of _rank (the inclusive value):

cwv-tech-report/sql/monthly.sql

Lines 64 to 92 in 3e64d0b

    
           SELECT 
        
             geo, 
        
             rank, 
        
             CONCAT(origin, '/') AS url, 
        
             IF(device = 'desktop', 'desktop', 'mobile') AS client, 
        
             # CWV 
        
             IS_NON_ZERO(fast_fid, avg_fid, slow_fid) AS any_fid, 
        
             IS_GOOD(fast_fid, avg_fid, slow_fid) AS good_fid, 
        
             IS_NON_ZERO(small_cls, medium_cls, large_cls) AS any_cls, 
        
             IS_GOOD(small_cls, medium_cls, large_cls) AS good_cls, 
        
             IS_NON_ZERO(fast_lcp, avg_lcp, slow_lcp) AS any_lcp, 
        
             IS_GOOD(fast_lcp, avg_lcp, slow_lcp) AS good_lcp, 
        
             (IS_GOOD(fast_fid, avg_fid, slow_fid) OR fast_fid IS NULL) AND 
        
             IS_GOOD(small_cls, medium_cls, large_cls) AND 
        
             IS_GOOD(fast_lcp, avg_lcp, slow_lcp) AS good_cwv, 
        
             # WV 
        
             IS_NON_ZERO(fast_fcp, avg_fcp, slow_fcp) AS any_fcp, 
        
             IS_GOOD(fast_fcp, avg_fcp, slow_fcp) AS good_fcp, 
        
             IS_NON_ZERO(fast_ttfb, avg_ttfb, slow_ttfb) AS any_ttfb, 
        
             IS_GOOD(fast_ttfb, avg_ttfb, slow_ttfb) AS good_ttfb 
        
           FROM 
        
             geo_summary, 
        
             UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS _rank 
        
           WHERE 
        
             date = (SELECT * FROM RELEASE_DATE) AND 
        
             device IN ('desktop', 'phone') AND 
        
             rank <= _rank

I'll update the all/monthly queries and regenerate the data to fix the issue in the dataset/dashboard.

tunetheweb · 2022-03-28T23:03:39Z

FYI, this is why I used rank_grouping (rather than _rank) as the UNEST alias when I used this pattern in Web Almanac queries. Less easy to mix up with the rank column.

tunetheweb · 2022-03-28T23:05:30Z

Misremembered. It was rank_grouping we used in Almanac: https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Analysts'-Guide#rank

shappir · 2022-03-29T09:03:44Z

Thank you for uncovering this!

rviscomi · 2022-03-29T16:21:40Z

Updated the dashboard. I think it's working now.

SELECT
  rank, origins
FROM
  `httparchive.core_web_vitals.technologies`
WHERE
  date = '2022-02-01' AND
  geo = 'United States of America' AND
  app = 'React' AND
  client = 'mobile'
ORDER BY
  origins

rank	origins
Top 1k	247
Top 10k	1,681
Top 100k	12,662
Top 1M	95,650
Top 10M	140,966
ALL	140,966

Note that I also made the ranks more human readable and changed the biggest 100M rank to "ALL" for consistency with other fields. (We're close to exceeding 10M origins, so I figured 100M was more forward-compatible)

One more thing: I was able to recover the data prior to March 2021 when the rank field was added to CrUX. So if you want historical data going as far back as January 2020, you'll need to select the "ALL" rank.

rviscomi mentioned this issue Mar 29, 2022

Rank fix #14

Merged

rviscomi closed this as completed in #14 Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Country-level origin counts should always increase by rank #13

Country-level origin counts should always increase by rank #13

rviscomi commented Mar 28, 2022

tunetheweb commented Mar 28, 2022

rviscomi commented Mar 28, 2022

tunetheweb commented Mar 28, 2022

pmeenan commented Mar 28, 2022

pmeenan commented Mar 28, 2022

tunetheweb commented Mar 28, 2022

rviscomi commented Mar 28, 2022

tunetheweb commented Mar 28, 2022 •

edited

Loading

tunetheweb commented Mar 28, 2022

shappir commented Mar 29, 2022

rviscomi commented Mar 29, 2022

Country-level origin counts should always increase by rank #13

Country-level origin counts should always increase by rank #13

Comments

rviscomi commented Mar 28, 2022

tunetheweb commented Mar 28, 2022

rviscomi commented Mar 28, 2022

tunetheweb commented Mar 28, 2022

pmeenan commented Mar 28, 2022

pmeenan commented Mar 28, 2022

tunetheweb commented Mar 28, 2022

rviscomi commented Mar 28, 2022

tunetheweb commented Mar 28, 2022 • edited Loading

tunetheweb commented Mar 28, 2022

shappir commented Mar 29, 2022

rviscomi commented Mar 29, 2022

tunetheweb commented Mar 28, 2022 •

edited

Loading