Security 2024 Queries #3671

JannisBush · 2024-06-01T18:12:41Z

Queries for the 2024 Security Chapter

General

Maybe add subfolders for the sections or prefix each query with their subsection to improve the structure (50 queries in one folder are quite a mess)?
Maybe remove some unused queries as they can distract?

Existing Queries

Copy all queries from the 2022 chapter
Update all of them to use only httparchive.all
Add annotations to queries missing annotations
Other structure/style improvements (preamble: Section: ? Question: ?)
Fix security.txt query to contain all documented fields (and not to heavily over report! 2022 said ~5% sites use security.txt, however most of these files are HTML pages and it is more like 1% of sites that use it which is similar to https://www.uriports.com/blog/security-txt/)
~~Fix sensitive values in robots.txt query~~ Query is already functional?

New Queries

List all new queries (according to the planning document as of July 16)
Implement all new queries

Attack Preventions

Security Header Adoption

Usage of document.domain setters and getters (Blink Feature Usage + Origin Agent Cluster)
General Usage of Timing-Allow-Origin
Value Distribution of Timing-Allow-Origin

Preventing Attacks using Cross-Origin Policies

~~Value Distribution CORP~~ Already existed
Value Distribution COEP
Value Distribution COOP
Also added queries for XFO (CSP already existed)

HTML Sanitization

General adoption of setHTMLUnsafe and parseHTMLUnsafe (Blink Feature Usage)
~~% Sites using above APIs of all sites that use shadow DOM~~ Usage is too rare: 6x parse, 2x set in total!

Drivers of Security Mechanism Adoption

Website Categories

% Sites use HTTPS per website category
% Sites use security headers per website category
~~% Sites use SRI per website category~~ We currently do not even have queries for SRI related to rank, country, ...; should we add website category here?
~~% Sites use bot protection services per website category~~ We currently do not even have queries for bot protection related to rank, country, ...; should we add website category here?

Security Misconfiguration and Oversights

~~Unsupported policies in <meta> tag~~ Already exists: meta_policies_allowed_vs_disallowed
- CSP, CSP-ro, XFO, HSTS, ...
- Discrepancies for policies definition in both header and <meta> tag
- More advanced query looking into the values (i.e., check for e.g., frame-ancestors, sandbox, report-uri, ...
CSP vs XFO?
Server-Timing header: general usage, dur property, ....

No real security header as it can only worsen security (MDN classifies it as CORS header) thus not included in all queries, e.g., not included in feature_adoption_by_country.sql)

vikvanderlinden · 2024-08-20T20:32:13Z

Thanks for the implementations!
I took a look over the queries, and here's a few of my thoughts (also as response to the notes here and there)

bot detection:
I would say we can just keep HSTS and Really Simple SSL and just select the values we'd like during writing. Same goes for entries like "bugcrowd" etc.
cryptominer usage:
A year or two is usually used in previous editions as far as I know. Two years would be great imo because the last almanac used data up until July 2022, which we can continue from. Usually the usage is very low but sometimes the usage spikes, which might be interesting and if that would be the case we may find a good explanation (like last edition). I would propose to still run this query for 2 years worth of data and to then see what the result is before deciding whether to include it into the chapter or not.
csp allowed host frequency:
The CSP header allows hostnames without scheme (or other schemes such as wss://), we should think whether we want to include those or not. It is my understanding that this currently does not happen and I think we should at least mention that in case we use the current version of the query. Parsing will be more difficult, but there might be hosts that are included much more frequently without scheme.
Example: github.com sends a CSP response header in which none of the domains have a scheme apart from wss://.

It is not unthinkable that certain sites/frameworks/plugins/platforms/... in general do not include schemes. I wonder for how many sites "https?://" is in the CSP header at all. Not having "https?://" could also mean they just don't use allowlists of hosts but better options like strict-dynamic and nonce/hash (which is recommended).
csp number of allowed hosts:
Same remark as previous (sheme-less hosts in CSP).
documentdomain usage:
I'm not sure I understand the note: "the header is very rare and yet this value is very low". Would that not be normal? Rare header == low number of urls == low usage value?
security adoption by category:
We should LOWER the header names so we don't accidentally miss some headers.
security adoption by rank:
Same remark as previous (LOWER header names)
server timing usage values:
It would be interesting to have the fractions of properties besides the 'dur' property to gauge which info may be included in the header.

For all other queries I had no remarks really.
Let me know what you think. (CC @GJFR)

JannisBush · 2024-08-21T06:36:35Z

Bot Detection: yes we can simply filter them out later (no change)
Cryptominer Usage: The usage in June 2024 is a total of 53 pages on mobile and 41 pages on desktop which is nothing (total pages: 15M+) (https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=2083415086#gid=2083415086) (~~I can change the query to start in 2022~~ done)
CSP hosts: Yes, we could come up with another regex/parsing to allow for hosts that do not start with https (Maybe change parsing of CSP hosts)
Documentdomain: The note might be confusing. Here is the data: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=150039899#gid=150039899 DocumentSetDomain has a count of 484310, DocumentDomainSettingWithoutOriginAgentClusterHeader has a count of only 804, this would suggest that most pages that use DocumentSetDomain have an OriginAgentCluster. However, this is not the case, as we can see from other queries that almost 0 pages have an OriginAgentCluster header. I checked the data and it seems like starting from 20230201 they only count DocumentDomainSettingWithoutOriginAgentClusterHeader if the page has an OriginAgentCluster that does not allow for using SetDomain, in all other cases including no-header, they do not count that counter up. (~~Improve the note text~~ done)
Security by category/rank: True, they should be lowercased as was done in the other queries. (~~Lowercase header-names x2~~ done)
Server Timing: Here are the results of the query: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=1339089790#gid=1339089790 This could also be tracked by the privacy team as it no a security header per se? (Maybe extract/count other properties of the server-timing header)

GJFR · 2024-08-22T16:48:23Z

Thanks a lot for writing the queries @JannisBush, going over the results was already very interesting! And especially thanks for updating all existing queries to use the new tables.

Some remarks:

CSP hosts: I just performed a small test query with regex (?i)(wss*://[^\s;]+)[\s;] instead of the https* one. Frequency of ws* hosts is very low compared to https* and imo wouldn't make sense to use in the same query (wss* hosts would almost be truncated out of the resulting table anyway). Highest relative frequency: wss://*.intercom.io and wss://*.hotjar.com with both +/- 0.07%. Might be interesting to still mention separately and would warrant a separate query for wss*, though?
Server Timing: I agree with @vikvanderlinden that it would make more sense to consider the properties alone, without timing values (they undermine the group by). Would it be feasible to extract every property? If not, a cheap improvement would be to just omit all numeric values from the string (remaining caveat is the different order of properties in headers)? What do you both think?
Cookie Age Negative: Results indeed differ quite a lot from 2022. I looked at the query in detail, but I can't find any issues. Other cookie-related queries seem to differ less from 2022 results.
It appears from Iframe Attributes Usage that usage of allow doubled compared to 2022. It would be nice to check the change in usage of time. Could you add such a query similar to Https Request Over Time for the usage of allow and sandbox?

I've added some additional derived tables to the results sheets to provide more insights into the data. When running queries again, could you avoid overwriting the full tabs? It's only a few tabs, though.

I will finish reviewing the last few queries tomorrow.

GJFR · 2024-08-23T10:51:20Z

Went though the remainder of queries:

I think I found a typo: two queries contain date = 2022-06-09 while it should be date = 2022-06-01?
- https://github.com/JannisBush/almanac.httparchive.org/blob/cbcb2a7ea833de7fad881ca7afa689112bf6ce27/sql/2024/security/security_headers_prevalence.sql#L22
- https://github.com/JannisBush/almanac.httparchive.org/blob/cbcb2a7ea833de7fad881ca7afa689112bf6ce27/sql/2024/security/server_information_header_prevalence.sql#L21
There is something off about the results of Tls Ca Issuers Pages: second largest issuer is an empty string (accounts for about 13% of all pages), which is quite a lot (this was 0.03% in 2022). I also found additional empty string issuers on row 151 and 180, which is weird because GROUP BY should have merged all empty string issuers. Maybe something went wrong during transfer of results from bigquery to sheets?
What does count_found mean in the results of Well Known Security? Since there is also a has_security_txt, I assume it doesn't represent all websites with a security.txt endpoint?

JannisBush · 2024-08-23T12:05:39Z

CSP hosts: We can add another query for WSS or non-https hosts
Server Timing: We can split at "," first and then take everything before the first ";", then take anything until the first "=" and that should give us all the metric names. If you also want to extract the descriptions and values, it will get more complex.
Cookie Age Negative: I started to run the new query on the old data to check if there might be an issue with the query
Iframe attribute usage: We can add a new query that tracks usage of allow and sandbox over time
2022-06-09: This is on purpose, the 2022-06-01 entries for all.requests is broken and the correct entries for that year/month are 2022-06-09
TLs Ca Issuers: As written in the notes, the query currently includes HTTP Pages (empty issuers, 13%) as well. We could filter that out (for the comparison to the old data), however, I think for the future including the amount of HTTP Pages in that query directly makes sense. Row 151 and 180 have 0.02% and are probably HTTPS Pages without an issuer (the same as the old 0.03% entries). I guess in bigquery, the HTTP Pages have a NULL issuer and the HTTPS Pages without an issuer have an empty string, so they are not merged together. However, when copying the data to sheets the difference between NULL and the empty string is lost.
count_found: Count found is what the custom metric reports as found which is the final response to ./well-known/security.txt returns a status code of 200, this includes a lot of false positives that return non-security.txt files with code 200 at that URL (probably a similar number as https://almanac.httparchive.org/en/2022/security#detecting-status-code-reliability). has_security_txt has the additional constraint that the content-type header of the response has to start with text/plain (which it MUST according to the spec: https://datatracker.ietf.org/doc/html/rfc9116#section-3). I did quite some experiments (see the outcommented queries in the file: https://github.com/HTTPArchive/almanac.httparchive.org/blob/cbcb2a7ea833de7fad881ca7afa689112bf6ce27/sql/2024/security/well-known_security.sql). My verdict of the results are that requiring text/plain gets rid of the vast majority of FPs while at the same time only introducing a negligible amount of FNs.
I am not automatically overwriting any data in the sheets.

Data for the new cookie query on the old data (2022-06-09):
The numbers for Max-Age are basically the same. The differences is in Expires (which then transfers to Real Age)
I had the note Some of the percentages are quite different to the old query; one of both might be broken (difficult to compare as both cannot operate on a shared dataset) in the query. So yeah, something is going on here.

client	count_negative_max_age	pct_negative_max_age	num_max_age_pages	pct_max_age_pages	num_max_age_hosts	pct_max_age_hosts	count_negative_expires	pct_negative_expires	num_expires_pages	pct_expires_pages	num_expires_hosts	pct_expires_hosts	count_negative_real_age	pct_negative_real_age	num_real_age_pages	pct_real_age_pages	num_real_age_hosts	pct_real_age_hosts
desktop	3027635	0.085079712693773929	391368	0.18646223068797993	164603	0.14661195391159976	5920452	0.06190244069067101	1110544	0.29486378523190565	303714	0.16427975276414852	5761149	0.056308908960075628	1069189	0.28305686557625642	264131	0.13890066271136348
mobile	2752659	0.061759115713496793	481595	0.16319871445790932	200027	0.1244405596580835	6068461	0.049396407189040661	1456445	0.27821213917970833	347559	0.13971620115235245	5938723	0.0449079900197829	1409598	0.26856166505420204	300906	0.11727684617708099

JannisBush · 2024-08-23T12:53:07Z

Maybe CAST(JSON_QUERY(summary, '$.startedDateTime') AS NUMERIC) in all.requests is not the same as startedDateTime in almanac.requests 🤔

Seems like there was some post-processing:

almanac.httparchive.org/sql/util/requests.sql

Line 121 in a6042fe

startedDateTime: Math.round(+startedDateTime / 1000),

I tried to do the same post-processing, but got very different results. I will ask in slack.

vikvanderlinden · 2024-08-25T02:08:39Z

For Server-Timing I think the metric names are most interesting. Additionally it would be interesting to check the count for the dur property.

Sort of like: total # pages, # with header, # with >= 1 dur, # with 1x dur, # with 2x dur, # with 3x dur, # with >3 dur.
And if we can get the property names, a table that shows the number of pages that have at least one of each property. I wouldn't prioritize this, but for instance:

property	# pages with that property at least once
dur	200
desc	150
cache	100

If that makes sense?

JannisBush · 2024-08-26T13:15:03Z

Maybe CAST(JSON_QUERY(summary, '$.startedDateTime') AS NUMERIC) in all.requests is not the same as startedDateTime in almanac.requests 🤔

Seems like there was some post-processing:

almanac.httparchive.org/sql/util/requests.sql

Line 121 in a6042fe

startedDateTime: Math.round(+startedDateTime / 1000),

I tried to do the same post-processing, but got very different results. I will ask in slack.

Tested more and they both seem very similar and have the same format, so that might not be the reason for the drastic differences.
The only difference I could discover is that in all.requests around 0.8% of requests have an extracted startedDateTime that is nullwhereas this is never the case for almanac.requests. However, as this only applies to such a small number of requests, I don't think this can be the (only) reason for the large difference.

Average startedDateTime: almanac: f0_
1654908665.0546837, all: f0_
1654883339.0342844

SELECT
  AVG(startedDateTime),
  AVG(CAST(JSON_QUERY(r2.summary, '$.startedDateTime') AS INT64))
FROM
  `httparchive.almanac.requests` as r1,
  `httparchive.all.requests` as r2
WHERE
  r1.date = '2022-06-01'
  AND r2.date = '2022-06-09'
  AND is_root_page

0.8% of all.requests have a startedDateTime that is NULL

SELECT
    COUNTIF(CAST(JSON_QUERY(summary, '$.startedDateTime') AS INT64) IS NULL) / count(0) as startedDateTime
  FROM
    `httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
  WHERE
    date = '2022-06-09' AND
    is_root_page

0% of almanac.requests has a startedDateTime that is NULL

  SELECT
    COUNTIF(startedDateTime IS NULL) / count(0)
  FROM
    `httparchive.almanac.requests` TABLESAMPLE SYSTEM (100 PERCENT)
  WHERE
    date = '2022-06-01'

JannisBush · 2024-08-26T15:31:19Z

I added the CSP WSS host query (and added the results to the sheets)
Iframes attributes: updated the query and added the results to sheets (only for some months as the query is huge), seems like the allow attributes only surged in 2024-06-01 🤔
Server Timing: I have the parsing step implemented. Now, we only need to create the aggregate statistics on it.

JannisBush · 2024-08-27T14:42:17Z

I finished the server-timing query and it now contains the information you wanted. The data is below the original data in the sheet: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=1339089790#gid=1339089790

GJFR · 2024-09-01T10:45:42Z

Thanks a lot for the additional queries and clarifications @JannisBush! Interesting to see the additional data on CSP hosts and iframe allow values.

About the discrepancy for Expires value for cookies: I also don't seem to find a cause for it. I think it's best to leave it as is now. Should we still report on this data in our article, we will also explicitly mention the issue.
I'm not 100% sure, but I think @vikvanderlinden is mainly interested in the Server-Timing attributes (not sure if this is the right term). For example:
- If the header is cfRequestDuration;dur=22.000074, then we should disregard cfRequestDuration and count this header towards the group of headers where only one dur attribute is used. So the value of dur (22.000074) would also be disregarded.
- If the header is cache;desc=hit, varnish;desc=hit_hit, dc;desc=fastly_g, this will be part of the group of headers with three desc attributes.

@vikvanderlinden, please correct met if I'm wrong about your intention for the Server-Timing header.

JannisBush · 2024-09-01T19:02:11Z

The new query counts the number of dur and desc attributes. It reports the average number of these attributes per host, as well as the exact number of hosts with exactly 1, 2, 3 or more dur attributes.

vikvanderlinden · 2024-09-01T19:28:39Z

Yes, that and if there is for instance

db;dur=4
cache;dur=0;desc=hit, varnish;desc=hit_hit, dc;desc=fastly_g;loc=eu-w-1
cache;dur=hit, varnish;desc=hit_hit, dc;desc=fastly_g

on different pages, we could also count:

property	frequency
dur	3
desc	2
loc	1

And truncate to 50 results of so to gauge how many site use certain properties, which may or may not show so often occurring properties besides dur and desc.

tunetheweb · 2024-10-30T10:46:10Z

@GJFR is this good to merge now?

GJFR · 2024-10-30T11:48:36Z

Yes, should be complete 👍

Add 2022 queries

cc95c24

JannisBush marked this pull request as draft June 1, 2024 18:14

JannisBush added 22 commits July 16, 2024 17:54

Add updated bot detection query (issue HSTS technology)

6f990c7

Query update

f868ff5

Update Cryptominer + Crypto query

64d6e85

Port all cookie queries

5c26c35

CSP + CORP porting

e6b1d51

More porting

17b0203

Many updated queries

650c631

Update many queries

a7b50ef

Update all remaining queries

6415896

Remove 2024 prefix

ae21656

Add header value distributions (some headers)

766795b

Add Timing-Allow-Origin Header Usage

5b169ff

No real security header as it can only worsen security (MDN classifies it as CORS header) thus not included in all queries, e.g., not included in feature_adoption_by_country.sql)

Add OAC queries

7e3f991

Add document.domain query

a38ea05

HTML Sanitization query

020ae72

Security Features by category

73eab10

Add server-timing query

90bce64

Update security.txt query

e0018bf

Improve security.text query

643057d

Only consider "real" security.txt files for query

d7a9131

Add some insights into security.txt data (FPs vs FNs)

7c01a20

Fix typo

b48a484

JannisBush marked this pull request as ready for review August 14, 2024 10:31

JannisBush added 3 commits August 19, 2024 09:46

SQLfluff fix

39baba3

Fix typo

d9264f8

Add a limit

1439550

Improve Note document.domain

f91d899

Add missing lowercasing

0d921f7

Update note

7a14bad

tunetheweb added the analysis Querying the dataset label Aug 21, 2024

tunetheweb added this to the 2024 Analysis milestone Aug 21, 2024

JannisBush added 3 commits August 22, 2024 08:39

Remove trailing whitespace

8bb2b4a

Add two more queries

8e08ac8

Fix typo

cbcb2a7

JannisBush added 2 commits August 26, 2024 15:27

Add WSS CSP query

abfaba5

Iframes attributes: change to custom_metrics + add more dates

09b1439

JannisBush added 4 commits August 26, 2024 17:31

Update server-timing parsing

5be0c96

FIx st query + add sampling data

4ec01b6

Add Server-Timing overview query

1e8eacc

Lint

b194e9d

GJFR added 2 commits September 11, 2024 16:40

Add query to measure use of disallowed CSP directives in <meta>

2be88e3

Fix linting issue

4e36ade

tunetheweb merged commit f610a48 into HTTPArchive:main Oct 30, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security 2024 Queries #3671

Security 2024 Queries #3671

JannisBush commented Jun 1, 2024 •

edited

Loading

vikvanderlinden commented Aug 20, 2024

JannisBush commented Aug 21, 2024 •

edited

Loading

GJFR commented Aug 22, 2024

GJFR commented Aug 23, 2024

JannisBush commented Aug 23, 2024 •

edited

Loading

JannisBush commented Aug 23, 2024 •

edited

Loading

vikvanderlinden commented Aug 25, 2024

JannisBush commented Aug 26, 2024 •

edited

Loading

JannisBush commented Aug 26, 2024 •

edited

Loading

JannisBush commented Aug 27, 2024

GJFR commented Sep 1, 2024

JannisBush commented Sep 1, 2024

vikvanderlinden commented Sep 1, 2024

tunetheweb commented Oct 30, 2024

GJFR commented Oct 30, 2024

Security 2024 Queries #3671

Security 2024 Queries #3671

Conversation

JannisBush commented Jun 1, 2024 • edited Loading

General

Existing Queries

New Queries

Attack Preventions

Security Header Adoption

Preventing Attacks using Cross-Origin Policies

HTML Sanitization

Drivers of Security Mechanism Adoption

Website Categories

Security Misconfiguration and Oversights

vikvanderlinden commented Aug 20, 2024

JannisBush commented Aug 21, 2024 • edited Loading

GJFR commented Aug 22, 2024

GJFR commented Aug 23, 2024

JannisBush commented Aug 23, 2024 • edited Loading

JannisBush commented Aug 23, 2024 • edited Loading

vikvanderlinden commented Aug 25, 2024

JannisBush commented Aug 26, 2024 • edited Loading

JannisBush commented Aug 26, 2024 • edited Loading

JannisBush commented Aug 27, 2024

GJFR commented Sep 1, 2024

JannisBush commented Sep 1, 2024

vikvanderlinden commented Sep 1, 2024

tunetheweb commented Oct 30, 2024

GJFR commented Oct 30, 2024

JannisBush commented Jun 1, 2024 •

edited

Loading

JannisBush commented Aug 21, 2024 •

edited

Loading

JannisBush commented Aug 23, 2024 •

edited

Loading

JannisBush commented Aug 23, 2024 •

edited

Loading

JannisBush commented Aug 26, 2024 •

edited

Loading

JannisBush commented Aug 26, 2024 •

edited

Loading