Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security 2024 Queries #3671

Merged
merged 41 commits into from
Oct 30, 2024
Merged

Conversation

JannisBush
Copy link
Contributor

@JannisBush JannisBush commented Jun 1, 2024

Queries for the 2024 Security Chapter

General

  • Maybe add subfolders for the sections or prefix each query with their subsection to improve the structure (50 queries in one folder are quite a mess)?
  • Maybe remove some unused queries as they can distract?

Existing Queries

  • Copy all queries from the 2022 chapter
  • Update all of them to use only httparchive.all
  • Add annotations to queries missing annotations
  • Other structure/style improvements (preamble: Section: ? Question: ?)
  • Fix security.txt query to contain all documented fields (and not to heavily over report! 2022 said ~5% sites use security.txt, however most of these files are HTML pages and it is more like 1% of sites that use it which is similar to https://www.uriports.com/blog/security-txt/)
  • Fix sensitive values in robots.txt query Query is already functional?

New Queries

  • List all new queries (according to the planning document as of July 16)
  • Implement all new queries

Attack Preventions

Security Header Adoption

  • Usage of document.domain setters and getters (Blink Feature Usage + Origin Agent Cluster)
  • General Usage of Timing-Allow-Origin
  • Value Distribution of Timing-Allow-Origin

Preventing Attacks using Cross-Origin Policies

  • Value Distribution CORP Already existed
  • Value Distribution COEP
  • Value Distribution COOP
  • Also added queries for XFO (CSP already existed)

HTML Sanitization

  • General adoption of setHTMLUnsafe and parseHTMLUnsafe (Blink Feature Usage)
  • % Sites using above APIs of all sites that use shadow DOM Usage is too rare: 6x parse, 2x set in total!

Drivers of Security Mechanism Adoption

Website Categories

  • % Sites use HTTPS per website category
  • % Sites use security headers per website category
  • % Sites use SRI per website category We currently do not even have queries for SRI related to rank, country, ...; should we add website category here?
  • % Sites use bot protection services per website category We currently do not even have queries for bot protection related to rank, country, ...; should we add website category here?

Security Misconfiguration and Oversights

  • Unsupported policies in <meta> tag Already exists: meta_policies_allowed_vs_disallowed
    • CSP, CSP-ro, XFO, HSTS, ...
    • Discrepancies for policies definition in both header and <meta> tag
    • More advanced query looking into the values (i.e., check for e.g., frame-ancestors, sandbox, report-uri, ...
  • CSP vs XFO?
  • Server-Timing header: general usage, dur property, ....

@JannisBush JannisBush marked this pull request as draft June 1, 2024 18:14
@JannisBush JannisBush marked this pull request as ready for review August 14, 2024 10:31
@vikvanderlinden
Copy link
Member

Thanks for the implementations!
I took a look over the queries, and here's a few of my thoughts (also as response to the notes here and there)

  • bot detection:
    I would say we can just keep HSTS and Really Simple SSL and just select the values we'd like during writing. Same goes for entries like "bugcrowd" etc.

  • cryptominer usage:
    A year or two is usually used in previous editions as far as I know. Two years would be great imo because the last almanac used data up until July 2022, which we can continue from. Usually the usage is very low but sometimes the usage spikes, which might be interesting and if that would be the case we may find a good explanation (like last edition). I would propose to still run this query for 2 years worth of data and to then see what the result is before deciding whether to include it into the chapter or not.

  • csp allowed host frequency:
    The CSP header allows hostnames without scheme (or other schemes such as wss://), we should think whether we want to include those or not. It is my understanding that this currently does not happen and I think we should at least mention that in case we use the current version of the query. Parsing will be more difficult, but there might be hosts that are included much more frequently without scheme.
    Example: github.com sends a CSP response header in which none of the domains have a scheme apart from wss://.

    It is not unthinkable that certain sites/frameworks/plugins/platforms/... in general do not include schemes. I wonder for how many sites "https?://" is in the CSP header at all. Not having "https?://" could also mean they just don't use allowlists of hosts but better options like strict-dynamic and nonce/hash (which is recommended).

  • csp number of allowed hosts:
    Same remark as previous (sheme-less hosts in CSP).

  • documentdomain usage:
    I'm not sure I understand the note: "the header is very rare and yet this value is very low". Would that not be normal? Rare header == low number of urls == low usage value?

  • security adoption by category:
    We should LOWER the header names so we don't accidentally miss some headers.

  • security adoption by rank:
    Same remark as previous (LOWER header names)

  • server timing usage values:
    It would be interesting to have the fractions of properties besides the 'dur' property to gauge which info may be included in the header.

For all other queries I had no remarks really.
Let me know what you think. (CC @GJFR)

@JannisBush
Copy link
Contributor Author

JannisBush commented Aug 21, 2024

  • Bot Detection: yes we can simply filter them out later (no change)
  • Cryptominer Usage: The usage in June 2024 is a total of 53 pages on mobile and 41 pages on desktop which is nothing (total pages: 15M+) (https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=2083415086#gid=2083415086) (I can change the query to start in 2022 done)
  • CSP hosts: Yes, we could come up with another regex/parsing to allow for hosts that do not start with https (Maybe change parsing of CSP hosts)
  • Documentdomain: The note might be confusing. Here is the data: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=150039899#gid=150039899 DocumentSetDomain has a count of 484310, DocumentDomainSettingWithoutOriginAgentClusterHeader has a count of only 804, this would suggest that most pages that use DocumentSetDomain have an OriginAgentCluster. However, this is not the case, as we can see from other queries that almost 0 pages have an OriginAgentCluster header. I checked the data and it seems like starting from 20230201 they only count DocumentDomainSettingWithoutOriginAgentClusterHeader if the page has an OriginAgentCluster that does not allow for using SetDomain, in all other cases including no-header, they do not count that counter up. (Improve the note text done)
  • Security by category/rank: True, they should be lowercased as was done in the other queries. (Lowercase header-names x2 done)
  • Server Timing: Here are the results of the query: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=1339089790#gid=1339089790 This could also be tracked by the privacy team as it no a security header per se? (Maybe extract/count other properties of the server-timing header)

@tunetheweb tunetheweb added the analysis Querying the dataset label Aug 21, 2024
@tunetheweb tunetheweb added this to the 2024 Analysis milestone Aug 21, 2024
@GJFR
Copy link
Member

GJFR commented Aug 22, 2024

Thanks a lot for writing the queries @JannisBush, going over the results was already very interesting! And especially thanks for updating all existing queries to use the new tables.

Some remarks:

  • CSP hosts: I just performed a small test query with regex (?i)(wss*://[^\s;]+)[\s;] instead of the https* one. Frequency of ws* hosts is very low compared to https* and imo wouldn't make sense to use in the same query (wss* hosts would almost be truncated out of the resulting table anyway). Highest relative frequency: wss://*.intercom.io and wss://*.hotjar.com with both +/- 0.07%. Might be interesting to still mention separately and would warrant a separate query for wss*, though?
  • Server Timing: I agree with @vikvanderlinden that it would make more sense to consider the properties alone, without timing values (they undermine the group by). Would it be feasible to extract every property? If not, a cheap improvement would be to just omit all numeric values from the string (remaining caveat is the different order of properties in headers)? What do you both think?
  • Cookie Age Negative: Results indeed differ quite a lot from 2022. I looked at the query in detail, but I can't find any issues. Other cookie-related queries seem to differ less from 2022 results.
  • It appears from Iframe Attributes Usage that usage of allow doubled compared to 2022. It would be nice to check the change in usage of time. Could you add such a query similar to Https Request Over Time for the usage of allow and sandbox?

I've added some additional derived tables to the results sheets to provide more insights into the data. When running queries again, could you avoid overwriting the full tabs? It's only a few tabs, though.

I will finish reviewing the last few queries tomorrow.

@GJFR
Copy link
Member

GJFR commented Aug 23, 2024

Went though the remainder of queries:

@JannisBush
Copy link
Contributor Author

JannisBush commented Aug 23, 2024

  • CSP hosts: We can add another query for WSS or non-https hosts
  • Server Timing: We can split at "," first and then take everything before the first ";", then take anything until the first "=" and that should give us all the metric names. If you also want to extract the descriptions and values, it will get more complex.
  • Cookie Age Negative: I started to run the new query on the old data to check if there might be an issue with the query
  • Iframe attribute usage: We can add a new query that tracks usage of allow and sandbox over time
  • 2022-06-09: This is on purpose, the 2022-06-01 entries for all.requests is broken and the correct entries for that year/month are 2022-06-09
  • TLs Ca Issuers: As written in the notes, the query currently includes HTTP Pages (empty issuers, 13%) as well. We could filter that out (for the comparison to the old data), however, I think for the future including the amount of HTTP Pages in that query directly makes sense. Row 151 and 180 have 0.02% and are probably HTTPS Pages without an issuer (the same as the old 0.03% entries). I guess in bigquery, the HTTP Pages have a NULL issuer and the HTTPS Pages without an issuer have an empty string, so they are not merged together. However, when copying the data to sheets the difference between NULL and the empty string is lost.
  • count_found: Count found is what the custom metric reports as found which is the final response to ./well-known/security.txt returns a status code of 200, this includes a lot of false positives that return non-security.txt files with code 200 at that URL (probably a similar number as https://almanac.httparchive.org/en/2022/security#detecting-status-code-reliability). has_security_txt has the additional constraint that the content-type header of the response has to start with text/plain (which it MUST according to the spec: https://datatracker.ietf.org/doc/html/rfc9116#section-3). I did quite some experiments (see the outcommented queries in the file: https://github.com/HTTPArchive/almanac.httparchive.org/blob/cbcb2a7ea833de7fad881ca7afa689112bf6ce27/sql/2024/security/well-known_security.sql). My verdict of the results are that requiring text/plain gets rid of the vast majority of FPs while at the same time only introducing a negligible amount of FNs.
  • I am not automatically overwriting any data in the sheets.

  • Data for the new cookie query on the old data (2022-06-09):
  • The numbers for Max-Age are basically the same. The differences is in Expires (which then transfers to Real Age)
  • I had the note Some of the percentages are quite different to the old query; one of both might be broken (difficult to compare as both cannot operate on a shared dataset) in the query. So yeah, something is going on here.
client	count_negative_max_age	pct_negative_max_age	num_max_age_pages	pct_max_age_pages	num_max_age_hosts	pct_max_age_hosts	count_negative_expires	pct_negative_expires	num_expires_pages	pct_expires_pages	num_expires_hosts	pct_expires_hosts	count_negative_real_age	pct_negative_real_age	num_real_age_pages	pct_real_age_pages	num_real_age_hosts	pct_real_age_hosts
desktop	3027635	0.085079712693773929	391368	0.18646223068797993	164603	0.14661195391159976	5920452	0.06190244069067101	1110544	0.29486378523190565	303714	0.16427975276414852	5761149	0.056308908960075628	1069189	0.28305686557625642	264131	0.13890066271136348
mobile	2752659	0.061759115713496793	481595	0.16319871445790932	200027	0.1244405596580835	6068461	0.049396407189040661	1456445	0.27821213917970833	347559	0.13971620115235245	5938723	0.0449079900197829	1409598	0.26856166505420204	300906	0.11727684617708099

@JannisBush
Copy link
Contributor Author

JannisBush commented Aug 23, 2024

Maybe CAST(JSON_QUERY(summary, '$.startedDateTime') AS NUMERIC) in all.requests is not the same as startedDateTime in almanac.requests 🤔

Seems like there was some post-processing:

startedDateTime: Math.round(+startedDateTime / 1000),

I tried to do the same post-processing, but got very different results. I will ask in slack.

@vikvanderlinden
Copy link
Member

For Server-Timing I think the metric names are most interesting. Additionally it would be interesting to check the count for the dur property.

Sort of like: total # pages, # with header, # with >= 1 dur, # with 1x dur, # with 2x dur, # with 3x dur, # with >3 dur.
And if we can get the property names, a table that shows the number of pages that have at least one of each property. I wouldn't prioritize this, but for instance:

property # pages with that property at least once
dur 200
desc 150
cache 100

If that makes sense?

@JannisBush
Copy link
Contributor Author

JannisBush commented Aug 26, 2024

Maybe CAST(JSON_QUERY(summary, '$.startedDateTime') AS NUMERIC) in all.requests is not the same as startedDateTime in almanac.requests 🤔

Seems like there was some post-processing:

startedDateTime: Math.round(+startedDateTime / 1000),

I tried to do the same post-processing, but got very different results. I will ask in slack.

Tested more and they both seem very similar and have the same format, so that might not be the reason for the drastic differences.
The only difference I could discover is that in all.requests around 0.8% of requests have an extracted startedDateTime that is nullwhereas this is never the case for almanac.requests. However, as this only applies to such a small number of requests, I don't think this can be the (only) reason for the large difference.

  • Average startedDateTime: almanac: f0_
    1654908665.0546837, all: f0_
    1654883339.0342844
SELECT
  AVG(startedDateTime),
  AVG(CAST(JSON_QUERY(r2.summary, '$.startedDateTime') AS INT64))
FROM
  `httparchive.almanac.requests` as r1,
  `httparchive.all.requests` as r2
WHERE
  r1.date = '2022-06-01'
  AND r2.date = '2022-06-09'
  AND is_root_page
  • 0.8% of all.requests have a startedDateTime that is NULL
SELECT
    COUNTIF(CAST(JSON_QUERY(summary, '$.startedDateTime') AS INT64) IS NULL) / count(0) as startedDateTime
  FROM
    `httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
  WHERE
    date = '2022-06-09' AND
    is_root_page
  • 0% of almanac.requests has a startedDateTime that is NULL
  SELECT
    COUNTIF(startedDateTime IS NULL) / count(0)
  FROM
    `httparchive.almanac.requests` TABLESAMPLE SYSTEM (100 PERCENT)
  WHERE
    date = '2022-06-01'

@JannisBush
Copy link
Contributor Author

JannisBush commented Aug 26, 2024

  • I added the CSP WSS host query (and added the results to the sheets)
  • Iframes attributes: updated the query and added the results to sheets (only for some months as the query is huge), seems like the allow attributes only surged in 2024-06-01 🤔
  • Server Timing: I have the parsing step implemented. Now, we only need to create the aggregate statistics on it.

@JannisBush
Copy link
Contributor Author

I finished the server-timing query and it now contains the information you wanted. The data is below the original data in the sheet: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=1339089790#gid=1339089790

@GJFR
Copy link
Member

GJFR commented Sep 1, 2024

Thanks a lot for the additional queries and clarifications @JannisBush! Interesting to see the additional data on CSP hosts and iframe allow values.

  • About the discrepancy for Expires value for cookies: I also don't seem to find a cause for it. I think it's best to leave it as is now. Should we still report on this data in our article, we will also explicitly mention the issue.

  • I'm not 100% sure, but I think @vikvanderlinden is mainly interested in the Server-Timing attributes (not sure if this is the right term). For example:

    • If the header is cfRequestDuration;dur=22.000074, then we should disregard cfRequestDuration and count this header towards the group of headers where only one dur attribute is used. So the value of dur (22.000074) would also be disregarded.
    • If the header is cache;desc=hit, varnish;desc=hit_hit, dc;desc=fastly_g, this will be part of the group of headers with three desc attributes.

@vikvanderlinden, please correct met if I'm wrong about your intention for the Server-Timing header.

@JannisBush
Copy link
Contributor Author

The new query counts the number of dur and desc attributes. It reports the average number of these attributes per host, as well as the exact number of hosts with exactly 1, 2, 3 or more dur attributes.

@vikvanderlinden
Copy link
Member

Yes, that and if there is for instance

  • db;dur=4
  • cache;dur=0;desc=hit, varnish;desc=hit_hit, dc;desc=fastly_g;loc=eu-w-1
  • cache;dur=hit, varnish;desc=hit_hit, dc;desc=fastly_g

on different pages, we could also count:

property frequency
dur 3
desc 2
loc 1

And truncate to 50 results of so to gauge how many site use certain properties, which may or may not show so often occurring properties besides dur and desc.

@tunetheweb
Copy link
Member

@GJFR is this good to merge now?

@GJFR
Copy link
Member

GJFR commented Oct 30, 2024

Yes, should be complete 👍

@tunetheweb tunetheweb merged commit f610a48 into HTTPArchive:main Oct 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants