Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security 2020: additional queries #1479

Merged
merged 5 commits into from
Nov 13, 2020
Merged

Security 2020: additional queries #1479

merged 5 commits into from
Nov 13, 2020

Conversation

tomvangoethem
Copy link
Contributor

@tomvangoethem tomvangoethem commented Nov 10, 2020

Progress on #906

@tomvangoethem
Copy link
Contributor Author

I believe that @nrllh still has some queries that will be included.

I still have to check if we've covered everything from the outline, and will add anything that's missing ASAP (was busy with work last week, so apologies for the delay)

@rviscomi rviscomi added the analysis Querying the dataset label Nov 10, 2020
@rviscomi rviscomi added this to the 2020 Analysis milestone Nov 10, 2020
Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good.

Some comments on CSP-RO and SAFE_DIVIDE.

SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, '(?i)Strict-Transport-Security')), COUNT(0)) AS pct_hsts,
SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, '(?i)X-Content-Type-Options')), COUNT(0)) AS pct_xcto,
SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, '(?i)Expect-CT')), COUNT(0)) AS pct_expectct,
SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy')), COUNT(0)) AS pct_csp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this will include Content-Security-Policy-Report-Only. Maybe should do this?:

Suggested change
SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy')), COUNT(0)) AS pct_csp,
SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy ')), COUNT(0)) AS pct_csp,

Also would it be interesting to have a column for CSP-RO? Or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the other should also have a space in the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No harm I guess but CSP was the only one I thought but as it has another header which starts the same. But yeah should do it for consistency (and who knows what the future will bring!)

COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))) AS total_with_header,
COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))) / COUNT(0) AS pct,
COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND STARTS_WITH(url, 'https')) / COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))) AS pct_header_and_https,
COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy')) / COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))) AS pct_header_and_csp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here and I actually ran this to see what the difference would be and got this (1st line is CSP, 2nd is CSP-RO):

CSP usage

So not an insignificant difference. It's only 93.8% usage when you're saying it's 100%.

Also interesting that 10% of CSP sites use CSP-RO.
And everyone using CSP-RO is also using CSP - is no one testing any more and everyone has gone all in?

Also think we should do a SAFE_DIVIDE here. Doesn't matter for now, but some of these are very small and could go to zero in near future (I experienced this when trying to run this with a different UNNEST that had a typo!)

Here's what I ended up doing:

#standardSQL
# Analysis of how certain features influence the adoption of other features
SELECT
  _TABLE_SUFFIX AS client,
  headername,
  COUNT(0) AS total_pages,
  COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))) AS total_with_header,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername))), COUNT(0)) AS pct,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND STARTS_WITH(url, 'https')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_https,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_csp,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy ')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_csp_space,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy[^-]')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_csp_not_dash,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Content-Security-Policy-Report-Only')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_csp_report_only,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Cross-Origin-Embedder-Policy')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_coep,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Cross-Origin-Opener-Policy')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_coop,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Cross-Origin-Resource-Policy')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_corp,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Expect-CT')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_expectct,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Feature-Policy')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_featurep,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Permissions-Policy')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_permissionsp,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Referrer-Policy')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_referrerp,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Report-To')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_reportto,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)Strict-Transport-Security')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_hsts,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)X-Content-Type-Options')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_xcto,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)X-Frame-Options')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_xfo,
  SAFE_DIVIDE(COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)) AND REGEXP_CONTAINS(respOtherHeaders, '(?i)X-XSS-Protection')), COUNTIF(REGEXP_CONTAINS(respOtherHeaders, CONCAT('(?i)', headername)))) AS pct_header_and_xss,
FROM
  `httparchive.summary_requests.2020_08_01_*` AS r,
UNNEST(['Content-Security-Policy', 'Content-Security-Policy-Report-Only', 'Cross-Origin-Embedder-Policy', 'Cross-Origin-Opener-Policy', 'Cross-Origin-Resource-Policy', 'Expect-CT', 'Feature-Policy', 'Permissions-Policy', 'Referrer-Policy', 'Report-To', 'Strict-Transport-Security', 'X-Content-Type-Options', 'X-Frame-Options', 'X-XSS-Protection']) AS headername
WHERE
  firstHtml
GROUP BY
  client,
  headername
ORDER BY
  client,
  headername

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps we should do the same for all the headers (that is, to put a space behind it in the regex)

Here the difference between CSP and CSP-RO doesn't matter so much I think; the 1.0 just shows that for hosts that use CSP or CSP-RO, all of them have CSP or CSP-RO - it does matter for the other queries though

Is there a particular reason to include Content-Security-Policy[^-]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason to include Content-Security-Policy[^-]?

No. Just me testing if space or dash was better separator. Let's use space since by above that's better as can use in other headers.

sql/2020/11_Security/feature_adoption_by_technology.sql Outdated Show resolved Hide resolved
sql/2020/11_Security/feature_adoption_by_technology.sql Outdated Show resolved Hide resolved
@tunetheweb
Copy link
Member

tunetheweb commented Nov 11, 2020

D'Oh - just realised this is in Draft. Could have saved myself some time there! Consider it an early review.

@rviscomi
Copy link
Member

@tomvangoethem are these queries relatively stable? If so we can review and merge immediately and open new PRs for any follow up changes.

@rviscomi rviscomi added the ASAP This issue is blocking progress label Nov 12, 2020
@tomvangoethem
Copy link
Contributor Author

I think they're relatively stable; will do a check tonight and make the proposed changes of @bazzadp (thanks for that!)

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A copy more issues with @nrllh 's additions. Small (but potentially important!) typo.

Rest of the new changes look fine.

sql/2020/11_Security/bot_detection.sql Outdated Show resolved Hide resolved
sql/2020/11_Security/web_cryptography_api.sql Outdated Show resolved Hide resolved
@tomvangoethem tomvangoethem marked this pull request as ready for review November 13, 2020 02:31
@tomvangoethem
Copy link
Contributor Author

Addressed the review comments; might need some more changes - see my responses to the unresolved comments

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we probably should add space for consistency to the other headers but doesn't really matter for now, and other than that it looks good, so OK approving this.

@tunetheweb
Copy link
Member

@tomvangoethem let me know if you're good to merge?

@tomvangoethem
Copy link
Contributor Author

tomvangoethem commented Nov 13, 2020

There were still some instances where the header needed to have a space behind it, so now I just added it everywhere.

In the future, I guess we would get the headers as a separate column in JSON format, so we wouldn't have to do this "parsing"

I'm good to merge!

@tunetheweb
Copy link
Member

In the future, I guess we would get the headers as a separate column in JSON format, so we wouldn't have to do this "parsing"

+1!

@tunetheweb tunetheweb merged commit 7bea5a8 into main Nov 13, 2020
@tunetheweb tunetheweb deleted the security-sql-2020 branch November 13, 2020 15:24
@nrllh nrllh restored the security-sql-2020 branch November 14, 2020 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset ASAP This issue is blocking progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants