First pass at a third party service scanner #107

konklone · 2017-01-17T06:14:18Z

This adds a third_parties scanner which uses phantomas to fetch a given domain's homepage (following redirects as necessary)

Right now the scanner returns data on requests to domains that can be either External (outside of the given domain's base domain) or Internal (inside of given domain's base domain, but not equal to the given domain).

So when scanning 18f.gsa.gov, a request to https://www.google-analytics.com is an External domain, but a (theoretical) request to https://www.gsa.gov would be an Internal domain (and requests made to https://18f.gsa.gov are not considered).

Some notes:

If pshtt scan data exists for a domain, then domains that are not live, or which redirect externally, will be skipped.
The scanner keeps a dict of "known" services that map to one or more hostnames. So Google Analytics maps to www.google-analytics.com, and Digital Analytics Program maps to dap.digitalgov.gov.
The scanner sets a default timeout of 60 seconds (longer than phantomas' default of 15), but when a timeout occurs, the scan data gotten from phantomas so far is saved and used to calculate the result. Right now, there is no way to tell in the data if the script timed out or not.
The timeout can be overridden with --timeout.
www is factored out when comparing hostnames to the given domain. So, scanning https://gsa.gov and seeing that it makes requests to https://www.gsa.gov (or vice versa) will not be logged -- they are considered identical.

TODOs:

There's a bug in the "don't scan external redirect domains" logic - it scans dod.gov in addition to defense.gov. This causes a further bug in that the scan for dod.gov thinks requests to defense.gov are external, when they are not.
Note as a field in the output whether there was a timeout or not. (If there's a timeout, it could point to a problematic website, and/or a need to extend the overall timeout, and/or some undetected third parties we missed by bailing out early.)
Greatly expand the library of "known" services.
Add support for regex matching for "known" services, to detect classes of them (e.g. [a-z]+.cloudfront.net).
Add a priority order for matching known services, so that (for example) a specific cloudfront.net subdomain can be matched before the catch-all fires.
Add a category of "unknown service" (with associated fields), just to factor out domains which haven't yet been isolated and described as a "known" service.
Add a category of "affiliated service" (with associated fields, and an --affiliated input option), that checks domains against one or more suffixes of domains known to be "affiliated" with the scanned domains. For example, if one were scanning twitter.com, one might want to provide twimg.com as an "affiliated" suffix so that it doesn't get flagged as "external". For scanning .gov domains, one might want to just provide .gov as an "affiliated" suffix so that any references to shared services within .gov are separated from truly "External" domains.

konklone · 2017-06-26T16:24:11Z

Time to merge this in and get to the TODOs in other PRs.

konklone added 5 commits January 17, 2017 00:41

allow certain return codes in scan, handle invalid pshtt dat

f4a5c0a

first pass at third parties scanner

ed4cfa4

capture data on number of domains, and number of requests from them

d358c04

more precisely detect whether pshtt data was invalid

7779016

add support for regex-based matching of hostnames

fc8e8a4

This was referenced Jan 17, 2017

Prototype scanning for 3rd party services 18F/pulse#638

Closed

Refine code for 3rd party services scanning 18F/pulse#639

Closed

konklone mentioned this pull request Feb 13, 2017

Inline the use of jQuery to avoid privacy leakage GSA/fedramp-tailored#3

Closed

konklone and others added 8 commits April 13, 2017 11:55

Merge branch 'master' into third-parties

c05bb9f

more regexes for services

1c4f0cd

Merge branch 'master' into third-parties

f0af768

Merge branch 'master' into third-parties

516cc02

Merge branch 'master' into third-parties

85b22f3

Merge branch 'master' into third-parties

c3349c5

Merge branch 'master' into third-parties

1c3336b

fix flake8

1fcf6dd

konklone merged commit 221c97e into master Jun 26, 2017

konklone deleted the third-parties branch June 26, 2017 16:24

konklone mentioned this pull request Sep 7, 2017

Custom DAP code gives google-analytics.com unfettered control of USG websites digital-analytics-program/gov-wide-code#61

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at a third party service scanner #107

First pass at a third party service scanner #107

konklone commented Jan 17, 2017 •

edited

Loading

konklone commented Jun 26, 2017

First pass at a third party service scanner #107

First pass at a third party service scanner #107

Conversation

konklone commented Jan 17, 2017 • edited Loading

konklone commented Jun 26, 2017

konklone commented Jan 17, 2017 •

edited

Loading