Skip to content
This repository has been archived by the owner on Dec 17, 2021. It is now read-only.

First pass at a third party service scanner #107

Merged
merged 13 commits into from
Jun 26, 2017
Merged

First pass at a third party service scanner #107

merged 13 commits into from
Jun 26, 2017

Conversation

konklone
Copy link
Contributor

@konklone konklone commented Jan 17, 2017

This adds a third_parties scanner which uses phantomas to fetch a given domain's homepage (following redirects as necessary)

Right now the scanner returns data on requests to domains that can be either External (outside of the given domain's base domain) or Internal (inside of given domain's base domain, but not equal to the given domain).

So when scanning 18f.gsa.gov, a request to https://www.google-analytics.com is an External domain, but a (theoretical) request to https://www.gsa.gov would be an Internal domain (and requests made to https://18f.gsa.gov are not considered).

Some notes:

  • If pshtt scan data exists for a domain, then domains that are not live, or which redirect externally, will be skipped.
  • The scanner keeps a dict of "known" services that map to one or more hostnames. So Google Analytics maps to www.google-analytics.com, and Digital Analytics Program maps to dap.digitalgov.gov.
  • The scanner sets a default timeout of 60 seconds (longer than phantomas' default of 15), but when a timeout occurs, the scan data gotten from phantomas so far is saved and used to calculate the result. Right now, there is no way to tell in the data if the script timed out or not.
  • The timeout can be overridden with --timeout.
  • www is factored out when comparing hostnames to the given domain. So, scanning https://gsa.gov and seeing that it makes requests to https://www.gsa.gov (or vice versa) will not be logged -- they are considered identical.

TODOs:

  • There's a bug in the "don't scan external redirect domains" logic - it scans dod.gov in addition to defense.gov. This causes a further bug in that the scan for dod.gov thinks requests to defense.gov are external, when they are not.
  • Note as a field in the output whether there was a timeout or not. (If there's a timeout, it could point to a problematic website, and/or a need to extend the overall timeout, and/or some undetected third parties we missed by bailing out early.)
  • Greatly expand the library of "known" services.
  • Add support for regex matching for "known" services, to detect classes of them (e.g. [a-z]+.cloudfront.net).
  • Add a priority order for matching known services, so that (for example) a specific cloudfront.net subdomain can be matched before the catch-all fires.
  • Add a category of "unknown service" (with associated fields), just to factor out domains which haven't yet been isolated and described as a "known" service.
  • Add a category of "affiliated service" (with associated fields, and an --affiliated input option), that checks domains against one or more suffixes of domains known to be "affiliated" with the scanned domains. For example, if one were scanning twitter.com, one might want to provide twimg.com as an "affiliated" suffix so that it doesn't get flagged as "external". For scanning .gov domains, one might want to just provide .gov as an "affiliated" suffix so that any references to shared services within .gov are separated from truly "External" domains.

@konklone konklone merged commit 221c97e into master Jun 26, 2017
@konklone
Copy link
Contributor Author

Time to merge this in and get to the TODOs in other PRs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant