This repository has been archived by the owner on Dec 17, 2021. It is now read-only.
First pass at a third party service scanner #107
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a
third_parties
scanner which usesphantomas
to fetch a given domain's homepage (following redirects as necessary)Right now the scanner returns data on requests to domains that can be either
External
(outside of the given domain's base domain) orInternal
(inside of given domain's base domain, but not equal to the given domain).So when scanning
18f.gsa.gov
, a request tohttps://www.google-analytics.com
is an External domain, but a (theoretical) request tohttps://www.gsa.gov
would be an Internal domain (and requests made tohttps://18f.gsa.gov
are not considered).Some notes:
pshtt
scan data exists for a domain, then domains that are not live, or which redirect externally, will be skipped.Google Analytics
maps towww.google-analytics.com
, andDigital Analytics Program
maps todap.digitalgov.gov
.60
seconds (longer thanphantomas
' default of15
), but when a timeout occurs, the scan data gotten fromphantomas
so far is saved and used to calculate the result. Right now, there is no way to tell in the data if the script timed out or not.--timeout
.www
is factored out when comparing hostnames to the given domain. So, scanninghttps://gsa.gov
and seeing that it makes requests tohttps://www.gsa.gov
(or vice versa) will not be logged -- they are considered identical.TODOs:
[a-z]+.cloudfront.net
).cloudfront.net
subdomain can be matched before the catch-all fires.--affiliated
input option), that checks domains against one or more suffixes of domains known to be "affiliated" with the scanned domains. For example, if one were scanningtwitter.com
, one might want to providetwimg.com
as an "affiliated" suffix so that it doesn't get flagged as "external". For scanning .gov domains, one might want to just provide.gov
as an "affiliated" suffix so that any references to shared services within.gov
are separated from truly "External" domains.