Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weigh how to frame total number of .gov sites #151

Closed
gbinal opened this issue Mar 19, 2015 · 10 comments
Closed

Weigh how to frame total number of .gov sites #151

gbinal opened this issue Mar 19, 2015 · 10 comments

Comments

@gbinal
Copy link
Member

gbinal commented Mar 19, 2015

From the About This Site section, Currently, the Digital Analytics Program collects web traffic from almost 300 executive branch government domains, including every cabinet department, out of about 1,350 domains total.

But actually many of those redirect or are not live sites but rather reserved domains that aren't in use. The actual number number of live second level domains at the federal level is ~700-800. Given what that paragraph is trying to say, it might be more helpful to use that number instead of 1,350.

@waldoj
Copy link
Contributor

waldoj commented Mar 19, 2015

Perhaps "functioning domains" is a phrase that would be both accurate and descriptive.

@gbinal
Copy link
Member Author

gbinal commented Mar 19, 2015

It will be hard at this stage to programmatically represent this so that it's always live. Instead, my suggestion would be that we take the current number and frame something like this, out of about 772 active domains (as of March 1, 20). It will be okay to update that periodically and I don't think there will be a problem with that.

@tdlowden
Copy link
Contributor

+1 to that. this needs to be fixed.

@gbinal
Copy link
Member Author

gbinal commented Mar 20, 2015

Update: we're running a fresh scan and should have updated numbers today.

@konklone
Copy link
Contributor

I've completed the scan. Instead of using the official executive-only list of just under 1200 domains that OGP publishes on Data.gov, I took the full list of 5000+ domains and whittled it down by removing classes of domains, which gives me 1,210 domains. I did that because I'd prefer to err on the side of too many domains.

I'm not sure why we say "1350 domains" -- perhaps we were including legislative and judicial domains in that number when we came up with it? I no longer remember.

I then ran these 1,210 through a tool called site-inspector. I've been using site-inspector for some months now to measure HTTPS adoption.

The scan produced these numbers:

  • Total federal executive domains: 1,210
  • Redirects to other domains: 253
  • Not "live" on the web: 214
  • "Live", non-redirect: 743

However, I decided to dig into those 214 "not live" domains a bit more, and found that some of them did in fact work when I typed them into my browser. Some demonstrated strange behavior, or were intelligence community login portals that required a Department of Defense computer to securely access.

Some had misconfigured servers that just barely allowed desktop browsers to load the content, but which broke other non-browser tools, like site-inspector. (Some would likely not work on some mobile browsers, either.) Also, some were just intermittently down, or placeholder pages for shut-down websites.

Altogether, there were something like 40+ domains that probably should be considered "live" but which aren't being properly detected during my automated scans. However, many of these sites are also not appropriate for the DAP. It's also the case that some of the ones detected as DAP-eligible would actually be ineligible, upon manual scrutiny, so there's likely some compensation in the other direction too.

The truth is that coming up with an ironclad exact number for DAP-eligible domains is just not feasible. There's just too much cruft and entropy at the margins of the .gov space. And of course, since this isn't analyzing subdomains, the whole enterprise is doomed to be very imprecise from the get go.

I've documented and reported a major source of the entropy, the need for "AIA fetching" to accommodate .gov domains that have incomplete certificate chains, but am not sure exactly where/how to implement that.

And if we're really trying to gauge DAP eligibility, then we probably need to incorporate some idea of whether the domain is functionally appropriate. National security sites, live domains with only placeholder content, intelligence community login portals -- and potentially other kinds of login portals with no public content -- likely aren't going to meet that standard.

In the meantime, defensible descriptions include "around 750", "700-800", or "over 700". I'd pick one of these and leave it until the metrics for gauging DAP eligibility are more clear.

@waldoj
Copy link
Contributor

waldoj commented Mar 21, 2015

That was a very enjoyable read.

@gbinal
Copy link
Member Author

gbinal commented Mar 23, 2015

I would suggest this: Currently, the Digital Analytics Program collects web traffic from almost 300 executive branch government domains, including every cabinet department, out of around 750 active domains.

cc @tdlowden @smarina04

@konklone
Copy link
Contributor

@gbinal I think instead we should just remove the denominator entirely, and just link to the CSV of participating sites. There's no denominator we can stand behind as an excellent measure of DAP adoption -- even beyond all the weirdness I documented above, the lack of measuring subdomains renders it a blunt instrument that in many cases is only measuring the tip of the iceberg.

@tdlowden
Copy link
Contributor

+1 The dash has spurred a lot of interest, and we can focus on the continued growth of the participants, rather than using resources to continuously define a denominator.

@gbinal gbinal closed this as completed Mar 27, 2015
@gbinal
Copy link
Member Author

gbinal commented Mar 27, 2015

Addressed by b903432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants