New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What do you think of a list like this? #1956
Comments
Hi @jawz101 a quick high-level look (see below). This would add almost 20,000 domains to our base list, increasing its bulk by This is a heavy cost considering our list tries to straddle the middle ground between too-small to be much good, and too-large for some applications like, incidentally, Microsoft Windows. It's tempting though. How is the list curated, do you know?
|
It's something I just threw together based on familiar ad companies which use the sort of naming convention I based it on the DNS requests Cisco actively report that their customers of the Umbrella/OpenDNS users look up every day on Cisco's DNS product Cisco Umbrella DNS service public daily Top 1 Million list they provide Most of the source lists in the Unified blocklist are stale so I use these reports to occasionally clean up the Adaway list. Like if ad companies go out of business or shut down servers, there's no reason for the list to block it. It's just an experiment for myself but I figure I'd mention it. It looks like the Unified list has grown by 40,000 over the past few months so I understand wanting to keep it smaller. Closing the issue since I really only wanted to chat |
I want to keep this open a bit longer @jawz101 so it stays on my radar. I'm presently writing a tool to assess how hosts sources contribute to the Unified list because I'm considering abandoning stale sources. But first I want to systematically know, what do we lose? What's the overlap covered by the other components, net of the removal candidate? I'd also love to know the list of specific domain gains and losses from release to release. And tracking the size of components over time... |
sidenote: I compared the source lists for the current Steven Black Unified Hosts file in the data folders to the most recent Cisco Umbrella (OpenDNS) Top 1 Million DNS lookups for today. This is how I evaluate the Adaway list on a routine basis. In other words, 99.84% of the 50k entries on the KADhosts list were not looked up yesterday by the millions of devices that use the Cisco Umbrella DNS product. Not factoring in entries appearing on multiple lists- this is just one way to view them. I personally think a list can be < 20,000 entries and be effective.
|
That's very interesting @jawz101. Admittedly the top 1-million is a us-centric, CISCO-specific thing. It would be interesting to see a .TLD breakdown of the top 1-million, and compare it to KADHosts, since that's the one you mention.
|
I do not understand the significance of the TLD thing. How do you interpret it? |
@jawz101 the TLD breakdown gives us a sense of global coverage. Let's look at Adaway. That's a much different mix of TLDs. KADHosts provides much more coverage of Europe and Eastern Europe. I like the TLD view because it's a different way to slice things. It's hard to draw definitive conclusions about quality based on just TLDs. I presume most independent malicious actors would certainly not be among the top million, and perhaps may have propensity for small-country or otherwise exotic TLD. That's just a guess.
|
@jawz101 here's a ghosts report on the top 1-million against our default amalgamated list. A 1.3% overlap. I would say, based on this, the top 1-million domains lists is heavily biased towards clean actors.
|
@jawz101 the full 1-million TLD breakdown is in this Gist: Here's the top few lines of the report. Yeah this appears very heavily biased to the USA. Scroll to the bottom of that Gist. Some crazy and implausible TLDs in that list, shedding some doubt about its quality. That kinda supports a basic premise: large lists are not curateable, so (in general) they aren't curated.
|
The USA's top level domain is .us .com is for commercial companies, regardless of country. Same with .net, .info, .biz, .io. .org is generally used for non-profits, open source projects, & communities |
@jawz101 lol 😆 And That |
https://www.statista.com/statistics/918403/number-of-universities-worldwide-by-country/ If you go to university in the U.S. a large chunk of students are international. Like a lot. And with the u.s. being the 3rd largest country, I scale America's union if states to Europe's union of countries |
@jawz101 that's just not acceptable. I'm not gonna stand for that. KADHosts is based in Poland. They're really strong on threats in that part of the world. HostsVN is based in Vietnam. They are really strong on threats based in that locality and surrounding area. These are strengths, not weaknesses. You can't gauge what we do here relative to a "Top 1-million" list from CISCO. That's nonsense, and I think the numbers clearly bear this out. I see zero evidence that comparisons to the "Top 1-million" list tells us anything. Let's get real. Population of India: 1.38 billion (2020). Total number of |
I have no idea why you're making it into whatever this turned out to be so I'll bow out. edit: I will say that it's silly you're acting like I have some American exceptionalism thing. A few years ago the Steven Black list was maybe 65,000 entries and now it's about twice as large. Back to the post, I'm just saying Cisco Umbrella (formerly and still OpenDNS), has peering partners such as Baidu, Alibaba, & British Telecom. |
On 6/16/22, Steven Black wrote:
@jawz101 the full 1-million TLD breakdown is in this Gist:
https://gist.github.com/StevenBlack/c08283f99a9c0d2042805e19076b971b
Here's the top few lines of the report. Yeah this appears very heavily
biased to the USA.
Scroll to the bottom of [that
Gist](https://gist.github.com/StevenBlack/c08283f99a9c0d2042805e19076b971b).
Some crazy and implausible TLDs in that list, shedding some doubt its
quality.
Take another look at the list description:
"The popularity list contains our most queried domains based on
passive DNS usage across our Umbrella global network of more than 100
Billion requests per day with 65 million unique active users, in more
than 165 countries."
People request name lookups on crazy and implausible names, so you get
crazy and implausible names in the list. See, for example,
https://icannwiki.org/.home and then look at how many names ending
with ".home" are in the list.
|
The same can be said for this hosts file. these entries on the currrent StevenBlack Unified list are invalid TLD's 0.0.0.0 fe ... but to me, it says a lot that it was more common for someone to try and look up .home name and show up on a top 1 million list than request some of the ones on the StevenBlack list that do not show up on a top 1 million list. If that makes sense. |
There are several ad/tracking/marketing campaign companies that use businesscustomer.acmeadco.com style of entries, difficult to maintain and clutter up lists. While many ad blockers are capable of wildcarding these sorts of domains, a host file list cannot.
Instead, this list is the Cisco Umbrella Top 1 Million daily list and pulls out the most popular lookups for these domains
https://github.com/jawz101/subdomain_blocklists
The text was updated successfully, but these errors were encountered: