New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
findspam.py: Move IP, NS, and ASN lists to external YAML files #3580
findspam.py: Move IP, NS, and ASN lists to external YAML files #3580
Conversation
What about using YAML like we're already doing for user privileges and chat rooms? It's both easier to write by hand and more readable by human. |
We are using JSON in a lot of places too; but I guess I should look at YAML as well. It could in fact be somewhat easier to coerce into the line per entry organization of the old CSV files (at least approximately). |
@iBug Thanks for the suggestion; please don't merge this yet -- I'll refactor it to use YAML instead. |
f9b7469
to
8785bd7
Compare
Thanks for the suggestion. This is now refactored to use YAML instead of JSON and rebased + force pushed. |
8785bd7
to
f39ed97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems mostly good to me with room for improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more suggestions and comments
dfaafb3
to
566f43d
Compare
@@ -0,0 +1,24 @@ | |||
Schema: yaml_cidr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be capitalized? I'd prefer to standardize casing if not, i.e. schema:
and schema_version:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was brought up by iBug as well. I wanted them to be lower case but that caused PyYAML to put them at the end of the file, which wasn't great either. Apparently it is hardcoded to alphabetize the keys on the main level, with no easy workaround. So I grudgingly capitalized them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, can we standardize Schema_version
to SchemaVersion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really an improvement though? Ugh.
def _validate(self, item): | ||
host_regex = regex.compile(r'^([a-z0-9][-a-z0-9]*\.){2,}$') | ||
if 'ns' not in item: | ||
raise ValueError('Item must have member field "ns": {0!r}'.format(item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these ValueErrors caught and logged, or do they crash the instance? If it's the latter, it'd be preferable to change it to the former, even if we have to exit after logging it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, same as above.
c92bfd9
to
b1b0a40
Compare
b1b0a40
to
5512a47
Compare
findspam.py: move IP/NS/ASN lists to global variables managed as above {blacklist,watch}ed_{cidr,nse,asn}s.yml: external YAML files with this data
5512a47
to
4319730
Compare
Rather than keep on committing brittle changes to
findspam.py
for routine updates, move these resources to external files, just like we already have for bad keywords, domain names, and phone numbers.I grappled with the design but eventually landed on
JSONYAML for this. I don't think adding more ad-hoc TSV variants is a good way forward, and envision moving all of our external lists to a unified format eventually (YAML or something similarly flexible and structured).We can't make assumptions about the order of the YAML entries in each file, but PyYAML always seems to sort the keys (which is why the
Schema
keys are proper case). The results look good so far, but it probably requires additional testing. I have only done brief sanity checks after converting from JSON (which is what the original version of this pull request had) to YAML. Unlike JSON, YAML should be reasonably diffable and human-editable.Down the line, I also hope to implement some sort of schema validation. I started on my own ad-hoc validation rules but decided to keep them really simple and maybe look at some proper YAML schema validation module going forward.
I'm hoping to migrate the IP blacklist to proper CIDR netblocks in a future update; this already contains some movement in that direction (though for the time being mainly in the naming).