Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findspam.py: Move IP, NS, and ASN lists to external YAML files #3580

Merged
merged 2 commits into from Dec 30, 2019

Conversation

tripleee
Copy link
Member

@tripleee tripleee commented Dec 11, 2019

Rather than keep on committing brittle changes to findspam.py for routine updates, move these resources to external files, just like we already have for bad keywords, domain names, and phone numbers.

I grappled with the design but eventually landed on JSONYAML for this. I don't think adding more ad-hoc TSV variants is a good way forward, and envision moving all of our external lists to a unified format eventually (YAML or something similarly flexible and structured).

We can't make assumptions about the order of the YAML entries in each file, but PyYAML always seems to sort the keys (which is why the Schema keys are proper case). The results look good so far, but it probably requires additional testing. I have only done brief sanity checks after converting from JSON (which is what the original version of this pull request had) to YAML. Unlike JSON, YAML should be reasonably diffable and human-editable.

Down the line, I also hope to implement some sort of schema validation. I started on my own ad-hoc validation rules but decided to keep them really simple and maybe look at some proper YAML schema validation module going forward.

I'm hoping to migrate the IP blacklist to proper CIDR netblocks in a future update; this already contains some movement in that direction (though for the time being mainly in the naming).

@iBug
Copy link
Member

iBug commented Dec 11, 2019

What about using YAML like we're already doing for user privileges and chat rooms? It's both easier to write by hand and more readable by human.

@tripleee
Copy link
Member Author

We are using JSON in a lot of places too; but I guess I should look at YAML as well. It could in fact be somewhat easier to coerce into the line per entry organization of the old CSV files (at least approximately).

@tripleee
Copy link
Member Author

@iBug Thanks for the suggestion; please don't merge this yet -- I'll refactor it to use YAML instead.

@tripleee tripleee changed the title findspam.py: Move IP, NS, and ASN lists to external JSON files findspam.py: Move IP, NS, and ASN lists to external YAML files Dec 12, 2019
@tripleee
Copy link
Member Author

Thanks for the suggestion. This is now refactored to use YAML instead of JSON and rebased + force pushed.

Copy link
Member

@iBug iBug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems mostly good to me with room for improvement

blacklists.py Outdated Show resolved Hide resolved
blacklists.py Outdated Show resolved Hide resolved
blacklisted_cidrs.yaml Outdated Show resolved Hide resolved
blacklists.py Show resolved Hide resolved
blacklists.py Outdated Show resolved Hide resolved
blacklists.py Outdated Show resolved Hide resolved
blacklists.py Show resolved Hide resolved
Copy link
Member

@iBug iBug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more suggestions and comments

@tripleee tripleee force-pushed the external-ip-ns-asn-lists branch 2 times, most recently from dfaafb3 to 566f43d Compare December 16, 2019 16:47
@@ -0,0 +1,24 @@
Schema: yaml_cidr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be capitalized? I'd prefer to standardize casing if not, i.e. schema: and schema_version:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was brought up by iBug as well. I wanted them to be lower case but that caused PyYAML to put them at the end of the file, which wasn't great either. Apparently it is hardcoded to alphabetize the keys on the main level, with no easy workaround. So I grudgingly capitalized them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, can we standardize Schema_version to SchemaVersion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really an improvement though? Ugh.

blacklists.py Outdated Show resolved Hide resolved
def _validate(self, item):
host_regex = regex.compile(r'^([a-z0-9][-a-z0-9]*\.){2,}$')
if 'ns' not in item:
raise ValueError('Item must have member field "ns": {0!r}'.format(item))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these ValueErrors caught and logged, or do they crash the instance? If it's the latter, it'd be preferable to change it to the former, even if we have to exit after logging it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, same as above.

@tripleee tripleee force-pushed the external-ip-ns-asn-lists branch 4 times, most recently from c92bfd9 to b1b0a40 Compare December 20, 2019 06:36
@tripleee tripleee removed the request for review from makyen December 20, 2019 06:37
findspam.py: move IP/NS/ASN lists to global variables managed as above

{blacklist,watch}ed_{cidr,nse,asn}s.yml: external YAML files with this data
@tripleee tripleee merged commit 380ef4b into Charcoal-SE:master Dec 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants