Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggressive mode to CCIAnalyzer #97

Open
MER-C opened this issue Aug 17, 2015 · 3 comments
Open

Add aggressive mode to CCIAnalyzer #97

MER-C opened this issue Aug 17, 2015 · 3 comments

Comments

@MER-C
Copy link
Owner

MER-C commented Aug 17, 2015

CCIAnalyzer could do with a second pass mode that, after looking at the original CCI, removes:

  • Template arguments
  • References
  • Comments

Other things that can be removed should be investigated.

@MER-C
Copy link
Owner Author

MER-C commented Sep 6, 2015

Disambig pages and lists should also go.

@MER-C
Copy link
Owner Author

MER-C commented Apr 5, 2018

Also consider upping the word threshold to 15 or something.

@MER-C
Copy link
Owner Author

MER-C commented Dec 28, 2019

  • 634eea7 implemented a variable word threshold and removal of list items and references.
  • a52526b added a table start culling function and stopped counting "words" consisting of only punctuation characters against the word count limit.
  • a226cc7 adds preprocessing functions - included so far are comment and external link removal
  • 64feb0f adds culling that acts on page titles, added are functions for disambiguation and list pages

Further ideas, should they become necessary in the future:

  • a blacklist that makes sure that all edits that touch a "plot", "synopsis" or "lyrics" section are flagged as major
  • filters on full diff content (needs machine readable diffs?)
  • safe table row culling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@MER-C and others