Detecting vandalism on Stack Overflow.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.settings
docs
ini
src
.classpath
.gitignore
.project
CNAME
README.md
_config.yml
pom.xml

README.md

Belisarius - Detecting vandalism on Stack Overflow

Background

This bot has been developed in an attempt to help capture possible vandalism. This includes:

  • Removing all code
  • Replacing all content with nonsense
  • Replacing all content with repeated words
  • Adding solutions to their questions instead of posting an answer
  • Removing large amounts of text from their post
  • Using certain keywords or offensive language within the edit summary

Why do we need the bot?

The point of the bot is to help identify bad edits and/or potential vandalism made to posts in real time so that the changes can be quickly rolled back.

Implementation

The bot queries the Stack Exchange API once every minute to get a list of the latest posts. There is logic to check that the post has been edited and that it has been edited by the author.

The post_Id from each post is then taken and the Stack Exchange API is again queried for the list of revisions. To limit calls we utilise the functionality of pushing multiple ids into the API and then logic is in place to ensure we are using the latest revision.

Edits can be made up of a title change, body change of a question, tag changes or changes made to the body of an answer. Currently tags are not checked. Instead the title, question body and answer body depending on what has been edited are run through filters, as is the edit summary.

Filtering

Titles are run through the following filters:

  • BlacklistedWords; certain words are appended to titles. The bot reads a file which holds a list of keywords to watch out for within titles

The question/answer body is run through the following filters:

  • TextRemoved; 80% or more of the body must have been removed and then it must have a Jaro Winkler score of less than 0.6
  • BlacklistedWords; certain words are appended to posts. The bot reads a separate file for questions and answers. Both hold a list of keywords to watch for
  • CodeRemoved; the bot watches for all code being removed
  • FewUniqueCharacters; the body must either be 30 plus characters long and have less than 7 unique characters or be 100 plus characters long and have less than 16 unique characters
  • RepeatedWords; this is when an edit is made were all the body is replaced with repeated words. The bot will output if 5 or less unique words are found
  • VeryLongWord; the bot checks the post for a word longer than 50 characters long. Code is removed before the check is done

Edit summaries are run through the following filters:

  • BlacklistedWords; certain words are used within the edit summaries. The bot holds a separate file for question edit summaries and answer edit summaries. Both hold a list of keywords to watch for
  • OffensiveWord; the bot checks for offensive language used within the edit summary. This is done via a separate regex file

Accounts

The project is running under the user Belisarius in the SOBotics room. A more detailed presentation is at http://belisarius.sobotics.org/ including a list of commands.

Feedback:

Currently feedback is taken by replying to the chat message with either tp (True Positive) or fp (False Positive).

A sample image of a report is:

Sample Image

The source code is available on GitHub and suggestions are welcome. The project is still under the testing phase.