Skip to content
Plagiarism detection software written in Python 3 and Django
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Plagiarism detection with Python 3 and Django

In 2014-2015 I developed a side-project called Plagiarism Guard which was a plagiarism detection service. I mainly did this to teach myself Python (kudos to Learn Python the Hard Way and also implement some (very!) basic NLP. This was developed in Python 3 and the Django framework.

The basic premise for this plagiarism detection is to accept resources in a few different formats (URL being the most popular, but also text files and Office-type documents). The resources can then be periodically scanned (via custom management commands triggered via a crontab), and a few fairly unique phrases are pulled out of them. These phrases are then searched online (using Bing's search engine API), and the results are the 'plagiarism-detected' candidates. Finally these candidates are scanned to rule out any false positives, and discover an approximate duplication score.

The files for this project are organised into three main folders:

  • /plag/ - this is the bulk of the Django application, hence it contains the models, forms, routes, services etc.
    • The /plag/templates/ folder contains the HTML pages covering both the public (unauthenticated) website pages such as the order form and legal documents (under static/), and the account (authenticated) pages (under dynamic/)
    • /plag/templatetags/ contains the Django custom template tags used in various parts of the HTML frontend
    • /plag/management/commands contains the custom management commands:
      • chooses ProtectedResource entries which are due to be scanned, and then calls the relevant 'utility' methods in /util/getqueriespertype/ to get a few (hopefully distinct) queries from the document/resource. Bing's search engine API is then called in /util/ to get any potential plagiarism matches for each query. These results are saved back to the DB.
      • then looks at each potential plagiarism match URL, loads up the URL and parses the text content to see whether this is a false positive or not. If it's a real match, a duplication percentage score is calculcated. This then appears on the user's account.
      • this parses a blog's RSS feed and saves the latest results to the database, so that the blog results can be shown in a cached/efficient way.
  • /PlagiarismGuard/ - these are the standard Django files used to configure and power the application.
  • /util/ - as covered a little above, these are a set of 'utilities' which perform the bulk of the plagiarism detection work.

A further write-up of this project is available on my personal site.

You can’t perform that action at this time.