# Plagiarism detection

**Plagiarism** is the use of another's work and ideas as if they are one's original work. For many institutions, it is a habit that is highly discouraged. In the case of an educational institution, we would want to identify cases of plagiarism so that students can receive a stick for such a habit and thereby discouraging them. However, being in the realm of natural language (with changing rules), the detection of plagiarism is a much more formidable problem than it seems at first. Therefore, in this report, we will look at a method that implements a machine learning algorithm so that as the data of users evolve, so will the detection techniques.

## Feature selection

### String matching

If we have two student submissions, _" today is Monday"_ and _"day"_, concatenating all the words and converting it to lower case, we can point out common substrings of length k (_k-gram_). In this case, if `k` is 3, then the possible (i, j) positions of matches in the first and second strings are (2, 0) and (10, 0). 

**NOTE:** We do not have to get rid of the spaces but to make our solution much more straightforward, we need to have all the strings in lower case.

The following forms the basis of the **containment** feature that compares the occurrences of _k-grams_ in a submitted and source text, relative to the traits of the answer text. The formula for containment is: $$\frac{\sum count(\text{k-gram}_{A}) \cap count(\text{k-gram}_{s}) }{\sum count(\text{k-gram}_{A}) }$$

There are two methods that we will explore that enables us to calculate such matches:
1. Rolling hashing
2. Regular hash matching

> **Citation:** Thomas H. Cormen. “Introduction to Algorithms”. Apple Books. https://books.apple.com/us/book/introduction-to-algorithms/id570172300

#### Rolling hashing

To use the rolling hashing technique, we will implement the _Rabin Karp_ algorithm. It has a matching time complexity of $O((n-m+1)m)$, whereby $n$ is the length of the answer text and $m$ is the length of the source text


In [1]:
import helpers

# Example usage of Rabin Karp
source = "today is monday"
answer = "day"

helpers.rabin_karp_matcher(
    target=source,
    potential=answer,
    d=7,
    q=10
)

[2, 12]

The output of the above method is the positions, the potential text appears in the target text. As we can see _"day"_ appears in the source text at indexes 2 and 12 (2 occurrences).

#### Regular hash matching