# Plagiarism detection

**Plagiarism** is the use of another's work and ideas as if they are one's original work. For many institutions, it is a habit that is highly discouraged. In the case of an educational institution, we would want to identify cases of plagiarism so that students can receive a stick for such a habit and thereby discouraging them. However, being in the realm of natural language (with changing rules), the detection of plagiarism is a much more formidable problem than it seems at first. Therefore, in this report, we will look at a method that implements a machine learning algorithm so that as the data of users evolve, so will the detection techniques.

## Feature selection

### String matching

If we have two student submissions, _" today is Monday"_ and _"day"_, concatenating all the words and converting it to lower case, we can point out common substrings of length k (_k-gram_). In this case, if `k` is 3, then the possible (i, j) positions of matches in the first and second strings are (2, 0) and (10, 0). 

**NOTE:** We do not have to get rid of the spaces but to make our solution much more straightforward, we need to have all the strings in lower case.

The following forms the basis of the **containment** feature that compares the occurrences of _k-grams_ in a submitted and source text, relative to the traits of the answer text. The formula for containment is: $$\frac{\sum count(\text{k-gram}_{A}) \cap count(\text{k-gram}_{s}) }{\sum count(\text{k-gram}_{A}) }$$

There are two methods that we will explore that enables us to calculate such matches:
1. Rolling hashing
2. Regular hash matching

> **Citation:** Thomas H. Cormen. “Introduction to Algorithms”. Apple Books. https://books.apple.com/us/book/introduction-to-algorithms/id570172300

#### Rolling hashing

To use the rolling hashing technique, we will implement the _Rabin Karp_ algorithm. It has a matching time complexity of $O((n-m+1)m)$, whereby $n$ is the length of the answer text and $m$ is the length of the source text.


In [1]:
import helpers

# Example usage of Rabin Karp
source = "today is monday"
answer = "day"

helpers.rabin_karp_matcher(
    target=source,
    potential=answer,
    d=7,
    q=10
)

[2, 12]

The output of the above method is the positions, the potential text appears in the target text. As we can see _"day"_ appears in the source text at indexes 2 and 12 (2 occurrences).

#### Regular hash matching

When implementing the regular hash matching algorithm, instead of using previously computed hash values, we calculate all of them from scratch. The time complexity is similar to that of the rolling hashing algorithm, $O((n-m+1)m)$, however, the constants to the time complexity are higher in the regular hash matching algorithm since we compute full hashes each time.

In [2]:
helpers.non_rolling_matcher(
    target=source,
    potential=answer
)

[2, 12]

We get a similar answer to that of the rolling hashing algorithm. The regular hash matching function hashes the values using the `hash` method in `mmh3`.

### Longest Common Subsequence

The other feature that is crucial in helping us figure out plagiarized works is the longest common subsequence. The LCS of 2 sequences is a sequence of a maximal length that is common between them.

In [3]:
helpers.longest_common_subsequence(
    x=source,
    y=answer
)

3.0

## Modeling

The data  we use for modeling a plagiarism detecting machine learning model is a slightly modified version of a dataset created by Paul Clough (Information Studies) and Mark Stevenson (Computer Science), at the University of Sheffield. Check more information at [their university webpage](https://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html).

> **Citation for data:** Clough, P. and Stevenson, M. Developing A Corpus of Plagiarized Short Answers, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis, In Press. [Download]

The data contains _answer_ and _source text files_. The answer text files are based on different task (labeled A-E).

Also, the text files, have different levels of plagiarism: 
1. cut: Copy-pasted directly.
2. light: Includes some copying and paraphrasing.
3. heavy: Expressed using different words and structure.
4. non: Not plagiarized
5. orig: The original (source text).

> So, out of the submitted files, the only category that does not contain any plagiarism is non.


### Preprocessing

> **Citation:** Done in the [feature engineering notebook](./feature_engineering.ipynb)

> **Citation:** Since this project is an extension of one of my earlier projects some of the feature engineering code is similar to the one I used in: https://github.com/Inventrohyder/ML_SageMaker_Studies/tree/master/Project_Plagiarism_Detection

In this stage we carry out the following tasks:

1. we switched the labels of the different files from strings to numbers since it is easier for machine learning algorithms to work with numbers.

2. To make it easier, we also split the data into binary (0 for not plagiarized; 1 for plagiarized).

4. Process the text to be in lower case only

5. Label the data into 'train', 'test' and 'orig'

6. Calculate out custom containment value using rolling hashing for different values of `k`

7. Calculate longest common subsequence

8. To avoid correlated features that would be redundant, only pick the columns that have a correlation lower than 0.95

![](images/correlation.png)

9. Split the data into train and test data files with the selected features.

### Generate models

> **Citation:** Done in the [train model notebook](train_model.ipynb)

To ensure, we implement design thinking, we use different Machine Learning algorithms to fit our data.

After evaluation, we get the following results

In [4]:
import pandas as pd

pd.read_csv('results.csv', index_col=0)

Unnamed: 0,scores
Nearest Neighbors,0.88
Linear SVM,0.92
RBF SVM,0.6
Gaussian Process,0.56
Decision Tree,0.92
Random Forest,0.92
Neural Net,0.84
AdaBoost,0.88
Naive Bayes,0.88
QDA,0.84


As we can see the three models that do really well are _linear SVM_, _Decision Tree_, and _Random Forest_.

## User-centered design

Now that we have a model, we can design an application that asks the user for an answer text and a source text, then we calculate all the features we have used to generate the model and we give a prediction of how likely it is that it is plagiarized.

To check if the work has been plagiarized from other students work, then we just replace the other student's work as the source text.

## Appendix

### Appendix A: CS110 Dashboard

!["CS110 Dashboard"](images/my_dashboard_apr29.png "CS110 Dashboard")

### Appendix B - Link to repository

The link to the living repository of the following assignment is: https://github.com/Inventrohyder/CS110/tree/master/Assignments/cs110_assignment_6

### Appendix B - HCs

* #designthinking: Building with constant testing and iteration was carried out at each stage to ensure that we build a program that forms the basis of an excellent user experience

* #regression: We build the machine learning models using regression and other techniques

* #correlation: To pick the right features to use, we use correlation as a metric to evaluate the best ones so that they are not redundant