# Plagiarism Detection
### Dataset
The dataset used here is a small sample taken from the [International Competition on Plagiarism Detection PAN 2010](https://www.uni-weimar.de/en/media/chairs/computer-science-department/webis/data/corpus-pan-pc-10/). It consists of two kinds of files - suspicious documents and source documents. The dataset also contains information about whether a suspicious documents is plagiarized or not from the source documents. 

### Model overview
We have built a classifier below to classify the suspicious documents as plagiarized or not. This is done in broadly two steps:
1. First we derive four distinct similarity measures for each suspicious document. Three of the measures are derived from the overlapping of the **trigrams** of the text - Jaccard similarity coefficient, containment measure and the longest common sequence whereas the last one is derived from the closeness of the vectors in Latent Semantic Analysis (LSA). 
2. Next we use these measures as the features for the suspicious documents to train a logistic regression classifier on them. 

### Rationale
Plagiarism comes in many forms. The first three of the similarity measures coming from the trigrams is targeted to catch the copying of the text from the source documents whereas the last measure related LSA attempts to catch the restructuring, revising and paraphrasing of the original text. 

The model is designed after reviewing the literature and playing with various ideas such as in terms of text preprocessing. Many ideas, such as the first three similarity measures derived from trigrams, are borrowed from the paper [Using Natural Language Processing for Automatic Detection of Plagiarism](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.458.9440&rep=rep1&type=pdf)  by Miranda Chong, et. al. For the text pre-processing, the detailed analysis from the paper  [The Influence of Text Pre-processing on Plagiarism Detection](https://pdfs.semanticscholar.org/a47c/1a35e2858da1eb82077b572e538a7b0b7b2d.pdf) is taken into account while building the model. The figure 2 in the paper clearly depicts the impact of the most commonly used text preprocessing methods. The following decisions are made for text preprocessing:
1. Stopwords are not removed since they seem to play a role in detecting overlapping in the trigrams
2. Sentence segmentation is avoided since the plagiarism seem to span across consecutive sentences, hence the trigrams linking the sentences are useful as well.
3. Numbers, colons, plus are removed whereas punctuations are kept.
4. Lowercase is used throughout the text.
5. Lemmatization is used but only for deriving the last feature (or measure) using LSA.
6. POS-tagging might prove to be useful but the current model does not use it


### Tools used: re (regular expressions), numpy, pandas, nltk, sklearn
'
### Ideas/Future work
* Text preprocessing: POS-tagging, synonymy recognition, etc.
* Using bigrams, 4-grams and/or 5-grams along with (or instead of) trigrams
* Deriving more features for the classifier using more similarity measures or other approaches
* Tuning the parameters for the logistic regression classifier
* Modifying/optimizing the code for large-scale data

### Challenges
* Multi-source plagiarism:  When each suspicious documents is compared with all source documents to calculate the similarity meaures, the highest scores are considered. Thus, multi-source plagiarism  is not taken into account. This can be fixed by instead considering the average of the top 3 scores while calculating the similarity measures.
* Language translation: Our model does not cover the plagiarism caused by translating the original text in another language
* Paraphrasing is only partially addressed.
* Model so far has a serious flaw that it might also flag a document that has properly quoted from the original source along with giving correct reference. This can be fixed by looking for the quotations at the very beginning. Check that the source for each quotation is properly attributed and then removing the quote along with the reference text.  



We start by importing the relevant modules.

In [1]:
import numpy as np 
import pandas as pd 
import re

import os
path = "../input/" # Update path
print("Files:")
print(os.listdir(path))

import nltk
from nltk import trigrams, word_tokenize
from nltk.stem import WordNetLemmatizer 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression

Files:
['suspicious-document00029.txt', 'suspicious-document00031.txt', 'source-document01651.txt', 'source-document01919.txt', 'source-document02707.txt', 'source-document01718.txt', 'suspicious-document00007.txt', 'suspicious-document00025.txt', 'source-document01328.txt', 'source-document07522.txt', 'source-document01999.txt', 'suspicious-document00010.txt', 'suspicious-document00014.txt', 'source-document03861.txt', 'suspicious-document00003.txt', 'source-document04426.txt', 'suspicious-document00028.txt', 'suspicious-document00021.txt', 'source-document01595.txt', 'suspicious-document00013.txt', 'suspicious-document00004.txt', 'source-document01600.txt', 'suspicious-document00009.txt', 'source-document00893.txt', 'suspicious-document00033.txt', 'suspicious-document00019.txt', 'suspicious-document00011.txt', 'suspicious-document00027.txt', 'suspicious-document00015.txt', 'source-document05310.txt', 'source-document04279.txt', 'source-document03029.txt', 'suspicious-document00005.tx

Below are the functions to clean the text files and combine the files to give dataframes - one each for the source and the suspicious files:

In [2]:
def clean_file(myfile):
    mf = myfile.read()
    mf = mf.lower()
    mf = re.sub(r'[\n]\s*',r' ', mf)
    mf = re.sub(r'[\']|[:]|[+]|\d+|[--]', '', mf)
    mf = re.sub(r'\(\)',r'', mf)
    mf = re.sub(r'\.\s+\.', r'.', mf)
    mf = mf.strip()
    return mf

def get_dataframe(files):
    data = []
    for f in files:
        with open(path + f, mode='r', encoding='utf-8-sig') as myfile:
            myfile = clean_file(myfile)
            data.append(myfile)
    df = pd.DataFrame(data, columns=['Text'])
    return df

Next we use the above functions to get a pandas dataframe for the suspicious file. The reason we put the files as a dataframe is because it will be easier to apply the same operations later on to each file.

In [3]:
suspicious_files = sorted([f for f in os.listdir(path) if f.startswith('suspicious-document')])
suspicious = get_dataframe(suspicious_files)
suspicious['File_index'] = [f[19:24] for f in suspicious_files]
suspicious.head()

Unnamed: 0,Text,File_index
0,bible studies in the life of paul historical a...,1
1,my impatience to inhabit the hermitage not per...,2
2,morning on the beachthe three letters i...,3
3,this morning it rained so hard (though it was ...,4
4,deadham hard a romance by lucas malet (mary st...,5


Similarly, below is the dataframe for the source files:

In [4]:
source_files = sorted([f for f in os.listdir(path) if f.startswith('source-document')])
source = get_dataframe(source_files)
source['File_index'] = [f[15:20] for f in source_files]
source.head()

Unnamed: 0,Text,File_index
0,mrs. ernest f. wurtele. take a piece of frozen...,893
1,after minutely examining every page of the man...,1328
2,the miscellaneous writings and speeches of lor...,1595
3,sister teresa by george moore london t. fisher...,1600
4,i was still wrestling on the pavement with the...,1651


Now, we get trigrams from the corpus of the files so that we can use them to detect plagiarism. 

In [5]:
def get_trigrams(df):
    df['Tokenized_text'] = df['Text'].apply(word_tokenize) 
    df['Trigrams'] = df['Tokenized_text'].apply(lambda x: set(trigrams(x)))
    return df

Getting trigrams for the suspicious files:

In [6]:
suspicious = get_trigrams(suspicious)
suspicious.head()

Unnamed: 0,Text,File_index,Tokenized_text,Trigrams
0,bible studies in the life of paul historical a...,1,"[bible, studies, in, the, life, of, paul, hist...","{(still, recognizable, .), (stage, in, the), (..."
1,my impatience to inhabit the hermitage not per...,2,"[my, impatience, to, inhabit, the, hermitage, ...","{(findley, ., ''), (resolution, of, seeing), (..."
2,morning on the beachthe three letters i...,3,"[morning, on, the, beachthe, three, letters, i...","{(,, boys, ,), (thinking, i, lay), (,, with, e..."
3,this morning it rained so hard (though it was ...,4,"[this, morning, it, rained, so, hard, (, thoug...","{(dinner, we, fell), (love, to, my), (and, dow..."
4,deadham hard a romance by lucas malet (mary st...,5,"[deadham, hard, a, romance, by, lucas, malet, ...","{(religion, as, a), (neither, enjoyed, ,), (at..."


Getting trigrams for the source files:

In [7]:
source = get_trigrams(source)
source.head()

Unnamed: 0,Text,File_index,Tokenized_text,Trigrams
0,mrs. ernest f. wurtele. take a piece of frozen...,893,"[mrs., ernest, f., wurtele, ., take, a, piece,...","{(harveys, sauce, and), (., miss, fry), (., pu..."
1,after minutely examining every page of the man...,1328,"[after, minutely, examining, every, page, of, ...","{(them, exactly, in), (in, maintaining, its), ..."
2,the miscellaneous writings and speeches of lor...,1595,"[the, miscellaneous, writings, and, speeches, ...","{(antiquity, ,, liberty), (find, nothing, anal..."
3,sister teresa by george moore london t. fisher...,1600,"[sister, teresa, by, george, moore, london, t....","{(get, about, ,), (to, read, some), (about, it..."
4,i was still wrestling on the pavement with the...,1651,"[i, was, still, wrestling, on, the, pavement, ...","{(seems, to, recall), (i, actually, heard), (i..."


Next we compare the suspicious files with the source files using three similarity measures:
1. Jaccard similarity coefficient
2. Containment measure
3. Longest common sequence

The formulae and explanation for these measures can be found in this [paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.458.9440&rep=rep1&type=pdf).

We write the code for the three measures.

In [8]:
def Jaccard_similarity_coefficient(A, B):
    J = len(A.intersection(B))/len(A.union(B))
    return J

def containment_measure(A, B):
    J = len(A.intersection(B))/len(B)
    return J

def LCS(A, B):
    m, n = len(A), len(B)
    counter = [[0]*(n+1) for x in range(m+1)]
    A, B = list(A), list(B)
    longest = 0
    for i in range(m):
        for j in range(n):
            if A[i] == B[j]:
                count = counter[i][j] + 1
                counter[i+1][j+1] = count
                if count > longest:
                    longest = count
    return longest

We write the functions to apply the above three measures to each suspicious file in the dataframe.  For each suspicious file, we compare it with all source files and keep the highest score for the respective measure. 


In [9]:
def check_plagiarism_Jaccard(doc_trigrams):
    Jaccard_similarity_scores = source.Trigrams.apply(lambda s: Jaccard_similarity_coefficient(s, doc_trigrams))
    most_similar = Jaccard_similarity_scores.idxmax()
    return Jaccard_similarity_scores[most_similar]#, source.loc[most_similar, 'File_index']

def check_plagiarism_containment(doc_trigrams):
    containment_measure_scores = source.Trigrams.apply(lambda s: containment_measure(s, doc_trigrams))
    most_similar = containment_measure_scores.idxmax()
    return containment_measure_scores[most_similar]#, source.loc[most_similar, 'File_index']

def check_plagiarism_LCS(doc_trigrams):
    LCS_scores = source.Trigrams.apply(lambda s: LCS(s, doc_trigrams))
    most_similar = LCS_scores.idxmax()
    return LCS_scores[most_similar]#, source.loc[most_similar, 'File_index']

We get the three measures for comparing the similarity between trigrams of suspicious and source files.  

In [10]:
suspicious['Jaccard_similarity_score'] = suspicious.Trigrams.apply(check_plagiarism_Jaccard)
suspicious['Containment_measure_score'] = suspicious.Trigrams.apply(check_plagiarism_containment)
# suspicious['Longest_common_sequence'] = suspicious.Trigrams.apply(check_plagiarism_LCS)

The above three measures are targeted to catch the plagiarism where words are more or less copied from the source file. Now we use Latent Semantic Analysis.

 ### Latent Semantic Analysis:
Next we use scikit-learn's ``TfidfVectorizer`` along with lemmatization to get a document term matrix where the columns corresponds to the files (source and suspicious both).

In [11]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

vectorizer = TfidfVectorizer(
    analyzer='word',
    token_pattern=r'\w{1,}',
    tokenizer=LemmaTokenizer(),
    ngram_range=(1, 4),
    max_features=1000,
    )

DTM = vectorizer.fit_transform(suspicious.Text.append(source.Text))

Now we use scikit-learn's ``TruncatedSVD`` to apply Singular Value Decomposition on the document term matrix *DTM* obtained above to get a lower dimensional matrix *DTM_LSA* with dim=40 and then normalize it.

In [12]:
LSA = TruncatedSVD(40, algorithm = 'arpack')
DTM_LSA = LSA.fit_transform(DTM)
DTM_LSA = Normalizer(copy=False).fit_transform(DTM_LSA)

Since we have normalized the matrix ``DTM_LSA``,  the dot product of the vectors corresponding to two files will give the cosine angle between them, which is precisely the measure of similarity in this case. Hence, we get the similarity matrix by multiplying the matrix *DTM_LSA* with its transpose.

In [13]:
similarity_matrix = np.asarray(np.asmatrix(DTM_LSA) * np.asmatrix(DTM_LSA).T)

Next we find the highest similarity score for each suspicious document while considering the values corresponding to the source documents only. We achieve this by first setting all the diagonal values as well as the LXL square matrix to zero and taking the max value for each row.

In [14]:
np.fill_diagonal(similarity_matrix, 0)
L = len(suspicious_files)
similarity_matrix[:L, :L] = np.zeros((L, L))
suspicious['LSA_similarity'] = np.max(similarity_matrix, 1)[:L]

The last step is to use all the similarity measures obtained above as the features for the suspicious documents and train a logistic regression model.  For that we split the suspicious documents into train and test sets and keep only the columns corresponding to the similarity measures.

In [29]:
# suspicious.set_index('File_index', inplace=True)
suspicious.reset_index(inplace=True)
suspicious = suspicious[['LSA_similarity', 'Jaccard_similarity_score', 'Containment_measure_score']]#, 'Longest_common_sequence']]
y = pd.Series(np.array([1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1]))
X_train, X_test, y_train, y_test = train_test_split(suspicious, y, test_size=0.20)
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.5714285714285714


### References:
1. [Using Natural Language Processing for Automatic Detection of Plagiarism](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.458.9440&rep=rep1&type=pdf)
2. [The Role of Natural Language Processing Techniques in Plagiarism Detection](https://prezi.com/yhepkzz-qn76/the-role-of-natural-language-processing-techniques-in-plagiarism-detection/)
3. [The Influence of Text Pre-processing on Plagiarism Detection](https://pdfs.semanticscholar.org/a47c/1a35e2858da1eb82077b572e538a7b0b7b2d.pdf)
4. [Dataset](https://www.uni-weimar.de/en/media/chairs/computer-science-department/webis/data/corpus-pan-pc-10/)