In [1]:
import os
import glob
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download the necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91801\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91801\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

os - This is a standard library in Python that provides a way to interact with the operating system. It is used in this code to access the file system and navigate through directories.

glob - This is another standard library in Python that provides a way to retrieve files that match a certain pattern. It is used in this code to retrieve all the files in a directory that have a certain file extension.

nltk - This is the Natural Language Toolkit, a popular library in Python for working with human language data. It provides a set of tools for tasks such as tokenization, stemming, and stopword removal.

stopwords - This is a corpus of common stopwords that can be used to filter out words that are not useful for analysis. The stopwords module from nltk is used to download and access this corpus.

word_tokenize - This is a function from nltk that splits a sentence into individual words. It is used in this code to tokenize the text data in the input files.

SnowballStemmer - This is a stemming algorithm from nltk that reduces words to their base or root form. It is used in this code to normalize words and reduce the dimensionality of the data.

TfidfVectorizer - This is a class from the sklearn library that creates a vector representation of text data based on the term frequency-inverse document frequency (tf-idf) weighting scheme. It is used in this code to create a vector representation of each input file.

cosine_similarity - This is a function from the sklearn library that calculates the cosine similarity between two vectors. It is used in this code to compare the similarity between each pair of input files based on their vector representations.

Overall, these libraries are used to preprocess the input files and convert them into a numerical representation that can be compared to detect plagiarism.

In [2]:
def preprocess(text):
    """
    Preprocesses a given text by tokenizing it into words,
    removing stopwords and punctuation, and stemming the words.
    """
    # Tokenize the text into words
    tokens = word_tokenize(text)
    
    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('arabic'))
    words = [word for word in tokens if word.isalpha() and word not in stop_words]
    
    # Stem the words
    stemmer = SnowballStemmer('arabic')
    words = [stemmer.stem(word) for word in words]
    
    # Join the words back into a string
    text = " ".join(words)
    
    return text

Function called preprocess that pre-processes a given text in preparation for plagiarism detection.

The function starts by tokenizing the text into words using the word_tokenize function from the nltk.tokenize module. The resulting tokens list contains all the words in the text.

Next, the function removes stopwords and punctuation from the tokens list. Stopwords are common words in a language that are generally not useful for text analysis (such as "the", "a", and "an" in English). Punctuation marks are also removed since they are not useful for detecting plagiarism.

After removing stopwords and punctuation, the function stems the words using the Snowball stemming algorithm from the nltk.stem module. Stemming reduces words to their base or root form, which helps to reduce the number of unique words in the text and improve the efficiency of the plagiarism detection algorithm.

Finally, the function joins the stemmed words back into a string and returns it. This pre-processed text will be used later in the plagiarism detection algorithm to calculate the similarity between texts.

In [3]:
def train(folder):
    """
    Trains the plagiarism checker by reading the contents of all files
    in the given folder, preprocessing them, and vectorizing them using
    the TF-IDF method.
    """
    # Read the contents of all files in the folder
    files = glob.glob(os.path.join(folder, "*.txt"))
    texts = []
    for file in files:
        with open(file, encoding='utf-8') as f:
            text = f.read()
            texts.append(text)
    
    # Preprocess the texts
    preprocessed_texts = [preprocess(text) for text in texts]
    
    # Create a TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Vectorize the preprocessed texts
    vectors = vectorizer.fit_transform(preprocessed_texts)
    
    return vectors, vectorizer, files

The train function is responsible for training the plagiarism checker by reading the contents of all files in the given folder, preprocessing them, and vectorizing them using the TF-IDF (Term Frequency-Inverse Document Frequency) method.

First, it uses the glob module to find all files in the folder with a .txt extension. Then, it reads the contents of each file and stores them in a list called texts.

Next, it preprocesses the texts by calling the preprocess function on each text in texts. This involves tokenizing each text into words, removing stop words and punctuation, and stemming the remaining words.

After preprocessing, it creates a TfidfVectorizer object, which will be used to convert the preprocessed texts into vectors using the TF-IDF method. The fit_transform method of the vectorizer is then called on the preprocessed texts, which returns a matrix of TF-IDF values for each word in each text.

Finally, the function returns the TF-IDF vectors, the vectorizer object, and a list of the file paths for each text in the folder.






In [4]:
def compare(file, vectors, vectorizer, files):
    """
    Compares the similarity between the given file and all other files
    in the trained dataset using cosine similarity.
    """
    # Read the contents of the file
    with open(file, encoding='utf-8') as f:
        text = f.read()
    
    # Preprocess the text
    preprocessed_text = preprocess(text)
    
    # Vectorize the preprocessed text
    query_vector = vectorizer.transform([preprocessed_text])
    
    # Calculate the cosine similarity between the query vector and the vectors of all texts in the folder
    similarities = cosine_similarity(query_vector, vectors)[0]
    
    # Find the index of the text with the highest similarity
    index = similarities.argmax()
    
    # Calculate the percentage similarity rounded to two decimal points
    percentage_similarity = round(similarities[index] * 100, 2)
    
    # Get the filename of the text with the highest similarity
    filename = os.path.basename(files[index])
    
    return filename, percentage_similarity

This function compares the similarity between a given file and all other files in the trained dataset using cosine similarity. Here's how it works:

It reads the contents of the file.
It preprocesses the text by using the preprocess() function to tokenize the text into words, remove stopwords and punctuation, and stem the words.
It vectorizes the preprocessed text using the vectorizer object created during training.
It calculates the cosine similarity between the query vector and the vectors of all texts in the folder using the cosine_similarity() function from scikit-learn.
It finds the index of the text with the highest similarity.
It calculates the percentage similarity rounded to two decimal points.
It gets the filename of the text with the highest similarity.
It returns the filename and percentage similarity.
Overall, this function uses cosine similarity to compare the similarity between the given file and all other files in the trained dataset, and returns the filename and percentage similarity of the most similar file.

In [5]:
# Define the folder containing the training data and the file to compare
folder = r"C:\Users\91801\Desktop\Trainingdata"
file = r"C:\Users\91801\Documents\sample text.txt"

# Train the plagiarism checker on the training data
vectors, vectorizer, files = train(folder)

# Compare the given file with the trained dataset and print the results
filename, percentage_similarity = compare(file, vectors, vectorizer, files)
print(f"Most similar file: {filename}")
print(f"Percentage similarity: {percentage_similarity}%")

Most similar file: text7.txt
Percentage similarity: 61.56%


This code defines a folder containing the training data and a file to compare for plagiarism. It then trains the plagiarism checker on the training data using the train() function defined earlier, which reads the contents of all files in the folder, preprocesses them, and vectorizes them using the TF-IDF method. The compare() function is then called with the file to compare and the trained data. This function preprocesses the given file, vectorizes it using the TF-IDF method, calculates the cosine similarity between the vector of the given file and the vectors of all texts in the folder, finds the index of the text with the highest similarity, calculates the percentage similarity, and returns the filename of the text with the highest similarity and the percentage similarity.

Finally, the filename and percentage similarity are printed for the user to see. This allows the user to determine if the given file is plagiarized and which file it is most similar to in the training dataset.