<a href="https://colab.research.google.com/github/Neelima-Barigela/AI-Based-Plagiarism-Detection-System-Using-NLP/blob/main/AI_Based_Plagiarism_Detection_System_Using_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI-Based Plagiarism Detection System Using NLP

This project focuses on detecting plagiarism in textual documents using Natural Language Processing (NLP) techniques. The system takes multiple text files as input, cleans and processes the text, and converts it into numerical form using TF-IDF vectorization. It then measures the similarity between documents using Cosine Similarity. Based on a predefined threshold, the system determines whether plagiarism exists between any pair of documents. This approach helps in identifying copied or highly similar content in an efficient and automated way.

**Step 1: Import Required Libraries**

Import all necessary Python libraries for text processing, vectorization, and similarity computation.

In [17]:
import os
import re
import nltk
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**# Step 2: Load Text Documents**

### Read multiple text files from the dataset to be used for plagiarism comparison.

In [1]:
import zipfile

zip_path = "/content/Plagiarism-checker-Python-master.zip"
extract_path = "/content/plagiarism_data"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)


In [7]:
import os

folder = "/content/plagiarism_data/Plagiarism-checker-Python-master"

documents = []
file_names = []

for file in os.listdir(folder):
    if file.endswith(".txt"):
        with open(os.path.join(folder, file), 'r', encoding='utf-8') as f:
            documents.append(f.read())
            file_names.append(file)

# Step 3: Text Preprocessing

Clean the text by converting to lowercase and removing unwanted characters.

In [5]:
import re
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z ]', '', text)
    words = text.split()
    words = [w for w in words if w not in stopwords.words('english')]
    return " ".join(words)

cleaned_docs = [clean_text(doc) for doc in documents]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
print("Files:", file_names)
print("Number of documents:", len(documents))


Files: ['fatma.txt', 'juma.txt', 'requirements.txt', 'john.txt']
Number of documents: 4


In [12]:
documents = []
file_names = []

for file in os.listdir(folder):
    if file.endswith(".txt") and file != "requirements.txt":
        with open(os.path.join(folder, file), 'r', encoding='utf-8', errors='ignore') as f:
            documents.append(f.read())
            file_names.append(file)

print(file_names)
print(len(documents))


['fatma.txt', 'juma.txt', 'john.txt']
3


In [13]:
cleaned_docs = [clean_text(doc) for doc in documents]

for name, doc in zip(file_names, cleaned_docs):
    print(name, "->", doc[:150])


fatma.txt -> life best trying tofind works taking time intrying pursue skills
juma.txt -> life finding money use things makes happycoz life kinda short
john.txt -> life finding money spending luxury stuffscoz life kinda short trust


# Step 4: Text Vectorization (TF-IDF)

Convert cleaned text documents into numerical vectors using TF-IDF technique.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(cleaned_docs)

print("TF-IDF shape:", tfidf_matrix.shape)


TF-IDF shape: (3, 22)


# Step 5: Cosine Similarity Calculation

Measure the similarity between each pair of documents using cosine similarity.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tfidf_matrix)

print("Cosine Similarity Matrix:")
print(similarity_matrix)


Cosine Similarity Matrix:
[[1.         0.0821799  0.0821799 ]
 [0.0821799  1.         0.48111972]
 [0.0821799  0.48111972 1.        ]]


# Step 6: Plagiarism Detection

Compare similarity scores with a threshold to identify plagiarism.

In [16]:
threshold = 0.8

print("Plagiarism Detection Results:\n")

for i in range(len(file_names)):
    for j in range(i+1, len(file_names)):
        score = similarity_matrix[i][j]

        if score >= threshold:
            print(f"‚ö†Ô∏è Plagiarism detected between {file_names[i]} and {file_names[j]} "
                  f"(Similarity = {score:.2f})")
        else:
            print(f"‚úÖ No plagiarism between {file_names[i]} and {file_names[j]} "
                  f"(Similarity = {score:.2f})")


Plagiarism Detection Results:

‚úÖ No plagiarism between fatma.txt and juma.txt (Similarity = 0.08)
‚úÖ No plagiarism between fatma.txt and john.txt (Similarity = 0.08)
‚úÖ No plagiarism between juma.txt and john.txt (Similarity = 0.48)


**üèÅ Conclusion**

## The plagiarism detection system successfully analyzed all the given documents and calculated their similarity scores. Since all similarity values were below the selected plagiarism threshold, the system correctly identified that no plagiarism exists between any of the document pairs. This demonstrates that the implemented NLP-based approach using TF-IDF and cosine similarity is effective for detecting similarity and identifying original content.