#Paraphrased Plagiarism Detection using Sentence Embeddings

Goal: This notebook implements a system to detect potential paraphrased plagiarism in a student submission by comparing it against known source documents.

Approach:
1. Load Data: Read the submission text and source document texts.
2. Preprocess: Segment texts into sentences.
3. Embed Sentences: Use a pre-trained Sentence-BERT model (all-MiniLM-L6-v2)(may add more powerful bert models for selction) to convert sentences into meaningful numerical vectors (embeddings).
4. Calculate Similarity:
  * Cosine Similarity: Measure the semantic similarity between submission sentence embeddings and source sentence embeddings. High cosine similarity indicates similar meaning.
  * Jaccard Similarity: Measure the lexical (word) overlap between sentence pairs. Low Jaccard similarity indicates different wording.
5. Filter: Identify pairs with high semantic similarity (Cosine > threshold) but low lexical similarity (Jaccard < threshold) as potential candidates for paraphrased plagiarism.
6. Evaluate: (Manual Step) Compare the flagged candidates against a known ground truth to assess performance.

## 1. Setup: Install Relevant Libraries and Mount Drive

First, we need to install the necessary libraries (`sentence-transformers` for the embedding model, `nltk` for text processing) and mount Google Drive to access our synethetic data files.

In [12]:
#install required libraries
!pip install sentence-transformers nltk -q

#import libraries
import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
import os
import re
from google.colab import drive


# Download necessary NLTK data for sentence tokenisation
nltk.download('punkt', quiet=True) #advoid verbose output
nltk.download('punkt_tab', quiet=True)

# NLTK tokenizers
from nltk.tokenize import word_tokenize, sent_tokenize

#Mount Google Drive
try:
  drive.mount('/content/drive')
except Exception as e:
  print(f"Error mounting drive: {e}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. Load Data

Defined paths to source and submission files on google drive (syenthetic dataset). Then, load the text content from these files

In [10]:
# --- Define File Paths (UPDATE THESE WITH YOUR ACTUAL PATHS) ---
source_photo_path = '/content/drive/MyDrive/NLP/Paraphrasing Detector/A3/Synethetic-Dataset/Sources/Source_Photosynthesis.txt'
source_ml_path = '/content/drive/MyDrive/NLP/Paraphrasing Detector/A3/Synethetic-Dataset/Sources/Source_MachineLearning.txt'
source_ww2_path = '/content/drive/MyDrive/NLP/Paraphrasing Detector/A3/Synethetic-Dataset/Sources/Source_WorldWar2.txt'
submission_path = '/content/drive/MyDrive/NLP/Paraphrasing Detector/A3/Synethetic-Dataset/Submission/Submission1.txt'

# --- Load Data ---

with open(source_photo_path, 'r', encoding='utf-8') as f: # Added encoding='utf-8' for robustness
    source1_photosynthesis = f.read()
with open(source_ml_path, 'r', encoding='utf-8') as f:
    source2_ML = f.read()
with open(source_ww2_path, 'r', encoding='utf-8') as f:
    source3_WW2 = f.read()
with open(submission_path, 'r', encoding='utf-8') as f:
    submission = f.read()

print("Data loaded successfully:")
print(f"- Photosynthesis source length: {len(source1_photosynthesis)} chars")
print(f"- Machine Learning source length: {len(source2_ML)} chars")
print(f"- World War 2 source length: {len(source3_WW2)} chars")
print(f"- Submission length: {len(submission)} chars")

Data loaded successfully:
- Photosynthesis source length: 2339 chars
- Machine Learning source length: 1989 chars
- World War 2 source length: 2817 chars
- Submission length: 3983 chars


## 3. Data Preporcessing

tokenise each file into individual sentences. This allows us to compare sentences for the submission against sentences from the source

In [13]:
# Combine Sources and keep track of origin
all_source_text = {
    "Photosynthesis": source1_photosynthesis,
    "Machine Learning": source2_ML,
    "World War 2": source3_WW2
}

source_sentences_info = []
min_sentence_length = 15 #filters out very short sentences as theyre irelevant

print("Segmenting source documents into sentences")
for source_name, text in all_source_text.items():
  sentences = sent_tokenize(text)

  for i, sentence in enumerate(sentences):
    # clean the sentence of whitespace
    clean_sentence = re.sub(r'\s+', ' ', sentence).strip()

    if len(clean_sentence) >= min_sentence_length: #Advoid very short sentences
      source_sentences_info.append(
          {
              "text": clean_sentence,
              "source": source_name,
              "source_index": i #represents original index position within its source document
          }
      )

print(f"Segmented source document into {len(source_sentences_info)} sentences (min length {min_sentence_length}).")

print("\nSegmenting Submission document into sentences...")
submission_sentences = [
    re.sub(r'\s+', ' ', sentence).strip() for sentence in sent_tokenize(submission)
    if len(re.sub(r'\s+', ' ', sentence).strip()) >= min_sentence_length
]

print(f"Segmented submission document into {len(submission_sentences)} sentences (min length {min_sentence_length}).")

# extract just the text for sentence embedding (needed for modelling)
all_source_text = [sentence['text'] for sentence in source_sentences_info]

# # Display first few sentences as a check
# print("\nFirst 3 source sentences:")
# for i in range(min(3, len(source_sentences_info))):
#   print(f"  [{source_sentences_info[i]['source']}]: {source_sentences_info[i]['text']}")

# print("\nFirst 3 submission sentences:")
# for i in range(min(3, len(submission_sentences))):
#     print(f"  {submission_sentences[i]}")

Segmenting source documents into sentences
Segmented source document into 42 sentences (min length 15).

Segmenting Submission document into sentences...
Segmented submission document into 28 sentences (min length 15).

First 3 source sentences:
  [Photosynthesis]: Photosynthesis (/ˌfoʊtəˈsɪnθəsɪs/ FOH-tə-SINTH-ə-sis) is a system of biological processes by which photosynthetic organisms, such as most plants, algae, and cyanobacteria, convert light energy, typically from sunlight, into the chemical energy necessary to fuel their metabolism.
  [Photosynthesis]: Photosynthesis usually refers to oxygenic photosynthesis, a process that produces oxygen.
  [Photosynthesis]: Photosynthetic organisms store the chemical energy so produced within intracellular organic compounds (compounds containing carbon) like sugars, glycogen, cellulose and starches.

First 3 submission sentences:
  Essay: Interconnected Systems - Nature, History, and Technology Our world is shaped by intricate processes, from