<a href="https://colab.research.google.com/github/Arjun15GIT/Plagiarism-checker-Assignment/blob/main/plagChecker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**STEP 1:** Downloading the datset and mounting on Google drive for easy access

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**STEP 2:** Unzipping the dataset zip file



In [2]:
!unzip "/content/drive/MyDrive/Plagiarism_detection/train_snli.txt.zip"


Archive:  /content/drive/MyDrive/Plagiarism_detection/train_snli.txt.zip
  inflating: train_snli.txt          


**STEP 3: Extract and Read Dataset**:
We start by extracting the dataset from a ZIP file stored in Google Drive. The script opens the ZIP archive, checks for the required file (train_snli.txt), and reads its content line by line.

In [2]:
import zipfile

file_path = "/content/drive/MyDrive/Plagiarism_detection/train_snli.txt.zip"

# Opening zip file
with zipfile.ZipFile(file_path, 'r') as zip_ref:
    # Get a list of files in the archive
    files_in_zip = zip_ref.namelist()
    if 'train_snli.txt' in files_in_zip:
        with zip_ref.open('train_snli.txt', 'r') as file:
            # Decode the content assuming UTF-8 encoding.
            data = file.read().decode('utf-8').splitlines()
    else:
        print(f"File 'train_snli.txt' not found in the archive. Available files: {files_in_zip}")

for i in range(5):
    print(data[i])

A person on a horse jumps over a broken down airplane.	A person is at a diner, ordering an omelette.	0
A person on a horse jumps over a broken down airplane.	A person is outdoors, on a horse.	1
Children smiling and waving at camera	There are children present	1
Children smiling and waving at camera	The kids are frowning	0
A boy is jumping on skateboard in the middle of a red bridge.	The boy skates down the sidewalk.	0


In [3]:
# Print more lines to understand structure
for i in range(10):
    print(f"Line {i+1}: {data[i]}")


Line 1: A person on a horse jumps over a broken down airplane.	A person is at a diner, ordering an omelette.	0
Line 2: A person on a horse jumps over a broken down airplane.	A person is outdoors, on a horse.	1
Line 3: Children smiling and waving at camera	There are children present	1
Line 4: Children smiling and waving at camera	The kids are frowning	0
Line 5: A boy is jumping on skateboard in the middle of a red bridge.	The boy skates down the sidewalk.	0
Line 6: A boy is jumping on skateboard in the middle of a red bridge.	The boy does a skateboarding trick.	1
Line 7: An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background.	A boy flips a burger.	0
Line 8: Two blond women are hugging one another.	The women are sleeping.	0
Line 9: Two blond women are hugging one another.	There are women showing affection.	1
Line 10: A few people in a restaurant setting, one of them is drinking orange juice.	The people ar

**STEP 4: Convert Raw Data into a Structured Dataset**:
Now that we have extracted the dataset, we convert it into a structured format using Pandas. Each line in the dataset consists of two sentences and a label, separated by a tab (\t).

In [4]:
import pandas as pd

# Converting raw data into structured format
data_list = [line.strip().split("\t") for line in data]

# Create DataFrame
df = pd.DataFrame(data_list, columns=["Sentence1", "Sentence2", "Label"])

# Convert Label to integer
df["Label"] = df["Label"].astype(int)

print(df.head())
print("\nDataset Size:", df.shape)


                                           Sentence1  \
0  A person on a horse jumps over a broken down a...   
1  A person on a horse jumps over a broken down a...   
2              Children smiling and waving at camera   
3              Children smiling and waving at camera   
4  A boy is jumping on skateboard in the middle o...   

                                       Sentence2  Label  
0  A person is at a diner, ordering an omelette.      0  
1              A person is outdoors, on a horse.      1  
2                     There are children present      1  
3                          The kids are frowning      0  
4              The boy skates down the sidewalk.      0  

Dataset Size: (367373, 3)


In [5]:
import re
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download("stopwords")
nltk.download("punkt")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**STEP 5: Text Preprocessing: Cleaning and Tokenization**:Before training our plagiarism detection model, we need to clean and preprocess the text data. This step removes unnecessary elements like special characters and stopwords while ensuring uniform formatting.

***Steps in Preprocessing:***

Convert to lowercase → Ensures uniformity.

Remove special characters → Eliminates punctuation and unwanted symbols.

Tokenization → Splits text into individual words.

Remove stopwords → Removes common but unimportant words (e.g., "the", "is", "and").

Reconstruct sentence → Joins cleaned words back into a sentence.




In [6]:
import re
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download("stopwords")
nltk.download("punkt")
nltk.download('punkt_tab')

# Load English stopwords
stop_words = set(stopwords.words("english"))

# Cleaning text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"[^a-zA-Z\s]", "", text)  # Remove special characters
    words = word_tokenize(text)  # Tokenize text
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    return " ".join(words)  # Convert back to sentence

# Apply cleaning function to both sentences
df["Sentence1"] = df["Sentence1"].apply(clean_text)
df["Sentence2"] = df["Sentence2"].apply(clean_text)
s
print(df.head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


                                  Sentence1                       Sentence2  \
0        person horse jumps broken airplane  person diner ordering omelette   
1        person horse jumps broken airplane           person outdoors horse   
2            children smiling waving camera                children present   
3            children smiling waving camera                   kids frowning   
4  boy jumping skateboard middle red bridge             boy skates sidewalk   

   Label  
0      0  
1      1  
2      1  
3      0  
4      0  


**STEP 6: Feature Extraction using TF-IDF**:
After cleaning the text, we need to convert it into a numerical format that machine learning models can understand. TF-IDF (Term Frequency-Inverse Document Frequency) is a common technique for text representation.

***Steps in TF-IDF:***

Term Frequency (TF) → Measures how often a word appears in a sentence.

Inverse Document Frequency (IDF) → Reduces the importance of common words across all sentences.

Vectorization → Converts text into a numerical matrix based on TF-IDF scores.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X1 = vectorizer.fit_transform(df["Sentence1"])
X2 = vectorizer.transform(df["Sentence2"])


**STEP 7: Computing Similarity using Cosine Similarity**
Now that we have transformed the text into TF-IDF vectors, we need to compare the similarity between Sentence1 and Sentence2.

Here:

1 → Sentences are identical

0 → Sentences are completely different

Between 0 and 1 → Partial similarity


In [8]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarity scores
similarity_scores = [cosine_similarity(X1[i], X2[i])[0][0] for i in range(X1.shape[0])]

# Add scores to the DataFrame
df["Similarity_Score"] = similarity_scores


**STEP 8: Preparing Data for Model Training**:

X: Feature matrix → Contains only the Similarity Score

y: Target labels → The Label column (0 = No Plagiarism, 1 = Plagiarism)

In [9]:
X = df[["Similarity_Score"]]
y = df["Label"]


**STEP 9: Splitting the Dataset**
Training Set (80%) → Used to train the model

Testing Set (20%) → Used to evaluate model performance

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**STEP 10: Training the Model with Logistic Regression**: Works by learning the relationship between similarity scores and plagiarism labels (0 for No Plagiarism, 1 for Plagiarism).

In [11]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)


**STEP 11: Evaluating the Model**

In [12]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.6682000680503573
              precision    recall  f1-score   support

           0       0.65      0.73      0.69     36795
           1       0.69      0.60      0.64     36680

    accuracy                           0.67     73475
   macro avg       0.67      0.67      0.67     73475
weighted avg       0.67      0.67      0.67     73475



**EXAMPLE USAGE**:

Preprocess the text → Convert to lowercase, remove special characters, and stopwords.

Vectorize sentences → Convert the input text to TF-IDF representation.

Compute similarity → Measure the cosine similarity between both vectors.

Predict plagiarism → Use the trained logistic regression model to classify.

In [13]:
def check_plagiarism(sentence1, sentence2):
    sentence1 = clean_text(sentence1)
    sentence2 = clean_text(sentence2)

    vec1 = vectorizer.transform([sentence1])
    vec2 = vectorizer.transform([sentence2])

    similarity = cosine_similarity(vec1, vec2)[0][0]
    prediction = model.predict([[similarity]])

    return "Plagiarism Detected" if prediction[0] == 1 else "No Plagiarism"

# Example Test
print(check_plagiarism("A boy is playing soccer.", "A child is playing football."))


No Plagiarism




In [14]:
def check_plagiarism(sentence1, sentence2):
    sentence1 = clean_text(sentence1)
    sentence2 = clean_text(sentence2)

    vec1 = vectorizer.transform([sentence1])
    vec2 = vectorizer.transform([sentence2])

    similarity = cosine_similarity(vec1, vec2)[0][0]
    prediction = model.predict([[similarity]])

    return "Plagiarism Detected" if prediction[0] == 1 else "No Plagiarism"

# Example Test
print(check_plagiarism("Be careful kids!", "Be careful kids!"))


Plagiarism Detected


