# 🔍 Plagiarism Detection using Jaccard Similarity

In this project, we build a simple text similarity checker using **Jaccard Similarity**.  
You'll learn:
- How to tokenize and clean text
- Use of Python sets
- Basic scikit-learn preprocessing


In [13]:
# Function to define jaccard similarity 
def jaccard_similarity(doc1, doc2):
    words_doc1 = set(doc1.lower().split())
    words_doc2 = set(doc2.lower().split())

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)

    similarity = len(intersection)/len(union)
    return similarity

In [15]:
doc1 = "I am a very good girl, said my old teacher!"
doc2 = "Am I really a good boy, dude?"

score = jaccard_similarity(doc1,doc2)
print(f"Similarity Score {score:.2f}")

if score > 0.5:
    print("Possible Plagiarism")
else:
    print("Sufficiently different")


Similarity Score 0.31
Sufficiently different


In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import jaccard_score
import numpy as np

def jaccard_similarity_sklearn(doc1, doc2):
    vectorizer = CountVectorizer(binary=True)
    vectors = vectorizer.fit_transform([doc1, doc2]).toarray()
    return jaccard_score(vectors[0], vectors[1])

# Test case
doc1 = "I am a very good girl, said my old teacher!"
doc2 = "Am I really a good boy, dude?"

score = jaccard_similarity_sklearn(doc1, doc2)
print(f"Similarity Score: {score:.2f}")

if score > 0.5:
    print("Possible Plagiarism")
else:
    print("Sufficiently different")


Similarity Score: 0.18
Sufficiently different
