# Jaccard Similarity
In this notebook, we explore the significance of Jaccard similarity as a metric for assessing document similarity. Jaccard similarity measures the degree of overlap between two sets by comparing the size of their intersection to that of their union. This characteristic makes it particularly useful for analyzing textual data, where the focus is on the presence or absence of unique words rather than their frequency.

By employing Jaccard similarity, we can effectively identify how closely related two documents are, making it an essential tool in applications such as plagiarism detection, content recommendation, and information retrieval. This notebook demonstrates how to compute Jaccard similarity using Python and scikit-learn, allowing us to quantify document similarity in a straightforward and interpretable manner.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score, pairwise
from pathlib import Path



In [11]:
p = Path(Path.cwd()).resolve().parents[1] / "introduction" / "datasets" / "Text_Similarity_Dataset.csv"
df = pd.read_csv(p)
df

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...
...,...,...,...
4018,4018,labour plans maternity pay rise maternity pay ...,no seasonal lift for house market a swathe of ...
4019,4019,high fuel costs hit us airlines two of the lar...,new media battle for bafta awards the bbc lead...
4020,4020,britons growing digitally obese gadget lover...,film star fox behind theatre bid leading actor...
4021,4021,holmes is hit by hamstring injury kelly holmes...,tsunami to hit sri lanka banks sri lanka s b...


In [None]:
jaccard_results = []
tfidf_results = []
for index, row in df.iterrows():
    count_vectorizer = CountVectorizer(binary=True)
    X_count = count_vectorizer.fit_transform([row['text1'], row['text2']]).toarray()
    jaccard_sim = jaccard_score(X_count[0], X_count[1])
    jaccard_results.append(jaccard_sim)

    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform([row['text1'], row['text2']])
    tfidf_sim = pairwise.cosine_similarity(X_tfidf[0], X_tfidf[1])
    tfidf_results.append(tfidf_sim[0][0])



In [None]:

df['Jaccard Similarity'] = jaccard_results
df['TF-IDF Cosine Similarity'] = tfidf_results

df.head()


Unnamed: 0,Unique_ID,text1,text2,Jaccard Similarity,TF-IDF Cosine Similarity
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...,0.091451,0.389094
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...,0.097701,0.546468
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...,0.11236,0.511436
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...,0.110266,0.270215
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...,0.151685,0.673715
