In [5]:
pip install datasketch

Collecting datasketch
  Downloading datasketch-1.6.4-py3-none-any.whl (88 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/88.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.3/88.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: datasketch
Successfully installed datasketch-1.6.4


In [6]:
import numpy as np
from datasketch import MinHash, MinHashLSH
from itertools import combinations

In [17]:
k = 10  # shingle size
perms = 100   # number of permutations
bands = 15
rows = perms//bands
threshold = 0.9  # Jaccard similarity threshold

In [19]:
lsh = MinHashLSH(threshold=threshold, num_perm=perms, params=(bands, rows))

def create_shingles(text, k):
    tokens = text.split()
    shingles = set()
    for i in range(len(tokens) - k + 1):
        shingle = " ".join(tokens[i:i + k])
        shingles.add(shingle)
    return shingles

def jaccard_similarity(minhash1, minhash2):
    return minhash1.jaccard(minhash2)

In [20]:
paragraph = {}
current_paragraph = ""

In [21]:
with open('/content/similarity', 'r', encoding='utf-8') as file:
    for i in file:
        if i.strip():
            current_paragraph += i
        else:
            text = current_paragraph.strip()
            shingles = create_shingles(text, k)
            minhash = MinHash(num_perm = perms)
            for shingle in shingles:
                minhash.update(shingle.encode('utf-8'))
            doc_id = len(paragraph)
            lsh.insert(doc_id, minhash)
            paragraph[doc_id] = minhash
            current_paragraph = ""



In [22]:
cp = [] #candidate pairs
for i, bucket in enumerate(lsh.hashtables):
    for key in bucket.keys():
        doc_ids = list(bucket.get(key))
        if len(doc_ids) > 1:
            cp.extend(combinations(doc_ids, 2))

cp = list(set(cp))


In [23]:
sp = [] #similar pairs
for x in cp:
    i, j = x #pairs
    similarity = jaccard_similarity(paragraph[i], paragraph[j])
    if similarity >= threshold:
        sp.append((i, j, similarity))

In [27]:
sp.sort(key=lambda x: x[2], reverse=True)
most_similar_pairs = sp[:5]
for x in most_similar_pairs:
    doc1, doc2, similarity = x
    print(f"Paras {doc1} and Paras {doc2} have a Jaccard Similarity of: {similarity:.2f}")

Paras 300 and Paras 270 have a Jaccard Similarity of: 1.00
Paras 298 and Paras 309 have a Jaccard Similarity of: 1.00
Paras 266 and Paras 286 have a Jaccard Similarity of: 1.00
Paras 164 and Paras 229 have a Jaccard Similarity of: 1.00
Paras 287 and Paras 318 have a Jaccard Similarity of: 1.00


In [32]:
### This is how I checked the paragraphs - whether they actually matched or not

paragraphs = []

with open('/content/similarity', 'r', encoding='utf-8') as file:
    current_paragraph = ""
    for line in file:
        if line.strip():
            current_paragraph += line
        else:
            paragraphs.append(current_paragraph)
            current_paragraph = ""

if current_paragraph:
    paragraphs.append(current_paragraph)

para_300_content = paragraphs[300]
para_270_content = paragraphs[270]

print(f"Paragraph 300 content:\n{para_300_content}")
print(f"Paragraph 270 content:\n{para_270_content}")

Paragraph 300 content:
    In the heart of a bustling city, a master chef toiled in a renowned restaurant's kitchen, orchestrating a culinary symphony that delighted the senses. Ingredients from the world over danced on the stove, and the clatter of pots and pans merged with the aroma of spices and herbs. Each dish was a canvas, a masterpiece of flavor and presentation that would leave diners with unforgettable memories of a gastronomic journey. The restaurant's dining room buzzed with anticipation, a place where food transcended sustenance and became a work of art.

Paragraph 270 content:
    In the heart of a bustling city, a master chef toiled in a renowned restaurant's kitchen, orchestrating a culinary symphony that delighted the senses. Ingredients from the world over danced on the stove, and the clatter of pots and pans merged with the aroma of spices and herbs. Each dish was a canvas, a masterpiece of flavor and presentation that would leave diners with unforgettable memories of

Question 1: I represented my textual data by creating 'k' shingles from the lines of each paragraph. I used a shingle size of 10, meaning that each shingle represents a sequence of 10 consecutive words from the text. I chose a shingle size of 10 so that I could capture longer sequences of words and phrases in the text. Larger shingle sizes help in identifying similarities that might span multiple sentences or concepts within a paragraph. I found that there were many paras with jaccard similarity equal to 1 when I kept the shingle size low. Hence I took a larger value.

Question 2:  If you naively try to compare
the signature matrices to determine the similar documents, it becomes very time consuming and expensive - especially when we have a huge dataset like this. There will also be many false positives because there isn't any technique to reduce these false positives.
The banding technique deivides the signature matrix into bands and applies hash functions to each band. Thus, at the end of it, it reduces the number of document pairs to be compared.
The bands techiniqe and LSH make it easier by reducing the computational complexity, improving accuracy and at the same time be able to deal with llater ge datasets like the one we have worked with.

Question 3: The documents/paras I found to be identical are:

**paras 300 and 270** - In the heart of a bustling city, a master chef toiled in a renowned restaurant's kitchen, orchestrating a culinary symphony that delighted the senses. Ingredients from the world over danced on the stove, and the clatter of pots and pans merged with the aroma of spices and herbs. Each dish was a canvas, a masterpiece of flavor and presentation that would leave diners with unforgettable memories of a gastronomic journey. The restaurant's dining room buzzed with anticipation, a place where food transcended sustenance and became a work of art.

**Paras 298 and Paras 309** -     In the heart of a bustling city, a hidden courtyard garden offered a tranquil refuge from the urban chaos. Ancient stone pathways led to lush flowerbeds, where vibrant blooms flourished in terra cotta pots. The garden was a testament to human ingenuity and the resilience of nature, where greenery thrived amidst the concrete and steel. Here, amidst the beauty of the courtyard, the relentless pace of the city seemed to pause, and visitors found solace in the embrace of nature's splendor.


**Paras 266 and Paras 286** -     On the precipice of twilight, a coastal town's harbor became a symphony of activity, as fishermen hauled in their day's catch, their boats returning with brimming nets. Gulls circled overhead, their calls mingling with the sound of seagulls clanging against masts. As the sun dipped below the horizon, the harbor transformed into a painting of tranquil beauty, a reflection of a community deeply intertwined with the ebb and flow of the ocean.


**Paras 164 and Paras 229** -     In a picturesque coastal village, a centuries-old windmill stood sentinel, its massive sails turning gracefully in the breeze. The windmill's stones had ground grains into flour for generations, providing sustenance to the community. Around it, narrow cobblestone streets wound through a tapestry of charming cottages, their flower-filled gardens a riot of color. The village was a living postcard, a place where history and beauty merged, and where the windmill's enduring presence seemed to transcend time itself.


**Paras 287 and Paras 318** -     A small, sunlit pottery studio was a haven for creativity, where the potter molded clay into intricate shapes, guided by a lifelong passion for the craft. Shelves were adorned with works of art, from delicate teacups to massive sculptures, each one bearing the potter's unique signature. In the studio, the rhythm of creation was a dance of imagination and technique, a testament to the timeless marriage of human skill and the transformative nature of clay.

These gave me a jaccard similarity of 1. And seeing them, they are indeed identical.


