# Finding similar documents

This notebook aims to use a pipeline to find similar documents.

The pipeline is as follows:

graph LR

    A[Documents] --> B[Tokenization]
    B --> C[Shingling]
    C --> D[MinHashing]
    D --> E[LSH]
    E --> F[Similarity Check]

In [202]:
import os
import itertools    
import numpy as np
from tqdm import tqdm

## Data collection
First step that we need to do is to find data to check for similarity.
We found a dataset with 10 different types of documents with 100 docs each.
This was downloaded and put into the current working directory in a folder called `data`

```python
import kagglehub

# Download latest version
path = kagglehub.dataset_download("jensenbaxter/10dataset-text-document-classification")

print("Path to dataset files:", path)

```


After downloading the data, the documents have to be read from memory everytime we want to process them

```python
docs = read_documents("data")
```

In [203]:
def read_documents(path):
    documents = []
    for root, _, files in os.walk(path):
        for file in files:
            if file.endswith('.txt'):
                with open(os.path.join(root, file), 'r') as f:
                    documents.append(f.read())
    return documents
    

## Tokenization

This step involves converting the docs into tokens. 

E.g. "A cat eats" -> [0,1,2]

Where 
vocab = {
    0: a, 
    1: cat,
    2, eats
}

We can use a more complicated tokenizer, e.g. BPE. This would speed up processing larger quantities of data since a number would represent a larger chunk. 

Nevertheless,
using a word level tokenizer is fine for the purpose of this exercise. Our main goal is to convert the documents into integer representations, allowing for efficient and more general computations. E.g. by only using numbers we can restrict memory usage with uint16 etc which allows for quicker computes. 

In this tokenizer, we are compressing the vocab slightly more by ignoring capitalizations. 


In [204]:
class Tokenizer:
    def __init__(self):
        self.vocabulary = {}
    
    def fit(self, documents):
        for doc in documents:
            doc = doc.lower()
            for word in doc.split():
                if word not in self.vocabulary:
                    self.vocabulary[word] = len(self.vocabulary)
        
    def encode(self, text):
        return [self.vocabulary[word.lower()] for word in text.split()]

    def decode(self, tokens):
        return ' '.join([self.vocabulary[token] for token in tokens])

## Shingling

shingle size 2

abcba -> [ab, bc, cb, ab]

unique set -> {ab, bc, cb}


TODO:
To further optimize the shingling process, we should use less overlap.

shingle size 3, with 1 overlap

abcdefghij -> [abc, cde, efg, ghi, ij]



In [213]:

class Shingling:
    def __init__(self):
        self.shingles = set()
        self.hashed_shingles = set()
    
    def create_shingles(self, text, k):
        # Create k-shingles as tuples
        self.shingles = {tuple(text[i:i+k]) for i in range(len(text) - k + 1)}
        
        # Hash each shingle and store in sorted order
        self.hashed_shingles = {hash(shingle) for shingle in self.shingles}
        
        return self.hashed_shingles


In [207]:
from typing import List

class MinHashing:
    def __init__(self):
        self.unique_shingles = set()
        
    def fit(self, sets:List[set]):
        for set_ in sets:
            self.unique_shingles.update(set_)

    def characteristic_vector(self, shingles: set):
        shingles_array = np.array(list(self.unique_shingles))
        return np.isin(shingles_array, list(shingles)).astype(np.int8)
        

    def signature(self, n, shingles:set):
        np.random.seed(n)
        
        permutations = np.array([np.random.permutation(len(self.unique_shingles)) for _ in range(n)])
        characteristic_vector = self.characteristic_vector(shingles)
        permuted_shingles = permutations * characteristic_vector
        minhash = np.array([np.min(row[row != 0]) if np.any(row != 0) else 0 for row in permuted_shingles])
        return minhash
    

## LSH

Locality-sensitive hashing basically mean that we just check a small portion of the signatures and if that portion is the same for two signatures, we consider them to be similar documents. The larger bands we have, more stuff needs to be the same for it to be a match, making us pickier. 

In [209]:
class LSH:
    def candidate_pairs(self, signatures:np.array, b, r):
        
        # Use a set to store unique pairs
        unique_pairs = set()
        
        # Add progress bar for band processing
        for i in tqdm(range(b), desc="Processing bands"):
            band = signatures[:, i*r:(i+1)*r]
            for j in range(len(band)):
                for k in range(j+1, len(band)):
                    if np.array_equal(band[j], band[k]):
                        unique_pairs.add((j, k))
        
        yield from unique_pairs
        
    

## Similiarity Check

The similarity check used here is the jaccard similarity. The intersection of two sets divided by the union.
Depending on the representations of the sets, this will be calculated differently.

In [215]:
class CompareSets:
    def jaccard(self, a, b):
        return len(a.intersection(b)) / len(a.union(b))

In [216]:

class CompareSignatures:
    def jaccard(self, a:np.array, b:np.array):
        # Only consider positions where at least one signature has a non-zero value
        non_zero_positions = (a != 0) | (b != 0)
        if not np.any(non_zero_positions):
            return 0.0
        return np.mean(a[non_zero_positions] == b[non_zero_positions])
    
    def mean(self, a:np.array, b:np.array):
        return np.mean(a == b)


In [214]:
import time

N = 15
B = 5
R = 3
K = 5

# Read documents
print("Starting pipeline...")
start = time.time()
docs = read_documents("data")
read_time = time.time() - start
print(f"✓ Reading documents: {read_time:.2f}s")

# Tokenization
start = time.time()
t = Tokenizer()
t.fit(docs)
tokenization_time = time.time() - start
print(f"✓ Tokenization: {tokenization_time:.2f}s")

# Shingling
start = time.time()
s = Shingling()
shingled_docs = [s.create_shingles(t.encode(doc), K) for doc in tqdm(docs, desc="Shingling")]
shingling_time = time.time() - start
print(f"✓ Shingling: {shingling_time:.2f}s")

# MinHashing
start = time.time()
mh = MinHashing()
mh.fit(shingled_docs)
signatures = [mh.signature(N, doc) for doc in tqdm(shingled_docs, desc="MinHashing")]
minhashing_time = time.time() - start
print(f"✓ MinHashing: {minhashing_time:.2f}s")

# LSH
start = time.time()
c = CompareSignatures()
sigs = np.array(signatures)
candidate_pairs = list(LSH().candidate_pairs(sigs, b=B, r=R))
lsh_time = time.time() - start
print(f"✓ LSH: {lsh_time:.2f}s")

total_time = read_time + tokenization_time + shingling_time + minhashing_time + lsh_time
print(f"\nTotal pipeline time: {total_time:.2f}s")

# Print similarity results
print("\nSimilarity Results:")
print(f"Number of candidate pairs: {len(candidate_pairs)}")
for x, y in candidate_pairs:
    print(f"\nCandidate pair: {x}, {y}")
    print(f"MinHash Jaccard similarity: {c.jaccard(signatures[x], signatures[y])}")
    print(f"Shingle Jaccard similarity: {CompareSets().jaccard(shingled_docs[x], shingled_docs[y])}")
    print(f"Shingle Mean similarity: {CompareSets().mean(shingled_docs[x], shingled_docs[y])}")

Starting pipeline...
✓ Reading documents: 0.07s
✓ Tokenization: 0.05s


Shingling: 100%|██████████| 1000/1000 [00:00<00:00, 5374.00it/s]


✓ Shingling: 0.19s


MinHashing: 100%|██████████| 1000/1000 [02:37<00:00,  6.34it/s]


✓ MinHashing: 157.79s


Processing bands: 100%|██████████| 5/5 [00:03<00:00,  1.46it/s]

✓ LSH: 3.46s

Total pipeline time: 161.55s

Similarity Results:
Number of candidate pairs: 24

Candidate pair: 527, 574
MinHash Jaccard similarity: 1.0
Shingle Jaccard similarity: 0.9931972789115646

Candidate pair: 432, 486
MinHash Jaccard similarity: 1.0
Shingle Jaccard similarity: 1.0

Candidate pair: 111, 199
MinHash Jaccard similarity: 1.0
Shingle Jaccard similarity: 0.8365384615384616

Candidate pair: 277, 298
MinHash Jaccard similarity: 1.0
Shingle Jaccard similarity: 1.0

Candidate pair: 669, 697
MinHash Jaccard similarity: 0.6
Shingle Jaccard similarity: 0.3023255813953488

Candidate pair: 133, 162
MinHash Jaccard similarity: 1.0
Shingle Jaccard similarity: 1.0

Candidate pair: 724, 798
MinHash Jaccard similarity: 0.6666666666666666
Shingle Jaccard similarity: 0.6682242990654206

Candidate pair: 439, 479
MinHash Jaccard similarity: 0.2
Shingle Jaccard similarity: 0.35120753172329106

Candidate pair: 362, 366
MinHash Jaccard similarity: 1.0
Shingle Jaccard similarity: 1.0

Cand




In [194]:
def view_documents(docs, x, y, max_width=70):
    def wrap_text(text, width):
        # Split text into chunks of max_width characters
        return [text[i:i+width] for i in range(0, len(text), width)]
    
    doc1_lines = docs[x].split("\n")
    doc2_lines = docs[y].split("\n")
    
    # Wrap long lines
    doc1_wrapped = [line for text in doc1_lines for line in wrap_text(text, max_width)]
    doc2_wrapped = [line for text in doc2_lines for line in wrap_text(text, max_width)]
    
    print(f"\n{'='*100}\nComparing documents {x} and {y}\n{'='*100}")
    print(f"{'Document ' + str(x):<{max_width+5}} | {'Document ' + str(y)}")
    print(f"{'-'*(max_width+5)}-+-{'-'*max_width}")
    
    for line1, line2 in itertools.zip_longest(doc1_wrapped, doc2_wrapped, fillvalue=""):
        print(f"{line1:<{max_width+5}} | {line2}")
    print()

for x, y in candidate_pairs:
    view_documents(docs, x, y, max_width=50)



Comparing documents 527 and 574
Document 527                                            | Document 574
--------------------------------------------------------+---------------------------------------------------
The ‘Secret War’ in Laos                                | 3. The ‘Secret War’ in Laos
Laos is the most heavily-bombed country per capita      | Laos is the most heavily-bombed country per capita
 in the world. The U.S. bombing of Laos (1964-1973      |  in the world. The U.S. bombing of Laos (1964-1973
) was part of a clandestine attempt by the CIA to       | ) was part of a clandestine attempt by the CIA to 
wrest power from the Pathet Lao, a communist group      | wrest power from the Pathet Lao, a communist group
 allied with North Vietnam and the Soviet Union du      |  allied with North Vietnam and the Soviet Union du
ring the Vietnam War. Laos was critical to Dwight       | ring the Vietnam War. Laos was critical to Dwight 
D. Eisenhower’s Domino Theory of keeping commun

Lets investigate the minhash similarities

In [150]:
# compute the pair wise signature similarities
pairs = list(itertools.combinations(range(len(signatures)), 2))
pair_similarities = [((x,y), c.jaccard(signatures[x], signatures[y])) for x, y in pairs]

pair_similarities.sort(key=lambda x: x[1], reverse=True)

pair_similarities[:10]



[((64, 77), np.float64(0.56)),
 ((46, 77), np.float64(0.52)),
 ((46, 64), np.float64(0.46)),
 ((69, 97), np.float64(0.38)),
 ((40, 49), np.float64(0.34)),
 ((76, 98), np.float64(0.34)),
 ((3, 12), np.float64(0.32)),
 ((56, 73), np.float64(0.32)),
 ((6, 67), np.float64(0.24)),
 ((61, 85), np.float64(0.24))]

Apparently we found the most similar documents earlier, with similarities higher than 0.5