Dataset = set of documents
My focus is on showing the benefits of using LSH in sorting the documents into buckets <- showing how faster it is of a Naive approach (= compare all documents). The steps:
Documents -> shingles representation in a high dimensional vectors -> use these vectors to compute the Jaccard similarity -> from high dim to low dim by computing the Min Hash (sim by signature) -> use LSH from Snapy library and compare efficiency.
I compute the speed of each step.

# Finding similar items

In this notebook we will focus on fiding similar documents. We will start with finding the shingles of each document, then we will find the MinHash. As for the last step, LSH, we will use a library.

We assume that the input is a file that contains, for each line, a document. The first word of each line is the identifier of the document. An example is:

```text
t980 A man was shot dead and fifteen others injured...
t1088 Russian Prime Minister Viktor Chernomyrdin on Thursday proposed...
t1233 Michael Johnson, who improved his own indoor 400m world record...
...
```

The other input will the the similarity threshold used to differentiate similar/not similar items. 

## Loading the data

The dataset is already pre-processed, now each line is a whole document (easier formatting to use): first word is an identifier of the doc and then the rest is a sorted string of words of the doc in una single line.

In this specific case, we are going to read the line and separate the first word (identifier) from the rest of the line. We then return two lists, one with the identifiers, the other with the documents.

In [15]:
# Function to load the data
def load_data(filename):
    # Define two empy lists
    doc_names = [] # identifier of the doc -> I will truly need it at the end to display the results
    actual_doc = [] # test of the doc
    raw_lines = open(filename, 'r').read().splitlines()
    for line in raw_lines:
        # Read the file line by line, each split considering the space
        words = line.split(" ") # Split the line into words
  
        # The identifier of the doc is the first word 
        doc_ID = words[0]
        # Add doc ID to the list of document IDs
        doc_names.append(doc_ID)
        
        # Remove the first word, and build back the line to add to actual_doc
        del words[0]  
        filtered_line = " ".join(words)
        actual_doc.append(filtered_line)
        # Since append add at the end, the order of the doc is mantained
    
    return doc_names, actual_doc

We provide two datasets:
- "2_articles_100.txt": a small dataset with 100 docs;
- "2_articles_1000.txt": a larger dataset with 1000 docs.

On Colab, remember to mount your Drive
```python
from google.colab import drive
drive.mount('/content/drive')
input_file = "/content/drive/My Drive/..."
```

Otherwise, simply load your chosen file (we start with the small one):

In [16]:
input_file = "./2_articles_100.txt" # small dataset of 100 docs (fast)

doc_names, docs_to_analyze = load_data(input_file)

## Computing the shingles

For the shingles, we consider as token each character, and the size of the shingle is given as input parameter. Each shingle is hashed into a 32-bit integer, and we will use a special library for this, which must be imported.

I now focus on the single document (= string with diff words insid)


In [17]:
import time
import hashlib

# Define the number of characters per shingles (here is 9 by default, I can change it to tune the results) -> compute the hash function and get a number
# -> put the number in a set (which assumes its elements are distinct, that's why I use it)
# I'm counting the total numbers of shingles
def compute_shingles(doc_names, docs_to_analyze, shingle_lenght = 9): # usually for love doc I have 8/9, for smaller doc 4/5 (in english this counts 3 words - end, whole, start)
    docs_shingle_sets = {} # set made up of the numbers obtained by the h.f.
    total_shingles = 0 # numbers of shingles in the whole dataset
    num_docs = len(docs_to_analyze) # length of a doc = number of characters in it
    for i in range(num_docs):
        # Consider one doc at a time
        doc = docs_to_analyze[i]
        
        # Set for all of the unique shingle IDs present in the current doc
        shingles_in_doc = set()

        # I create shingles by considering a sliding window
        # The index goes from 0 to the length of the doc - shingle_length (ex. len = 100 and shingle_length = 9: first shingle from 0 to 8, second from 1 to 9, etc., until 91)
        for index in range(len(doc) - shingle_lenght):
            # I can do this since the doc are strings
            shingle = doc[index: index + shingle_lenght] # select the substring

            # Hash the shingle to a 32-bit integer: translates the sequence of characters into a 32-bit integer
            crc = int(hashlib.sha256(shingle.encode('utf-8')).hexdigest(), 16) % 2**32
            # Add the hash to the set of shingles 
            shingles_in_doc.add(crc)
  
        # Store the completed list of shingles for this document in the dictionary
        doc_ID = doc_names[i]
        docs_shingle_sets[doc_ID] = shingles_in_doc
  
        # Count the number of shingles across all documents.
        total_shingles = total_shingles + (len(doc) - shingle_lenght)

    return docs_shingle_sets, total_shingles

We are now ready to compute the shingles, considering a shingle lenght of 9 characters. I have transformed my documents from strings to vectors containing the numbers that represnt the documnet itself.

We check also how mush time it takes.

In [18]:
shingle_lenght = 9
t0 = time.time()
docs_shingle_sets, total_shingles = compute_shingles(doc_names, docs_to_analyze, shingle_lenght)

print('Shingling ' + str(len(docs_to_analyze)) + ' docs took %.2f sec.\n' % (time.time() - t0))
 
print('Average shingles per doc: %.2f' % (total_shingles / len(docs_to_analyze))) # average shingles in a set

Shingling 100 docs took 0.29 sec.

Average shingles per doc: 1544.70


## Jaccard similarities using shingles

We use a naive computation where we compare each pair.

In [19]:
# Threshold = 2 documents are considered similar if they share at least x% elements (= numbers) -> share x% of shingles
def Jaccard_sim_naive(doc_names, docs_shingle_sets, j_threshold):
    similar_pairs = {}
    num_docs = len(doc_names)
    # I compare each doc with the others -> first for cycle is for passing all documents
    for i in range(num_docs):
        # Retrieve the shingles for document i -> vector
        s_i = docs_shingle_sets[doc_names[i]]
        # Second for loop is for document after (i+1) until the end (I don't need to go back since I have already done the comparison)
        for j in range(i+1, num_docs):
            # Retrieve shingles for document j -> vector
            s_j = docs_shingle_sets[doc_names[j]]
        
            # Compute the Jaccard similarity: since the shingles are sets, I use the functions .intesection and .union
            jaccard_sim = (len(s_i.intersection(s_j)) / len(s_i.union(s_j))) # count how many members are in the intersection/union
            # if sim >= threshold, then the doc are similar -> I record the pair and the corresponding similarity
            if jaccard_sim >= j_threshold:
                similar_pairs[(doc_names[i], doc_names[j])] = jaccard_sim
    
    return similar_pairs

We now compute the similar pair and see how much time it takes.

In [20]:
j_threshold = 0.6 #60%
t0 = time.time()

similar_pairs = Jaccard_sim_naive(doc_names, docs_shingle_sets, j_threshold) # I consider the shingles, not the whole doc!

print("Calculating all Jaccard Similarities took %.2f sec\n"% (time.time() - t0))

# print the similar documents
for pair in similar_pairs.keys():
    print(pair[0], "and", pair[1], "has similarity", similar_pairs[pair])

Calculating all Jaccard Similarities took 0.90 sec

t980 and t2023 has similarity 0.9840166782487839
t1088 and t5015 has similarity 0.9869366427171783
t1297 and t4638 has similarity 0.9849869451697127
t1768 and t5248 has similarity 0.9857050032488629
t1952 and t3495 has similarity 0.9825418994413407


This is a naive approach that compute the similarity between ALL documents, it works but it's not too slow only because the dataset is small (100 docs). With a bigger dataset, I have to change strategy!

## Computing the MinHashes

We now transform each shingle into the corresponding MinHash (= creata a set of signature) and compare efficiency with above. The set of signature is much smaller than the set of shingles (100 elements vs 1500 elements).

The method below assumes that the shingles IDs are coded into a 32-bit integer, so there are some hard-coded constants. A general method would take these constants as input.   

At the end I check if the translation kept the property of similarity, by comparing the current results with the aboves.

In [21]:
import random
# Started from a document aas a string -> create shingles by considering set of characters (an hash function translate the set in a number) -> the
# set of number is transalted in a binary vector -> apply a permutation to the binary vector -> the minHash (=signature) of a permutation is the index
# where I find the first 1 (idc where the zeros go) -> repeat for others permutations, for all 100 permutaions -> minHash is made by a sequence of
# indexes, which is smaller :)
# Applying first permutation is equivalent to using an hash function to transform the the shingle, then choosing the smallest value to put into the
# signature - it's the hash function h(x) -> repeat the process until I fill the signature (100 times)

# It's more complex, since I use permutation of columns and find the integer of the smallest number inside that column
# I transalte the set of numbers (with no order) into binary vectors (where the values of the set are indexes) -> apply a random
# permutation -> find the index with the first 1 = findthe index of the smallest 1 -> record in the signature (= Min Hash).
def compute_MinHashes(doc_names, docs_shingle_sets, num_hashes, seed = 289386372):
    # Hard-coded contants
    max_shingle_ID = 2**32-1
    next_prime = 4294967311 # next largest prime number above 'max_shingle_ID'

    # Set the seed in the random number genertor
    random.seed(seed)
    
    # PERMUTATION COMPUTED AS A HASH FUNCTION 
    # Permutation are easily computed by using a hash function with coefficients 'a' and 'b' ('c' is a prime number greater than the maximum shingle ID)
    # The random hash function will take the form of:
    #   h(x) = (a*x + b) % c
    # Where 'x' is the input value, 'a' and 'b' are random coefficients, and 'c' is a prime number greater than max_shingle_ID.
    # Here 'a' and 'b' are vector with 100 values

    # We compute the coefficients: the "random.sample(N, k)" returns the first k elements
    # of a random permutation of set of N integers
    coeffA = random.sample(range(max_shingle_ID), num_hashes) # sequences with 100 numer of hashes of random numbers
    coeffB = random.sample(range(max_shingle_ID), num_hashes) # sequences with 100 numer of hashes of random numbers
    
    # Rather than generating a random permutation of all possible shingles, 
    # we'll just hash the IDs of the shingles that are *actually in the document*,
    # then take the lowest resulting hash code value. This corresponds to the index 
    # of the first shingle that you would have encountered in the random order.

    all_signatures = {}
    
    # For cycle to consider each document -> Focus on single documents
    for doc_ID in doc_names:
        # Get the shingle set for this document
        shingle_set = docs_shingle_sets[doc_ID]
  
        # The resulting minhash signature for this document. 
        signature = []

        # For each shingle in the shingle set I compute its hash code with the hash functionswith parameters 'a' and 'b' (which change every iteraction)
        for i in range(num_hashes):
            # For each of the shingles actually in the document, calculate its hash code
            # using hash function 'i'. 
    
            # Track the lowest hash ID seen. Initialize 'minHashCode' to be greater than
            # the maximum possible value output by the hash.
            min_hash_code = next_prime + 1
    
            # Change the order of the shingles (= perm) -> index i become index j -> track where index j go and track the minimum to put into the signature
            # shingle_ID is a number
            for shingle_ID in shingle_set: # for each shingles in the shingle set
                hash_code = (coeffA[i] * shingle_ID + coeffB[i]) % next_prime # shingle_ID is a number
                if hash_code < min_hash_code:
                    # I keep track of the min value of hash code
                    min_hash_code = hash_code

            # At the end, add the smallest hash code value as component number 'i' of the signature.
            signature.append(min_hash_code)
  
        # Store the MinHash signature for this document.
        all_signatures[doc_ID] = signature
        
    return all_signatures

Let's compute the MinHashes and see how much time it takes.

In [22]:
num_hashes = 100

t0 = time.time()
docs_minhash = compute_MinHashes(doc_names, docs_shingle_sets, num_hashes)

# It's a bit slow because I go through all elements of all documents
print('Generating MinHash signatures took %.2f sec\n' % (time.time() - t0))

Generating MinHash signatures took 6.58 sec



## Jaccard similarities using MinHash

We compute the Jaccard similarities between pair with the naive method, but using the signatures instead of the whole set of shingles.

In [23]:
# I compute the Jaccard sim using the Min Hash, but I consider only the signatures and not all the shingle set
def Jaccard_sim_minhash_naive(doc_names, docs_minhash, j_threshold):
    similar_pairs = {}
    num_docs = len(doc_names)
    for i in range(num_docs):
        # Shingles for document i
        s_i = docs_minhash[doc_names[i]]
        num_hashes = len(s_i)
        for j in range(i+1, num_docs):
            # Shingles for document j
            s_j = docs_minhash[doc_names[j]]

            count = 0
            for k in range(num_hashes):
                count = count + (s_i[k] == s_j[k])

            jaccard_sim = (count / num_hashes)
            if jaccard_sim >= j_threshold:
                similar_pairs[(doc_names[i],doc_names[j])] = jaccard_sim
    
    return similar_pairs

We now compute the similar pair with the MinHash and see how much time it takes.

In [33]:
j_threshold = 0.6
t0 = time.time()

similar_pairs_minhash = Jaccard_sim_minhash_naive(doc_names, docs_minhash, j_threshold)

print("Calculating all Jaccard Similarities with MinHash took %.2f sec\n"% (time.time() - t0))

# The results are close enough to the ones above (I don't have a big differnce due to the small size of the dataset)
for pair in similar_pairs_minhash.keys():
    print(pair[0], "and", pair[1], "has similarity", similar_pairs_minhash[pair])
# It's much faster!

Calculating all Jaccard Similarities with MinHash took 0.13 sec

t980 and t2023 has similarity 0.99
t1088 and t5015 has similarity 1.0
t1297 and t4638 has similarity 0.97
t1768 and t5248 has similarity 0.99
t1952 and t3495 has similarity 0.98


The similarity is computed using the signatures -> I get the same results + mantain the properties of similarity.

The computation of the signature is long, but it's done only once! If I introduce a new doc, I compute just its siganture and the comparison with other docs is fast.

## Using a LSH library

Since LSH is useful tool, there exist libraries that implement LSH, so that one can simply import the library and use the function.

One example is the MinHash and the LSH implementation from the SNAPY library:  
https://pypi.org/project/snapy/

In order to use is, we need to install the library, if not already done before

```python
pip install snapy
```

### Question  Q1
<div class="alert alert-info">
Using the MinHash and the LSH implementation from the SNAPY library (see the documentation on the link provided above) compute the minhash and LSH of the dataset used so far (2-articles_100.txt).
</div>

Documents grouped together into buckets thanks to LSH -> compare only docs in the same bucket

In [27]:
import snapy
print(dir(snapy))
# permutation = size of minhash

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [28]:
help(snapy)

Help on package snapy:

NAME
    snapy

PACKAGE CONTENTS
    planet_gravity

FILE
    c:\users\330s-15ikb-w3ix\appdata\local\programs\python\python310\lib\site-packages\snapy\__init__.py




In [32]:
import snapy

content = [
    'Jupiter is primarily composed of hydrogen with a quarter of its mass '
    'being helium',
    'Jupiter moving out of the inner Solar System would have allowed the '
    'formation of inner planets.',
    'A helium atom has about four times as much mass as a hydrogen atom, so '
    'the composition changes when described as the proportion of mass '
    'contributed by different atoms.',
    'Jupiter is primarily composed of hydrogen and a quarter of its mass '
    'being helium',
    'A helium atom has about four times as much mass as a hydrogen atom and '
    'the composition changes when described as a proportion of mass '
    'contributed by different atoms.',
    'Theoretical models indicate that if Jupiter had much more mass than it '
    'does at present, it would shrink.',
    'This process causes Jupiter to shrink by about 2 cm each year.',
    'Jupiter is mostly composed of hydrogen with a quarter of its mass '
    'being helium',
    'The Great Red Spot is large enough to accommodate Earth within its '
    'boundaries.'
]

labels = [1, 2, 3, 4, 5, 6, 7, 8, 9]
seed = 3


# Create MinHash object.
minhash = snapy.MinHash(content, n_gram=9, permutations=100, hash_bits=64, seed=3)

AttributeError: module 'snapy' has no attribute 'MinHash'

### Question  Q2
<div class="alert alert-info">
Compute the runing time for the operations in Q1.
</div>

In [None]:
# your answer here

### Question  Q3
<div class="alert alert-info">
Print the similar pairs with Jaccard similarity at least 0.6 and compare the result with the one obtained before. Why the list is longer? 
</div>

In [None]:
# your answer here

### Question  Q4
<div class="alert alert-info">
Repeat the computation with a larger dataset (2-articles_1000.txt)
</div>

In [None]:
# your answer here

---
This notebook has been inspired by:  
- McCormick, C. (2015, June 12). MinHash Tutorial with Python Code. Retrieved from http://www.mccormickml.com