# Finding similar items

In this notebook we will focus on fiding similar documents. We will start with finding the shingles of each document, then we will find the MinHash. As for the last step, LSH, we will use a library.

We assume that the input is a file that contains, for each line, a document. The first word of each line is the identifier of the document. An example is:

```text
t980 A man was shot dead and fifteen others injured...
t1088 Russian Prime Minister Viktor Chernomyrdin on Thursday proposed...
t1233 Michael Johnson, who improved his own indoor 400m world record...
...
```

The other input will the the similarity threshold used to differentiate similar/not similar items. 

## Loading the data

In this specific case, we are going to read the line and separate the first word (identifier) from the rest of the line. We then return two lists, one with the identifiers, the other with the documents.

In [1]:
def load_data(filename):
    doc_names = []
    actual_doc = []
    raw_lines = open(filename, 'r').read().splitlines()
    for line in raw_lines:
        # Split the line into words
        words = line.split(" ") 
  
        # The doc ID is the first word 
        doc_ID = words[0]
  
        # Add doc ID to the list of document IDs
        doc_names.append(doc_ID)
        
        # Remove the first word, and build back the line
        del words[0]  
        filtered_line = " ".join(words)
        actual_doc.append(filtered_line)
    
    return doc_names, actual_doc

We provide two datasets:
- "2-articles_100.txt": a small dataset with 100 docs;
- "2-articles_1000.txt": a larger dataset with 1000 docs.

On Colab, remember to mount your Drive
```python
from google.colab import drive
drive.mount('/content/drive')
input_file = "/content/drive/My Drive/..."
```

Otherwise, simply load your chosen file (we start with the small one):

In [2]:
input_file = "./2-articles_100.txt"

doc_names, docs_to_analyze = load_data(input_file)

## Computing the shingles

For the shingles, we consider as token each character, and the size of the shingle is given as input parameter. Each shingle is hashed into a 32-bit integer, and we will use a special library for this, which must be imported.


In [4]:
import time
import hashlib

def compute_shingles(doc_names, docs_to_analyze, shingle_lenght=9):
    docs_shingle_sets = {}
    total_shingles = 0
    num_docs = len(docs_to_analyze)
    for i in range(num_docs):
        # Consider one doc at a time
        doc = docs_to_analyze[i]
        
        # Set for all of the unique shingle IDs present in the current doc
        shingles_in_doc = set()

        for index in range(len(doc) - shingle_lenght):
            shingle = doc[index:index+shingle_lenght]
            # Hash the shingle to a 32-bit integer.
            crc = int(hashlib.sha256(shingle.encode('utf-8')).hexdigest(), 16) % 2**32

            # Add the hash to the list of shingles 
            shingles_in_doc.add(crc)
  
        # Store the completed list of shingles for this document in the dictionary
        doc_ID = doc_names[i]
        docs_shingle_sets[doc_ID] = shingles_in_doc
  
        # Count the number of shingles across all documents.
        total_shingles = total_shingles + (len(doc) - shingle_lenght)

    return docs_shingle_sets, total_shingles

We are now ready to compute the shingles. We consider a shingle lenght of 9 characters. We check also how mush time it takes.

In [5]:
shingle_lenght = 9
t0 = time.time()
docs_shingle_sets, total_shingles = compute_shingles(doc_names, 
                                                     docs_to_analyze, 
                                                     shingle_lenght)

print('Shingling ' + str(len(docs_to_analyze)) + ' docs took %.2f sec.\n' % (time.time() - t0))
 
print('Average shingles per doc: %.2f' % (total_shingles / len(docs_to_analyze)))

Shingling 100 docs took 0.76 sec.

Average shingles per doc: 1544.70


## Jaccard similarities using shingles

We use a naive computation where we compare each pair.

In [6]:
def Jaccard_sim_naive(doc_names, docs_shingle_sets, j_threshold):
    similar_pairs = {}
    num_docs = len(doc_names)
    for i in range(num_docs):
        # Shingles for document i
        s_i = docs_shingle_sets[doc_names[i]]
        for j in range(i+1, num_docs):
            # Shingles for document j
            s_j = docs_shingle_sets[doc_names[j]]
        
            jaccard_sim = (len(s_i.intersection(s_j)) / len(s_i.union(s_j)))
            if jaccard_sim >= j_threshold:
                similar_pairs[(doc_names[i],doc_names[j])] = jaccard_sim
    
    return similar_pairs

We now compute the similar pair and see how much time it takes.

In [7]:
j_threshold = 0.6
t0 = time.time()

similar_pairs = Jaccard_sim_naive(doc_names, docs_shingle_sets, j_threshold)

print("Calculating all Jaccard Similarities took %.2f sec\n"% (time.time() - t0))

for pair in similar_pairs.keys():
    print(pair[0], "and", pair[1], "has similarity", similar_pairs[pair])

Calculating all Jaccard Similarities took 1.73 sec

t980 and t2023 has similarity 0.9840166782487839
t1088 and t5015 has similarity 0.9869366427171783
t1297 and t4638 has similarity 0.9849869451697127
t1768 and t5248 has similarity 0.9857050032488629
t1952 and t3495 has similarity 0.9825418994413407


## Computing the MinHashes

We now transform each shingle into the corresponding MinHash.

The method below assumes that the shingles IDs are coded into a 32-bit integer, so there are some hard-coded constants. A general method would take these constants as input.   

In [8]:
import random

def compute_MinHases(doc_names, docs_shingle_sets, num_hashes, seed = 289386372):
    # Hard-coded contants
    max_shingle_ID = 2**32-1
    next_prime = 4294967311 # next largest prime number above 'max_shingle_ID'

    # Set the seed in the random number genertor
    random.seed(seed)
    
    # The random hash function will take the form of:
    #   h(x) = (a*x + b) % c
    # Where 'x' is the input value, 'a' and 'b' are random coefficients, 
    # and 'c' is a prime number greater than max_shingle_ID.

    # We compute the coefficients: the "random.sample(N,k)" returns the first k elements
    # of a random permutation of set of N integers
    coeffA = random.sample(range(max_shingle_ID), num_hashes)
    coeffB = random.sample(range(max_shingle_ID), num_hashes)
    
    # Rather than generating a random permutation of all possible shingles, 
    # we'll just hash the IDs of the shingles that are *actually in the document*,
    # then take the lowest resulting hash code value. This corresponds to the index 
    # of the first shingle that you would have encountered in the random order.

    all_signatures = {}
    
    for doc_ID in doc_names:
        # Get the shingle set for this document.
        shingle_set = docs_shingle_sets[doc_ID]
  
        # The resulting minhash signature for this document. 
        signature = []
  
        for i in range(num_hashes):
            # For each of the shingles actually in the document, calculate its hash code
            # using hash function 'i'. 
    
            # Track the lowest hash ID seen. Initialize 'minHashCode' to be greater than
            # the maximum possible value output by the hash.
            min_hash_code = next_prime + 1
    
            for shingle_ID in shingle_set:
                hash_code = (coeffA[i] * shingle_ID + coeffB[i]) % next_prime 
                if hash_code < min_hash_code:
                    min_hash_code = hash_code

            # Add the smallest hash code value as component number 'i' of the signature.
            signature.append(min_hash_code)
  
        # Store the MinHash signature for this document.
        all_signatures[doc_ID] = signature
        
    return all_signatures

Let's compute the MinHashes and see how much time it takes.

In [9]:
num_hashes = 100

t0 = time.time()
docs_minhash = compute_MinHases(doc_names, docs_shingle_sets, num_hashes)

print('Generating MinHash signatures took %.2f sec\n' % (time.time() - t0))

Generating MinHash signatures took 8.89 sec



## Jaccard similarities using MinHash

We compute the Jaccard similarities between pair with the naive method, but using the signatures instead of the whole set of shingles.

In [10]:
def Jaccard_sim_minhash_naive(doc_names, docs_minhash, j_threshold):
    similar_pairs = {}
    num_docs = len(doc_names)
    for i in range(num_docs):
        # Shingles for document i
        s_i = docs_minhash[doc_names[i]]
        num_hashes = len(s_i)
        for j in range(i+1, num_docs):
            # Shingles for document j
            s_j = docs_minhash[doc_names[j]]

            count = 0
            for k in range(num_hashes):
                count = count + (s_i[k] == s_j[k])

            jaccard_sim = (count / num_hashes)
            if jaccard_sim >= j_threshold:
                similar_pairs[(doc_names[i],doc_names[j])] = jaccard_sim
    
    return similar_pairs

We now compute the similar pair with the MinHash and see how much time it takes.

In [11]:
j_threshold = 0.6
t0 = time.time()

similar_pairs_minhash = Jaccard_sim_minhash_naive(doc_names, docs_minhash, j_threshold)

print("Calculating all Jaccard Similarities with MinHash took %.2f sec\n"% (time.time() - t0))

for pair in similar_pairs_minhash.keys():
    print(pair[0], "and", pair[1], "has similarity", similar_pairs_minhash[pair])

Calculating all Jaccard Similarities with MinHash took 0.13 sec

t980 and t2023 has similarity 0.99
t1088 and t5015 has similarity 1.0
t1297 and t4638 has similarity 0.97
t1768 and t5248 has similarity 0.99
t1952 and t3495 has similarity 0.98


## Using a LSH library

Since LSH is useful tool, there exist libraries that implement LSH, so that one can simply import the library and use the function.

One example is the MinHash and the LSH implementation from the SNAPY library:  
https://pypi.org/project/snapy/

In order to use is, we need to install the library, if not already done before

```python
pip install snapy
```

### Question  Q1
<div class="alert alert-info">
Using the MinHash and the LSH implementation from the SNAPY library (see the documentation on the link provided above) compute the minhash and LSH of the dataset used so far (2-articles_100.txt).
</div>

In [None]:
# your answer here

### Question  Q2
<div class="alert alert-info">
Compute the runing time for the operations in Q1.
</div>

In [None]:
# your answer here

### Question  Q3
<div class="alert alert-info">
Print the similar pairs with Jaccard similarity at least 0.6 and compare the result with the one obtained before. Why the list is longer? 
</div>

In [None]:
# your answer here

### Question  Q4
<div class="alert alert-info">
Repeat the computation with a larger dataset (2-articles_1000.txt)
</div>

In [None]:
# your answer here

---
This notebook has been inspired by:  
- McCormick, C. (2015, June 12). MinHash Tutorial with Python Code. Retrieved from http://www.mccormickml.com