# ID2222 - Homework 1

You are to implement the stages of finding textually similar documents based on Jaccard similarity using the shingling, minhashing, and locality-sensitive hashing (LSH) techniques and corresponding algorithms. The implementation can be done using any big data processing framework, such as Apache Spark, Apache Flink, or no framework, e.g., in Java, Python, etc. To test and evaluate your implementation, write a program that uses your implementation to find similar documents in a corpus of 5-10 or more documents such as web pages or emails.

The stages should be implemented as a collection of classes, modules, functions or procedures depending the framework and the language of your choice. Below, we give a description of sample classes that implement different stages of finding textually similar documents. You do not have to develop the exact same classes and data types as described below. Feel free to use data structures that suit you best.

1. A class Shingling that constructs k–shingles of a given length k (e.g., 10) from a given document, computes a hash value for each unique shingle, and represents the document in the form of an ordered set of its hashed k-shingles.
2. A class CompareSets that computes the Jaccard similarity of two sets of integers – two sets of hashed shingles.
3. A class MinHashing that builds a minHash signature (in the form of a vector or a set) of a given length n from a given set of integers (a set of hashed shingles).
4. A class CompareSignatures that estimates similarity of two integer vectors – minhash signatures – as a fraction of components, in which they agree.
5. (Optional task for extra 2 bonus) A class LSH that implements the LSH technique: given a collection of minhash signatures (integer vectors) and a similarity threshold t, the LSH class (using banding and hashing) finds all candidate pairs of signatures that agree on at least fraction t of their components.

To test and evaluate scalability (the execution time versus the size of input dataset) of your implementation, write a program that uses your classes to find similar documents in a corpus of 5-10 documents. Choose a similarity threshold s (e.g., 0,8) that states that two documents are similar if the Jaccard similarity of their shingle sets is at least s. 

## Scalability
To test and evaluate scalability (the execution time versus the size of input dataset) of your implementation, write a program that uses your classes to find similar documents in a corpus of 5-10 documents. Choose a similarity threshold s (e.g., 0,8) that states that two documents are similar if the Jaccard similarity of their shingle sets is at least s.


In [1]:
import json

def load_and_prepare_data(path):
    with open("train-v2.0.json") as f:
        d = f.read()
        data = json.loads(d)

    characters = {}
    for d in data['data']:
        info = d['paragraphs'][0]['context']

        characters[d['title']] = info

    return characters

In [2]:
import os
import string

def load_business():
    path = "bbc-full-text-document-classification/bbc/business"
    file_list = os.listdir(path)
    
    data = {}
    
    for file in file_list:
        with open(os.path.join(path, file)) as f:
            data[file.split(".")[0]] = f.read().replace('\n', ' ').lower()
    return data

def load_conrad():
    path = "conradbooks"
    file_list = os.listdir(path)
    
    data = {}
    
    for file in file_list:
        with open(os.path.join(path, file)) as f:
            data[file.split(".")[0]] = f.read().replace('\n', ' ').lower().replace(string.punctuation, "")
            
    return data

In [3]:
dataset = load_conrad()

In [4]:
#dataset = load_and_prepare_data("train-v2.0.json")

In [5]:
import binascii
import time

class Shingling:
    def __init__(self, k):
        self.k = k
        self.docs_shingles = {}
        self.doc_names = []
        
    def _clean(self, doc):
        """
        Some rules for cleaning the text:
        https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
        """
        doc = doc.lower()
        doc = doc.replace(" ", "_")
        return doc
    
    def _tokenize(self, doc):
        """
        Construct the shingles based on k-characters
        """
        sh = set()
        if len(doc) >= self.k:
            for idx, token in enumerate(doc):
                if idx + self.k <= len(doc):
                    sh.add(self._hash(doc[idx:idx + self.k]))

        return sh

    def _hash(self, shingle):
        """
        Compute hash values for the shingle
        """
        return binascii.crc32(shingle.encode("utf-8")) & 0xffffffff
    
    def generate_shingles(self, doc):
        doc = self._clean(doc)
        shingles = self._tokenize(doc)
        
        return shingles  
    
    def generate_shingles_for_docs(self, docs):
        """
        Takes in docs in the form of a dict of {"docID": "doc string"}
        """
        print("Shingling {} articles...".format(len(docs)))

        t0 = time.time()
        for k, v in docs.items():
            self.doc_names.append(k)
            d = self._clean(v)
            d = self._tokenize(d)
    
            self.docs_shingles[k] = d
    
        print ('\nShingling took %.2f sec.' % (time.time() - t0))

    @staticmethod
    def compare_sets(s1, s2):
        """
        Compute Jaccard Similarity
        n(intersection) / n(union)
        """
        # add in some checks
        if(s1 == set() or s2 == set()):
            print("Warning: at least one of the two set is empty\n")
            return 0
        else:
            jacc_sim = (len(s1.intersection(s2)) / float(len(s1.union(s2))))
            return jacc_sim

In [6]:
shing = Shingling(7)

In [7]:
shing.generate_shingles_for_docs(dataset)

Shingling 10 articles...

Shingling took 3.32 sec.


In [8]:
shing.compare_sets(shing.docs_shingles[shing.doc_names[1]],
                   shing.docs_shingles[shing.doc_names[2]])

0.1739170626190168

In [9]:
import numpy as np

def generate_jaccard_sim(docs_shingles, doc_names):
    print("Calculating the Jaccard Similarity for all documents")
    dataset_size = len(docs_shingles)

    jaccSimMatrix = np.zeros(dataset_size * dataset_size).reshape(dataset_size,dataset_size)
    t0 = time.time()

    docKeys = list(docs_shingles.keys())
    for j in range(0, len(doc_names)):    
        s1 = docs_shingles[doc_names[j]]
        for k in range(j, len(doc_names)):
            s2 = docs_shingles[doc_names[k]]
            if(s1 == set() or s2 == set()):
                print("Warning: at least one of the two set is empty\n")
                jacc_sim = 0
            else:
                jacc_sim = (len(s1.intersection(s2)) / float(len(s1.union(s2))))
            jaccSimMatrix[j, k] = jacc_sim
            jaccSimMatrix[k, j] = jacc_sim

    print ('\nJaccard Similarity for ' + str(len(doc_names)) + ' docs took %.2f sec.' % (time.time() - t0))
    print ('\nJaccard Similarity Matrix\n ' + str(jaccSimMatrix))

In [10]:
jaccSimMatrix  = generate_jaccard_sim(shing.docs_shingles, shing.doc_names)

Calculating the Jaccard Similarity for all documents

Jaccard Similarity for 10 docs took 2.38 sec.

Jaccard Similarity Matrix
 [[1.         0.16113816 0.21036779 0.22820762 0.21872387 0.20052262
  0.16820954 0.22315445 0.2077853  0.22019119]
 [0.16113816 1.         0.17391706 0.12242617 0.17244878 0.201923
  0.22293174 0.13569971 0.18379233 0.1543728 ]
 [0.21036779 0.17391706 1.         0.21260154 0.22143697 0.2152056
  0.18705407 0.22008889 0.21683682 0.22518434]
 [0.22820762 0.12242617 0.21260154 1.         0.21161193 0.18434147
  0.13867327 0.26457318 0.19503701 0.22545309]
 [0.21872387 0.17244878 0.22143697 0.21161193 1.         0.22098586
  0.18859731 0.21821277 0.21574418 0.22023137]
 [0.20052262 0.201923   0.2152056  0.18434147 0.22098586 1.
  0.21503294 0.19661474 0.21942763 0.20335983]
 [0.16820954 0.22293174 0.18705407 0.13867327 0.18859731 0.21503294
  1.         0.15455349 0.19693328 0.1690633 ]
 [0.22315445 0.13569971 0.22008889 0.26457318 0.21821277 0.19661474
  0.154553

### MinHashing

In [11]:
import random 

class MinHashing:
    
    def __init__(self, n, max_shingle_ID = 2**32-1):
        self.n = n # number of hashes
        self.max_shingle_ID = max_shingle_ID # the max number
        self.next_prime = 4294967311 # the next prime number after max shingle ID
        self.coeffs_A = self.generate_coeffs()
        self.coeffs_B = self.generate_coeffs()
        self.docs_minhash_signatures = {}
    
    def generate_coeffs(self):
        """
        Create a list of 'n' unique random values.
        """
        coeffs_list = []
        
        for _ in range(self.n):
            # TODO: check if it a good idea to have 0 for coeff A
            rand_idx = random.randint(0, self.max_shingle_ID)

            # Ensure that each random number is unique.
            while rand_idx in coeffs_list:
                rand_idx = random.randint(0, self.max_shingle_ID)

            coeffs_list.append(rand_idx)

        return coeffs_list

    def _minHash_function(self, pos, x):
        """
        Return a hash in the form of (ax+b) % prime
        """
        return (self.coeffs_A[pos] * x + self.coeffs_B[pos]) % self.next_prime
        
    def generate_signature(self, shingle_set):
        """
        Given a shingle set of IDs, generate the hashes and compute the minimum hash
        """
        signature = []
        
        for i in range(self.n):
            signature.append(min(map(lambda x: self._minHash_function(i,x), shingle_set)))

        return signature
    
    def generate_doc_signatures(self, shingles):
        print("Generating MinHash signatures for documents..")
        t0 = time.time()

        for k, v in shingles.items():
            self.docs_minhash_signatures[k] = self.generate_signature(v)
       
        print ('\n Generating Signatures for ' + str(len(shingles)) + ' docs took %.2f sec.' % (time.time() - t0))

    @staticmethod
    def compare_signatures(s1, s2):
        if not len(s1) == len(s2):
            print("Unequal length of Signature")
            
        equality = 0
        signature_len = len(s1)
        for x, y in zip(s1, s2):
            if(x == y):
                equality += 1 
        return equality / float(signature_len)

In [12]:
minhash = MinHashing(200)

In [13]:
#minhash.generate_signature(shing.docs_shingles[shing.doc_names[1]])

In [14]:
minhash.compare_signatures(minhash.generate_signature(shing.docs_shingles[shing.doc_names[1]]),
                   minhash.generate_signature(shing.docs_shingles[shing.doc_names[2]]))

0.145

In [15]:
minhash.generate_doc_signatures(shing.docs_shingles)

Generating MinHash signatures for documents..

 Generating Signatures for 10 docs took 170.74 sec.


### LSH
Partition into Bands
- Divide matrix M into b bands of r rows.
- For each band, hash its portion of each column to a hash table with k buckets.
- Make k as large as possible.
- Candidate column pairs are those that hash to the same bucket for a number of bands with regards to the threshold set.

In [16]:
from collections import defaultdict

class LSH:
    
    def __init__(self, band_size, row_size, threshold):
        self.band_size = band_size
        self.threshold = threshold
        self.row_size = row_size
        self.docs_lsh = {}
        self.candidate_pairs = defaultdict(set)

    def get_lsh(self, signature):
        lsh = []
        for i in range(self.band_size):
            lsh.append(hash(tuple(signature[i*self.row_size:(i*self.row_size+self.row_size)])) % 4294967311)
        return lsh

    def get_lsh_for_docs(self, signatures):
        for k, v in signatures.items():
            self.docs_lsh[k] = self.get_lsh(v)
        
    def generate_candidate_pairs(self, t=0.0002):
        """
        t: the fraction of components that pair of signatures agrees on
        """
        all_docs = list(self.docs_lsh.values())
        all_names = list(self.docs_lsh.keys())
        
        # Minimum number of bands that should has overlap
        # hash according to the threshold set
        threshold = t * self.band_size

        # Stores the intermediate number of band overlaps
        pairs = defaultdict(lambda: defaultdict(float))

        for idx, s1 in enumerate(all_docs):
            s1_name = all_names[idx]
            
            # Sliding count to perform comparison
            for curr_iter, s2 in enumerate(all_docs[idx + 1:]):
                s2_name = all_names[curr_iter + idx + 1]
                
                for x, y in zip(s1, s2):
                    if(x == y):
                        if not pairs[s1_name][s2_name]:
                            pairs[s1_name][s2_name] = 1
                        else:
                            pairs[s1_name][s2_name] += 1

                # Store pairs that is above the threshold as candidate pairs
                if pairs[s1_name][s2_name] > threshold:
                    self.candidate_pairs[s1_name].add(s2_name)

In [17]:
lshh = LSH(50, 4, 0.1)

In [18]:
lshh.get_lsh(minhash.docs_minhash_signatures[shing.doc_names[1]])

[3330187200,
 2127868941,
 519083102,
 745080478,
 3135366119,
 2006144478,
 50045948,
 1332400039,
 3980626399,
 3801580741,
 2497146791,
 262966903,
 1824525263,
 1255471848,
 3142506904,
 3963954758,
 27889379,
 2144500702,
 7864650,
 1656721948,
 1176055604,
 3371242827,
 311765484,
 703792230,
 1542726733,
 1268895722,
 1925046707,
 996801477,
 3631345277,
 2767722317,
 1170095390,
 3564658967,
 1729763392,
 3002260798,
 2684752910,
 2806297386,
 3015818423,
 3010009412,
 3644051409,
 1800828627,
 2017540586,
 4197045970,
 2533592844,
 409855283,
 2853379971,
 1987280662,
 876834819,
 3333429166,
 4092622166,
 3740663004]

In [19]:
lshh.get_lsh_for_docs(minhash.docs_minhash_signatures)

In [20]:
lshh.docs_lsh[shing.doc_names[1]]

[3330187200,
 2127868941,
 519083102,
 745080478,
 3135366119,
 2006144478,
 50045948,
 1332400039,
 3980626399,
 3801580741,
 2497146791,
 262966903,
 1824525263,
 1255471848,
 3142506904,
 3963954758,
 27889379,
 2144500702,
 7864650,
 1656721948,
 1176055604,
 3371242827,
 311765484,
 703792230,
 1542726733,
 1268895722,
 1925046707,
 996801477,
 3631345277,
 2767722317,
 1170095390,
 3564658967,
 1729763392,
 3002260798,
 2684752910,
 2806297386,
 3015818423,
 3010009412,
 3644051409,
 1800828627,
 2017540586,
 4197045970,
 2533592844,
 409855283,
 2853379971,
 1987280662,
 876834819,
 3333429166,
 4092622166,
 3740663004]

In [21]:
lshh.generate_candidate_pairs()

In [22]:
lshh.candidate_pairs

defaultdict(set,
            {"Almayer'sFolly": {'ChanceATaleInTwoParts'},
             'AmyFoster': {'The Secret Sharer'},
             'Falk': {'TheArrowofGold'}})

In [23]:
#print("This article:\n\n", dataset['229'])
#print("\n\nis similar to:\n ", dataset['209'])