# Assignment 1: Finding Similar Items: Textually Similar Documents

You are to implement the stages of finding textually similar documents based on Jaccard similarity using the shingling, minhashing, and locality-sensitive hashing (LSH) techniques and corresponding algorithms. The implementation can be done using any big data processing framework, such as Apache Spark, Apache Flink, or no framework, e.g., in Java, Python, etc. To test and evaluate your implementation, write a program that uses your implementation to find similar documents in a corpus of 5-10 or more documents such as web pages or emails.

The stages should be implemented as a collection of classes, modules, functions or procedures depending the framework and the language of your choice. Below, we give a description of sample classes that implement different stages of finding textually similar documents. You do not have to develop the exact same classes and data types as described below. Feel free to use data structures that suit you best.

1. A class Shingling that constructs k–shingles of a given length k (e.g., 10) from a given document, computes a hash value for each unique shingle, and represents the document in the form of an ordered set of its hashed k-shingles.
2. A class CompareSets that computes the Jaccard similarity of two sets of integers – two sets of hashed shingles.
3. A class MinHashing that builds a minHash signature (in the form of a vector or a set) of a given length n from a given set of integers (a set of hashed shingles).
4. A class CompareSignatures that estimates similarity of two integer vectors – minhash signatures – as a fraction of components, in which they agree.
5. (Optional task for extra 2 bonus) A class LSH that implements the LSH technique: given a collection of minhash signatures (integer vectors) and a similarity threshold t, the LSH class (using banding and hashing) finds all candidate pairs of signatures that agree on at least fraction t of their components.

To test and evaluate scalability (the execution time versus the size of input dataset) of your implementation, write a program that uses your classes to find similar documents in a corpus of 5-10 documents. Choose a similarity threshold s (e.g., 0,8) that states that two documents are similar if the Jaccard similarity of their shingle sets is at least s. 

Datasets
For documents, see the datasets in the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php), or find other documents such as web pages or emails.
To find more datasets follow this link (https://github.com/awesomedata/awesome-public-datasets)

The Dataset considered for these assignment is "BBC Full Text Document Classification" and it has been made available on Kaggle. (https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification/download)
This dataset contains 2225 .txt documents belonging to five different cetegory:
- business
- entertainment
- politics
- sport
- tech

In [1]:
from pathlib import Path

def retrieve_file(data_folder, file_name):
    file_to_open = data_folder / file_name
    f = open(file_to_open)
    return f.read().replace('\n', ' ')

business_folder = Path("bbc-full-text-document-classification/bbc/business")
entertainment_folder = Path("bbc-full-text-document-classification/bbc/entertainment")
politics_folder = Path("bbc-full-text-document-classification/bbc/politics")
sport_folder = Path("bbc-full-text-document-classification/bbc/sport")
tech_folder = Path("bbc-full-text-document-classification/bbc/tech")

bus1 = retrieve_file(business_folder, "001.txt")
bus2 = retrieve_file(business_folder, "002.txt")
pol1 = retrieve_file(politics_folder, "001.txt")

## Shingling Class
In the following Shingling Class the hash function used simply map the input shingle into a 32bit integer using the library binascii.

In [2]:
import binascii

class Shingling():
    def __init__(self, k, document):
        self.k = k
        self.document = document
        self.set = {*()}
        
    def hash_function(self, shingle):
        return binascii.crc32(shingle.encode("utf-8")) & 0xffffffff
    
    def create_shingle_set(self):
        for i in range(len(self.document)-(self.k-1)):
            shingle = self.document[i:(i+self.k)]
            hashed_shingle = self.hash_function(shingle)
            self.set.add(hashed_shingle)
        return sorted(self.set)        

## CompareSets Class
The similarity function implemented by the Class is the Jaccard similarity, which is computed considering the number of the elements in commong among two sets over the union of those two sets: 

In [3]:
class CompareSets():
    def __init__(self, set1, set2):
        self.set1 = set(set1)
        self.set2 = set(set2)
    
    def compute_jacc_sim(self):
        if(self.set1 == set() or self.set2 == set()):
            print("Warning: at least one of the two set is empty\n")
            return 0
        else:
            return len(self.set1.intersection(self.set2)) / len(self.set1.union(self.set2))

In [4]:
shing1 = Shingling(5,bus1)
set1 = shing1.create_shingle_set()

shing2 = Shingling(5,bus2)
set2 = shing2.create_shingle_set()

shing3 = Shingling(5,pol1)
set3 = shing3.create_shingle_set()

comparison = CompareSets(set1, set2)
similarity = comparison.compute_jacc_sim()
print("The Jaccard similarity among two business articles is ",similarity)

comparison = CompareSets(set2, set3)
similarity = comparison.compute_jacc_sim()
print("The Jaccard similarity among a business and a politc article is ",similarity)

The Jaccard similarity among two business articles is  0.07042253521126761
The Jaccard similarity among a business and a politc article is  0.06374147643047733


## MinHashing Class
To compute the MinHash it has been used this formula: $(ax + b) mod c$, in which a and b are randomly generated coefficients and c is the next prime bigger than $2^{32} - 1$ since the previous class Shingling create sets of 32 bit integer.

In [8]:
import random

class MinHashing():    
    def __init__(self, n=50):
        self.n = n
        self.max_shingle_hash = 2**32 - 1
        self.next_prime = 4294967311
        self.coef_a = self.generate_coef()
        self.coef_b = self.generate_coef()
    
    def generate_coef(self):
        coefficients = []
        
        for _ in range(self.n):
            new_coef = random.randint(1, self.max_shingle_hash)
            while new_coef in coefficients:
                new_coef = random.randint(1, self.max_shingle_hash)
            coefficients.append(new_coef)
        
        return coefficients
    
    def hash_function(self, position, value):
        a = self.coef_a[position]
        b = self.coef_b[position]
        return (a * value + b) % self.next_prime
    
    def create_signature(self, inputSet):
        sign = []
        for i in range(self.n):
            sign.append(min (map (lambda x: self.hash_function(i,x), inputSet) ) )
        return sign

## CompareSignatures Class
The CompareSignatures Class simply receive two signatures and count how many items are similar over the signatures length. The most important check to be performed is to controll if both signatures are of the same lenth, otherwise it is not possible to compare them.

In [9]:
class CompareSignatures():
    def __init__(self, sig1, sig2):
        self.sig1 = sig1
        self.sig2 = sig2
    
    def compute_sig_sim(self):
        if(len(self.sig1) != len(self.sig2)):
            print("Impossible to compare those signatures since they are of different lengths\n")
        elif(self.sig1 == 0):
            print("Warning: both signatures are empty")
            return 0
        else:
            equal_sign = 0
            all_sign = len(self.sig1)
            for a, b in zip(self.sig1, self.sig2):
                if(a == b):
                    equal_sign += 1 
            return equal_sign / all_sign

In [15]:
minhash = MinHashing(100)
sig1 = minhash.create_signature(set1)

sig2 = minhash.create_signature(set2)

sig3 = minhash.create_signature(set3)

comparison = CompareSignatures(sig1, sig2)
similarity = comparison.compute_sig_sim()
print("The signatures similarity among two business articles is ", similarity)

comparison = CompareSignatures(sig2, sig3)
similarity = comparison.compute_sig_sim()
print("The signatures similarity among a business and a politc article is ", similarity)

The signatures similarity among two business articles is  0.04
The signatures similarity among a business and a politc article is  0.06
