* > # EE226 - Coding 2
## Streaming algorithm & Locality Sensitive Hashing

### Streaming: DGIM

DGIM is an efficient algorithm in processing large streams. When it's infeasible to store the flowing binary stream, DGIM can estimate the number of 1-bits in the window. In this coding, you're given the *stream_data.txt* (binary stream), and you need to implement the DGIM algorithm to count the number of 1-bits. Write code and ask the problems below.

### Your task

1. Set the window size to 1000, and count the number of 1-bits in the current window.

In [None]:
import time
 
buckets=[] # the list of bucket
window_size=1000
time_now=2000 # Current moment

fileName = '../input/coding2/stream_data.txt'
f = open(fileName, 'r')
data = f.read().split('\t')  # Load data
f.close()

def update(buckets=buckets):
    for i in range(len(buckets)-1,1,-1):
        if buckets[i]['bit_sum']==buckets[i-2]['bit_sum']:
            buckets[i-2]['bit_sum']*=2
            buckets[i-2]['timestamp']=buckets[i-1]['timestamp']
            del buckets[i-1]

def DGIM(data=data, buckets=buckets, window_size=window_size, time_now=time_now):
    bit_sum=0
    start_time=time.time()
    for i in range(time_now):
        if len(buckets)>0 and i-window_size+1==buckets[0]['timestamp']:
            del buckets[0] # delete too old bucket
        tmp=data[i]
        if int(tmp)==1:
            bucket={"timestamp":i+1,"bit_sum":1} # add a new bucket
            buckets.append(bucket)
            update(buckets) # check and merge
    for i in range(len(buckets)):
        bit_sum+=buckets[i]['bit_sum']
    bit_sum-=buckets[0]['bit_sum']/2
    return bit_sum,time.time()-start_time


bit_sum,bit_time=DGIM(data,buckets,window_size,time_now)
print("the number of 1-bits in the current window is {}, and DGIM total running time is {}.".format(bit_sum,bit_time))

2. Write a function that accurately counts the number of 1-bits in the current window, and compare the difference between its running time and space and the DGIM algorithm.

In [None]:
def preciseCount(data=data, window_size=window_size, time_now=time_now):
    bit_sum=0
    start_time=time.time()
    offset = max(time_now - window_size, 0)
    for i in range(min(time_now,window_size)):
        tmp=data[i+offset]
        if int(tmp)==1:
            bit_sum+=1
 
    return bit_sum,time.time()-start_time

precise_bit_sum, precise_bit_time=preciseCount(data,window_size,time_now)
print("the number of 1-bits in the current window is {}, and precise count total running time is {}.".format(precise_bit_sum,precise_bit_time))

error=abs(precise_bit_sum-bit_sum)
error_rate=100*float(error)/precise_bit_sum
print("Error rate of DGIM and precise count is {}%.".format(error_rate))

### Locality Sensitive Hashing

The locality sensitive hashing (LSH) algorithm is efficient in near-duplicate document detection. In this coding, you're given the *docs_for_lsh.csv*, where the documents are processed into set of k-shingles (k = 8, 9, 10). *docs_for_lsh.csv* contains 201 columns, where column 'doc_id' represents the unique id of each document, and from column '0' to column '199', each column represents a unique shingle. If a document contains a shingle ordered with **i**, then the corresponding row will have value 1 in column **'i'**, otherwise it's 0. You need to implement the LSH algorithm and ask the problems below.

### Your task

Use minhash algoirthm to create signature of each document, and find 'the most similar' documents under Jaccard similarity. 
Parameters you need to determine:
1) Length of signature (number of distinct minhash functions) *n*. Recommanded value: n > 20.

2) Number of bands that divide the signature matrix *b*. Recommanded value: b > n // 10.

In [None]:
import pandas as pd
import numpy as np
import time
from random import shuffle
from tqdm import trange
from sklearn.metrics import jaccard_score

hash_num=50
band_num=10
signatures = []

filename = "../input/coding2/docs_for_lsh.csv"
dataset = pd.read_csv(filename,usecols=list(range(1,201)))

In [None]:
def generate_sig_mat(data=dataset, hash_num=hash_num):
    docs_num = dataset.shape[0]
    shingle_num = dataset.shape[1]
    
    permutation = np.arange(shingle_num)
    shuffle(permutation)
    
    signatures = []
    for i in trange(hash_num):
        # random permutation π
        permutation = np.arange(shingle_num)
        shuffle(permutation)
        
        # a copy of dataset
        df = dataset.copy()
        df.columns=permutation
        df = df.sort_index(axis=1).values
        
        # get one signature
        signature = (df!=0).argmax(axis=1)
        
        signatures.append(signature)
        del df
    return signatures
                     
signatures = generate_sig_mat(dataset, hash_num)

Problem: For document 0 (the one with id '0'), list the **30** most similar document ids (except document 0 itself). You can valid your results with the [sklearn.metrics.jaccard_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html) function.

Tips: You can adjust your parameters to hash the documents with similarity *s > 0.8* into the same bucket.

In [None]:
import hashlib
def LSH(signatures=signatures, band_num=band_num, hash_num=hash_num):
    buckets = {} # a dictionary, buckets as key, docs as value
    signatures = np.array(signatures)
    s_len = signatures.shape[1]
    step_size = hash_num//band_num
    for i in trange(s_len):
        for j in range(band_num):
            hash_md5 = hashlib.md5()
            band = signatures[int(j*step_size):int((j+1)*step_size),i]
            band = [str(i) for i in band]
            band = "".join(band)+str(j)
            hash_md5.update(band.encode('utf-8'))
            bucket = hash_md5.hexdigest()
            if bucket not in buckets:
                buckets[bucket]=[i]
            elif i not in buckets[bucket]: # add element to an existing bucket
                buckets[bucket].append(i)
    return buckets

buckets = LSH(signatures, band_num, hash_num)

In [None]:
def find_most_similiar_30(buckets, signatures=signatures, data=dataset.values):
    signatures = np.array(signatures)
    scores=np.zeros(signatures.shape[1])
    for key,value in buckets.items():
        if 0 in value:
            for idx in value:
                scores[idx] += 1
    scores[0]=0
    rank=np.argsort(-scores)
    print("The 30 most similar document ids and score is as following:")
    print("id    : score")
    for i in rank[:30]:
        print("{:5d} : {:.5f}".format(i, jaccard_score(data[0,:],data[i,:])))

find_most_similiar_30(buckets, signatures, dataset.values)