# 1. Hashing task!
For this task we're working with hashing algorithms. In particular you we are going to implement hash functions and a structure called Bloom Filter.

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.

The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

# Importing libraries we need.

In [1]:
import numpy as np,pandas as pd,math,time

Opening password files and returning the DataFrame.

In [2]:
def load_df(_): 
    with open(f"passwords{_}.txt","r") as f:
        data = f.read()
        return pd.DataFrame(data.split("\n"))

# Global variables
We have set the size of bloom filter to be large, thus reducing false positives.

In [4]:
n     = 100_000_000                        #number of passwords
k     = 4                                  #number of hash_functions used.
m     = 1_500_000_000                      # size of Bloom filter
p     = pow(1 - math.exp(-k / (m / n)), k) #probability of false positive.
df    = load_df(1) #loading password 1 file.
df2   = load_df(2) #loading password 2 file.
prime = 1_500_000_001

# Defining hash table

In [5]:
hash_table = np.zeros(m,dtype=int)     #creating the Hash_table based on size of m.

Code to get next prime number after a given number.

In [6]:
def next_prime(number): 
    while True:
        number += 1
        for i in range(2,number):
            if number%i == 0:
                break
            else:
                return number

In [7]:
prime = next_prime(m)

DJB hash function
Here i have modified DJB hash function to turn strings to large numbers.
5381 is just a number that, in testing, resulted in fewer collisions and better avalanching.

In [8]:
def hash_function1(s):                                                                                                                                
    h = 5381
    for x in s:
        h = (( h << 5) + h) + ord(x)    
    return h % prime

Polynomials hashing. We can choose a nonzero constant, a != 1, and calculate (𝑥0𝑎𝑘−1+𝑥1𝑎𝑘−2+...+𝑥𝑘−2𝑎+𝑥𝑘−1)
(
x
0
a
k
−
1
+
x
1
a
k
−
2
+
.
.
.
+
x
k
−
2
a
+
x
k
−
1
)
The value of a is usually a prime number. 

In [9]:
def hash_function2(string):
    hash_value = 0
    for _ in range(len(string)-1,-1,-1):
        hash_value += (ord(string[_])*pow(19,_))
    hash_value = hash_value % prime
    return hash_value

A tweak on the previous hash function

In [10]:
def hash_function3(string):
    hash_val = 0
    for position in range(len(string)):
        hash_val = hash_val + (ord(string[position]) **position)
    return hash_val % prime

Hash function based on XOR

In [11]:
def hash_function4(string):
    hash = 0
    for char in range(len(string)):
        hash += (hash ^ ord(string[char]))*char
    return hash % prime

For number of passwords in the df
and for each row we call 4 hash_functions
technically we get 4 indices and we change
the value to 1 in the hash table.

In [12]:
def update_hash_function(): 
    for _ in range(len(df)):
        try:
            result1 = hash_function1(df[0][_])
            result2 = hash_function2(df[0][_])
            result3 = hash_function3(df[0][_])
            result4 = hash_function4(df[0][_])
            hash_table[result1] = 1
            hash_table[result2] = 1
            hash_table[result3] = 1
            hash_table[result4] = 1    
        except:
            pass

loading second file and paassing to the same hash functions.
we check the results with values in hash_table.
if they are all 1 we add 1 to the sum variable.

In [13]:
def check_function():
    sum = 0
    for _ in range(len(df2)):
        result1 = hash_table[hash_function1(df2[0][_])]
        result2 = hash_table[hash_function2(df2[0][_])]
        result3 = hash_table[hash_function3(df2[0][_])]
        result4 = hash_table[hash_function4(df2[0][_])]
        if result1 == 1 and result2 == 1 and result3 == 1 and result4 ==1:
            sum += 1
    return sum

Here we call update and check functions and print the results.

In [14]:
def BloomFilter():  
    start = time.time()  
    update_hash_function()
    n = check_function()
    end = time.time() 
    print('Number of hash function used: ', k)
    print('Number of duplicates detected: ', n)
    print('Probability of false positives: ', p)
    print('Execution time: ', end-start)

In [15]:
BloomFilter()

Number of hash function used:  4
Number of duplicates detected:  14075668
Probability of false positives:  0.0030018939981219244
Execution time:  20898.570882081985


The results taken from Mac book air which is not meant for computational tasks.