# First Exercise

I calculated the posterior probability with this way. First of all we have to consider that a false positive occurs when all the hashing functions returns a value that is set to true into the bloom filter and this value is **k**, second, all the events H_i = {the i-th hashing function returns a value that is set to true into the BloomFilter} are independent. So the P(A|B) = P(A and B)/P(B) = P(A) = P(H_1 and H_2) = P(H_1) * P(H_2) =  Ber(p) where: A = {false positive}, B = {the password isn't contained into the first set} = omega **p = |variables set to true| / #lenght of the bloomFilter**

In [1]:
import time
import numpy as np

def hashFun1(s, bloomLen):
    res = 1
    for i in range(len(s)):
        res *= int(ord(s[i]) / (i+2))
    return res % bloomLen

def hashFun2(s, bloomLen):
    res = 1
    for i in range(int(len(s)/2)):
        res *= int(ord(s[i]) / (i+1))
    return res % bloomLen

def hashFun3(s, bloomLen):
    res = 1
    for i in range(int(len(s)/3)):
        res *= int(ord(s[i]) / (i+1))
    return res % bloomLen

def hashFun4(s, bloomLen):
    res = 1
    for i in range(int(len(s)/4)):
        res *= int(ord(s[i]) / (i+1))
    return res % bloomLen

def initBloomFilter(bloomLen, path):
    bloomFil = [False for i in range(bloomLen)]
    bloomFil = np.array(bloomFil)
    fPass1 = open(path, "r", encoding="utf-8")
    for i in range(100000000):
        password = fPass1.readline()[:20]
        h1 = hashFun1(password, bloomLen); h2 = hashFun2(password, bloomLen)
        h3 = hashFun3(password, bloomLen); h4 = hashFun4(password, bloomLen)

        # changes the values according the values returned by the hashing functions
        bloomFil[h1] = True; bloomFil[h2] = True; bloomFil[h3] = True; bloomFil[h4] = True
    fPass1.close()
    return bloomFil
    

def passCheck(path, bloomLen, bloomFil):
    n = 0
    fPass2 = open(path, "r", encoding="utf-8")
    for i in range(3900000):
        password = fPass2.readline()[:20]
        h1 = hashFun1(password, bloomLen); h2 = hashFun2(password, bloomLen)
        h3 = hashFun3(password, bloomLen); h4 = hashFun4(password, bloomLen)
        if bloomFil[h1] == True and bloomFil[h2] == True and bloomFil[h3] == True and bloomFil[h4] == True:
            n += 1
    fPass2.close()
    return n

# initialisation of variables, I choose 70000000 as the lenght of the bloom filter because to have a prior probability of 
# almost 3% i have to do 4 hashing functions and a vector of that size
k = 4; bloomLen = 700000000

start = time.time()
# inizialises the bloomFilter array of bit values
blFilter = initBloomFilter(bloomLen, "C:\\Users\\asus\\Desktop\\Algoritmic methods for data science\\ADM-HM4\\passwords1.txt")

# check all passwords2 presence in the filter
n = passCheck("C:\\Users\\asus\\Desktop\\Algoritmic methods for data science\\ADM-HM4\\passwords2.txt", bloomLen, blFilter)

# calculating the prior and posterior probability of false positive
pPr = ((1 - (1/2.71) ** ((k * 100000000) / bloomLen)) ** k)
numTrue = 0
for i in range(len(blFilter)):
    if blFilter[i] == True:
        numTrue += 1
pPos = (numTrue/bloomLen) ** k

end = time.time()

print('Number of hash function used: ', k)
print('Number of duplicates detected: ', n)
print('Probability of false positives previous the creation of the bloom filter: ' + str(round(pPr * 100, 7)) + "%")
print('Probability of false positives after the creation of the bloom filter: ' + str(round(pPos * 100, 7)) + "%")
print('Execution time: ', end-start)

Number of hash function used:  4
Number of duplicates detected:  3858137
Probability of false positives previous the creation of the bloom filter: 3.5575%
Probability of false positives after the creation of the bloom filter: 0.0%
Execution time:  2032.940449476242


### Bonus Part

To calculate the effective number of the false positive we have to find the passwords that are effectively into the second dataset and then we have to subtract the number we have find previously with the number we have found in this step. 

In [None]:
# initialize variables and opens files
numEff = 0
fPass1 = open("C:\\Users\\asus\\Desktop\\Algoritmic methods for data science\\ADM-HM4\\passwords1.txt", "r", encoding="utf-8")
fPass2 = open("C:\\Users\\asus\\Desktop\\Algoritmic methods for data science\\ADM-HM4\\passwords2.txt", "r", encoding="utf-8")

start = time.time()
# check all passwords2 presence in the passwords1
for i in range(3900000):
    passwordCheck = fPass2.readline()[:20]
    
    for i in range(100000000):
        password = fPass1.readline()[:20]
        if passwordCheck == password:
            numEff += 1
            break
    fPass1.close()
    fPass1 = open("C:\\Users\\asus\\Desktop\\Algoritmic methods for data science\\ADM-HM4\\passwords2.txt", "r", encoding="utf-8")
end = time.time()

print("Number of false positives: ", n-numEff)
print('Execution time: ', end-start)