# Bloom Filter

![](https://upload.wikimedia.org/wikipedia/commons/a/ac/Bloom_filter.svg)

(image stolen form [Wikipedia](https://en.wikipedia.org/wiki/Bloom_filter) :)

## Find the False Positive Probability

Define
$$m:=\text{hash table size}$$
$$n:=\text{number of inserted items}$$
$$k:=\text{number of hash functions used}$$

Assume each hash function has little correlation with each other (for simplicity now, albeit not accurate).

Then the probability of a certain bit (slot of the hash table) is not set (to 1) by a certain hash function is
$$1-\frac{1}{m}.$$
And not set by any of the hash functions is
$$\left(1-\frac{1}{m}\right)^k.$$

Note that
$$\left(1-\frac{1}{m}\right)^m = \left(1-\frac{1}{m}\right)^{(-m)\cdot(-1)} \to e^{-1}\quad (m \to \infty),$$
thus
$$\left(1-\frac{1}{m}\right)^k = \left(1-\frac{1}{m}\right)^{m\cdot \frac{k}{m}} \approx e^{-\frac{k}{m}}$$
for large $m$.

Therefore, after we have inserted $n$ items, the probability of a certain bit is still unset by any of the
$k$ hash functions is approximately $e^{-kn/m}.$

Hence, the probability that it is set to 1 is

$$1-\left(1-\frac{1}{m}\right)^{kn} \approx 1-e^{-kn/m},$$

and the probability of all $k$ hash functions map a new (not-in-table) key to $k$ already-set bits (i.e. the false positive probability) is

$$\left[1-\left(1-\frac{1}{m}\right)^{kn}\right]^k \approx \left(1-e^{-kn/m}\right)^k.$$


## Find the Optimal Number of Hash Functions 

<img src="img/plot1_m_over_n=5.svg" width="500"> <img src="img/plot2_m_over_n=10.svg" width="500"> 

```octave
t = .2; k = 1:.5:10; p = (1-exp(-t*k)).^k; plot(k,p,'-o','LineWidth',1)
```

Let
$$ L = \ln\left(1-e^{-kn/m}\right)^k = k \ln\left(1-e^{-kn/m}\right),$$
and let its derivative be equal to 0
$$ 0 = \frac{dL}{dk} = \ln\left(1-e^{-kn/m}\right) + k\cdot n/m\cdot \frac{e^{-kn/m}}{1-e^{-kn/m}}. $$
Let $ u = e^{-kn/m} $, then 
$$u\ln u = (1-u)\ln(1-u).$$

Plot it on [GeoGebra](https://www.geogebra.org/graphing?lang=en)

<img src="img/geogebra_ulnu=(1-u)ln(1-u).svg" width="500">

Ha! $u\ln u = (1-u)\ln(1-u)$ is a symmetric equation (say, if we let $t=1-u$, then the equation becomes $(1-t)\ln(1-t)=t\ln t$),
so we can guess $u = 1-u$ (obviously, this is one solution, but is this the only valid solution?? how can we prove it??),
and therefore $u=\frac{1}{2}$ (or we can see it from the above plot).

This means 
$$e^{-kn/m} = \frac{1}{2},$$

$$k = \frac{m}{n}\, \ln 2,$$

which is our optimal number of hash functions (though not accurate, it is a close approximation :-)