# Bloom Filters

- A lightweight version of a hash table
- Efficient insertions and lookups
- More space efficient than hash table, but this comes at the cost of having "false positives" for entry lookup
- Will never get a false negative
- This is a probabilistic Data structure

 A space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It is used where we just need to know the element belongs to the object or not. A bloom filter uses k hash functions and array of n bits, where array bit set to 0, means element doesn’t exist and 1 indicates that element is present. 

## Whys use Bloom Filters?
- Allows for fast lookups and insertions
- Care about how much space the Data structure uses
- Don't care if data structure sometimes indicates an item is present when in fact it is not.

Example:<br>
-Keep track of IP addresses that are blocked and I dont care if a blocked IP address is occasionally able to access my website, but I do care if someone not on the blocked list is unable to access the site.

- Password Validators [Strong/Weak]

| Hash Tables     | Bloom Filters |
|-----------------|---------------|
| In hash table the object gets stored to the bucket(index position in the hashtable) the hash function maps to.| Bloom filters doesn’t store the associated object. It just tells whether it is there in the bloom filter or not.|
| Hash tables are less space efficient.| Bloom filters are more space efficient. it’s size is even the less than the associated object which it is mapping.|
| Supports deletions.|It is not possible to delete elements from bloom filters.|
|Hashtables give accurate results.|Bloom filters have small false positive probability. ( False positive means it might be in bloom filter but actually it is not.)|
|In a hashtable either we should implement multiple hash functions or have a strong hash function to minimize collisions.|A bloom filter uses many hash functions. There is no need to handle collisions.|
|Hashtables are used in compiler operations, programming languages(hash table based data structures),password verification, etc.|Bloom filters find application in network routers, in web browsers(to detect the malicious urls), in password checkers(to not a set a weak or guessable or list of forbidden passwords), etc.|

### Collision meaning
Suppose r = 256 and table_size = 17, in which r % table_size i.e. 256 % 17 = 1. <br>
So for key = 37599, its hash is 37599 % 17 = 12 <br>
But for key = 573, its hash function is also 573 % 17 = 12 <br>
Hence it can be seen that by this hash function, many keys can have the same hash. This is called Collision.

Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.

Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.

Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.

Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements.

In [17]:
class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = [False] * size

    def add(self, item):
        for i in range(self.num_hashes):
            index = (hash(item) + i) % self.size
            self.bit_array[index] = True

    def contains(self, item):
        for i in range(self.num_hashes):
            index = (hash(item) + i) % self.size
            if self.bit_array[index] == False:
                return False
        return True


We initialize a bit array of size size with all bits set to False. 

When an item is added to the Bloom filter, we use the Division Method to compute num_hashes different hash values for the item, and set the corresponding bits in the bit array to True. 

When we want to check if an item is in the Bloom filter, we again use the Division Method to compute num_hashes different hash values for the item, and check if all of the corresponding bits in the bit array are True. 

If any of the bits are False, we can be sure that the item is not in the Bloom filter. 

If all of the bits are True, the item may or may not be in the filter (there's a small chance of false positives due to collisions).

In [18]:
# Initialize the Bloom filter with size=10 and num_hashes=3
bloom_filter = BloomFilter(size=10, num_hashes=3)

# Add some items to the Bloom filter
bloom_filter.add("apple")
bloom_filter.add("banana")
bloom_filter.add("cherry")

# Check if some items are in the Bloom filter
print(bloom_filter.contains("apple"))   # True
print(bloom_filter.contains("banana"))  # True
print(bloom_filter.contains("cherry"))  # True
print(bloom_filter.contains("orange"))  # False
print(bloom_filter.contains("grape"))   # False

True
True
True
True
False


Orange is not added, but still contains() returned as True.

This is because, the contains() method may return True for an item that was not actually added to the Bloom filter. This probability of false positives depends on the size of the bit array and the number of hash functions used.

In the example implementation using the Division Method that I provided earlier, the probability of a false positive can be estimated using the following formula: ```p = (1 - e^(-kn/m))^k```

where p is the probability of a false positive, k is the number of hash functions used, n is the number of items added to the Bloom filter, and m is the size of the bit array. e is the mathematical constant 2.71828... (also known as Euler's number).

In our example, we used a bit array of size 10 and 3 hash functions. Let's say we added 3 items to the Bloom filter, which means n = 3. Plugging in these values into the formula, we get:

```p = (1 - e^(-3*3/10))^3```
```p = 0.0081```

This means that there's a 0.81% chance of a false positive, which is relatively low. 

However, as we add more items to the Bloom filter, the probability of false positives will increase. To reduce the probability of false positives, we can increase the size of the bit array and/or the number of hash functions used.

p = 0.001
n = 3

m = -n * ln(p) / (ln(2))^2
m = -3 * ln(0.001) / (ln(2))^2
m = 23.52

k = m / n * ln(2)
k = 23.52 / 3 * ln(2)
k = 4.92

So to achieve a probability of false positives less than 0.1%, we would need a bit array of size at least 24 and at least 5 hash functions.

In [19]:
# Initialize the Bloom filter with size=10 and num_hashes=3
bloom_filter = BloomFilter(size=24, num_hashes=5)

# Add some items to the Bloom filter
bloom_filter.add("apple")
bloom_filter.add("banana")
bloom_filter.add("cherry")

# Check if some items are in the Bloom filter
print(bloom_filter.contains("apple"))   # True
print(bloom_filter.contains("banana"))  # True
print(bloom_filter.contains("cherry"))  # True
print(bloom_filter.contains("orange"))  # False
print(bloom_filter.contains("grape"))   # False

True
True
True
False
False
