# RA3 Bloom Filters

## 145 Hashing Outline

In this lecture we're going to talk about hashing and we're going to describe a popular technique called bloom filters. Before we get into hashing, we're going to talk about a toy problem, balls into bins. We are going to throw balls randomly into a set of bins. We'll do some simple probabilistic analysis of this problem and this will give us some intuition about the design of hashing schemes. One of the neat ideas we're going to see with the balls and bin's problems, is the power of two choices. This is going to motivate some of our more sophisticated hashing schemes. After we look at the toy problem balls into bins, then we'll look at the traditional hashing scheme; chain hashing. And then we'll look at our more sophisticated scheme, bloom filters. So let's dive into the balls into bins.

## 146 Balls Into Bins

For this toy problem, we have n balls, which are identical to each other, and we have n bins, which we'll label B1, B2, up through Bn. Each ball is thrown into a random bin, and this is done independent of the other bins. Now, what we're going to do for this toy problem, is we're going to throw each ball into a random bin. Now, the process for each ball is independent of what happened for the other balls. We want to look at the number of balls that are assigned to each bin. Therefore, we look at the random variable load_of_i, which is the number of balls assigned to bin i. What we're interested in is the maximum load. This is the maximum number of balls in any particular bin. In other words, the max load is the max_over_i of the load of bin i. How large can the max load get? Well, in the worst case, all of the balls might get assigned to the same bin. But, what's the chance of that? It's quite small. I mean, what's the chance of all n balls being assigned to bin B1, let's say. The probability of one particular ball being assigned to bin B1 is one_over_n. What's the chance that all n balls get assigned to this particular bin? It's one_over_n_to_the_n. This is really tiny, so this is very unlikely. In the worst case, the max load could be n because of such a scenario. But what is the typical scenario? How large is the max load typically? That's what we want to analyze now. We want to make a statement such as, with high probability, the max load is, some quantity, such as square root n, log n, order one. So, let's dive in and see what the max load is in a typical scenario.

## 147 Probability Quiz Question

Let's take a short probability quiz to give a quick refresher on some basic probability. Let's look at what is the probability, that the first log and balls are assigned to bin bi for a particular i. We want to look at the probability that the first log and balls are assigned to this particular bi. Go ahead, and write the quantity here.

## 148 Probability Quiz Solution

Now, the solution is one_over_n_to_the_log_n. Why is that? Let's look at it more closely. Take a particular ball j. What is the probability that this ball is assigned to this particular bin Bi? Well, the ball is going to a randomly chosen bin, so the chance of going to any particular bin is one_over_n. Therefore, the first ball has probability one_over_n of being assigned to bin Bi. The second ball, same, has a probability one_over_n of being assigned to bin Bi, and so on for all these log_n balls. So the probability that all of them are assigned to bin Bi is the product of one_over_n, for each. So, the total probability is one_over_n raised to the power log.

## 149 Analysis Setup

Now let's dive into the analysis of the maximum load. What we've seen so far, is that for a particular set of log and balls, the probability that these log and balls are assigned to a particular Bin Bi is one over N raised to the power LogN. We're going to try to show that the max load is typically at most LogN. Now, what does typically mean? We're going to have to detail exactly what we mean by that. But in order to prove that the max load is a most LogN, we want to show that the probability that a particular Bin Bi gets load greater than LogN and we want to propound that probability and show that it's small. It's unlikely to get load at least LogN. In order for Bin Bi to get loaded least LogN, a particular set of log N balls have to get assigned to bin Bi. Now maybe more than log and balls that get assigned to Bi but we know that there's at least LogN balls that are assigned to Bi. So we're going to get an upper bound on this probability. First, we have to choose the particular set of LogN balls that are going to be assigned to Bin Bi. How many ways are there of choosing the LogN balls? There's N choose LogN. Now, for this particular set of LogN balls, what's the chance that they are all assigned to Bin Bi? Well, that's what we found before, that's one over N raised to the power LogN. Now what happens for the other N minus LogN balls? Well, some of them might get assigned to this Bin Bi as well, in which case we may be counting these events multiple times. So we're getting an upper bound on this probability. Notice, that if we had an extra term here which is one minus one over N raised to the power N minus LogN, this is saying that all of the other balls besides these LogN balls that we chose are assigned to other bins. So what is the probability that a particular ball is not assigned to Bin Bi? That's one minus one over N, N minus LogN balls that are not assigned to Bin Bi. Then what is this bound? This is actually equal to the probability that this Bin Bi gets load exactly LogN. But that's not what we want to bound. We want to bound the probability that the Bin gets at least LogN. So we want to get an upper bound. We ignore where the other balls, the other N minus is LogN balls are assigned. And then we get an upper bound on the probability because we allow these balls, these N minus Log N balls to be assigned to any Bin. Maybe Bi or maybe a different Bin. Now let's try to get a handle on this term, N choose LogN. Let's look at it more generally, N choose K. Recall what is N choose K. It's N factorial over and minus K factorial times K factorial. If we expand this out we got N factorial on the numerator, but all the terms from N minus K downwards cancel out with this N minus K factorial in the denominator. So we get N times N minus one, down to N minus K plus one. The remaining terms again cancel with this N minus K factorial on the denominator and then also in the denominator what are we left with? We're left with K factorial which is K times K minus one times K minus two, and so on down to one. Let's try to get a handle on this quantity. Notice the first term is N over K, the second term is similar, is N minus one over K minus one. If N is big, that's pretty similar to N over K, and so on. So we have N over K, N minus one over K minus one, and so on down to N minus K plus one over one. So there's K quantities, K ratios there. So here are the K ratios. Each one let's say is approximately N over K. So this is approximately N over K raised to the power K. Actually, this approximation is not too bad of a bound on this quantity. What one can show using Stirling's formula, is that N choose K is the most, N times E over K all raised to the power K. So if we put an extra factor of E in the numerator, then we get a rigorous upper bound on N choose K. And that's what we're aiming for. We're aiming for an upper bound on the Load size of Bin Bi. So we can upper bound N choose LogN by using this formula. So plugging in this bound for our case, we have K equals LogN in our scenario. So we get the upper bound N times E over LogN raised to the power of Log N, that's for the N choose LogN, and then we still have this term one over N to LogN, one over N to LogN. Now these are both raised to the power of LogN, so we can cancel out this N with this N. So what are we left with? We're left with E over LogN, E over LogN raised to the power of LogN. Now notice the denominator is growing with N, whereas the numerator is a fixed constant. So as N grows this becomes smaller and smaller. We're going to look at this asymptotically as a function of N so we can bound this inner term by any fixed cost and we want. So let's bound it by the constant one quarter. So, we're going to say E over LogN is in most one fourth, and so we get this whole quantity is bounded by one fourth to the LogN. Now what is the bound that we used here on N? We use the fact that LogN is bigger than four times E. When is that true? That's true when N is big enough. In particular, if N is bigger than two to the 11 then LogN is bigger than four times E, so we can replace E over LogN by one fourth. Now what is the nice thing about using one fourth here? Well, assuming that the log was base two, then this quantity one fourth raised to the LogN is equal to one over N square. So in summary, we've shown that the probability that Bin Bi gets load at least LogN is at most to one over N square which is tiny as N grows.

## 150 Max Load Quiz Question

Now what we saw in the last slide was that the probability that a particular bin Bi gets load at least log n is at most, one_over_n_squared. Now, can we use that to bound the max load? Can we upper bound the probability that the max load is at least log n? Why don't you go ahead and write this quantity here. If you have trouble doing it, don't worry, we'll go through it in a minute.

## 151 Max Load Quiz Solution

The solution is one_over_N, the probability that the max load is at least Log n, is at most one_over_N. Let's go through it in more details to see why this is the case.

## 152 Max Load Analysis

Now we want to bound the probability that the max load is at least log n, that's the quantity here. In order for the maximum load to be at least log n, at least one bin, maybe several, have to have a load at least log n. In order to upper bound this probability, we can use our earlier analysis. We know the probability that a particular bin gets loaded, at least log n, is at most one over N squared. So now, in order for at least one bin to get loaded at least log n, lets look at all the choices for the bins. So we can sum over each bin, and then we can look at the probability that the load in bin BI, is at least log n, which is what we bounded right here. That's at most 1 over N squared. There are N choices for the bin, and for each particular bin the probability the load is at least log n is at most 1 over N squared, and this is, 1 over N, which proves the result, that the max load is at least log n, with probability at most 1 over N. Now, we want to look at the complementary event, that the max load is strictly less than log n. What is the probability of this? Well, this event is unlikely to happen. This happens with a small probability, this is going to happen with a large probability. In particular, it happens with probability at least one minus one over N. And in fact, one can do a similar analysis, a little bit more carefully, and you can show that the max load is theta of log n over log log n. So we can get an upper bound, and a lower bound which are within constants of each other. So, instead of an upper bound of log n, on the max load, we can get a slightly better bound of log n over log log n. And this error probability can be boosted from one over N to 1 over and squared or 1 over N to the tenth, any polynomial, and then, we can get here by changing the constant up front. And when this error probability is 1 over a polynomial and N, we say that this event happens with high probability. In particular, if it happens with high probability, then that means it happens with probability at least one minus one over some polynomial and N, and we can make this polynomial small as we want, by increasing this constant up front.

## 153 Best of Two Scheme

Now, let's look at the following twist on the balls and bins problem. This is going to be a better approach, better in the sense that it's going to reduce the max load. Now, in the previous balls and bins problem, each ball was going to a random bin and it was independent of the assignment for the other balls. Therefore, we could have assigned all the balls simultaneously, in parallel. They could have all been assigned to bins at the same time, or we could assign them one by one, sort of sequentially. So, we could have taken ball one and assigned it to a random bin, take ball two, assign to a random bin, and so on, up to ball n. That's what we're going to do here. We're going to assign the balls sequentially, and we're going to look at a slight twist of the other scheme. So, let's go through the balls from one through n, index i will correspond to the current ball that we're considering. In the old scheme, we choose one random bin say j, and we assigned the i ball to bin bj. Now, in the new scheme what we're going to do is, we're going to choose two random bins, say j and k. Now, which bin are we going to assign the ith ball? Well we're going to assign it to the better of these two bins. What exactly do we mean? We mean the one which has smaller load. So, let load of j and load of k denote the current load in these particular bins bj and bk, and we're going to assign this ith ball to the least loaded of these two. So if the load of bj is structurally smaller than the load of bk, then we assign the ith ball to bin bj. And in the other case, we assign the ith ball to bin bk. So, the ith ball is going to the least loaded of these two bins.

## 154 Power of Two Choices

This new best of two approach is a simple twist on the original approach. What are we doing now? Well, instead of choosing a random bin, we choose two random bins, and we assign the ith ball to the least loaded of these two random bins. It turns out that with this scheme, the maximum load is order log log n, with high probability. Recall that the previous approach had max load on the order of roughly log n and we've reduced it from log n to log log n just by, instead of choosing one random bin, we choose two random bins, and we send to the best of the two. This is a substantial gain because log log n is quite small, even for very large n. So this is almost like order one, it's very close, it's a very small quantity. After seeing this result, you might say, "Well, why choose two random bins. " Let's choose three random bins and maybe we'll get log log log n." Well, it turns out that the big gain is from one to two, and after that there's not much gain. In particular, if you choose d, at least two bins, so instead of choosing two random bins you choose d random bins and you assign the ith ball to the least loaded of all of these d bins, then the max load is going to be log log n over log d. So the improvement with d is very small. Now, keep this idea in mind later, this idea of choosing the best of two choices. We're going to use this intuition later to get better hashing schemes and then to drive the intuition behind the Bloom filters. So finally, let's dive into hashing.

## 155 Hashing Setup

Now, let's turn our attention back to hashing, which is our main focus. It will be useful to keep a running example in mind to motivate our various hashing schemes. The example we'll use is unacceptable passwords. We want to maintain a database of unacceptable passwords. For example, these might be words that are in the dictionary. Now the setting is, that a user will enter a proposed password and our system should quickly respond if that proposed password is acceptable or not. So we need to quickly check whether the proposed password is in the database of unacceptable passwords or not. Now let's formalize our setting a little bit more precisely. We have a huge set U which is the universe of possible passwords. Now this is an enormous set. For example, if we simply look at passwords as strings or words of length A, then this set is of size 52 to the A, hence this set U is too large to maintain. Instead, we're going to maintain a subset of this universe, which will denote a subset S. S will contain the set of unacceptable passwords. The main operation our data structure needs to perform are queries. So for an element X in our universe. So X is a proposed password in this setting. Is X in this subset S? So is X an unacceptable password? Now we want to build a data structure or hashing scheme, which answers these queries quickly. Let's first look at how the traditional hashing scheme known as chain hashing, works in this setting. Now in order to maintain this set S, we're going to use a hash table H of size N, little N. In chain hashing, this table H is an array of linked lists. We're going to use a hash function little H which maps elements in U two elements and H. So each of the possible passwords is mapped to one of these N bins by little H. Now to insert an element into this subset S we simply find its hash value, then we go to that bin and then we add the element onto the linked list at that particular location. And then to do a query, we simply go to the hash value and we look at the linked list to check whether it's there or not. For each element of the universe, little H of X maps to one of these N bins. Now we're going to analyze random hash functions. So we're going to assume that H of X maps to a random bin. Moreover, we'll assume this choice, this random map, is independent of all other hashes. So where H of X maps to is independent of where any other element of the universe maps to. So this little H is a completely random hash function. Now if you think of this hash table as bins and you think of these elements in S as balls then what this hash function is doing is its assigning these balls into random bins. So it's reminiscent of our balls into bins problem that we analyzed before. Now it will be useful to have a little bit of notation before we move on. This set U is huge, and we'll denote its size by capital N. The hash table, we'll denote its size by little N. Now capital N, as in RSA, will be exponential size in little N, and we'll use little M to denote the size of this database, capital S, that we're maintaining. And once again, capital N is much bigger than little N, and typically our hash table size is at least the size of the database we're maintaining. So little N is at least size M, and our goal of course is to try to maintain this database as not much larger than little M.

## 156 Chain Hashing

Once again, in the traditional hashing scheme, chain hashing, H or hash table is an array of linked lists. H is an array of size N, and H of I, the Ith element of H, is a linked list of elements in our subset S which map to I. So, in other words, H of I is a linked list of those elements. So, those unacceptable passwords whose hash value is exactly I. Let's look at the query time. How long does it take us to answer a query of the form? Is X in a subset S, that we're maintaining. Now, in order to answer this query, what we have to do is look at the hash table at the index I, which is H of X, and then we have to go through that entire linked list and check whether X is in there in that linked list or not. So, the time it takes us, is proportional to this size of this linked list. What's the size of this linked list? It's the load at this bin. If we think of the elements of S as balls, these are getting assigned to bins, which are their hash values. The time it takes us to answer a query is proportional to the load size at the hash value. Let's introduce some notation. Let M be the size of our dictionary of unacceptable passwords, and let little n, be the size of our hash table. So, in our balls in bins analogy, little m is the number of balls that we're throwing in, and little n is the number of bins. Now, if m equals n, so the number of balls is the same as the number of bins. This is the toy problem that we analyzed before, and what we saw is that the max load is order log n, with high probability. Of course, in the worst case it might be, order n or n might be the max load, but that's an unlikely event. With high probability, the max load is going to be order log n, which means in the query time in the worst case, it's going to be order log n with high probability. Now, when n is huge, then order log n might be too slow for us. So how can we achieve faster query time? Well, one way is to try to increase the size of our hash table. In order to decrease this max load from order log n to order one, so that the query time will be order one constant time queries. We're going to have to increase the size of the hash table from order m to order m squared. Now, that's quite a large price to pay. So let's see if there are simpler ways to achieve reductions in the query time. Now, our intuition for the following scheme is comes from the balls and bins example from before. This is a simple scheme that we use right now. We're sending n balls into n bins. Each ball is going into a random bin. What do we use to improve that balls and bins scheme, to improve the max load? Well, we use the two choice scheme, and what we saw is that the max load goes down to order log log n, when we allow each ball to go to the least loaded of two random bins. So, let's try and use a similar scheme now for hashing.

## 157 Power of Two Hashing Question

what we're gonna do now is instead of using a single hash function we're going to choose a pair of hash functions H 1 and H 2 each of these hash tables Maps elements of U of our possible passwords into our hash table of size n now we're assume that these hash functions are random so each element acts in the universe of possible passwords maps to a random element of the hash table H 1 of X is random in each two of X are random and these are independent of each other and independent of the other hash values the first question is how do we insert an element a possible password into our dictionary of unacceptable passwords in the traditional scheme this is quite straightforward what we do is we look at our hash function and we look at H of X that tells us the hash value and now we looked at the linked list at H of X and we add element acts onto that linked list but now we have two hash functions H 1 and H 2 so how do we do this insertion into our dictionary in this scenario when we have two hash functions this is a bit of an open-ended question but why don't you think about how you would insert an element into our dictionary using two hash functions

## 158 Power of Two Hashing Solution

let's go ahead and look at how we do this insertion we want to insert this element X into our dictionary of unacceptable passwords first thing we do is compute these two hash values so we compute H 1 of X and H 2 of X think of our balls and bins analogy we have this ball ax and what we've done is we've chosen two random bins H 1 of X and H 2 of X which bin do we add the ball ax into we added into the least loaded at all of these two bins what is the load of the bin it's the size of the linked list we can maintain the size of each of these linked lists so that we can quickly determine which of these two is least loaded and then we can add in X into that appropriate linked list and then we can increment the size of that linked list so this can all be done in order one time for an insertion next question is how do we do a query how do we check whether an element Y up proposed password Y is in our dictionary of unacceptable passwords we start off the same as an assertion we compute the two hash values H one of Y and H 2 of Y these are the two possible locations for Y we have no way of determining which of these two locations it might be in if at all because we have no way of determining what the dictionary looked like at the time that we inserted Y if we did insert Y so what do we do we check both bins we check the linked list at h1 of Y and we check the linked list at h2 of Y and we look for Y in both of these linked lists so we check the linked list at H of H 1 of Y and we check the linked list at H of H 2 of Y and we look in both of these linked lists for the element Y if it's in either of these linked lists then we know that Y is in the dictionary if it's in either of these linked lists then we know that Y was never inserted into a dictionary of unacceptable passwords so how long does it take to do a query or the query time now depends on the load at this location and the load at this location so if we have an upper bound on the maximum load then the query time is twice the maximum load now if M equals n so the size of our date dictionary of unacceptable passwords and the size of our hash table are the same then what we know from our balls and pinion analogy is that the query time the max load is gonna be order log log and in this scenario so just changing from one hash function to a pair of hash functions and using this scheme then our query time goes down dramatically from order log n to order log log n and there's no extra cost in terms of the space though there is a question about how we maintain this hash function H especially if it's a truly random hash function in practice we can't store a truly random hash function instead we use pseudo random hash functions so we use a hash function which we obtain from a library such as ran or D ran 48 but for the purposes as an analysis it's convenient to consider a truly random hash function so that we can do this nice analysis such as how we obtain the order log and max load for the M equals n case for the simple case of one hash function we skipped the analysis for the case of the balls and bins example where we did the two choices where we had each ball going to the best of two random bins in that case we we claimed that the max load is order log log n that analysis is reasonable to do but it's much more complicated so we skipped it in this lecture

## 159 Lecture Outline

Now this completes our description of the traditional hashing approach, the chain hashing. Now we can move on to the bloom filters.

## 160 Bloom Filters Motivation

Now we can finally describe Bloom filters. Let's keep in mind this running example from before the case of unacceptable passwords. We're going to describe a new data structure that has faster queries. Recall in the traditional hashing scheme that we previously described, the query time was order log in for the simple scheme or order log log in for the more advanced scheme which used the power of two choices that are ideal. Here we're going to achieve query time order one. So constant query time, and this is guaranteed. Recall that the other query times were probabilistic statements. So with high probability the query time was order log in or order log log in. In the worst case it was order N. But here it's guaranteed to always be constant query time. This data structure will be very simple and it will use less space than before. There are no linked lists or anything like that. It will just be a simple binary array. Now there are a lot of benefits. It's simpler, less space, faster queries. Now there must be some cost for this simplicity and this faster time. So what is the cost, what is the tradeoff for this scheme? Well, this scheme is not always correct. Occasionally, there are false positives, and this happens with some probability that we'll analyze. We'll try to figure out what is this probability of false positives occurring. What exactly do we mean by false positive? We have an element X which is not in our dictionary of unacceptable passwords. So this is an acceptable password, but our algorithm occasionally says, yes this X is in the dictionary. In this setting false positives are acceptable. Why? Because we have an acceptable password but we say that the password is unacceptable. It's in our dictionaries. So we falsely say that the password is in our dictionary of unacceptable passwords. So somebody types in a password and we say, no that's not allowed. Ideally, it should have been allowed, but we said that, no is not allowed. So then the user has to enter a new password. But in exchange for these false positives we have guaranteed query time. So we answer the question of whether it was an acceptable password or not quickly. And in this setting false positives are reasonable. False negatives that would have been a big cost, that would have been unacceptable in this setting. When we have an unacceptable password we definitely want to say it's unacceptable. If we have an acceptable password, okay. If we occasionally say that it's an unacceptable password that's okay. So in this setting it's reasonable to have false positive with some small probability that we'll try to bound. In other settings it may be unacceptable to have false positives. In which case bloom filters might be a bad idea. So this is not a universal scheme. You have to look at your setting and determine whether the price of having a simpler and faster scheme is worth the cost of having false positives. Is it acceptable to have false positives with some small probability?

## 161 Operations

What are the basic operations that our data structure is going to support? First operation is insert x. So given a possible password X, we want to add this password into our dictionary of unacceptable passwords. The second operation is a query on X, is this proposed password in our dictionary of unacceptable passwords? If this proposed password is in our dictionary of unacceptable passwords, then we're always going to output YES, so we're always correct in this case. The problem is that, when this proposed password is not in our dictionary, so it is an acceptable password, we usually output NO and we have to bound what we mean by usually. But occasionally, we're going to output YES. So when in this proposed password is acceptable, occasionally we're going to have a false positive and say YES it is in the dictionary of unacceptable passwords, so this password is not allowed. So we have false positives with some small rate and we have to bound that rate and see what it looks like.

## 162 Bloom Filters

Finally we can describe our Bloom filter data structure. The basic data structure is simply a binary array a 0-1 array of size little n. So we have this binary array. We don't have any linked lists hanging off it at all. It's just a binary array of size n, that's the whole data structure. We're going to start off by setting h to all zeros. So all of the n bits are set to zero. As before, we're going to use a random hash function which maps elements of the universe of possible passwords into our hash table of size little n. How do we insert an element x of possible password into our dictionary x of unacceptable passwords? First off we compute is hash value, then we set the bit in this array to one at that hash value. So we compute H_of_x and we set H, capital H at H_of_x to be one. Now it might already be one, in which case we're not doing anything. So the bits only change from zeros to ones. We never change them back from ones to zeros. That's one of the limitations on this data structure. There is no easy way to implement deletions, because we never change bits from ones to zeros, we only change them from zero to ones. Now how do we do a query? How do we check whether an element x is in our dictionary s? Where we compute x hash value, and we check the array. The bit had that hash value and we see whether it's one or zero. If the bit at this hash value is one, then we output yes. We believe it is in the dictionary s. If it's zero, then we're guaranteed that it no it's not in the dictionary. Because if it's zero that means we definitely did not insert it. If it's one, then we think, we might have inserted it but we're not sure. Somebody else might have been inserted at that hash value, and we have no way of checking whether it's x was inserted at the hash values or somebody else was inserted at the hash value. Because we're not maintaining a linked list at this point. Let me repeat this point about how false positives can arise. We have some element x which we do a query on. It's not in our dictionary of unacceptable passwords, but there is some other element y which is in our dictionary of unacceptable passwords. And these two elements, x and y, have the same hash value. h_of_x equals h_of_y. So when we inserted y into our dictionary, then we set this bit at this point to one. So then when we do the query on x, this bit looks is 1. So we think or as far as we know, x might be in our dictionary. So we have to output yes because it might be there. But in fact it is no because it was not inserted but somebody else was inserted with the same hash value. That's how false positives arise. Now this scheme is not going to perform very well. How can we improve it? Well we can try to use our power of two choices idea that we used before in our traditional hashing scheme. So what are we going to do? Well instead of using one hash function, we're going to use two hash functions. Now in the traditional balls and bins example, there was a big gain from going from one hash function to two hash functions, but then going from two to three or three to four, was not much of a gain. But here, this is a slightly different setting and there'll be a big gain possibly going from one to two but even for two to three there might be a gain. And it's not clear how many hash functions to use and we're going to try to optimize that choice of number hash functions. So we're going to allow, instead of two hash functions, we're going to allow k hash functions. So we want to generalize this scheme to allow k hash functions and then we're going to go back and figure out what is the optimal choice of k, the number of hash functions. So let's look at the more robust setting where we allow k hash functions, and how do we modify this data structure to accommodate k hash functions.

## 163 Robust Scheme

So now we have k hash functions instead of just one, h1, h2, up to hk. We're going to initialize our hash table at H, capital H to all zeroes. So all of the bits, the n bits of H are set to zero. How do we do an insertion? How do we add an element X? A possible password into our dictionary S of unacceptable passwords. Previously we computed this hash value H of X and we set that bit to 1. What are we going to do now? Now we compute the K hash values and we set all of those K bits to 1. So we go through the hash functions 1 through K, and then we compute it's I'th hash value, and we set this bit to 1. It might already be 1 as before, but we always change bits from 0 to 1 just like before. How do we do a query? How do we check whether an element X was inserted into our dictionary S? Well, we compute it's K hash values and we check whether all of those K bits are set to 1 or not. If all of those K bits are set to 1, then our best guess is that X was inserted into the database S. If any of those are still 0, then we're guaranteed that X was not inserted into the database. So let's write that out more precisely. If for all of those K hash values the hash function at those K bits is set to 1, then we output yes. We believe that this element X is in the database S. If any of these K bits is still 0, then we're guaranteed that this element X is not in the database. So we output no.

## 164 Correctness

Let's take a look at the correctness of this algorithm for our queries. Suppose x was inserted into our database s, and we do a query on x. What do we output? Well, when we inserted x into the database, we set all of these k bits to one. So when we do a query, we're guaranteed that all of these bits are set to one, and so we're going to output Yes, because none of the bits ever change from ones to zeros. Bits only change from zeros to one. It's a one directional process.So if x was inserted into the database, when we do a query on x, we always output Yes. It is in the database. Now, suppose x was not inserted into the database and we do a query on x. Sometimes, we might say yes, we believe it's in the database. In which case, we get a false positive. We falsely say that, yes, it's in the database. How can this occur? This can occur if all of the k bits were set to one by other insertions. So for each of the k bits of x, so take the ith bit. So this is hi of x. There is some element, z, which was inserted into the database s and one of the k bits for z exactly matches the ith bit of x. Which of the k bits for z? Let's say the Jth bit for z. So the Jth bit for z matches the ith for x. In other words, h_i of x as the ith for x, matches the jth bit of z. So h_i of x equals h_j of z. This means that when z was inserted into the database we did the insert of z. Then we set this bit which matches the ith bit of x to one. And if this is true for every bit of x, so all the k bits of x are set to one by some other insertion then we're going to get a false positive on x. So this scheme has this extra robustness or redundancy. In order to get a false positive, we need all of these k bits to be set to one by some other insertions, whereas the previous scheme only had one bit which we're checking. Now we have k bits which need to get set to one in order to get a false positive. So it seems like things improve that the false positive rate goes down as k increases. But in fact there's an optimal choice of k. If k gets too large, the false positive rate starts to shoot up again. Why is that? Well if k is huge, then for every insertion you're setting k bits to one. So you're setting many bits to one if k is huge. So that means that for each of these and insertions, each of these elements in s, they have many bits, many choices of j which are set to one. So it's more likely if k is big, that one of these k bits is going to match up with one of the bits of x. So if k is too large, every insertion is setting too many bits to one. If k is small, then when we're checking it, when we're doing the query on x, we're checking too few bits. So there's some optimal choice of k, not too large and not too small. What we want to do now is more precisely analyze these false positives. What's the probability of a false positive? We want to look at it as a function of k and then we can figure out the optimal choice of k in order to minimise the false positive rate. And then we can compare and see what that false positive rate looks like to see whether this is a good data structure to use

## 165 Analysis Setup

Let's start analyzing the false positive rate. As before, let M, denote the size of our database or dictionary that we're maintaining. And let little N denote the size of our hash table. Now, presumably, N, the size of our hash table is going to be at least the size of the database that we're maintaining. So, the number of bins is at least the number of balls that we're inserting. So, the important parameter is going to be the ratio of these sizes. So, let C denote this ratio. The size of the hash table compared to the size of the database. So, C is going to be at least one. And our goal is to try to get the smallest C possible. So once again, the size of our database as a dictionary of unacceptable passwords is M. And the size of our hash table is C times M. There's a constant C which is at least one, and our hash table is this constant C bigger than the database that we're maintaining. Now, for an element X, which is not in our database. So, we didn't do an insertion on X. Let's look at the probability of a false positive for this X. So X was not inserted into the database. And what is the probability of a false positive? So, we incorrectly say that X was inserted into the database. So, in order for this to occur, we need that all the K bits for X. So H-1 of X H-2 of X up to H-K of X, were all set to one. If all of these bits are one but X was never inserted into the database, then we'll get a false positive. We'll incorrectly say yes, it is in the database because all of the K bits are one, but it was never in fact inserted into the database. So, let's try to analyze this probability that all this K bits are set to one. Let's first look at a simpler problem for a specific bit, B. What's the probability that specific that is set to one? So, for a specific bit. B, this is ranges between 0, 1, and N minus one. What is the probability that this specific bit, B is set to one? It would be slightly easier to look at the complimentary event that this specific bit is set to zero. So, the probability that this specific bit is one, is one minus the probability that this bit is still zero. Now, to analyze the probability that it's still set to zero, what we have to do is we have to check that all of the insertions miss this one bit. Now, let's go back and think about our balls and bins analogy in order to analyze this probability. Now, we have M insertions. In our simple hashing scheme, these insertions correspond to throwing a ball into a bin. So, this corresponds to throwing M balls into bins. But notice for each insertion, we have K hash values that we look at. And we set K of these values to one. So, each insertion corresponds to K balls. So, actually we're throwing M times K balls into bins. So, we're throwing these M times K balls and we're throwing them into N bins. Now, what is this specific bit being set to zero correspond to in this balls and bins example? In order for this bit to still be zero we need that all these M times K balls miss this specific bin, B. So this probability that this bit is zero is equivalent to the probability that all M times K balls miss this specific bin. For one ball, what's the probability that it misses a specific bin? The chance that it hits the specific bin is one over N, the chance it misses this bin is one minus one over N. And we're doing this for M times K balls. Now, this expression is not very complicated but will be much more convenient for us to have a slightly simpler expression.

## 166 False Positive Probability

So what have we done so far? We've shown that the probability that a specific bit is zero is equal to one minus one over N raised to the power M times K. Let's try to manipulate this to get a slightly more convenient expression. I want to replace this by an exponential. Supposed I look at E to the minus A for a number A. Let's take a look at the Taylor series for the exponential function. Let me remind you of that expression. So I have the exponential of minus A, where A is a number. That start off with one minus A, plus A squared over two factorial, minus A cubed over three factorial. Notice it's a alternating sign next term is A to the fourth over four factorial, and so on. You have this infinite series. Now for small A this series is decreasing, and as A goes to zero, then this series is approximated by one minus A. So this is a good approximation when A is sufficiently small. That's going to correspond in our case to N being sufficiently large and we're looking at A as N grows to infinity. So let's use this approximation to simplify our analysis of the false positive rate. So what can I do here? Here I have a equals one over N, as N grows one over N goes to zero. So there's a reasonable approximation, so I can replace one minus one over N by E to the minus one over N. So I have E to the minus MK over N. Recall that C is the ratio of the size of the hash table to the size of the database. Therefore, this expression can be simplified to E to the minus K over C. So now I have a very simple expression for the probability that a specific bit is zero. Recall our original problem. We have an element X which was not inserted into our database. We want to look at the probability of a false positive for this element. So what's the chance we output yes, even though X was not inserted into the database? So what's the probability that all of these K Bits corresponding to X were set to one by some other insertions? Well, the probability of a specific Bit being set to zero. We've just analyzed and shown that it's approximately E to the minus K over C. So what's the probability one of these specific Bits is set to one? It's one minus the probability that is set to zero. The probability set to zero is E minus K over C, the probability set to one is one minus E minus K over C. And we want K specific Bits all set to one. So the probability of that, is this raised to the power K. This expression is the false positive probability. It's not very nice right now because we have this K. Can we simplify this by eliminating K? Can we figure out what is the optimal K in order to minimize this false positive probability? Recall our intuition from before, we wanted to have K not too small, if K is small, then when we do a query, we're checking too few bits. But if K was big, if it's too large, then when we do an assertion we're setting too many bits to one, for each insertion. So there's some middle ground and we want to figure out the optimal choice of K in order to minimize this false positive probability. So what are we going to do? We're going to take this function. We're going to take its derivative. Set it at equal to zero and find the optimal choice of K in order to minimize this expression. So a bit of calculus, we're going to skip it. But I'll tell you the punchline.

## 167 Optimal k

Recall the expression that we just obtained for the false positive probability. The expression was one minus E to the minus K over C raised to the power K. Let's look at this as a function of K. So look, Let F of K denote this expression. This is F of K is a false positive probability for a specific choice of K, the number of hash functions. Our goal is to figure out what's the optimal choice for the number of hash functions, in order to minimize this false positive probability. So what do we do? We are going to minimize this function F of K. How do we find the optimal choice of K? Well, we take it's derivative, set it equal to zero. Check whether it's a global minimum. And then we find that choice of K, which sets the derivative equal to zero. Where does that optimal happen? It happens at K equals C times LN two. That's the natural log of two. The log base E of two. I'm going to skip this calculus, so that I don't embarrass myself. I've forgotten my calculus too, so don't worry. But let's look at this choice of K, which is the optimal choice in order to minimize the false positive probability. It's quite interesting. Actually, first off, let's plug this choice of K back into this expression. And then we can simplify it. Well, first off, when you plug in K equal C LN two over here, then the C is cancelled, and you're left with LN two. And then on the outside we have C LN two. What is E to the minus LN two? That's exactly one half. And what's one minus a half? It's a half. So the inside is one half raised to the power C LN two. Let's separate the C from LN two. So we have one half to the LN two, all raised to the power C. Now what is this inside? This is just a fixed constant. It's one half raised to the power LN two. You can plug it in on your calculator. Turns out that inner expression is .6185 approximately. So we have that raised to the power C. C is the ratio of the size of our hash table compared to our database. Now we have a simple expression for the false positive probability. It's .6185 raised to the power C. So if you tell me how large of a hash table you're willing to do, I can tell you what the false positive probability is. Before we look at that, let's just take a look at this expression right here. I want to point out something interesting about this choice, this optimal choice of K. What's a probability a specific bit is zero, or one? The chance is one is this inner expression. Chance is zero is this expression right here. Both of those are one half. So the chance we're setting a bit to zero is one half. The chance we're setting it to one is a half. So what does this binary string look like after we've done M insertions? So it's a binary string of length N, and we've done M insertions into it. What does it look like? Well, each bit afterwards is set to zero with probability half. So each bit is randomly set to zero or one. So this string H corresponds to a random string, where each bit is independently flipped to zero or one. So what's interesting, is that the optimal choice of K means, that if we just look at what this random string looks like, it's going to correspond to a random binary string, where each bit is independently set to zero or one. So the optimal choice of K corresponds to balancing out the zeros and ones in H, so that H looks like a random string.

## 168 Looking at False Positive Rate

Recall our setting. We have a database of size M and we have a hash table of size C times M for some C strictly greater than one. What we just showed is that the false positive probability is approximately .6185 raised to the power of C. This .6185 corresponded to one half raised to the power LN two. Let's now look at some specific examples to see how this performs. Let's suppose we did the naive scheme, where K equals one. So, we didn't do the optimal choice of K. We just set one hash function. And let's look at the case where we do 10 times larger or 100 times larger. Now, this expression for the false positive probability was assuming the optimal choice of K. In order to analyze this case where K equals one, we have to go back to our expression of F of K. If you look back at that expression and you plug in K equals one and C equal's 10, or C equal's a 100, you get the following. In the first case, the false positive probability is .095. And in the second case it's point.00995. Now, suppose we do the optimal choice of K. So, then our false positive probability is going to be this expression. Let's look at C equals 10. What do we get? We get.0082. A reasonable gain. But not that much better than C equal's a 100 with this simple K equals one case. Let's try to C equals a 100. Sorry. Hash table is a 100 times bigger than the database we're trying to store. But this is just a binary string, right? So, it's very reasonable to consider a hash table which is 100 times bigger. Now, the false positive probability is 1.3 times ten to the minus 21. The key thing is that this is exponential in C. So, taking C equals to a 100, it's tiny. This is really a minuscule probability. And if this is not small enough for you, you can go C equals to 200, or 300 and you're going to get a really, really tiny probability of a false positive. So, if you're willing to have a very small probability of a false positive, then you have this very simple data structure which just corresponds to having a binary string. It's very simple to maintain and is very fast query times and the false positive probability is very small. The downside of this data structure is that occasionally, you might have some false positives and also it doesn't easily allow for deletions from the database. Though, there are some heuristics for allowing deletions, these are modifications which are called Counting Bloom Filters. Well, that completes our description of Bloom Filters. I look forward to seeing your projects where you're going to implement Bloom Filters and you're going to explore whether these approximations that we did in our analysis were reasonable or not.