# Week3B (Basic)

## Question 1

Suppose we hash the elements of a set S having 20 members, to a bit array of length 99. The array is initially all-0's, and we set a bit to 1 whenever a member of S hashes to it. The hash function is random and uniform in its distribution. What is the expected fraction of 0's in the array after hashing? What is the expected fraction of 1's? You may assume that 99 is large enough that asymptotic limits are reached.

<ol>
<li>The fraction of 1's is $79/99$ (0.798)
<li>The fraction of 1's is $e^{-79/99}$ (0.450)
<li>The fraction of 1's is 1 - $e^{-20/99}$ (0.182)
<li>The fraction of 1's is $e^{-20/99}$ (0.817)
</ol>

### Solution:
Throwing darts approximation: $(1−\frac{1}{t})t^{(d/t)}$ = $e^{-d/t}$

In [38]:
# Video: Week 3: Bloom Filter
# TODO: Investigate how to write out the equation to get the exact answer

import math

def find_result(approx, test):
    """
    The number of 1' is approximately the number of elements times the number of hash functions, in this case 1.
    However, some collision might occur and lower this number slightly.
    """
    temp = [x for x in test if x <= approx]
    return (min(temp, key=lambda x:abs(x-approx)))

approx = 20/99
test = [79/99, math.e ** (-79/99), 1 - math.e ** (-20/99), math.e ** (-20/99)]

m = find_result(approximate, test)

print("1:", m == test[0])
print("2:", m == test[1])
print("3:", m == test[2])
print("4:", m == test[3])

1: False
2: False
3: True
4: False


## Question 2

A certain Web mail service (like gmail, e.g.) has 108 users, and wishes to create a sample of data about these users, occupying 1010 bytes. Activity at the service can be viewed as a stream of elements, each of which is an email. The element contains the ID of the sender, which must be one of the 108 users of the service, and other information, e.g., the recipient(s), and contents of the message. The plan is to pick a subset of the users and collect in the 1010 bytes records of length 100 bytes about every email sent by the users in the selected set (and nothing about other users).
The method of Section 4.2.4 will be used. User ID's will be hashed to a bucket number, from 0 to 999,999. At all times, there will be a threshold t such that the 100-byte records for all the users whose ID's hash to t or less will be retained, and other users' records will not be retained. You may assume that each user generates emails at exactly the same rate as other users. As a function of n, the number of emails in the stream so far, what should the threshold t be in order that the selected records will not exceed the 1010 bytes available to store records? From the list below, identify the true statement about a value of n and its value of t.


<ol>
<li>n = $10^{9}$; t = 999
<li>n = $10^{12}$; t = 999
<li>n = $10^{10}$; t = 10,000
<li>n = $10^{13}$; t = 9
</ol>

In [70]:
# Video - Sampling a Stream

BYTE_LIMIT = 1010

test = [
    {"n": 10**9, "t": 999},
    {"n": 10**12, "t": 999},
    {"n": 10**10, "t": 10000},
    {"n": 10**13, "t": 9}]

for t in test:
    print(((t["t"] / t["n"]) * t["n"] * 100) < 1010)

False
False
False
True
