# Hash Tables

The goal is to maintain a (possibly evolving) set of objects. We aim to implement insert, delete and lookup (via a key). Using a properly implemented hash table, all of these operations can be done in $O(1)$ time (for "non-pathalogical" data).

Let $U$ be the universe of all possible elements. We aim to maintain a evolving set $S \subseteq U$.

1. Pick $n = O(\lvert S \rvert)$, 
2. choose a hash function
$$
h: U \mapsto \{0, 1, 2, \cdots, n-1 \}
$$
3. use an array $A$ of length $n$ store $x$ in $A[h(x)]$

In general however, with only $\sqrt{n}$ elements, there is a $50%$ chance of two elements mapping to the same hash. To resolve collisions we may opt to

1. Seperate chaining: 

    Create a linked link in each bucket
2. Open addressing:

    maintain only one object per bucket. Using a hash function to specify a sequence of arrays to try, until we find an open slot.

    

A good hash function achieves two things
1. Spreads out the data evenly
2. Easy to compute / remeber the hash

The load of a hash table is
$$
\alpha = \frac{\text{\# objects in hash table}}{\text{\# of buckets}}
$$.
In order for the hash tables operations to run in constant time, we need
$$
\alpha = O(1)
$$
Furthermore, for open addressing we need $\alpha << 1$.

To maintain this, we can grow the number of buckets with the number of objects in the hash table.

Furthermore, for every data set and hash function there exists a data set, we can call it the pathological data set, such that the hash function performs poorly.

Solutions
1. Use a cryptographic hash function [SHA-2](https://en.wikipedia.org/wiki/SHA-2)
2. Use randomization

    design a family $H$ of hash functions that are picked at random at when initialising the hash table. This is also known as "Universal Hashing". The hash function can also be changed when rehashing the entire table.

## Universal Hashing

Let $H$ be a set of hash functions that map
$$
h: U \mapsto \{ 0, 1, 2, \cdots, n-1 \}
$$

$H$ is universal iff 
$$
\forall x, y \in U \quad, \quad x \neq y \\[5pt]
p[h(x) = h (y)] \leq \frac{1}{n} \\[10pt]
$$

when $h$ is chosen uniformly at random from $H$.


### Example: Hashing IP Addresses

Let $U = $ IP addresses of the form,
$$
(x_1, x_2, x_3, x_4) \; \text{for} \; x_i \in \{0, 1, 2, \cdots, 255\}
$$

Let $n = $ a prime.

Define one hash function $h_a$ for each 4-tuple
$$
a = (a_1, a_2, a_3, a_4) \; \text{for} \; a_i \in \{0, 1, 2, \cdots, n-1\}
$$

This produces $n^4$ such functions.

For an IP address $i$
$$
h_a: i \mapsto a \cdot i\mod{n} \\[10pt]
h_{(a_1, a_2, a_3, a_4)}{(x_1, x_2, x_3, x_4)} = a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 \mod{n}
$$

Finally, the universal set $H$ is the set of all $h_a$
$$
H = \{ h_a | a\}
$$

Proof:

Consider distinct IP addresses $(x_1, x_2, x_3, x_4)$ and $(y_1, y_2, y_3, y_4)$ having a collision.
$$
h_{a}{(x_1, x_2, x_3, x_4)} = h_{a}{(y_1, y_2, y_3, y_4)} \\[10pts]
a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 \equiv a_1y_1 + a_2y_2 + a_3y_3 + a_4y_4 \mod{n} \\
a_4 (x_4 - y_4) \equiv \sum_{i=1}^{3}{a_i(y_i - x_i)} \mod{n} 
$$

Consider $a_4$ as a random variables given that $a_1, a_2, a_3$ are already fixed.

Further, assume that $n$ is sufficiently larger than $x_4$ and $y_4$ such that 
$$
x_4 - y_4 \not\equiv 0 \mod{n}
$$
Where $g = (x_4 - y_4)$

Therefore the probability of such a collision reduces to the number of solutions of the equation
$$
a_4 \times g \equiv m \mod{n}
$$

Assume that there exists two solutions $k_1$, $k_2$ to this equation.
$$
k_1 \times g \equiv k_2 \times g \equiv m \mod{n} \\
k_1 \times g - k_2 \times g \equiv 0 \mod{n} \\
(k_1 - k_2) \times g \equiv 0 \mod{n}
$$

Since $n$ is prime and $g$ is not congruent to $0$ modulo $n$, this implies that
$$
k_1 - k_2 \equiv 0 \mod{n} \\
k_1 \equiv k_2 \mod{n}
$$

Therefore, since $a_4$ can only take values from $\{0, 1, 2, \cdots, n-1\}$. There is only one such value of $a_4$ such that the congruence is satisfied. 

Since only $1$ out of $n$ choices for $a_4$ satisfies the congruence and $a_4$ is uniformly chosen at random.
$$
p( \text{collision} ) = \frac{1}{n}
$$

## Chaining: Constant-Time Gaurantee


Hash table implemented with chaining and the hash function $h$ is chosen uniformly at random from a universal family $H$.

We can gaurantee that lookup insert and delete operations run in $O(1)$.
- Given as an expectation over random choice of hash function $h$
- Load of the hash table $\alpha = O(1)$
- Hash function takes $O(1)$ time to evaluate

We will analyse the running time of an unsuccesful lookup as an upper-bound on the running time of the other operations.

In general, the running time for a hash function containing the set $S \subset U$
$$
O(1) + O(\text{list length in} \; A[h(x)])
$$
Where $x \not\in S$ and the list length depends on the random choice of $h$

Let $L =$ list length in $A[h(x)]$

For $y \in S$, define
$$
Z_y = 
\begin{cases}
1 & \text{if} \; h(y) = h(x) \\
0 & \text{otherwise}
\end{cases}
$$
Then,
$$
L = \sum_{y \in S}{Z_y}
$$
Therefore,
$$
E(L) = \sum_{y \in S}{E(Z_y)} \\[5pt]
= \sum_{y \in S}{p(h(y) = h(x))} \\[5pt]
\leq \sum_{y \in S}{\frac{1}{n}} = \frac{\lvert S \rvert}{n} \\[5pt]
= \alpha = O(1)
$$

## Open Addressing

In open addressing, we enforce that there is only object per slot. To do so, each hash function will procude a probe sequence for each possible key $x$.

Heuristic Analysis:

We assume that each of the $n!$ probe sequences are equally likely. Then we can show that the insertion time
$$
\approx \frac{1}{1-\alpha}
$$

Proof: By a crude upper bound
$$
p(\text{finding an empty slot by random}) = 1 - \alpha 
$$

Let $N$ be the number of probe attempts before finding an empty slot.
$$
E(N) = 1 + \alpha \times E(N) \\
\therefore \quad E(N) = \frac{1}{1-\alpha}
$$

This tends to be the case for double hashing.

For Linear probing the analysis is different since the heuristic assumption is false.

Instead we will assume, initial probe is uniformly random and independant for different keys

The insertion time can then be expected to be (Knuth 1962)
$$
\approx \frac{1}{(1-\alpha)^{2}}
$$

# Bloom Filters

Supports the same operations as a hash table. However,
1. \+ More space efficient than a hash table
2. \- Can't store as associated object
3. \- No Deletions
4. \- Small false positive probability

Most suited for remember if you've seen something before or not.

Implementing a Bloom Filter.

1. We define an array of $n$ bits, such that 
$$
\frac{n}{\lvert s \rvert} = \text{\# of bits per object in} \;S
$$

2. $k$ hash functions $h_1, \cdots h_k$ such that $k$ is a small constant

```
Insert(x):
    for i=1, 2, ... ,k
        set A[h_i(x)] = 1

Lookup(x):
    for i=1, 2, ... ,k
        if A[h_i(x)] != 1
            return False

    return True
```

Since we never set bits back to 0, there will be no false negatives. Furthermore, at the point where the bloom filter becomes "full", every element will become a false positive.

### Heuristic Analysis

For this analysis, we will assume that each value of $h_i(x)$ is uniformly random and independant across all $i$ and $x$.

Given $n$ bits, $k$ hash functions and a data set $S$.
$$
p(\text{a given bit is set to 1}) = 1 - \left(1 - \frac{1}{n}\right)^{k\lvert S \rvert}
$$

For any given bit, there will be $k \times \lvert S \rvert$ attempts where the bit may be set to 1. Further since the hash functions are uniformly random and independant,
$$
p(\text{the bit remains 0}) = \left(1 - \frac{1}{n}\right)^{k\lvert S \rvert}
$$.

Therefore, the probability that the bit is set to 1 at least once, is as stated above.

Applying the upper bound
$$
1 - x \leq e^{-x} \\
p(\text{a given bit is set to 1}) \leq 1 - e^{-\frac{ k\lvert S \rvert}{n}}
$$

Let $b$ be the number of bits per object, 
$$
b = \frac{n}{\lvert S \rvert}
$$
Then,
$$
p(\text{a given bit is set to 1}) \leq 1 - e^{\frac{-k}{b}}
$$

For a given object $x \not\in S$,
$$
p(\text{false positive for } x) = \left(1 - e^{\frac{-k}{b}}\right)^{k}
$$\

To optimise the bloom filter, we fix a value of $b$ that suits our application, then we minimise the false positive probability $\epsilon$. 
$$
\epsilon = \left(1 - e^{\frac{-k}{b}}\right)^{k} \\[10pt]
$$
To find a minimum, we solve for 
$$
\frac{\partial \epsilon}{\partial k} = 0
$$

$$
\frac{\partial \epsilon}{\partial k} = \frac{\partial}{\partial k}\left(e^{k\ln{\left(1 - e^{\frac{-k}{b}}\right)}}\right) \\[10pt]
= \left(1 - e^{\frac{-k}{b}}\right)^{k} \times \frac{\partial}{\partial k}\left(k\ln{\left(1 - e^{\frac{-k}{b}}\right)}\right) \\[10pt]
\implies \frac{\partial}{\partial k}\left(k\ln{\left(1 - e^{\frac{-k}{b}}\right)}\right) = 0 \\[10pt]
\frac{\partial}{\partial k}\left(k\ln{\left(1 - e^{\frac{-k}{b}}\right)}\right) = \ln{\left(1 - e^{\frac{-k}{b}}\right)} + \frac{k}{1 - e^{\frac{-k}{b}}} \times \frac{e^{\frac{-k}{b}}}{b} \\[10pt]
=\ln{\left(1 - e^{\frac{-k}{b}}\right)} + \frac{k}{b}\frac{1}{e^{\frac{k}{b}} -1}
$$

At this point lets introduce $g = e^{\frac{k}{b}}$. Then,
$$
\ln{\left(1 - \frac{1}{g}\right)} + \frac{\ln{g}}{g -1} = 0 \\[10pt]
\ln{(g -1)} - \ln{g} + \frac{\ln{g}}{g -1} = 0 \\[10pt]
(g -1)\ln{(g -1)} - \ln{g}(g -1) + \ln{g} = 0 \\[10pt]
(g -1)\ln{(g -1)} - \ln{g}(g -2) = 0 \\[10pt]
$$

Notice that $g=2$ will be a root of both $\ln{(g -1)}$ and $(g -2)$. Therefore a valid solution is
$$
e^{\frac{k}{b}} = 2 
\implies \frac{k}{b} = \ln{2}
\implies k = b\ln{2}
$$

This gives an overall false positive error
$$
\epsilon = \left(\frac{1}{2}\right)^{\ln{2} \times b}
$$

# Optional Theory Problems

## 1

Recall that a set $H$ of hash functions (mapping the elements of a universe $U$ to the buckets $\{0,1,2,\cdots,n−1\}$ ) is universal if for every distinct $x, y\in U$, the probability $p[h(x)=h(y)]$ that $x$ and $y$ collide, assuming that the hash function $h$ is chosen uniformly at random from $H$, is at most $1/n$. 

In this problem you will prove that a collision probability of $1/n$ is essentially the best possible. 

Precisely, suppose that $H$ is a family of hash functions mapping $U$ to $\{0,1,2,\cdots,n−1\}$, as above. Show that there must be a pair $x, y\in U$ of distinct elements such that, if $h$ is chosen uniformly at random from $H$, then $p[h(x)=h(y)] \geq \frac{1}{n} - \frac{1}{\lvert U \rvert}$

To find a lower bound of the probability of a collision, we can assume that the randomly chosen hash function exhibits perfect hashing for $n$ out of $\lvert U \rvert$ keys. Then for the remaining,
$$
\lvert U \rvert - n
$$
keys a collision is garunteed to occur.

Therefore, for any hash function there exist a pair of keys $x$ that belongs to the $n$ perfect hashes and $y$ belonging to the remaining keys,
$$
p[h(x) = h(y)] \\
\geq p(\text{choosing a collision key}) \times p(\text{choosing a key that belongs to the perfect set}) \\
= \frac{\lvert U \rvert - n}{\lvert U \rvert} \times \frac{1}{n} \\
= \frac{\lvert U \rvert - n}{\lvert U \rvert n} = \frac{1}{n} - \frac{1}{\lvert U \rvert}
$$

# Programming Assingment 

The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.

The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the $i^{th}$ row of the file specifying the $i^{th}$ entry of the array.

Your task is to compute the number of target values $t$ in the interval [-10000,10000] (inclusive) such that there are distinct numbers $x$, $y$ in the input file that satisfy $x+y=t$ . (NOTE: ensuring distinctness requires a one-line addition to the algorithm from lecture.)

Write your numeric answer (an integer between 0 and 20001) in the space provided.

OPTIONAL CHALLENGE: If this problem is too easy for you, try implementing your own hash table for it. For example, you could compare performance under the chaining and open addressing approaches to resolving collisions.

In [18]:
from typing import Protocol, Generic, TypeVar

T = TypeVar('T')

class Hash(Protocol, Generic[T]):
    def __iter__(self):
        ...
    
    def add(self, item: T):
        ...
    
    def discard(self, item: T):
        ...

def two_sum(arr: list[int], hash: Hash[int], targets: set[int]):
    found = set()

    iters = 0
    cycle = 0

    for x in arr:
        iters += 1
        
        if iters // 1000 > cycle:
            cycle += 1
            completion = (iters / len(arr)) * 100
            print(f'progress: {format(completion, ".2f")}%, numbers: {iters}', end='\r')

        for sum in targets:
            if sum - x in hash:
                found.add(sum)

        targets = targets.difference(found)

    completion = (iters / len(arr)) * 100
    print(f'progress: {format(completion, ".2f")}%, numbers: {iters}', end='\r')
    
    return len(found)

def populate_hash(arr: list[int], hash: Hash[int]):
    for x in arr:
        hash.add(x)
    return hash

In [19]:
def data():
    with open('Week 4 2-SUM.txt') as f:
        for line in f:
            yield int(line)

In [None]:
arr = [x for x in data()]
hash = populate_hash(arr, set())
targets = {x for x in range(-10_000, 10_000 +1)}

count = two_sum(arr, hash, targets)
print(count)

Hashing is taking so long!

In [127]:
def sorted_two_sum(sorted_arr: list[int]):
    targets = set()

    front_idx = 0
    end_idx = len(sorted_arr) -1

    while front_idx < end_idx:
        testsum = sorted_arr[front_idx] + sorted_arr[end_idx]

        if testsum < -10_000: # sum is outside the range, make sum larger
            front_idx += 1
        elif testsum > 10_000: # sum is outside the range, make sum smaller
            end_idx -= 1
        else: # sum is within the range, count possibilites whilst making sum smaller
            targets.add(testsum)
            for r in range(end_idx, front_idx, -1):
                testsum = sorted_arr[front_idx] + sorted_arr[r]
        
                if testsum < -10_000:
                    break
                else:
                    targets.add(testsum)
            
            front_idx += 1
    
    return len(targets)

In [128]:
arr = [x for x in data()]
arr.sort()

sorted_two_sum(arr)

427

I must give credit to [En Lin](https://www.coursera.org/learn/algorithms-graphs-data-structures/discussions/forums/wuxqB3b0EeamjgocByS1BQ/threads/peidUXM7EeiD9g6PBVjxjg). Brilliant. 

The sorting is faster since, sorting should run in $O(n\log{n})$ while the hashing would run in $O(n) \times O(1)$ where we compare
$$
\log{n} \approx 6 \\
O(1) \approx 20,000
$$
Since in the worst case we have to check all 20,000 sums for each number in the hash.