# Hash Tables

The goal is to maintain a (possibly evolving) set of objects. We aim to implement insert, delete and lookup (via a key). Using a properly implemented hash table, all of these operations can be done in $O(1)$ time (for "non-pathalogical" data).

Let $U$ be the universe of all possible elements. We aim to maintain a evolving set $S \subseteq U$.

1. Pick $n = O(\lvert S \rvert)$, 
2. choose a hash function
$$
h: U \mapsto \{0, 1, 2, \cdots, n-1 \}
$$
3. use an array $A$ of length $n$ store $x$ in $A[h(x)]$

In general however, with only $\sqrt{n}$ elements, there is a $50%$ chance of two elements mapping to the same hash. To resolve collisions we may opt to

1. Seperate chaining: 

    Create a linked link in each bucket
2. Open addressing:

    maintain only one object per bucket. Using a hash function to specify a sequence of arrays to try, until we find an open slot.

    

A good hash function achieves two things
1. Spreads out the data evenly
2. Easy to compute / remeber the hash

The load of a hash table is
$$
\alpha = \frac{\text{\# objects in hash table}}{\text{\# of buckets}}
$$.
In order for the hash tables operations to run in constant time, we need
$$
\alpha = O(1)
$$
Furthermore, for open addressing we need $\alpha << 1$.

To maintain this, we can grow the number of buckets with the number of objects in the hash table.

Furthermore, for every data set and hash function there exists a data set, we can call it the pathological data set, such that the hash function performs poorly.

Solutions
1. Use a cryptographic hash function [SHA-2](https://en.wikipedia.org/wiki/SHA-2)
2. Use randomization

    design a family $H$ of hash functions that are picked at random at runtime. This is also known as "Universal Hashing"




## Universal Hashing

Let $H$ be a set of hash functions that map
$$
h: U \mapsto \{ 0, 1, 2, \cdots, n-1 \}
$$

$H$ is universal iff 
$$
\forall x, y \in U \quad, \quad x \neq y \\[5pt]
p[h(x) = h (y)] \leq \frac{1}{n} \\[10pt]
$$

when $h$ is chosen uniformly at random from $H$.


### Example: Hashing IP Addresses

Let $U = $ IP addresses of the form,
$$
(x_1, x_2, x_3, x_4) \; \text{for} \; x_i \in \{0, 1, 2, \cdots, 255\}
$$

Let $n = $ a prime.

Define one hash function $h_a$ for each 4-tuple
$$
a = (a_1, a_2, a_3, a_4) \; \text{for} \; a_i \in \{0, 1, 2, \cdots, n-1\}
$$

This produces $n^4$ such functions.

For an IP address $i$
$$
h_a: i \mapsto a \cdot i\mod{n} \\[10pt]
h_{(a_1, a_2, a_3, a_4)}{(x_1, x_2, x_3, x_4)} = a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 \mod{n}
$$

Finally, the universal set $H$ is the set of all $h_a$
$$
H = \{ h_a | a\}
$$

Proof:

Consider distinct IP addresses $(x_1, x_2, x_3, x_4)$ and $(y_1, y_2, y_3, y_4)$ having a collision.
$$
h_{a}{(x_1, x_2, x_3, x_4)} = h_{a}{(y_1, y_2, y_3, y_4)} \\[10pts]
a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 \equiv a_1y_1 + a_2y_2 + a_3y_3 + a_4y_4 \mod{n} \\
a_4 (x_4 - y_4) \equiv \sum_{i=1}^{3}{a_i(y_i - x_i)} \mod{n} 
$$

Consider $a_4$ as a random variables given that $a_1, a_2, a_3$ are already fixed.

Further, assume that $n$ is sufficiently larger than $x_4$ and $y_4$ such that 
$$
x_4 - y_4 \not\equiv 0 \mod{n}
$$
Where $g = (x_4 - y_4)$

Therefore the probability of such a collision reduces to the number of solutions of the equation
$$
a_4 \times g \equiv m \mod{n}
$$

Assume that there exists two solutions $k_1$, $k_2$ to this equation.
$$
k_1 \times g \equiv k_2 \times g \equiv m \mod{n} \\
k_1 \times g - k_2 \times g \equiv 0 \mod{n} \\
(k_1 - k_2) \times g \equiv 0 \mod{n}
$$

Since $n$ is prime and $g$ is not congruent to $0$ modulo $n$, this implies that
$$
k_1 - k_2 \equiv 0 \mod{n}
$$

Therefore, since $a_4$ can only take values from $\{0, 1, 2, \cdots, n-1\}$. There is only one such value of $a_4$ such that the congruence is satisfied. 

Since only $1$ out of $n$ choices for $a_4$ satisfies the congruence and $a_4$ is uniformly chosen at random.
$$
p( \text{collision} ) = \frac{1}{n}
$$