In [1]:
import math
import logging
FORMAT = '[%(name)s:%(levelname)s]  %(message)s'
logging.basicConfig(level=logging.DEBUG, format=FORMAT)
logger = logging.getLogger('dbg')

def dprint(s):
    logger.debug(s)

def iprint(s):
    logger.info(s)

logger.setLevel(logging.INFO)

## Hash Tables

**Hash Tables** are data structures designed for fast `search`, `insert` and `delete`, ideal for abstract data types. Storage complexity is $\Theta(n)$.

|            | Search | Insert | Delete |
| -          | - | - | - |
| Average    | $O(1)$ | $O(1)$ | $O(1)$ |
| Worst Case | $\Theta(n)$ | $\Theta(n)$  | $\Theta(n)$  |

The idea is $O(1)$ **Search-By-Index**, replacing search with basic addressing.

### Direct Addressing SBI

Suppose the $n$ objects have unique integer keys that exist in the universe $U$ of possible keys s.t. $k_i \in \{0... m-1\}$, i.e. the universe contains $m$ possible keys: $U = \{0...m-1\}$. We *could* have an $m$ slot array and just use keys as indexes in $O(1)$, but super inefficient.

e.g. IPv6 Addresses: $|U| = 2^{128}$ which is $> 28 \cdot 10^{27}$ GBP!!

**Hash Tables** use a hash function $h$ to compute slots. The **Goal** is design $h$ to shrink the required storage array size to something like $m = \Theta(n)$.



### Hash Functions

A hash function, $h:U\rightarrow \{0,\cdots m-1\}$, maps every element of a universe of keys $U=\{k_0,k_1,\cdots,k_{m-1}\}$ to one element in a set $\{0,\cdots m-1\}$.

Example function for universe size $m$ would be: $h(k) = k\mod m$.

Function must be:
1. Fast to  Compute
2. Minimise Collisions: $k_i \neq k_j \; \& \; h(k_i) = h(k_j)$

An **Ideal $h(k)$** would roll a fair, random, $m$-sided die to select a slot for each $k$ - An independent uniform random hash function. However input data is rarely random, so good hash functions must be very random!

#### Static Forms

Vulnerable to suboptimal key distributions giving many collisions.

**Division Method** - $h(k) = k\mod m$ 

Helps a bit if $m$ is a prime number

**Multiplication Method** - $h(k) = \lfloor m \cdot (Ak\mod 1) \rfloor$  *for*  $A \in (0, 1)$ 

#### Random Forms

*Universal* Families, e.g. Family H:

$P_{h \in H} \left[ h(k_i) = h(k_j)\right] \leq \frac{1}{m} \; \forall \; i \neq j$

In other words, any two different keys of the universe collide with probability at most $\frac{1}{m}$ when hash function $h$ is drawn uniformly at random from $H$. This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key. 

Pick a prime number $p$ such that $p > |U|$ (is greater than the universe size). Let $a$ and $b$ be 'salts' used to lower vulnerability with added randomness.

$ h_{a,b}(k) \; \hat{=} \; ((ak+b) \mod p) \mod m)$ - The hash function.

The universal family $H_{p,m} = \{h_{a,b} | a \in \mathbb{Z}_p^*, \; b \in \mathbb{Z}_p\}$

### Chaining

Chaining is a solution to managing collisions, using element-wise doubly linked lists:

<img src="media/hashDLL.png" alt="drawing" width="300"/>

**Worst Case:**

* All $n$ keys collide $\Rightarrow$ All objects are placed in the same slot 
* Search is then $\Theta(n)$ with linked list linear search
* Could be $\Theta( \log n)$ if the chains are ordered for binary search

**Average Case:** (Cost of search)

* Define a table *Load Factor* - $ \alpha = \hat{=} \frac{n}{m}$ for $n$ items stored amongst $m$ entries in the table.
* Assume hash function is *Universal*, Collision probability $ < \frac{1}{m}$
* $\mathbb{E}[$ Chain Length $] = \frac{n}{m} = \alpha$
* Av Cost = $\Theta(1 + \alpha)$ - Hash + chain search
  
### Open Addressing 

Open addressing is a form of chain-free collision handling. The simplest variant is *linear probing* where slots are sequentially searched until an empty one is found.

Probe sequences produce permutations of $(0, 1,..., m-1)$. 

An alternative form of probing is **Double Hashing:**  $h(k, i) = (h_1(k) + i h_2(k) \mod m)$.  $h_2(k)$ and $m$ must be coprime.

Number of probes in an unsuccessful search, assuming independent uniform permutation hashing:

Max Probes: $\frac{1}{1-\alpha} = 1 + \alpha + \alpha^2 + \alpha^3 + ...$ : at least 1, more than 1, more than 2 and so on... 

**Estimate the number of table accesses (probes) needed when inserting a new entry into the table.**

Let $\alpha=\frac{k}{n}$ be the load factor of the hash table. The probability of a successful insert in one probe is $1-\alpha$. Probability of a successful insert in exactly $t$ probes is $P(\text{success in }t)=(1-\alpha) \alpha^{t-1}$. <br>

Hence, the expected number of inserts is $E[t]= (1-\alpha)\sum_{t=0}^{\infty} t \alpha^{t-1}=\frac{1}{1-\alpha}$.