In [1]:
import numpy as np
import cProfile

## Algorithm Design 2019-20 @ Computer Science - Università di Pisa

### Scribes: Chiara Boni, Eleonora Di Gregorio 
### Lecturer: Roberto Grossi 

# Hashing
## Universal Hash Family

### Definitions and goals

The motivation behind the construction of a $universal hash family$ is due to the fact that any fixed hash function could face a worst-case scenario in which all the keys are stored in the same slot in the table, increasing the average retrieval time.<br>
In universal hashing, the function is selected $randomly$ and $independently$ from a class of functions. <br>
The winning aspect of this approach stands on the randomized selection, because it can guarantee that no single input will evoke the worst-case scenario, since the algorithm will behave differently on each execution. <p>
    
Given $H$, a finite set of hash functions that maps a universe $U$ of keys into the range {0, 1,..., m - 1}, it is called universal if, for each pair of distinct keys $k,l \in U$, the number of hash functions $h \in H$ for which $h(k) = h(l)$ is at most $|H| / m$. <br>
Hence, given a function randomly chosen, the probability of a collision with two distict keys is 1/m: <br>
$P[h(k) = h(l)] = 1/m$ <p>
    
$Load factor$<br>
Given an hash function $h$, randomly chosen from an universal collection, which stores $n$ keys into a table: if key $k$ is not in the table, then the expected length $E[n_{h(k)}]$ of the list that $k$ hashes is at most the $load factor$ $\alpha = n / m$. <br>
If $k$ is in the table, then the expected length $E[n_{h(k)}]$ of the list cointaining $k$ is at most 1 + $\alpha$. <p>

$Proof$<br>
For each pair $k$ and $l$ of distinct keys, define the indicator random variable<br>
    $X _{kl}$ = $I${$h(k) = h(l)$}. <br>
By definition of _universal hashing_, a pair of keys have the probability of collision of at most $1 / m$, $Pr$ {$ h(k) = h(l)$} $\le 1 / m$. <br>
    Therefore $E$[$X _{kl}$] $\le 1 / m$. <p>
 
Then, it's possible to assign a random variable $Y_{k}$ for each key, that equals to the number of keys that hash to the same slot as $k$.<br>
$Y_{k}$ = $\sum_{l \in T, l \ne k}$ $X_{kl}$. <br>
   
Then, <br>
$E$[$Y_{k}$] = $E \biggl[ \sum _{l \in T, l \ne k}$ $X_{kl}$ $\biggr]$ <br><br>
= $Y_{k}$ = $\sum _{l \in T, l \ne k}$  $X_{kl}$ (by linearity of expectation) <br><br>
        $\le Y_{k}$ = $\sum _{l \in T, l} \frac1m$. <p>
    
The last thing to show depends on whether key $k$ is in the table. <br>

- if $k$ $\not\in$ $T$, then $n _{h(k)}$ = $Y_{k}$, and $|${$l : l \in T$ and $l \not = k$}$| = n$.<br>
Therefore, $E$[$n_{h(k)}$] = $E$[$Y_{k}$] $\le n / m = \alpha$.<br>
    
- if $k \in T$, then the count $Y_{k}$ does not include $k$, since it is in the list $T[h(k)]$. <br>
 $n_{h(k)}$ = $Y_{k} + 1$ and $|${$l : l \in T$ and $l \not = k$}$| = n - 1$.<br>
 $E$[$n_{h(k)}$] = $E$[$Y_{k}$] $+ 1 \le (n - 1)/m + 1 = 1 + \alpha - 1/m < 1 + \alpha$. <p>

### Designing a universal class of hash functions

### Code

### Animation

### References