In [1]:
import numpy as np
import cProfile

## Algorithm Design 2019-20 @ Computer Science - Università di Pisa

### Scribes: Eleonora Di Gregorio, Chiara Boni 
### Lecturer: Roberto Grossi 

# Hashing
## Universal Hash Family

### Definitions and goals

The motivation behind the construction of a **Universal Hash Family** is due to the fact that any fixed hash function could face a worst-case scenario in which all the _keys_ are stored in the same slot in the table, increasing the average retrieval time.<br>
In _universal hashing_, the function is selected _randomly_ and _independently_ from a **class of functions**. <br>
The winning aspect of this approach stands on the **randomized selection**, because it can guarantee that no single input will evoke the worst-case scenario, since the algorithm will behave differently on each execution. <p>
    
Given _H_, a finite set of hash functions that maps a universe _U_ of keys into the range {0, 1,..., m - 1}, it is called **universal** if, for each pair of distinct keys _k,l_ $\in$ _U_, the number of hash functions _h_ $\in$ _H_ for which _h(k) = h(l)_ is at most |_H_| / _m_. <br>
Hence, given a function randomly chosen, the _probability_ of a collision with two distict keys is 1/m: <br>
**_P[h(k) = h(l)] = 1/m_** <p>
    
**Load factor**<br>
Given an hash function _h_, randomly chosen from an universal collection, which stores _n_ keys into a table: if key _k_ is not in the table, then the expected length $\textrm{E}$[_n<sub>h</sub>_(_k_)] of the list that _k_ hashes is at most the _load factor_ **$\alpha$ = _n / m_**. <br>
If _k_ is in the table, then the expected length $\textrm{E}$[_n<sub>h</sub>_(_k_)] of the list cointaining _k_ is at most 1 + $\alpha\$. <p>

_Proof_ <br>
For each pair _k_ and _l_ of distinct keys, define the indicator random variable <br>
    _X_ <sub>kl</sub> = $\textrm{I}$ { _h_(_k_) = _h_(_l_) }. <br>
By definition of _universal hashing_, a pair of keys have the probability of collision of at most 1 / _m_, Pr { _h_(_k_) = _h_(_l_) } $\le$ 1 / _m_. <br>
    Therefore $\textrm{E}$[_X_ <sub>kl</sub>] $\le$ 1 / _m_. <p>
 
Then, it's possible to assign a random variable _$\textrm{Y}$_<sub>k</sub> for each key, that equals to the number of keys that hash to the same slot as _k_. <br>
        _$\textrm{Y}$_ <sub>k</sub> = $\sum$ <sub>l $\in$ T, l $\ne$ k</sub> _X_ <sub>kl</sub> . <br>
   
Then, <br>
$\textrm{E}$[_$\textrm{Y}$_<sub>k</sub>] = $\textrm{E}$$\biggl[$  $\sum$ <sub> l $\in$ T, l $\ne$ k</sub>  _X_ <sub>kl</sub> $\biggl]$ <br><br>
= _$\textrm{Y}$_ <sub>k</sub> = $\sum$ <sub> l $\in$ T, l $\ne$ k</sub>  _X_ <sub>kl</sub> (by linearity of expectation) <br><br>
        $\le$ _$\textrm{Y}$_ <sub>k</sub> = $\sum$ <sub> l $\in$ T, l</sub> $\frac1m$. <p>
    
The last thing to show depends on whether key _k_ is in the table. <br>

- if _k_ $\not \in$ _T_, then _n_<sub>_h_(_k_)</sub> = _$\textrm{Y}$_<sub>k</sub>, and |{_l : l $\in$ T_ and _l_ $\not =$ _k_}| = _n_. <br>
Therefore, $\textrm{E}$[_n_<sub>_h_(_k_)</sub>] = $\textrm{E}$[_$\textrm{Y}$_<sub>k</sub>] $\le$ _n / m_ = $\alpha$.<br>
    
- if _k_ $\in$ _T_, then the count _$\textrm{Y}$_<sub>k</sub> does not include _k_, since it is in the list _T_[_h_(_k_)]. <br>
 _n_<sub>_h_(_k_)</sub> = _$\textrm{Y}$_<sub>k</sub> + 1 and |{_l : l $\in$ T_ and _l_ $\not =$ _k_}| = _n_ - 1.<br>
 $\textrm{E}$[_n_<sub>_h_(_k_)</sub>] = $\textrm{E}$[_$\textrm{Y}$_<sub>k</sub>] + 1 $\le$ (_n_ - 1)/_m_ + 1 = 1 + $\alpha$ - 1/_m_ $<$ 1 + $\alpha$. <p>

### Designing a universal class of hash functions

### Code

### Animation

### References