## Algorithm Design 2019-20 @ Computer Science - Università di Pisa

### Scribes: Chiara Boni, Eleonora Di Gregorio 
### Lecturer: Roberto Grossi 

# Hashing
## Universal Hash Family

### Definitions and goals

The motivation behind the construction of a $universal$ $hash$ $family$ is due to the fact that any fixed hash function could face a worst-case scenario in which all the keys are stored in the same slot in the table, increasing the average retrieval time.<br>
In universal hashing, the function is selected $randomly$ and $independently$ from a class of functions. <br>
The winning aspect of this approach stands on the randomized selection, because it can guarantee that no single input will evoke the worst-case scenario, since the algorithm will behave differently on each execution. <p>
    
Given $H$, a finite set of hash functions that maps a universe $U$ of keys into the range {0, 1,..., m - 1}, it is called universal if, for each pair of distinct keys $k,l \in U$, the number of hash functions $h \in H$ for which $h(k) = h(l)$ is at most $|H| / m$. <br>
Hence, given a function randomly chosen, the probability of a collision with two distict keys is 1/m: <br>
$P[h(k) = h(l)] = 1/m$ <p>
    
$Load factor$<br>
Given an hash function $h$, randomly chosen from an universal collection, which stores $n$ keys into a table: if key $k$ is not in the table, then the expected length $E[n_{h(k)}]$ of the list that $k$ hashes is at most the $load factor$ $\alpha = n / m$. <br>
If $k$ is in the table, then the expected length $E[n_{h(k)}]$ of the list cointaining $k$ is at most 1 + $\alpha$. <p>

$Proof$<br>
For each pair $k$ and $l$ of distinct keys, define the indicator random variable<br>
    $X _{kl}$ = $I${$h(k) = h(l)$}. <br>
By definition of _universal hashing_, a pair of keys have the probability of collision of at most $1 / m$, $Pr$ {$ h(k) = h(l)$} $\le 1 / m$. <br>
    Therefore $E$[$X _{kl}$] $\le 1 / m$. <p>
 
Then, it's possible to assign a random variable $Y_{k}$ for each key, that equals to the number of keys that hash to the same slot as $k$.<br>
$Y_{k}$ = $\sum_{l \in T, l \ne k}$ $X_{kl}$. <br>
   
Then, <br>
$E$[$Y_{k}$] = $E \biggl[ \sum _{l \in T, l \ne k}$ $X_{kl}$ $\biggr]$ <br><br>
= $Y_{k}$ = $\sum _{l \in T, l \ne k}$  $X_{kl}$ (by linearity of expectation) <br><br>
        $\le Y_{k}$ = $\sum _{l \in T, l} \frac1m$. <p>
    
The last thing to show depends on whether key $k$ is in the table. <br>

- if $k \notin T$, then $n _{h(k)}$ = $Y_{k}$, and $|${$l : l \in T$ and $l \ne k$}$| = n$.<br>
Therefore, $E$[$n_{h(k)}$] = $E$[$Y_{k}$] $\le n / m = \alpha$.<br>
    
- if $k \in T$, then the count $Y_{k}$ does not include $k$, since it is in the list $T[h(k)]$. <br>
 $n_{h(k)}$ = $Y_{k} + 1$ and $|${$l : l \in T$ and $l \ne k$}$| = n - 1$.<br>
 $E$[$n_{h(k)}$] = $E$[$Y_{k}$] $+ 1 \le (n - 1)/m + 1 = 1 + \alpha - 1/m < 1 + \alpha$. <p>

### Designing a universal class of hash functions

In order to design a universal class of hash functions, it is necessary to have: <br>
- a prime number $p$, which represents the size of the set of keys $k$ <br>
- $\mathbb{Z}_{p}$ which denotes the set {0,1,...$p-1$} <br>
- $\mathbb{Z}^{*}_{p}$ which denotes the set {1,2,..$p-1$} <p>
    
Since the size of the universe of keys is greater than the number of slots in the hash table, we have $p>m$. <br>
It's possible now to define an hash function $h_{ab}$ for any $a \in \mathbb{Z}^{*}_{p}$ and any $b \in \mathbb{Z}_{p}$, using a linear transformation with reductions modulo $p$ and modulo $m$: <br>
$h_{ab}$(k) = (($ak+b$) mod $p$) mod $m$). <br>
The family of such functions is $H_{pm}$ = {$h_{ab}: a \in \mathbb{Z}^{*}_{p}$ and $b \in \mathbb{Z}_{p}$}. <p>
    
$Theorem$ <br>
The class of hash functions $H_{pm}$ = {$h_{ab}: a \in \mathbb{Z}^{*}_{p}$ and $b \in \mathbb{Z}_{p}$} is universal. <p>
    
$Proof$ <br>
Taken two distint keys $k$ and $l$ from $\mathbb{Z}_{p}$, such that $k \ne l$, for a given function $h_{ab}$ let: <br>
$r$ = ($ak+b$) mod $p$ <br>
$s$ = ($al+b$) mod $p$. <p>

The first thing that must me noted is that $r \ne s$ because $p$ is prime and both $a$ and ($k-l$) are nonzero modulo $p$, so their product must be also nonzero modulo $p$. <br>
This implies that, when computing any $h_{ab} \in H_{pm}$, distinct inputs $k$ and $l$ map to distint values $r$ and $s$ modulo $p$, so there are no collisions on the "mod $p$ level" so far. <p>
    
For each of the possible $p$($p-1$) choices for the pair ($a,b$), with $a \ne 0$, it returns a different resulting pair ($r,s$), with $r \ne s$; there's a one-to-one correspondance between the two pairs ($a,b$), with $a \ne 0$, and pairs ($r,s$), with $r \ne s$. <br>
Therefore, for any give pair of inputs $k, l$, picking ($a, b$) uniformly at random from $\mathbb{Z}^{*}_{p}$ X $\mathbb{Z}_{p}$, the resulting pair ($r, s$) is equally likely to be any pair of distint values modulo $p$. <br>
Thus, the probability that distinct keys $k, l$ collide is equally to the probability that $r \equiv s$ (mod $m$), with $r, s$ randomly chosen as distinct values modulo $p$. <p>
    
For a given value of $r$, of the $p-1$ remaining values of $s$, the number of values $s$ such that $s \ne r$ and $s \equiv r$ (mod $m$) is at most: <br>
$\bigl\lceil$ $p/m$ $\bigl\rceil$ -1 $\le$ (($p+m-1$)$/m$) $-1$ <br>
    = ($p-1$)$/m$. <br>
The probability that $s$ collides with $r$, reduced modulo $m$, is at most: (($p-1$)$/m/$($p-1$) = $1/m$.<br>
Therefore, for any pair of distinct values $k, l \in \mathbb{Z}_{p}$, Pr{$h_{ab}$(k) = $h_{ab}$(l)} $\le 1/m$, which is universal by definition.

### Code

In [18]:
import math
import random


def getPrime( m ):   
    def isPrime (x):
        for i in range(2, int(math.sqrt(x))):
            if x % i == 0:
                return False
        return True

    for p in range(m+1, 2*m+1):
        if isPrime(p):
            return p
        
    
class UniversalHashFamily(object):
    def __init__(self, rangeSize):
        self.m = rangeSize
        self.p = getPrime( rangeSize )
        self.a = 0
        self.b = 0
      
    def randomChoose(self):
        self.a = a = random.randint(1, self.p-1)
        self.b = b = random.randint(0, self.p-1)
        return lambda x: ((a * x + b) % self.p) % self.m

    def __str__(self):
        return "h(x) = (%d*x + %d %% %d) %% %d" % (self.a,self.b,self.p,self.m)


def buildUniversalHash(S):
    n = len(S)
    m = 2*n
    
    H = UniversalHashFamily(m)
    h = H.randomChoose()
    print (H)

# test the universal hash
S = [ 11, 25, 36, 41, 57, 66, 73, 89, 95 ]
print ("S =", S)
buildUniversalHash(S)

S = [11, 25, 36, 41, 57, 66, 73, 89, 95]
h(x) = (10*x + 15 % 19) % 18


### Animation

### References

"Introduction to Algorithms" - Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. 