## Algorithm Design 2019-20 @ Computer Science - Università di Pisa

### Scribes: Chiara Boni, Eleonora Di Gregorio 
### Lecturer: Roberto Grossi 

# Hashing

## Universal Hash Family

### Definitions and goals



The motivation behind the construction of a $universal$ $hash$ $family$ is due to the fact that any fixed hash function could face a worst-case scenario in which all the keys are stored in the same slot in the (a hash) table, increasing the average retrieval time.<br>
This scenario could happen if a malicious adversary chooses the keys to be hashed by the fixed hash function. <br>
In universal hashing, the function is selected $randomly$ and $independently$ from a class of functions. <br>
The winning aspect of this approach stands on the randomized selection, because it can guarantee that no single input will always evoke the worst-case scenario, since the algorithm will behave differently on each execution. <p>
    
Given $H$, a finite set of hash functions that maps a universe $U$ of keys into the range {0, 1,..., m - 1}, it is called universal if, for each pair of distinct keys $k,l \in U$, the number of hash functions $h \in H$ for which $h(k) = h(l)$ is at most $|H| / m$. <br>
Hence, given a function randomly chosen, the probability of a collision with two distict keys is 1/m: <br>
$P[h(k) = h(l)] = 1/m$ <p>
    
$Load factor$<br>
Given an hash function $h$, randomly chosen from an universal collection, which stores $n$ keys into a table: if key $k$ is not in the table, then the expected length $E[n_{h(k)}]$ of the list that $k$ hashes is at most the $load factor$ $\alpha = n / m$. <br>
If $k$ is in the table, then the expected length $E[n_{h(k)}]$ of the list cointaining $k$ is at most 1 + $\alpha$. <p>

$Proof$<br>
For each pair $k$ and $l$ of distinct keys, define the indicator random variable<br>
    $X _{kl}$ = $I${$h(k) = h(l)$}. <br>
By definition of $universal$ $hashing$, a pair of keys have the probability of collision of at most $1 / m$, $Pr$ {$ h(k) = h(l)$} $\le 1 / m$. <br>
    Therefore $E$[$X _{kl}$] $\le 1 / m$. <p>
 
Then, it's possible to assign a random variable $Y_{k}$ for each key, that equals to the number of keys that hash to the same slot as $k$.<br>
$Y_{k}$ = $\sum_{l \in T, l \ne k}$ $X_{kl}$. <br>
   
Then, <br>
$E$[$Y_{k}$] = $E \biggl[ \sum _{l \in T, l \ne k}$ $X_{kl}$ $\biggr]$ <br><br>
 = $\sum _{l \in T, l \ne k}$  $E$[$X_{kl}$] $\; \; \; \; \; \; (by\;linearity\;of\;expectation) $<br><br>
        $\le $ $\sum _{l \in T, l \ne k} \frac1m$. <p>
    
The last thing to show depends on whether key $k$ is in the table. <br>

- if $k \notin T$, then $n _{h(k)}$ = $Y_{k}$, and $|${$l : l \in T$ and $l \ne k$}$| = n$.<br>
Therefore, $E$[$n_{h(k)}$] = $E$[$Y_{k}$] $\le n / m = \alpha$.<br>
    
- if $k \in T$, then the count $Y_{k}$ does not include $k$, even if it is in the list $T[h(k)]$. <br>
 So $n_{h(k)}$ = $Y_{k} + 1$ and $|${$l : l \in T$ and $l \ne k$}$| = n - 1$.<br>
 Thus $E$[$n_{h(k)}$] = $E$[$Y_{k}$] $+ 1 \le (n - 1)/m + 1 = 1 + \alpha - 1/m < 1 + \alpha$. <p>

### Designing a universal class of hash functions


In order to design a universal class of hash functions, it is necessary to have: <br>
- a prime number $p$, large enough so that every possible key $k$ is in the range $0$ to $p - 1$ <br>
- $\mathbb{Z}_{p}$ which denotes the set {0,1,...$p-1$} <br>
- $\mathbb{Z}^{*}_{p}$ which denotes the set {1,2,..$p-1$} <p>
    
Since the size of the universe of keys is greater than the number of slots in the hash table, we have $p>m$. <br>
It's possible now to define an hash function $h_{ab}$ for any $a \in \mathbb{Z}^{*}_{p}$ and any $b \in \mathbb{Z}_{p}$, using a linear transformation with reductions modulo $p$ and modulo $m$: <br>
$h_{ab}$(k) = (($ak+b$) mod $p$) mod $m$). <br>
The family of such functions is $H_{pm}$ = {$h_{ab}: a \in \mathbb{Z}^{*}_{p}$ and $b \in \mathbb{Z}_{p}$}. <p>
    
$Theorem$ <br>
The class of hash functions $H_{pm}$ = {$h_{ab}: a \in \mathbb{Z}^{*}_{p}$ and $b \in \mathbb{Z}_{p}$} is universal. <p>
    
$Proof$ <br>
Taken two distint keys $k$ and $l$ from $\mathbb{Z}_{p}$, such that $k \ne l$, for a given function $h_{ab}$ let: <br>
$r$ = ($ak+b$) mod $p$ <br>
$s$ = ($al+b$) mod $p$. <p>

The first thing that must be noted is that $r \ne s$ because $p$ is prime and both $a$ and ($k-l$) are nonzero modulo $p$, so their product must be also nonzero modulo $p$. <br>
This implies that, when computing any $h_{ab} \in H_{pm}$, distinct inputs $k$ and $l$ map to distint values $r$ and $s$ modulo $p$, so there are no collisions on the "mod $p$ level" so far. <p>
    
For each of the possible $p$($p-1$) choices for the pair ($a,b$), with $a \ne 0$, it returns a different resulting pair ($r,s$), with $r \ne s$; there's a one-to-one correspondance between the two pairs ($a,b$), with $a \ne 0$, and pairs ($r,s$), with $r \ne s$. <br>
Therefore, for any given pair of inputs $k, l$, picking ($a, b$) uniformly at random from $\mathbb{Z}^{*}_{p}$ X $\mathbb{Z}_{p}$, the resulting pair ($r, s$) is equally likely to be any pair of distint values modulo $p$. <br>
Thus, the probability that distinct keys $k, l$ collide is equally to the probability that $r \equiv s$ (mod $m$), with $r, s$ randomly chosen as distinct values modulo $p$. <p>
    
For a given value of $r$, of the $p-1$ remaining values of $s$, the number of values $s$ such that $s \ne r$ and $s \equiv r$ (mod $m$) is at most: <br>
$\bigl\lceil$ $p/m$ $\bigr\rceil$ -1 $\le$ (($p+m-1$)$/m$) $-1$ <br>
    = ($p-1$)$/m$. <br>
The probability that $s$ collides with $r$, reduced modulo $m$, is at most: (($p-1$)$/m)/$($p-1$) = $1/m$.<br>
Therefore, for any pair of distinct values $k, l \in \mathbb{Z}_{p}$, Pr{$h_{ab}$(k) = $h_{ab}$(l)} $\le 1/m$, which is universal by definition.

### Code

In [1]:
import math
import random


def getPrime( m ):   
    def isPrime (x):
        for i in range(2, int(math.sqrt(x))):
            if x % i == 0:
                return False
        return True

    for p in range(m+1, 2*m+1, 1):
        if isPrime(p):
            return p
        
    
class UniversalHashFamily(object):
    def __init__(self, m, rangeSize):
        self.m = m
        self.p = getPrime( rangeSize )
        self.a = 0
        self.b = 0
      
    def randomChoose(self):
        self.a = a = random.randint(1, self.p-1)
        self.b = b = random.randint(0, self.p-1)
        return lambda x: ((a * x + b) % self.p) % self.m

    def __str__(self):
        return "h(x) = (%d*x + %d %% %d) %% %d" % (self.a,self.b,self.p,self.m)


def buildUniversalHash(S):
    n = len(S)
    max_key = max(S)
    m = 2*n
    
    H = UniversalHashFamily(m, max_key)
    h = H.randomChoose()
    print (H)
    T = [None] * m
    for elem in S:
        hash = h(elem);
        print("h(", elem, ")= ", hash)


# test the universal hash
S = [ 11, 25, 36, 41, 57, 66, 73, 89, 95 ]
print ("S =", S)
buildUniversalHash(S)

S = [11, 25, 36, 41, 57, 66, 73, 89, 95]
h(x) = (50*x + 88 % 97) % 18
h( 11 )=  2
h( 25 )=  5
h( 36 )=  9
h( 41 )=  4
h( 57 )=  10
h( 66 )=  0
h( 73 )=  16
h( 89 )=  4
h( 95 )=  13


### Animation
In the previous paragraph we showed the code for obtaining a universal hash famaily, and in particular an
universal hash function to map all the keys of the set S. Through this animation we illustrate how an Hash Table is effectively build.
If you want run the example and create a new animation, remember to run before the [code in the appendix](#Appendix).

In [None]:
hv = UHashingVisualizer(m=13,p=101)
hv.insert( [1, 3, 4, 40, 61, 12, 36, 99])

h(x) = (36*x+27 % 101) % 13
11
8
5
1
1
3
10
4


<img src="gifs/universal_hashing_0.gif" />
<br>
<img src="https://github.com/Claire-gip/Hashing/blob/master/gifs/universal_hashing_0.gif" />
<br>

### Test

### References


"Introduction to Algorithms" - Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. 

### Appendix

In [5]:
from typing import Any, Union

import imageio
from matplotlib import colors
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt

import seaborn as sns

sns.set()
import numpy as np
import random
import os


class UniversalHashFamily(object):
    def __init__(self, m=13, p=101):
        self.m = m
        self.p = p
        self.a = 0
        self.b = 0

    def randomChoose(self):
        self.a = a = random.randint(1, self.p - 1)
        self.b = b = random.randint(0, self.p - 1)
        return lambda x: ((a * x + b) % self.p) % self.m

    def __str__(self):
        return "h(x) = (%d*x+%d %% %d) %% %d" % (self.a, self.b, self.p, self.m)


class UHashingVisualizer:
    def __init__(self, m=13, p=101):
        self.H = UniversalHashFamily(m,p)
        self.hash_function = self.H.randomChoose()
        print(self.H)

        self.buckets = [[] for i in range(self.H.m)]
        self.minimumCol = -(self.H.p +2)
        self.valueCol = 10
        self.bucketsColor = [[self.minimumCol] for i in range(self.H.m)]

        self._ims = []
        self.fps = 0.5
        self.input_shape = (self.H.m, 1)
        self.title = "Universal Hashing \n" + str(self.H) +"\n"
        self.comparisons = 0
        self.track_operations = True

        self.img_width = self.H.m*2.5
        self.img_height = self.H.m*2.5*2/3

        self.show_title = True
        self.title_fontsize = 40

        self.show_xlable = True
        self.xlable_fontsize = 40

        self.rectangle_color_1 = 'black'
        self.rectangle_color_2 = 'gold'
        self.rectangle_linewidth = 4

        self.colormap = 'Blues'
        self.numbers_color = 'dynamic'
        self.numbers_fontsize= 40
        self.dpi = 150

        self.save_dir = 'gifs/'
        self.custom_save_name = False
        self.save_name = ''

    def visulize_algorithm(self, i_1, i_2, i_3, ec1, ec2, ec3, message=''):

        # reahsape array into its original shape
        # array2 = np.array(self.buckets).reshape(self.input_shape[0], self.input_shape[1])

        # find coordinates for the rectangles
        dx = 0.05
        dy = dx
        index = [(i_1 % self.input_shape[1] + dx, i_1 // self.input_shape[1] + dy),
                 (i_2 % self.input_shape[1] + dx, i_2 // self.input_shape[1] + dy),
                 (i_3 % self.input_shape[1] + dx, i_3 // self.input_shape[1] + dy)]

        if self.numbers_color == 'dynamic':
            annot_kws = {'fontsize': self.numbers_fontsize}
        else:
            annot_kws = {'fontsize': self.numbers_fontsize, 'color': self.numbers_color}

        # set plot figsize
        scale = 0.8
        figsize = (self.img_width,self.img_height)
        fig, ax = plt.subplots(figsize=figsize)

        #set the labels for each cell
        notation = []
        previous = 0
        for i in range(self.H.m):
            for j in range(len(self.bucketsColor[i])):
                if self.bucketsColor[i][j] > 0:
                    notation.append(str(int(self.buckets[i][int(j / 2)])))
                    previous = 1
                else:
                    if j > 0 and previous == 1 :
                        notation.append('-->')
                    else:
                        notation.append('')
                    previous = 0

        notation = np.array(notation).reshape(self.input_shape[0], self.input_shape[1])

        cmap = colors.ListedColormap(['white', 'blue'])

        # create heatmap
        ax = sns.heatmap(self.bucketsColor,
                         annot=notation,
                         annot_kws=annot_kws,
                         fmt="",
                         linewidths=2,
                         cbar=False,
                         #cmap=self.colormap,
                         cmap=cmap,
                         yticklabels=True,
                         xticklabels=False,
                         square=True,
                         vmin=self.minimumCol)

        ax.tick_params(axis="y", labelsize=20, labelrotation=90)

        # plot title
        if self.show_title:
            ax.set_title(self.title, fontsize=self.title_fontsize)


        # plot the operation performed
        if self.show_xlable:
            plt.xlabel(f'\n'+message, fontsize=self.xlable_fontsize)

        plt.setp(ax.get_yticklabels(), rotation=90, ha="right",
                 rotation_mode="anchor")

        if self.track_operations:
            # draw rectangle around tracked cells
            if i_1 != -1:
                ax.add_patch(Rectangle(index[0], 0.9, 0.9, fill=False, edgecolor=ec1, lw=self.rectangle_linewidth))
            if i_2 != -1:
                ax.add_patch(Rectangle(index[1], 0.9, 0.9, fill=False, edgecolor=ec2, lw=self.rectangle_linewidth))
            if i_3 != -1:
                ax.add_patch(Rectangle(index[2], 0.9, 0.9, fill=False, edgecolor=ec3, lw=self.rectangle_linewidth))

        # create and save gifs

        if not os.path.exists('temp/'):
            os.mkdir('temp/')

        img_loc = 'temp/' + 'temp_image_{:d}'.format(self.comparisons + 1) + '.png'
        #pad_inches = 1 if self.input_shape[1] == 1 else 1 / (self.input_shape[1] - 1)
        #plt.savefig(img_loc, bbox_inches='tight', pad_inches=pad_inches, dpi=self.dpi)
        plt.savefig(img_loc,dpi=self.dpi)
        self._ims.append(imageio.imread(img_loc))
        os.remove(img_loc)
        plt.close()

    def insert(self, to_insert):

        for elem in to_insert:
            self.visulize_algorithm(-1, -1, -1,
                                    ec1=self.rectangle_color_2,
                                    ec2=self.rectangle_color_2,
                                    ec3=self.rectangle_color_2,
                                    message='Insertion of '+str(elem))
            hash = self.hash_function(elem)
            print(hash)
            row = 0
            self.buckets[hash].append(elem)
            self.visulize_algorithm(hash * self.input_shape[1], -1, -1,
                                    ec1=self.rectangle_color_1,
                                    ec2=self.rectangle_color_2,
                                    ec3=self.rectangle_color_2,
                                    message='Insertion of ' + str(elem) + '\n h(' + str(elem) + ')=' + str(hash))

            index = (len(self.buckets[hash]) - 1) * 2
            if index < len(self.buckets[hash]):
                self.bucketsColor[hash][index] = self.valueCol
            else:
                for j in range(self.H.m):
                    self.bucketsColor[j].append(self.minimumCol)
                self.input_shape = (self.input_shape[0], self.input_shape[1] + 1)
                self.visulize_algorithm(hash * self.input_shape[1] + index-1, -1, -1,
                                        ec1=self.rectangle_color_1,
                                        ec2=self.rectangle_color_2,
                                        ec3=self.rectangle_color_2,
                                        message= 'Insertion of '+str(elem)+'\n h('+str(elem)+')='+str(hash))
                for j in range(self.H.m):
                    self.bucketsColor[j].append(self.minimumCol)
                self.input_shape = (self.input_shape[0], self.input_shape[1] + 1)
                self.bucketsColor[hash][index] = self.valueCol
            self.visulize_algorithm(hash * self.input_shape[1] + index, -1, -1,
                                    ec1=self.rectangle_color_1,
                                    ec2=self.rectangle_color_2,
                                    ec3=self.rectangle_color_2,
                                    message= 'Insertion of '+str(elem)+'\n h('+str(elem)+')='+str(hash))

        if self.custom_save_name:
            imageio.mimsave(f'{self.save_dir}{self.save_name}.gif', self._ims, fps=self.fps)
        else:
            imageio.mimsave(f'{self.save_dir}universal_hashing_{self.comparisons}.gif', self._ims,
                            fps=self.fps)

        self._ims = []

