## Algorithm Design 2019-20 @ Computer Science - Università di Pisa

### Scribes: Chiara Boni, Eleonora Di Gregorio
### Lecturer: Roberto Grossi

# Hashing

## Perfect Hashing

### Definitions and goals



Hashing is often a good choice for its excellent average-case performance, but what if the set of the keys we want to store
is static? Having a $static$ $set$ $of$ $keys$ means that once the keys are stored in a table, the set of keys never changes, this
could happen in several real word applications: the set of reserved words in a programming language, or the set of file names on
a CD-ROM. In this case hashing can also provide excellent worst-case performance. <br>

A hashing technique is called $perfect$ $hashing$ if $O(1)$ memory accesses are required to perform a search in the worst case. <br>
This means that given a subset $S$ of the universe of keys $U$, hashing is called perfect for $S$ if and only if no collisions occur.



$$ S \subseteq U, \forall	k_1,k_2 \in S, k_1 \ne k_2 \iff h(k_1) \ne h(k_2) $$

### Designing a Perfect Hashing Scheme


To create Perfect Hashing, a two level scheme is needed, with universal hashing at each level. The levels are built as follows:
- Create a table with $m$ slots, $m$ could be equal to $n$ or linear in the value of $n$, which is the cardinality of $S$. <br>
    Hash the $n$ keys into $m$ slots using a hash function $h$ selected from a family of universal hash functions $H$.
- Instead of making a link list of keys hashing to slot $j$, create a small secondary hash table of $m_j = n_j^2$ buckets, where $n_j$ is the cardinality
    of $S_j =\{x \in S | h(x) = j \}$, the elements of $S_j$ will be inserted according to a selected universal hash function $h_j$.<br>

To prove that the hashing is perfect we have to ensure that:
 - the secondary table has no collisions.
 - the amount of memory used overall is linear in the number of elements to be stored.

##### First Claim
Suppose that we store  keys in a hash table of size $m = n^2$ using a hash function $h_j$ randomly chosen from a universal class of hash functions.
Then, the probability is less than $1/2$ that there are any collisions. <br>

$$ Pr(h_j\: is\: perfect\: for\:  S_j) \ge \frac{1}{2}$$

$Proof$ <br>
Considering that $h_j$ is perfect for $S_j$ if no collisions occur, define a random variable $X$ that counts the number of collisions as follows:

$$
X_{kl} =\begin{cases}
 1& \text{}  k \ne l, h_j(k)=h_j(l)\\
 0& \text{}  ow
\end{cases}
$$
<br>
$$
X =  \sum_{k < l} X_{kl} = \text{ #collisions}
$$

Now we have to keep in mind that $h_j$ is a universal hash function so the probability that it each pairs of elements collide is $\frac{1}{m_j}$.

$Pr[X = 0] = 1 - Pr[X \ge 1 ] \ge \frac{1}{2}$ , so we need to demonstrate $ Pr[X \ge 1 ] <  \frac{1}{2}$
<br>

$Pr[X \ge 1] <\;\; \; \; \; \; \; \; \; \; \; \; \; (Markov's\:  Inequality)$<br>

$E[X] =\; \; \; \; \; \; \; \; \; \; \;\; \; \; \; \; \; \; \;\; \; (Linearity\:  of\:  Expectation)$<br>

$\sum_{k < l} E[X_{kl}] =$<br>

$\sum_{k < l}Pr[X_{kl} = 1] =$<br>

$\sum_{k < l} \frac{1}{m_j} =\; \; \; \; \; \; \; \; \; \; \;\; \; \; \; \; (For\: each\: couple\: in\: that\: bucket)$<br>

$\binom{n_j}{2}\frac{1}{m_j}=\; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; (m_j=n_j^2) $<br>

$\frac{n_j(n_j-1)}{2}\cdot\frac{1}{n_j^2} <$<br>

$\frac{1}{2}$<br>


##### Second Claim
Suppose that we store $n$ keys in a hash table of size $m = n$ using a hash function $h$ randomly chosen from a universal class of hash functions. Then the
probability that the number of buckets is less than $4n$ is greater than $1/2$<br>

$$Pr_{h \in H}(\sum_{j=0}^{m-1}n_j^2\le 4n) \gt \frac{1}{2}$$

$Proof$<br>

$E[\sum_{j=0}^{m-1} n_j^2] \;\; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \;  \; \; \; \; \; \; \; \; \; \; \; \; \; \;(a^2=a+2\binom{a}{2})$<br>

$= E[\sum_{j=0}^{m-1} (n_j +2\binom{n_j}{2})]\; \; \; \; \; \; \;\; \; \; \;  \; \; \; \; \; \;(by\: linearity\: of\: expectation\:)$<br>

$= E[\sum_{j=0}^{m-1}n_j] +2E[\sum_{j=0}^{m-1}\binom{n_j}{2}] \; \;  \; \; \;(E[n_j] = n/m)$<br>

$= n +2E[\sum_{j=0}^{m-1}\binom{n_j}{2}] \; \; \; \; \; \;\; \; \;  \; \; \; \; \; \;\; \; \; \; \; \;(The\: summation\: is\: the\: total\: number\: of\: keys\: that\: collide)$<br>

$= n + 2\binom{n}{2} \frac{1}{m} $<br>

$= n+ 2\frac{n(n-1)}{2m}\; \; \; \; \; \;\; \; \;  \; \; \; \; \; \;\; \; \; \; \; \; \; \; \; \; \; \; \;\; \; \; \; (m=n)$<br>

$= n +n-1 $<br>

$< 2n$<br>

Using Markov's Inequality:<br>
$Pr[\sum_{j=0}^{m-1}n_j^2 \ge 4n] \le \frac{E[\sum_{j=0}^{m-1}n_j^2]}{4n}<\frac{2n}{2\cdot 2n} = \frac{1}{2}$

At this point subtract this value to 1 to obtain what we want to prove.

### Markov's Inequality


To prove previous claims we used $Markov's\; inequality$, it gives  an upper bound for the probability that a non-negative function of a random variable is greater than or equal to some positive constant.
Markov's inequality relates probabilities to expectations, and provides bounds for the cumulative distribution function of a random variable.

##### Statement
If $X$ is a non negative random variable and $a > 0$, then the probability that $X$ is at least $a$ is at most the expectation of $X$ divided by $a$: <br>
$$ Pr(X \ge a) \le \frac{E(X)}{a}$$

$Proof$<br>
Consider the random indicator variable $I$:<br>
$$
I =\begin{cases}
 1& \text{}  X \ge a\\
 0& \text{}  ow
\end{cases}
$$

than, given $a > 0$ we can observe that:<br>

$a\cdot I \le X $.<br>

Indeed if $X < a$ then $I = 0$ an $a\cdot I=0$, else $X \ge a$ and $I = 1$. Since $E$ is a monotonically increasing function, taking expectation of both sides of an inequality cannot reverse it. Therefore,<br>

$E(a)\le E(X)$.<br>

Now, using linearity of expectations, the left side of this inequality is the same as<br>

$aE(I)= a(1\cdot Pr(X \ge a)+0\cdot Pr(X < a)) = aPr(X\ge a)$.<br>

Thus we have<br>

$aPr(X \ge a) \le E(X)$

and since $a > 0$, we can divide both sides by $a$.

### Code

In [6]:
import math
import random


def getPrime( m ):
    def isPrime (x):
        for i in range(2, int(math.sqrt(x))):
            if x % i == 0:
                return False
        return True

    for p in range(m+1, 2*m+1):
        if isPrime(p):
            return p


class UniversalHashFamily(object):
   def __init__(self, m, rangeSize):
      self.m = m
      self.p = getPrime( rangeSize )
      self.a = 0
      self.b = 0

   def randomChoose(self):
      self.a = a = random.randint(1, self.p-1)
      self.b = b = random.randint(0, self.p-1)
      return lambda x: ((a * x + b) % self.p) % self.m

   def __str__(self):
       return "h(x) = (%d*x+%d %% %d) %% %d" % (self.a,self.b,self.p,self.m)

def buildPerfectHash( S ):
    n = len(S)
    max_key = max(S)
    m = 2*n

    H = UniversalHashFamily(m,max_key)
    h = H.randomChoose()
    print (H)

    buckets = [[] for i in range(m)]
    bucketSize = [0] * m
    bucketHash = [ None ] * m
    bucketTable = [[] for i in range(m)]

    for i in range(n):
        buckets[ h(S[i]) ] += [ S[i] ]
        bucketSize[ h(S[i]) ] += 1
    print( "buckets =", buckets )
    print( "bucket sizes = ", bucketSize)


    for i in range(m):
        rehashing = True
        while rehashing:
            rehashing = False
            if (bucketSize[i] > 0):
                max_key = max(buckets[i])
                F = UniversalHashFamily(int(bucketSize[i]*bucketSize[i]),max_key)
                g = bucketHash[i] = F.randomChoose()
                t = bucketTable[i] = [None] * (bucketSize[i]*bucketSize[i])
                for j in range(bucketSize[i]):
                    key = buckets[i][j]
                    if t[ g(key) ] != None:  # rehashing
                        rehashing = True
                        print ("Collision detected: rerun!")
                        exit(1)
                    t[ g(key) ] = key
                print( "bucket table =", t, "where", F)
    print( "total table space", sum([bucketSize[i]*bucketSize[i] for i in range(m)]))



# test the perfect hash
S = [ 11, 25, 36, 41, 57, 66, 73, 89, 95 ]
print("S =", S)
buildPerfectHash( S )

#

S = [11, 25, 36, 41, 57, 66, 73, 89, 95]
h(x) = (93*x+31 % 97) % 18
buckets = [[], [], [], [95], [57, 66], [], [], [41], [], [36, 89], [25], [], [11, 73], [], [], [], [], []]
bucket sizes =  [0, 0, 0, 1, 2, 0, 0, 1, 0, 2, 1, 0, 2, 0, 0, 0, 0, 0]
bucket table = [95] where h(x) = (54*x+5 % 97) % 1
Collision detected: rerun!
bucket table = [None, None, 66, None] where h(x) = (35*x+65 % 67) % 4
bucket table = [None, 66, None, 57] where h(x) = (26*x+55 % 67) % 4
bucket table = [41] where h(x) = (5*x+21 % 43) % 1
bucket table = [None, 89, None, 36] where h(x) = (13*x+48 % 97) % 4
bucket table = [25] where h(x) = (6*x+20 % 29) % 1
bucket table = [None, 73, 11, None] where h(x) = (73*x+68 % 79) % 4
total table space 15


### Animation
In the previous paragraph we showed the code to obtain a perfect hashing scheme, and through this animation we want to clarify this process.
If you want run the example and create a new animation, remember that the [code in the appendix](#Appendix) must be run before.

In [11]:
hv =PHashingVisualizer([1, 3, 4, 40, 61, 12, 36, 99])
hv.build()

h(x) = (23*x+44 % 101) % 8


<h2></h2>

![](https://github.com/Claire-gip/Hashing/blob/master/perfect_hashing_0.gif?raw=true)

<h2></h2>

![](perfect_hashing_0.gif)

<h2></h2>

<img src="perfect_hashing_0.gif" width="100%" height="100%">

<img src="/perfect_hashing_0.gif?raw=true" width="100" height="100" />

### References


Cormen T.H., Leiserson C.E., Rivest R.L., Stein C. (2009). $Hash$ $Table$ (Cap. 11). $Introduction$ $to$ $Algorithms$.

### Appendix

In [10]:
import math
import imageio
from matplotlib import colors
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt

import seaborn as sns

sns.set()
import numpy as np
import random
import os


def getPrime(m):  # naive method to find a prime in [m+1, 2m]
    def isPrime(x):
        for i in range(2, int(math.sqrt(x))):
            if x % i == 0:
                return False
        return True

    for p in range(m + 1, 2 * m + 1):
        if isPrime(p):
            return p


class UniversalHashFamily(object):
    def __init__(self, m=13, p=101):
        self.m = m
        self.p = p
        self.a = 0
        self.b = 0

    def randomChoose(self):
        self.a = a = random.randint(1, self.p - 1)
        self.b = b = random.randint(0, self.p - 1)
        return lambda x: ((a * x + b) % self.p) % self.m

    def __str__(self):
        return "h(x) = (%d*x+%d %% %d) %% %d" % (self.a, self.b, self.p, self.m)


class PHashingVisualizer:
    def __init__(self, S):
        self.S = S
        self.H = UniversalHashFamily(m=len(S), p=getPrime(max(S)))
        self.hash_function = self.H.randomChoose()
        print(self.H)

        self.buckets = [[] for i in range(self.H.m)]
        self.bucketSize = [0] * self.H.m
        self.bucketHash = [None] * self.H.m
        self.bucketTable = [[] for i in range(self.H.m)]
        self.minimumCol = -(self.H.p + 2)
        self.valueCol = 10
        self.mediumCol = int((self.minimumCol + self.valueCol) / 2)
        self.bucketsColor = [[self.minimumCol] for i in range(self.H.m)]

        self._ims = []
        self.fps = 0.5
        self.input_shape = (self.H.m, 1)
        self.title = "Perfect Hashing of\n" + str(self.S) + "\n"
        self.comparisons = 0
        self.track_operations = True
        self.second_level = False

        self.img_width = self.H.m * 3
        self.img_height = self.H.m * 3 * 2 / 3

        self.show_title = True
        self.title_fontsize = 40

        self.show_xlable = True
        self.xlable_fontsize = 40

        self.rectangle_color_1 = 'black'
        self.rectangle_color_2 = 'gold'
        self.rectangle_linewidth = 4

        self.colormap = colors.ListedColormap(['white', 'cyan', 'blue'])
        self.numbers_color = 'dynamic'
        self.numbers_fontsize = 40
        self.dpi = 150

        self.save_dir = 'gifs/'
        self.custom_save_name = False
        self.save_name = ''

    def visulize_algorithm(self, i_1, i_2, i_3, ec1, ec2, ec3, message=''):

        # find coordinates for the rectangles
        dx = 0.05
        dy = dx
        index = [(i_1 % self.input_shape[1] + dx, i_1 // self.input_shape[1] + dy),
                 (i_2 % self.input_shape[1] + dx, i_2 // self.input_shape[1] + dy),
                 (i_3 % self.input_shape[1] + dx, i_3 // self.input_shape[1] + dy)]

        if self.numbers_color == 'dynamic':
            annot_kws = {'fontsize': self.numbers_fontsize}
        else:
            annot_kws = {'fontsize': self.numbers_fontsize, 'color': self.numbers_color}

        # set plot figsize
        scale = 0.8
        figsize = (self.img_width, self.img_height)
        fig, ax = plt.subplots(figsize=figsize)

        # set the labels for each cell
        notation = []
        for i in range(self.H.m):
            for j in range(len(self.bucketsColor[i])):
                # TODO non deve essere maggiore di zero ma un valore scelto
                if self.bucketsColor[i][j] >= self.valueCol:
                    notation.append(str(int(self.buckets[i][j])))
                else:
                    if self.bucketsColor[i][j] >= self.mediumCol:
                        if j == 0 or self.buckets[i][j]<0:
                            notation.append('')
                        else:
                            notation.append(str(int(self.buckets[i][j])))
                    else:
                        notation.append('')

        notation = np.array(notation).reshape(self.input_shape[0], self.input_shape[1])

        xtick = []
        if self.second_level:
            xtick.append('')
            xtick.append('a')
            xtick.append('b')
            for i in range(3, self.input_shape[1], 1):
                xtick.append(str(i-3))

        # create heatmap
        ax = sns.heatmap(self.bucketsColor,
                         annot=notation,
                         annot_kws=annot_kws,
                         fmt="",
                         linewidths=2,
                         cbar=False,
                         cmap=self.colormap,
                         yticklabels=True,
                         xticklabels=xtick,
                         square=True,
                         vmin=self.minimumCol)

        ax.tick_params(axis="y", labelsize=20)
        ax.tick_params(axis="x", labelsize=20)

        # plot title
        if self.show_title:
            ax.set_title(self.title, fontsize=self.title_fontsize)

        # plot the operation performed
        if self.show_xlable:
            plt.xlabel(f'\n' + message, fontsize=self.xlable_fontsize)

        plt.setp(ax.get_yticklabels(), rotation=90, ha="right",
                 rotation_mode="anchor")

        if self.track_operations:
            # draw rectangle around tracked cells
            if i_1 != -1:
                ax.add_patch(Rectangle(index[0], 0.9, 0.9, fill=False, edgecolor=ec1, lw=self.rectangle_linewidth))
            if i_2 != -1:
                ax.add_patch(Rectangle(index[1], 0.9, 0.9, fill=False, edgecolor=ec2, lw=self.rectangle_linewidth))
            if i_3 != -1:
                ax.add_patch(Rectangle(index[2], 0.9, 0.9, fill=False, edgecolor=ec3, lw=self.rectangle_linewidth))

        # create and save gifs

        if not os.path.exists('temp/'):
            os.mkdir('temp/')

        img_loc = 'temp/' + 'temp_image_{:d}'.format(self.comparisons + 1) + '.png'
        # pad_inches = 1 if self.input_shape[1] == 1 else 1 / (self.input_shape[1] - 1)
        # plt.savefig(img_loc, bbox_inches='tight', pad_inches=pad_inches, dpi=self.dpi)
        plt.savefig(img_loc, dpi=self.dpi)
        self._ims.append(imageio.imread(img_loc))
        os.remove(img_loc)
        plt.close()

    def build(self):
        n = len(self.S)
        for i in range(n):
            hash = self.hash_function(self.S[i])
            self.buckets[hash] += [self.S[i]]
            self.bucketSize[hash] += 1
            if self.bucketSize[hash] <= len(self.bucketsColor[hash]):
                j = 0
                while self.bucketsColor[hash][j] == self.valueCol:
                    j += 1
                self.bucketsColor[hash][j] = self.valueCol
            else:
                for elem in self.bucketsColor:
                    elem += [self.minimumCol]
                self.input_shape = (self.input_shape[0], self.input_shape[1] + 1)
                self.bucketsColor[hash][-1] = self.valueCol
            # self.input_shape[1]
            self.visulize_algorithm(hash * self.input_shape[1] + self.bucketSize[hash] - 1, -1, -1,
                                    ec1=self.rectangle_color_1,
                                    ec2=self.rectangle_color_2,
                                    ec3=self.rectangle_color_2,
                                    message=str(self.H) + ' \n Insertion of ' + str(self.S[i]) + ': h(' + str(
                                        self.S[i]) + ')=' + str(hash))
        index = 0
        for elem in self.buckets:
            if len(elem) == 0:
                self.bucketsColor[index][0] = self.mediumCol
            self.visulize_algorithm(index * self.input_shape[1], -1, -1,
                                    ec1=self.rectangle_color_1,
                                    ec2=self.rectangle_color_2,
                                    ec3=self.rectangle_color_2,
                                    message='Chechink for building second level \n')
            if len(elem) > 1:
                self.second_level = True;
                copy = elem.copy()
                po = pow(len(elem), 2)
                if po + 3 > self.input_shape[1]:
                    while self.input_shape[1] < po + 3:
                        for e in self.bucketsColor:
                            e += [self.minimumCol]
                        self.input_shape = (self.input_shape[0], self.input_shape[1] + 1)
                        self.bucketsColor[index][-1] = self.valueCol
                while len(self.buckets[index]) < po+3:
                    self.buckets[index].append(-1);

                for e in range(len(self.bucketsColor[index])):
                    self.bucketsColor[index][e] = self.mediumCol

                found = False
                while not found:
                    found = True
                    p = getPrime(max(elem));
                    h = UniversalHashFamily(po,p)
                    l = h.randomChoose()
                    self.buckets[index][0] = -1
                    self.buckets[index][1] = h.a
                    self.buckets[index][2] = h.b
                    for e in copy:
                        pos = l(e)+3
                        if self.buckets[index][pos] == -1:
                            self.buckets[index][pos] = e
                            self.bucketsColor[index][pos] = self.valueCol
                        else:
                            found = False
                            for i in range(3,len(self.buckets[index]), 1):
                                self.buckets[index][i] = -1
                                self.bucketsColor[index][i] = self.mediumCol
                            break
                self.visulize_algorithm(index * self.input_shape[1], -1, -1,
                                        ec1=self.rectangle_color_1,
                                        ec2=self.rectangle_color_2,
                                        ec3=self.rectangle_color_2,
                                        message='Second level of hashing built\n')
            index += 1



        if self.custom_save_name:
            imageio.mimsave(f'{self.save_dir}{self.save_name}.gif', self._ims, fps=self.fps)
        else:
            imageio.mimsave(f'{self.save_dir}perfect_hashing_{self.comparisons}.gif', self._ims,
                            fps=self.fps)

        self._ims = []


