# 10.3 Hash Tables

If we put merge sort together with binary search, we have a nice way to search lists.

we can still ask, is logarithmic **the best** that we can do for  search when we are willing to do some

<font color="blue">Preprocessing</font>

###  When we introduced the type <font color="blue">dict</font> in Chapter 5 

* dictionaries use a technique called <b>hashing</b> to do <b>the lookup in time</b> 

  that is <b>nearly independent of the size of the dictionary</b>.

In [None]:
import timeit

query_lst = [-600,-60,-6,-70,-6,0,6,70,6,60,600]

lst = []
dic = {}

# search space :lst,dic
size=10000
for i in range(size):
    lst.append(i)
    dic[i] = i 

ls="""
for v in query_lst:
    if v in lst:
        continue
"""
   
t_querylst=timeit.Timer(stmt=ls,globals=globals())
lst_time=t_querylst.timeit(10)
print('Linear search time : ',lst_time/10)

bls="""
for v in query_lst:
    if search(lst, v):
       continue
"""
t_querylst=timeit.Timer(stmt=bls,globals=globals())
bst_time=t_querylst.timeit(10)
print('Binary search time : ',bst_time/10)

ds="""
for v in query_lst:
    if v in dic:
        continue
"""
t_querydic=timeit.Timer(stmt=ds,globals=globals())
dict_time=t_querydic.timeit(10)
print('dict search time : ',dict_time/10)

### A hash table 

The basic idea behind **a hash table** is

* **convert the key to an integer, and then use that integer to index into a list**

  which can be done in constant time.

  In principle, values of any immutable type can be easily converted to an integer through **Hash functions**.


### Hash functions

**Hash functions** can be used to convert 

* <b>a large space of keys</b>  to  <b>a smaller space of integer indices</b>.

Since the space of possible outputs is**smaller** than the space of possible inputs, 

a hash function is a  <b>many-to-one mapping</b>,

multiple different inputs may be mapped to the same output. 

When two inputs are mapped to the same output, 

it is called a <b>collision</b>. 

A good hash function produces 

<b>a uniform distribution</b>

every output in the range is equally probable, which <b>minimizes the probability of collisions</b>.

**Designing good hash functions** is surprisingly challenging.

The problem is that one wants the outputs to be uniformly distributed given the expected distribution of inputs

## hash bucket

**class intDict(object)**  uses a simple hash function 

```python
# i%j returns the remainder when the integer i is divided by the integer j 

hashBucket = self.buckets[dictKey%self.numBuckets]
``` 

to implement a dictionary with integers as keys.

The basic idea is to represent 

**an instance of class intDict**

by a list of <b>hash buckets</b>, 

where **each bucket** is a list of **key/value** pairs. 

<b>By making each bucket a list, 

we handle collisions by storing all of the values that hash to the same bucket in the list</b>. 

The <b>hash table</b> works as follows: 

*  **1 def __init__(self, numBuckets):**

   The instance variable <b>buckets</b> is initialized to 

   <b>a list</b> of  <b>numBuckets</b> <b>empty lists</b>.

```python
        self.buckets = []
        self.numBuckets = numBuckets
        for i in range(numBuckets):
            self.buckets.append([]) 
```
   
* **2 def addEntry(self, dictKey, dictVal):**

   To store or look up an entry with key **dictKey**, 

```python   
        hashBucket = self.buckets[dictKey%self.numBuckets]
        for i in range(len(hashBucket)):
            if hashBucket[i][0] == dictKey:
                hashBucket[i] = (dictKey, dictVal) #if one was found,replace
                return
        hashBucket.append((dictKey, dictVal)) # append a new entry to the bucket if none was found.
```      
   
we use <b>the hash function i%j to convert dictKey into an integer</b>, 
```python  
    hashBucket = self.buckets[dictKey%self.numBuckets
```    
and use that integer to index into buckets 
```python
   hashBucket[i]
```
to find the hash bucket associated with **dictKey**

If <b>a value is to be stored</b>,then  

if one was found:  <b>replace</b> the value in the existing entry, if one was found, 

if none was found: <b>append</b> a new entry to the bucket

### separate chaining(分离链接法)

In the method known as separate chaining, each bucket is independent, and has some sort of list of entries with the same index. 

![](./img/ds/Hashcollisionbyseparatechaining.jpg)

* **3 def getValue(self, dictKey):**

We then search that bucket (which is a list) linearly to see if there is an entry with the key dictKey.

```python 
 for e in hashBucket:
            if e[0] == dictKey: // key
                return e[1]     // valu
```

If we are doing <b>a lookup</b> and there is an entry with the key, we simply return the value stored with that key. 

If there is no entry with that key, we return None. 




In [None]:
#Page 139, Figure 10.6

class intDict(object):
    """A dictionary with integer keys"""
    
    def __init__(self, numBuckets):
        """Create an empty dictionary"""
        
        ## buckets is initialized to a list of numBuckets empty lists.
        
        self.buckets = []
        self.numBuckets = numBuckets
        for i in range(numBuckets):
            self.buckets.append([]) # empty list
            
    def addEntry(self, dictKey, dictVal):
        """Assumes dictKey an int.  Adds an entry."""
        hashBucket = self.buckets[dictKey%self.numBuckets]
        for i in range(len(hashBucket)):
            if hashBucket[i][0] == dictKey:
                hashBucket[i] = (dictKey, dictVal) #if one was found,replace
                return
        hashBucket.append((dictKey, dictVal)) # append a new entry to the bucket if none was found.
        
    def getValue(self, dictKey):
        """Assumes dictKey an int.  Returns entry associated
           with the key dictKey"""
        hashBucket = self.buckets[dictKey%self.numBuckets]
        for e in hashBucket:
            if e[0] == dictKey: # key
                return e[1]     # value 
        return None
    
    def __str__(self):
        result = '{'
        for b in self.buckets:
            for e in b:
                result = result + str(e[0]) + ':' + str(e[1]) + ','
        return result[:-1] + '}' #result[:-1] omits the last comma



The following code first constructs an **intDict** with twenty entries. 

The values of the entries are the integers 0 to 19.

The **keys** are chosen at random from <b>integers in the range 0 to 10^5 - 1<b>.

## Twenty entries

In [None]:
import random #a standard library module
Entrys=[]
NumEntry=20

random.seed(1)

for i in range(NumEntry):
    #choose a random int between 0 and 10**5
    key = random.randint(0, 10**5)
    Entrys.append((key,i))

print('The Entrys (key,i)is:')
for entry in Entrys:
    print(entry)

### Search key->value in <font color="blue">list</font> of Entrys

In [None]:
def searchintListbyKey(Entrys,key):
    value=None
    for item in Entrys:
        if (key==item[0]):
            value=item[1] 
            break
    return value

In [None]:
value=searchintListbyKey(Entrys,Entrys[10][0])
print(Entrys[10][0],value)

In [None]:
%timeit searchEntryListbyKey(Entrys,Entrys[10][0])

## Put entrys into <font color="blue">intDict</font>

* numBuckets =29

In [None]:
numBuckets =29
# numBuckets 29  >entries 20 <<< the range of key （0，100000）
D = intDict(numBuckets)
for item in Entrys:
    D.addEntry(item[0],item[1])

print('The intDict is:' )
print(D)

print('\n', 'The hase buckets are:')
i=0
for hashBucket in D.buckets:
    print('BucketID',i,'  ', hashBucket)
    i=i+1

we see that many of the hash buckets are **empty**. 

Others contain **one, two, or three tuples** depending upon <b>the number of collisions</b> that occurred

In [None]:
def getIntDictByKey(key):
    return D.getValue(key)

In [None]:
value=getIntDictByKey(Entrys[10][0])   
print(Entrys[10][0],value)

In [None]:
%timeit getIntDictByKey(Entrys[10][0])   

### What is the complexity of **getValue**? 

* If there were <b>no collisions</b> it would be <b>O(1)</b>,  because each hash bucket would be of length 0 or 1.


* There might be <b>collisions</b>，If everything hashed to **the same bucket**,

>it would be <b>O(n)</b> where n is the number of entries in the dictionary，because the code >would perform a linear search on that hash bucket.

### By making the <b>hash table large enough</b>, 

we can <b>reduce the number of collisions</b> sufficiently to allow us to treat <b>the complexity as O(1)</b>.

### hash table small , more collisions

* numBucket=5

In [None]:
import random #a standard library module

# hash table small , more collisions
numBuckets=5
D = intDict(numBuckets) # numBuckets < entries

for item in Entrys:
    D.addEntry(item[0],item[1])

print('The value of the intDict is:')
print(D)

print('\n', 'The buckets are:')
for hashBucket in D.buckets: #violates abstraction barrier
    print('  ', hashBucket)


In [None]:
%timeit getIntDictByKey(Entrys[10][0])

### hash table large , less collisions

* numBucket=50

In [None]:
import random #a standard library module

# hash table large , less collisions
numBucket=50

D = intDict(numBucket) # numBuckets >> entries

for item in Entrys:
    D.addEntry(item[0],item[1])
    
print('The value of the intDict is:')
print(D)

print('\n', 'The buckets are:')
for hashBucket in D.buckets: #violates abstraction barrier
    print('  ', hashBucket)

In [None]:
%timeit getIntDictByKey(Entrys[10][0])

##  Note: 

### hash table: a series of buckets

* **bucket** : a list of key/value pairs. storing all of the values that hash to the same bucket in the list 

* **hash each element to bucket**: an address that is derived form the key value by applying the hash function
    
*  **search:**

    * key -> the hash function -> an address of bucket in the hash table -> get valve in bucket

## intDict in C

* intDict.h/c

* mainintDict.c

In [None]:
%%file ./code/ds/intDict.h

#ifndef INTDICTH
#define INTDICTH

typedef struct _node
{
	int key;
	int value;
	struct _node *next;
} Node;

typedef struct _hashtable
{ 
	int   numBuckets;
	Node **buckets; //the linked list stack 
} Hashtable;

// Create hash table
Hashtable *createHash(int numBuckets);

// free hash table
void *freeHash(Hashtable *hTable);

// hash function for int keys
int inthash(int key,int  numBuckets);

// Add Entry to table - keyed by int
void addEntry(Hashtable *hTable, int key,int value);

// Lookup  by int key
Node *searchEntry(Hashtable *hTable, int key);

// Get by int key
int getValue(Hashtable *hTable, int key);

#endif

In [None]:
%%file ./code/ds/intDict.c

#include <stdio.h>
#include <stdlib.h>
#include "intDict.h"


// Create hash table
Hashtable *createHash(int numBuckets)
{
    Hashtable *table=(Hashtable*)malloc(sizeof(Hashtable*));
    if(!table) {
		return NULL;
	}
    
    table->buckets=(Node**)malloc(sizeof(Node)*numBuckets);
    if(!table->buckets) {
		free(table);
		return NULL;
	}
	
    table->numBuckets=numBuckets;
	// initialize the head pointer of the bucket stack to NULL
	for(int i=0;i<table->numBuckets;i++)
		table->buckets[i] = NULL;
   
    return table;
}

void *freeHash(Hashtable *hTable)
{
   	Node *b,*p;
	for(int i=0;i<hTable->numBuckets;i++)
	{
       b = hTable->buckets[i];
	   while(b!=NULL)
	   { 
         p=b->next;
	     free(b);
		 b=p;
	   }	 
	}
	free(hTable->buckets);
	free(hTable);
}

// hash function for int keys
int inthash(int key,int  numBuckets)
{
    return key % numBuckets;
}

// Lookup  by int key
Node *searchEntry(Hashtable *hTable, int key)
{
	Node *p;
	int addr = inthash(key, hTable->numBuckets);
	p = hTable->buckets[addr];

	while(p && p->key !=key)
		p = p->next;

	return p;
}

// Add Entry to table - keyed by int
void addEntry(Hashtable *hTable, int key,int value)
{
	int addr;
	Node *p,*entry;
	p = searchEntry(hTable,key);
	if(p)
	{
		return;
	}
	else
	{   /*
          add a new item on the top of the linked list stack 
          and a pointer to the top element.  
       */
		addr =  inthash(key, hTable->numBuckets);
		entry=(Node*)malloc(sizeof(Node));
	  	entry->key = key;
		entry->value=value;
		entry->next =hTable->buckets[addr];
		hTable->buckets[addr] = entry;
	}
}

// Get by int
int getValue(Hashtable *hTable, int key)
{   
	Node *p;
	p = searchEntry(hTable,key);
	if (p)
	{
      return p->value;
	}
}


In [None]:
%%file ./code/ds/mainintDict.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "intDict.h"

int main()
{
	int numBuckets=5;
    int numEntries=20;
    Hashtable *hTable;
	int *key;
	int *value;
	
    hTable=createHash(numBuckets);
   
	printf("The value of the intDict is:\n");
    printf("(key value)\n");
    
    key=(int*)malloc(sizeof(int)*numEntries);
	value=(int*)malloc(sizeof(int)*numEntries);
	srand(time(NULL));
	for(int i=0;i<numEntries;i++)
	{
	  	key[i] = rand() % 100000;
		value[i]=i;

		addEntry(hTable,key[i],value[i]);
		printf("(%d %d)\n",key[i],value[i]);
	}

    printf("The buckets(the linked list stack) are: \n");
	for(int i=0;i<hTable->numBuckets;i++)
	{
       Node *b,*p;
	   b = hTable->buckets[i];
	   printf("bucket %d :",i);
	   if (b)
	   {
	     for(p=b; p!=NULL; p=p->next)
	        printf(" (%d %d) ",p->key,p->value);
	     printf("\n"); 		
	   }
	   else
	    	printf("\n"); 	   
	}

    printf("Hash search(even):\n");
    printf("(key value) : key -> value:\n");
	for(int i=0;i< numEntries;i++)
	{
      if (i%2==0)
      {
        int val=getValue(hTable,key[i]);
        printf("(%d  %d): -> %d \n",key[i],value[i],val);
      } 
	}
  
    free(key);
    free(value);
    
    freeHash(hTable);
    
  	return 0;
}

In [None]:
!gcc -o ./code/ds/mainintDict ./code/ds/mainintDict.c ./code/ds/intDict.c

In [None]:
!.\code\ds\mainintDict 

## Unordered Map(C++11)

Unordered maps are associative containers that store elements formed by the combination of a key value and a mapped value, and which allows for fast retrieval of individual elements based on their keys.

In an unordered_map, the key value is generally used to uniquely identify the element, while the mapped value is an object with the content associated to this key. Types of key and mapped value may differ.

Internally, the elements in the unordered_map are not sorted in any particular order with respect to either their key or mapped values, but organized into buckets depending on their hash values to allow for fast access to individual elements directly by their key values (with a constant average time complexity on average).

In [1]:
%%file ./code/ds/demo1_unordered_map.cpp

#include <iostream>
#include <string>
#include <unordered_map>
 
int main()
{
    std::unordered_map<std::string, int> months;
    months["january"] = 31;
    months["february"] = 28;
    months["march"] = 31;
    months["april"] = 30;
    months["may"] = 31;
    months["june"] = 30;
    months["july"] = 31;
    months["august"] = 31;
    months["september"] = 30;
    months["october"] = 31;
    months["november"] = 30;
    months["december"] = 31;
    std::cout << "september -> " << months["september"] << std::endl;
    std::cout << "april     -> " << months["april"] << std::endl;
    std::cout << "december  -> " << months["december"] << std::endl;
    std::cout << "february  -> " << months["february"] << std::endl;
    return 0;
}

Writing ./code/ds/demo1_unordered_map.cpp


In [2]:
!g++ -o ./code/ds/demo1_unordered_map.exe ./code/ds/demo1_unordered_map.cpp

In [3]:
!.\code\ds\demo1_unordered_map 

september -> 30
april     -> 30
december  -> 31
february  -> 28


## Further Reading

* 严蔚敏，李冬梅，吴伟民. 数据结构（C语言版），人民邮电出版社（第2版）,2015年2月  

* Mark Allen Weiss. Data Structures and Algorithm Analysis in C

* Hash table https://en.wikipedia.org/wiki/Hash_table