# Hash Tables

## 1 The simple example
The table store the student info,every the student info recored has the uniqe StudentID

In [36]:
%%file ./data/StudentIDInfo.csv
学号,姓名,电话
3118603,张馨尹,188
3118601,付嘉宁,170
3118616,乐辰前,163
3118604,张秋维,190 
3118614,迟洪均,171
3118605,周雪,153
3118623,魏翰泽,155
3118606,温舒馨,133
3118609,马婧仪,130
3118613,汪伊婧,189

Overwriting ./data/StudentIDInfo.csv


```python
StudentIDInfoList=[(id,(name,phone)),...]
```

In [37]:
import  csv
filename="./data/StudentIDInfo.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)
StudentIDInfoList=[]
for line in csvdata:
    id = int(line['学号'])
    name=line['姓名']
    phone=line['电话']
    StudentIDInfoList.append((id,(name,phone)))
csvfile.close()  

In [38]:
for item in   StudentIDInfoList:
    print(item)

(3118603, ('张馨尹', '188'))
(3118601, ('付嘉宁', '170'))
(3118616, ('乐辰前', '163'))
(3118604, ('张秋维', '190 '))
(3118614, ('迟洪均', '171'))
(3118605, ('周雪', '153'))
(3118623, ('魏翰泽', '155'))
(3118606, ('温舒馨', '133'))
(3118609, ('马婧仪', '130'))
(3118613, ('汪伊婧', '189'))


Get get student info from StudentID

In [39]:
CurStudentID=3118616
for item in  StudentIDInfoList:
    if CurStudentID==item[0]:
        print(item[1])       

('乐辰前', '163')


The Linear Search will perform  $𝑂(N)$  

If we put merge sort together with binary search, we have a nice way to search lists. We use merge sort to preprocess the list in order $𝑂(n*log(n))$ time, and then we use binary search to test whether elements are in the list in order $𝑂(log(n))$ time. If we search the list k times, the overall time complexity is order $𝑂(n*log(n) + k*log(n))$

This is good, but we can still **ask**, `is logarithmic the best` that we can do for search when we are willing to do some preprocessing?

When we introduced the type <font color="blue">dict</font> dictionaries use a technique called <b>hashing</b> to do <b>the lookup in time</b> 

* that is nearly `independent` of the `size` of the dictionary

The basic idea behind hashing is

* **convert the key to an integer, and then use that integer to index into a list**

which can be done in `constant` time. 

For example we use the remainder `key%ListSize` as the index record in the list

In [40]:
import  csv
filename="./data/StudentIDInfo.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=30;
# the store table 
StudentIDInfoList=[None for i in range(ListSize)]
for line in csvdata:
    id = int(line['学号'])
    name=line['姓名']
    phone=line['电话']
    # convert the key to an integer: index of the list
    Index_StudentIDInfoList= id%ListSize
    # put the record in the index of the list
    StudentIDInfoList[Index_StudentIDInfoList]=(name,phone)
csvfile.close() 

In [41]:
for item in  StudentIDInfoList:
    print(item)

None
None
None
('魏翰泽', '155')
None
None
None
None
None
None
None
('付嘉宁', '170')
None
('张馨尹', '188')
('张秋维', '190 ')
('周雪', '153')
('温舒馨', '133')
None
None
('马婧仪', '130')
None
None
None
('汪伊婧', '189')
('迟洪均', '171')
None
('乐辰前', '163')
None
None
None


Get get student info from StudentID with the  Index_StudentIDInfoList

It is done in **constant** time that is nearly `independent` of the `size` of StudentIDInfoList

The complexity is $O(1)$

In [42]:
CurStudentID=3118616
Index_StudentIDInfoList=CurStudentID%ListSize
StudentIDInfoList[Index_StudentIDInfoList]

('乐辰前', '163')

In [43]:
CurStudentID=3118605
Index_StudentIDInfoList=CurStudentID%ListSize
StudentIDInfoList[Index_StudentIDInfoList]

('周雪', '153')

**Hash functions** : any function that can be used to map data of `arbitrary` size to `fixed-size` values.

* `CurStudentID%ListSize`

**Hash value** : The values returned by a hash function are called 
    
* `Index_StudentIDInfoList=CurStudentID%ListSize`

**Hash table**: The values are usually used to index a fixed-size table called a hash table.
    
* `StudentIDInfoList=[None for i in range(ListSize)]`

**The process of search:**

* key $>$ the hash function $>$ an address in the hash table $>$ get valve in the hash table

## 2 Collision 

**Collision**: a situation that occurs when two distinct pieces of data have the same hash value

For a a hash function. if the space of possible outputs is **smaller** than the space of possible inputs, 

* a hash function is a `many`-to-`one` mapping. 

the different keys are mapped to the same hash value,it is called a <b>collision</b>. 

For example: the simple hash function 

* `id%ListSize`

the remainder is the hase value of key is the remainder `key%numIndices`

If

* the input sizes of key is :10

* the output sizes of hash value:ListSize is 5

you may see many Collision!

In [44]:
import  csv
filename="./data/StudentIDInfo.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=5;
# the store table 
for line in csvdata:
    id = int(line['学号'])
    # convert the key to an integer: index of the list
    Index_StudentIDInfoList= id%ListSize
    print(id, Index_StudentIDInfoList)
csvfile.close()  

3118603 3
3118601 1
3118616 1
3118604 4
3118614 4
3118605 0
3118623 3
3118606 1
3118609 4
3118613 3


**The paths to handle the collision in Hash Table**

1. minimizes collisions: A good hash function produces : **uniform distribution** every output in the range is equally probable, which `minimizes` the probability of `collisions`

2. collision resolution: Separate Chainingg(分离链接法), Open Addressing 


## 3 Handle collisions:Separate Chaining(分离链接法)


There are different ways through which a collision can be resolved. We will look at a method called **Separate Chaining(分离链接法)**, which aims to create independent chains for all items that have the same hash index:`hash bucket`




### 3.1 hash bucket

In the method, `each bucket is independent`, and has some sort of list of entries with the same index. 

![](./img/ds/Hashcollisionbyseparatechaining.jpg)


The basic idea is to represent 

1. **an instance of class intDict** by `a list` of `hash buckets`, where **each bucket** is a list of **key/value** pairs. 

buckets
```python
[
[bucket for the same hash value1],
[bucket for the same hash value2]
,...
]
```

```python
def __init__(self, numBuckets):
   """
   The instance variable buckets is initialized to a list of numBuckets empty lists
   """

        self.buckets = []
        self.numBuckets = numBuckets
        for i in range(numBuckets):
            self.buckets.append([]) 
```

2. By making each bucket a list, we handle collisions by storing all of the values that hash to the same bucket in the list</b>. 

```python
def addEntry(self, dictKey, dictVal):
    """
     To store or look up an entry with key **dictKey
    """ 
    hashBucket = self.buckets[dictKey%self.numBuckets] # hashing the location `hashBucket` list in  the list of self.buckets 
    for i in range(len(hashBucket)):
        if hashBucket[i][0] == dictKey:# the item in each bucket is tuple: (dictKey, dictVal)
            hashBucket[i] = (dictKey, dictVal) #if one was found,replace
            return
         hashBucket.append((dictKey, dictVal)) # append a new entry (dictKey, dictVal) to the bucket if none was found.
```      
   
we use <b>the hash function i%j to convert dictKey into an integer</b>, 
```python  
    hashBucket = self.buckets[dictKey%self.numBuckets]
```    
and use that integer to index into buckets 
```python
   hashBucket[i]
```
to find the hash bucket associated with **dictKey**: if <b>a value is to be stored</b>,then  

* if one was found:  <b>replace</b> the value in the existing entry,  

* if none was found: <b>append</b> a new entry to the bucket


* **3 def getValue(self, dictKey):**

We then search that bucket (which is a list) linearly to see if there is an entry with the key dictKey.

```python 
 for e in hashBucket:
            if e[0] == dictKey: // key
                return e[1]     // value
```

If we are doing <b>a lookup</b> and there is an entry with the key, we simply return the value stored with that key. 

If there is no entry with that key, we return None. 




In [4]:
class intDict(object):
    """A dictionary with integer keys"""
    
    def __init__(self, numBuckets):
        """Create an empty dictionary
           buckets is initialized to a list of numBuckets empty lists.
        """
        self.buckets = []
        self.numBuckets = numBuckets
        for i in range(numBuckets):
            self.buckets.append([]) # empty list
            
    def addEntry(self, dictKey, dictVal):
        """Assumes dictKey an int.  Adds an entry."""
        hashBucket = self.buckets[dictKey%self.numBuckets]
        for i in range(len(hashBucket)):
            if hashBucket[i][0] == dictKey:
                hashBucket[i] = (dictKey, dictVal) #if one was found,replace
                return
        hashBucket.append((dictKey, dictVal)) # append a new entry to the bucket if none was found.
        
    def getValue(self, dictKey):
        """Assumes dictKey an int.  Returns entry associated
           with the key dictKey"""
        hashBucket = self.buckets[dictKey%self.numBuckets]
        for e in hashBucket:
            if e[0] == dictKey: # key
                return e[1]     # value 
        return None
    
    def __str__(self):
        result = '{'
        for b in self.buckets:
            for e in b:
                result = result + str(e[0]) + ':' + str(e[1]) + ','
        return result[:-1] + '}' #result[:-1] omits the last comma



### 3.2 Example:

The following code constructs an **intDict** with StudentID Info entries. 


StudentIDInfo:10 entries

In [47]:
import  csv
filename="./data/StudentIDInfo.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)
Entrys=[]
for line in csvdata:
    id = int(line['学号'])
    name=line['姓名']
    phone=line['电话']
    Entrys.append((id,(name,phone))) 
csvfile.close()  

Put entrys into <font color="blue">intDict</font>

**hash table larger size, none collisions**

* numBuckets =29

In [48]:
numBuckets =29
# numBuckets 29  >entries 10
D = intDict(numBuckets)
for item in Entrys:
    D.addEntry(item[0],item[1])

print('The intDict is:')
print(D)

print('\n', 'The hase buckets are:')
i=0
for hashBucket in D.buckets:
    print('BucketID',i,'  ', hashBucket)
    i=i+1

The intDict is:
{3118603:('张馨尹', '188'),3118604:('张秋维', '190 '),3118605:('周雪', '153'),3118606:('温舒馨', '133'),3118609:('马婧仪', '130'),3118613:('汪伊婧', '189'),3118614:('迟洪均', '171'),3118616:('乐辰前', '163'),3118623:('魏翰泽', '155'),3118601:('付嘉宁', '170')}

 The hase buckets are:
BucketID 0    []
BucketID 1    [(3118603, ('张馨尹', '188'))]
BucketID 2    [(3118604, ('张秋维', '190 '))]
BucketID 3    [(3118605, ('周雪', '153'))]
BucketID 4    [(3118606, ('温舒馨', '133'))]
BucketID 5    []
BucketID 6    []
BucketID 7    [(3118609, ('马婧仪', '130'))]
BucketID 8    []
BucketID 9    []
BucketID 10    []
BucketID 11    [(3118613, ('汪伊婧', '189'))]
BucketID 12    [(3118614, ('迟洪均', '171'))]
BucketID 13    []
BucketID 14    [(3118616, ('乐辰前', '163'))]
BucketID 15    []
BucketID 16    []
BucketID 17    []
BucketID 18    []
BucketID 19    []
BucketID 20    []
BucketID 21    [(3118623, ('魏翰泽', '155'))]
BucketID 22    []
BucketID 23    []
BucketID 24    []
BucketID 25    []
BucketID 26    []
BucketID 27    []
BucketID 

we see that many of the hash buckets are **empty**. 


In [49]:
def getIntDictByKey(key):
    return D.getValue(key)

In [50]:
key=3118606
thebucket=getIntDictByKey(key)   
print(key,thebucket)

3118606 ('温舒馨', '133')


**hash table smaller sise ,collisions**

* numBucket=5

In [51]:
numBuckets=5
# numBuckets 5 <entries 10
D = intDict(numBuckets)
for item in Entrys:
    D.addEntry(item[0],item[1])

print('The intDict is:')
print(D)

print('\n', 'The hase buckets are:')
i=0
for hashBucket in D.buckets:
    print('BucketID',i,'  ', hashBucket)
    i=i+1

The intDict is:
{3118605:('周雪', '153'),3118601:('付嘉宁', '170'),3118616:('乐辰前', '163'),3118606:('温舒馨', '133'),3118603:('张馨尹', '188'),3118623:('魏翰泽', '155'),3118613:('汪伊婧', '189'),3118604:('张秋维', '190 '),3118614:('迟洪均', '171'),3118609:('马婧仪', '130')}

 The hase buckets are:
BucketID 0    [(3118605, ('周雪', '153'))]
BucketID 1    [(3118601, ('付嘉宁', '170')), (3118616, ('乐辰前', '163')), (3118606, ('温舒馨', '133'))]
BucketID 2    []
BucketID 3    [(3118603, ('张馨尹', '188')), (3118623, ('魏翰泽', '155')), (3118613, ('汪伊婧', '189'))]
BucketID 4    [(3118604, ('张秋维', '190 ')), (3118614, ('迟洪均', '171')), (3118609, ('马婧仪', '130'))]


**one, two, or three tuples** depending upon <b>the number of collisions</b> that occurred

In [52]:
key=3118606
thebucket=getIntDictByKey(key)   
print(key,thebucket)

3118606 ('温舒馨', '133')


### 3.3 The complexity of **getValue**


If there were <b>no collisions</b> it would be <b>O(1)</b>,  because each hash bucket would be of length 0 or 1.

There might be <b>collisions</b>，If everything hashed to **the same bucket**,

* it would be <b>O(n)</b> where n is the number of entries in the dictionary，because the code would perform a linear search on that hash bucket.


In [None]:
key=3118606
thebucket=getIntDictByKey(key)   
print(key,thebucket)

By making the <b>hash table large enough</b>, 

we can <b>reduce the number of collisions</b> sufficiently to allow us to treat <b>the complexity as O(1)</b>.

###  3.4  hash table in Separate Chaining

**a series of buckets**

* **bucket** : a list of key/value pairs. storing all of the values that hash to the same bucket in the list 

* **hash each element to bucket**: an address that is derived form the key value by applying the hash function
    
*  **search:**

    * key -> the hash function -> an address of bucket in the hash table -> get valve in bucket

## 4 Hash: intDict in C/C++

### 4.1 intDict in C

* intDict.h/c

* mainintDict.c

In [6]:
%%file ./demo/include/intDict.h

#ifndef INTDICTH
#define INTDICTH

typedef struct _node
{
	int key;
	int value;
	struct _node *next;
} Node;

typedef struct _hashtable
{ 
	int   numBuckets;
	Node **buckets; //the linked list stack 
} Hashtable;

// Create hash table
Hashtable *createHash(int numBuckets);

// free hash table
void *freeHash(Hashtable *hTable);

// hash function for int keys
int inthash(int key,int  numBuckets);

// Add Entry to table - keyed by int
void addEntry(Hashtable *hTable, int key,int value);

// Lookup  by int key
Node *searchEntry(Hashtable *hTable, int key);

// Get by int key
int getValue(Hashtable *hTable, int key);

#endif

Overwriting ./demo/include/intDict.h


In [19]:
%%file ./demo/src/intDict.c

#include <stdio.h>
#include <stdlib.h>
#include "intDict.h"


// Create hash table
Hashtable *createHash(int numBuckets)
{
    Hashtable *table=(Hashtable*)malloc(sizeof(Hashtable*));
    if(!table) {
		return NULL;
	}
    
    table->buckets=(Node**)malloc(sizeof(Node)*numBuckets);
    if(!table->buckets) {
		free(table);
		return NULL;
	}
	
    table->numBuckets=numBuckets;
	// initialize the head pointer of the bucket stack to NULL
	for(int i=0;i<table->numBuckets;i++)
		table->buckets[i] = NULL;
   
    return table;
}

void *freeHash(Hashtable *hTable)
{
   	Node *b,*p;
	for(int i=0;i<hTable->numBuckets;i++)
	{
       b = hTable->buckets[i];
	   while(b!=NULL)
	   { 
         p=b->next;
	     free(b);
		 b=p;
	   }	 
	}
	free(hTable->buckets);
	free(hTable);
}

// hash function for int keys
int inthash(int key,int  numBuckets)
{
    return key % numBuckets;
}

// Lookup  by int key
Node *searchEntry(Hashtable *hTable, int key)
{
	Node *p;
	int addr = inthash(key, hTable->numBuckets);
	p = hTable->buckets[addr];

	while(p && p->key !=key)
		p = p->next;

	return p;
}

// Add Entry to table - keyed by int
void addEntry(Hashtable *hTable, int key,int value)
{
	int addr;
	Node *p,*entry;
	p = searchEntry(hTable,key);
	if(p)
	{
		return;
	}
	else
	{   /*
          add a new item on the top of the linked list stack 
          and a pointer to the top element.  
       */
		addr =  inthash(key, hTable->numBuckets);
		entry=(Node*)malloc(sizeof(Node));
	  	entry->key = key;
		entry->value=value;
		entry->next =hTable->buckets[addr];
		hTable->buckets[addr] = entry;
	}
}

// Get by int
int getValue(Hashtable *hTable, int key)
{   
	Node *p;
	p = searchEntry(hTable,key);
	if (p)
	{
      return p->value;
	}
}


Writing ./demo/src/intDict.c


In [7]:
%%file ./demo/src/mainintDict.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "intDict.h"

int main()
{
	int numBuckets=5;
    int numEntries=20;
    Hashtable *hTable;
	int *key;
	int *value;
	
    hTable=createHash(numBuckets);
   
	printf("The value of the intDict is:\n");
    printf("(key value)\n");
    
    key=(int*)malloc(sizeof(int)*numEntries);
	value=(int*)malloc(sizeof(int)*numEntries);
	srand(time(NULL));
	for(int i=0;i<numEntries;i++)
	{
	  	key[i] = rand() % 100000;
		value[i]=i;

		addEntry(hTable,key[i],value[i]);
		printf("(%d %d)\n",key[i],value[i]);
	}

    printf("The buckets(the linked list stack) are: \n");
	for(int i=0;i<hTable->numBuckets;i++)
	{
       Node *b,*p;
	   b = hTable->buckets[i];
	   printf("bucket %d :",i);
	   if (b)
	   {
	     for(p=b; p!=NULL; p=p->next)
	        printf(" (%d %d) ",p->key,p->value);
	     printf("\n"); 		
	   }
	   else
	    	printf("\n"); 	   
	}

    printf("Hash search(even):\n");
    printf("(key value) : key -> value:\n");
	for(int i=0;i< numEntries;i++)
	{
      if (i%2==0)
      {
        int val=getValue(hTable,key[i]);
        printf("(%d  %d): -> %d \n",key[i],value[i],val);
      } 
	}
  
    free(key);
    free(value);
    
    freeHash(hTable);
    
  	return 0;
}

Overwriting ./demo/src/mainintDict.c


In [22]:
!gcc -o ./demo/bin/mainintDict ./demo/src/mainintDict.c ./demo/src/intDict.c -I./demo/include

In [23]:
!.\demo\bin\mainintDict 

The value of the intDict is:
(key value)
(17208 0)
(27832 1)
(4960 2)
(10653 3)
(11576 4)
(24738 5)
(6186 6)
(5867 7)
(6791 8)
(73 9)
(27826 10)
(16405 11)
(26372 12)
(3621 13)
(1897 14)
(18521 15)
(21826 16)
(32137 17)
(7453 18)
(9433 19)
The buckets(the linked list stack) are: 
bucket 0 : (16405 11)  (4960 2) 
bucket 1 : (21826 16)  (18521 15)  (3621 13)  (27826 10)  (6791 8)  (6186 6)  (11576 4) 
bucket 2 : (32137 17)  (1897 14)  (26372 12)  (5867 7)  (27832 1) 
bucket 3 : (9433 19)  (7453 18)  (73 9)  (24738 5)  (10653 3)  (17208 0) 
bucket 4 :
Hash search(even):
(key value) : key -> value:
(17208  0): -> 0 
(4960  2): -> 2 
(11576  4): -> 4 
(6186  6): -> 6 
(6791  8): -> 8 
(27826  10): -> 10 
(26372  12): -> 12 
(1897  14): -> 14 
(21826  16): -> 16 
(7453  18): -> 18 


### 4.2 intDict in C++

In [5]:
%%file ./demo/src/Test_intDict.cpp
#include <iostream>
#include <list>

using namespace std;

class HashTable{
private:
  list<int> *table; // 
  int total_elements;

  // Hash function to calculate hash for a value:
  int getHash(int key){
    return key % total_elements;
  }

public:
  // Constructor to create a hash table with 'n' indices:
  HashTable(int n){
    total_elements = n;
    table = new list<int>[total_elements];
  }

  // Insert data in the hash table:
  void insertElement(int key){
    table[getHash(key)].push_back(key);
  }

  // Remove data from the hash table:
  void removeElement(int key){
    int x = getHash(key);

    list<int>::iterator i; 
    for (i = table[x].begin(); i != table[x].end(); i++) { 
      // Check if the iterator points to the required item:
      if (*i == key) 
        break;
    }

    // If the item was found in the list, then delete it:
    if (i != table[x].end()) 
      table[x].erase(i);
  }

  void printAll(){
    // Traverse each index:
    for(int i = 0; i < total_elements; i++){
      cout << "Index " << i << ": ";
      // Traverse the list at current index:
      for(int j : table[i])
        cout << j << " => ";

      cout << endl;
    }
  }
};

int main() {
  // Create a hash table with 3 indices:
  HashTable ht(3);

  // Declare the data to be stored in the hash table:
  int arr[] = {2, 4, 6, 8, 10};

  // Insert the whole data into the hash table:
  for(int i = 0; i < 5; i++)
    ht.insertElement(arr[i]);

  cout << "..:: Hash Table ::.." << endl;
  ht.printAll();
  
  ht.removeElement(4);
  cout << endl << "..:: After deleting 4 ::.." << endl;
  ht.printAll();

  return 0;
}

Writing ./demo/src/Test_intDict.cpp


In [9]:
!g++ -o ./demo/bin/Test_intDict ./demo/src/Test_intDict.cpp

In [10]:
!.\demo\bin\Test_intDict

..:: Hash Table ::..
Index 0: 6 => 
Index 1: 4 => 10 => 
Index 2: 2 => 8 => 

..:: After deleting 4 ::..
Index 0: 6 => 
Index 1: 10 => 
Index 2: 2 => 8 => 


## 5 Unordered Map(C++11)

Unordered maps are associative containers that store elements formed by the combination of a key value and a mapped value, and which allows for fast retrieval of individual elements based on their keys.

In an unordered_map, the key value is generally used to uniquely identify the element, while the mapped value is an object with the content associated to this key. Types of key and mapped value may differ.

Internally, the elements in the unordered_map are not sorted in any particular order with respect to either their key or mapped values, but organized into buckets depending on their hash values to allow for fast access to individual elements directly by their key values (with a constant average time complexity on average).

In [1]:
%%file ./demo/src/demo1_unordered_map.cpp

#include <iostream>
#include <string>
#include <unordered_map>
 
int main()
{
    std::unordered_map<std::string, int> months;
    months["january"] = 31;
    months["february"] = 28;
    months["march"] = 31;
    std::cout << "february  -> " << months["february"] << std::endl;
    return 0;
}

Writing ./demo/src/demo1_unordered_map.cpp


In [2]:
!g++ -o ./demo/bin/demo1_unordered_map.exe ./demo/src/demo1_unordered_map.cpp

In [3]:
!.\demo\bin\demo1_unordered_map 

february  -> 28


## Further Reading

* 严蔚敏，李冬梅，吴伟民. 数据结构（C语言版），人民邮电出版社（第2版）,2015年2月  

* Mark Allen Weiss. Data Structures and Algorithm Analysis in C

* Hash table https://en.wikipedia.org/wiki/Hash_table