# Hash Tables

The example：Measurement Tags of vapor-compression refrigeration cycle



## 1 Measurement Tags of VCC
The table store the Measurement Tags of VCC,every the Tag recored has the uniqe tagID

Refrigerant 134a is the working fluid in an ideal vapor-compression refrigeration cycle that
communicates thermally with a cold region at 0°C and a warm region at 26°C. 

Saturated vapor enters the compressor at 0°C and saturated liquid leaves the condenser at 26°C.

The mass flow rate of the refrigerant is 0.08 kg/s.



![](./img/vcr/ivcr-ts.jpg)


In [None]:
%%file ./data/VCC1_Tag.csv
TagID,Tag,Desc,Unit,Value
108600,CompressorIPortM,压缩机入口流量,kg/s,0.08
108616,CompressorOPortP,压缩机出口压力,MPa,0.6854
108614,CompressorOPortT,压缩机出口温度,°C,29.27
108714,CondenserOPortT,冷凝器出口温度,°C,26
108708,CondenserOPortX,冷凝器出口干度,-,0
108814,ExpansionValveOPortT,膨胀阀出口温度,°C,26
108808,ExpansionValveOPortX,膨胀阀出口干度,°C,0
108914,EvaporatorValveOPortT,蒸发器出口温度,°C,0
108908,EvaporatorValveOPortX,蒸发器出口干度,-,1

**Data Stucture of Tags**

tag=(id,(tag,desc,value))
VCC1_TagTable=[]
```python
VCC1_TagTable=[(id,(tag,desc,unit,value)),...]
```

In [None]:
import  csv
filename="./data/VCC1_Tag.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)
VCC1_TagTable=[]
for line in csvdata:
    id = int(line['TagID']) # convert to int
    tag=line['Tag']
    desc=line['Desc']
    unit=c=line['Unit']
    value=float(line['Value'])
    VCC1_TagTable.append((id,(tag,desc,unit,value)))
csvfile.close()  

In [None]:
for item in  VCC1_TagTable:
    print(item)

Get the tags of Compressor through tagID by the Linear Search

In [None]:
CompressorTagIDList=[108600,108616,108614,108914,108908]
for tagid in CompressorTagIDList:
    for item in VCC1_TagTable:
        if tagid==item[0]:
            print(item[1])       

The Linear Search will perform  $𝑂(N)$  

If we put merge sort together with binary search, we have a nice way to search lists. We use merge sort to preprocess the list in order $𝑂(n*log(n))$ time, and then we use binary search to test whether elements are in the list in order $𝑂(log(n))$ time. If we search the list k times, the overall time complexity is order $𝑂(n*log(n) + k*log(n))$

This is good, but we can still **ask**, `is logarithmic the best` that we can do for search when we are willing to do some preprocessing?

When we introduced the type <font color="blue">dict</font> dictionaries use a technique called <b>hashing</b> to do <b>the lookup in time</b> 

* that is nearly `independent` of the `size` of the dictionary

The basic idea behind hashing is

* **convert the key to an integer, and then use that integer to index into a list**

which can be done in `constant` time. 

**Hash functions** : any function that can be used to map data of `arbitrary` size to `fixed-size` values.

* `CurTagID%ListSize`(除留余数法 k mod m - 关键字k除以表长度m的余数)
![](./img/ds/hash1.png)

**Hash value** : The values returned by a hash function are called 
    
* `Index_TagID=CurTagID%ListSize`

**Hash table**: the data structure that maps keys to values with hashing

* `VCC1_TagTable=[None for i in range(ListSize)]`

>散列表通过把关键码值映射到表中一个位置来访问记录，以加快查找的速度。这个映射函数叫做散列函数，存放记录的数组叫做散列表 


For example we use the remainder `key%ListSize` as the index into the list

In [None]:
import  csv
filename="./data/VCC1_Tag.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=30;
# the store table 
VCC1_TagTable=[None for i in range(ListSize)]
for line in csvdata:
    id = int(line['TagID'])
    tag=line['Tag']
    desc=line['Desc']
    unit=line['Unit']
    value=float(line['Value'])
    # convert the key to an integer: index of the list
    Index_TagID= id%ListSize
    # put the record in the index of the list
    VCC1_TagTable[Index_TagID]=(tag,desc,unit,value)
csvfile.close() 

In [None]:
for item in  VCC1_TagTable:
    print(item)

Get get one tag info from TagID with the  Index_TagID

It is done in **constant** time that is nearly `independent` of the `size` of VCC1_TagList

The complexity is $O(1)$

In [None]:
CompressorTagIDList=[108600,108616,108614,108914,108908]
for tagid in CompressorTagIDList:
    Index_TagID=tagid%ListSize
    print(VCC1_TagTable[Index_TagID])   

## 2 Collision 

**Collision**: a situation that occurs when two distinct pieces of data have the same hash value

* 冲突：在散列表中，不同的关键字值对应到同一个存储位置的现象


For a a hash function. if the space of possible outputs is **smaller** than the space of possible inputs, 

* a hash function is a `many`-to-`one` mapping. 

the different keys are mapped to the same hash value,it is called a <b>collision</b>. 

For example: the simple hash function 

* `id%ListSize`

the remainder is the hase value of key is the remainder `key%numIndices`

If

* the input sizes of key is :10

* the output sizes of hash value:ListSize is 5

you may see many Collision!

In [None]:
import  csv
filename="./data/VCC1_Tag.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=5;
# the store table 
for line in csvdata:
    id = int(line['TagID'])
    # convert the key to an integer: index of the list
    Index_TagID= id%ListSize
    print(id, Index_TagID)
csvfile.close()  

**The paths to handle the collision in Hash Table**

1. `minimizes collisions`: A good hash function produces : **uniform distribution** every output in the range is equally probable, which `minimizes` the probability of `collisions`

2. `collision resolution`: Separate Chainingg(分离链接法), Open Addressing 


## 3 Handle collisions:Separate Chaining(分离链接法)

There are different ways through which a collision can be resolved. We will look at a method called **Separate Chaining(分离链接法)**, 

* **Chain hashing** avoids collision. The idea is to make each cell of hash table point to a linked list of records that have same hash function value.(将散列到同一个值的所有元素保留到一个链表中)

The hash table is a list of `hash buckets`. 

* **bucket(桶)**: a list of `key/value` pairs with same hash function value

![](./img/ds/Hashcollisionbyseparatechaining.jpg)



**Hash Table in Python**

The basic idea is to represent the hash table by a list where **each item** is a list of **key/value** pairs that have the `same` hash index

```python
[
[bucket for the same hash value1],
[bucket for the same hash value2]
,...
]
```


For examples:





In [31]:
keys=[18,9,68,55,79]
num_buckets=8
buckets=[[] for i in range(num_buckets)]

print("Key","The address in buckets","\n"+20*"-")
for key in keys:
    #hash function: key % num_buckets
    address=key % num_buckets
    buckets[address].append(key)
    print(key,address)

print("No.","Bucket","\n"+20*"-")   
for  i,bucket in  enumerate(buckets):
    print(i,bucket)   

Key The address in buckets 
--------------------
18 2
9 1
68 4
55 7
79 7
No. Bucket 
--------------------
0 []
1 [9]
2 [18]
3 []
4 [68]
5 []
6 []
7 [55, 79]



### 3.1 Separate Chaining in Python


#### 3.1.1  Init the hash table

```python
def __init__(self, numBuckets):
   """
   The instance variable buckets is initialized to a list of numBuckets empty lists
   """

        self.buckets = []
        self.numBuckets = numBuckets
        for i in range(numBuckets):
            self.buckets.append([]) 
```

#### 3.1.2  hash function

```python
 def getHashValue(self, dictKey):
        return dictKey%self.numBuckets

```


#### 3.1.3 addEntry

By making each bucket a list, we handle collisions by storing all of the values that hash to the same bucket in the list</b>. 

```python
def addEntry(self, dictKey, dictVal):
    """
     To store or look up an entry with key **dictKey
    """ 
    hashBucket = self.buckets[self.getHashValue(dictKey)] # hashing the location `hashBucket` list in  the list of self.buckets 
    for i in range(len(hashBucket)):
        if hashBucket[i][0] == dictKey:# the item in each bucket is tuple: (dictKey, dictVal)
            hashBucket[i] = (dictKey, dictVal) #if one was found,replace
            return
         hashBucket.append((dictKey, dictVal)) # append a new entry (dictKey, dictVal) to the bucket if none was found.
```      
   
we use the hash function `i%j` to convert dictKey into an integer, 
```python  
    hashBucket = self.buckets[dictKey%self.numBuckets]
```    
and use that integer to index into buckets 
```python
   hashBucket[i]
```
to find the hash bucket associated with **dictKey**: if <b>a value is to be stored</b>,then  

* if one was found:  <b>replace</b> the value in the existing entry,  

* if none was found: <b>append</b> a new entry to the bucket


#### 3.1.4 getValue

```python 

def getValue(self, dictKey)
```
We then search that bucket (which is a list) linearly to see if there is an entry with the key dictKey.

```python 
 for e in hashBucket:
            if e[0] == dictKey: // key
                return e[1]     // value
```

If we are doing <b>a lookup</b> and there is an entry with the key, we simply return the value stored with that key. 

If there is no entry with that key, we return None. 




In [11]:
class intDict(object):
    """A dictionary with integer keys"""
    
    def __init__(self, numBuckets):
        """Create an empty dictionary
           buckets is initialized to a list of numBuckets empty lists.
        """
        self.buckets = []
        self.numBuckets = numBuckets
        for i in range(numBuckets):
            self.buckets.append([]) # empty list
            
    def getHashValue(self, dictKey):
        return dictKey%self.numBuckets
    
    def addEntry(self, dictKey, dictVal):
        """Assumes dictKey an int.  Adds an entry."""
        hashBucket = self.buckets[self.getHashValue(dictKey)]
        for i in range(len(hashBucket)):
            if hashBucket[i][0] == dictKey:
                hashBucket[i] = (dictKey, dictVal) #if one was found,replace
                return
        hashBucket.append((dictKey, dictVal)) # append a new entry to the bucket if none was found.
        
    def getValue(self, dictKey):
        """Assumes dictKey an int.  Returns entry associated
           with the key dictKey"""
        hashBucket = self.buckets[self.getHashValue(dictKey)]
        for e in hashBucket:
            if e[0] == dictKey: # key
                return e[1]     # value 
        return None
    
    def __str__(self):
        result = '{'
        for b in self.buckets:
            for e in b:
                result = result + str(e[0]) + ':' + str(e[1]) + ','
        return result[:-1] + '}' #result[:-1] omits the last comma

### 3.2 Example: Measurement Tags of VCC

The following code constructs an **intDict** with TagID of VCC. 


In [12]:
import  csv
filename="./data/VCC1_Tag.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)
Entrys=[]
for line in csvdata:
    id = int(line['TagID'])
    tag=line['Tag']
    desc=line['Desc']
    unit=line['Unit']
    value=float(line['Value'])
    Entrys.append((id,(tag,desc,unit,value))) 
csvfile.close()  

Put entrys into <font color="blue">intDict</font>

**hash table larger size, none collisions**

* numBuckets =29

In [13]:
numBuckets =29
# numBuckets 29  >entries 10
D = intDict(numBuckets)
for item in Entrys:
    D.addEntry(item[0],item[1])

print('The intDict is:')
print(D)

print('\n', 'The hase buckets are:')
i=0
for hashBucket in D.buckets:
    print('BucketID',i,'  ', hashBucket)
    i=i+1

The intDict is:
{108808:('ExpansionValveOPortX', '膨胀阀出口干度', '°C', 0.0),108814:('ExpansionValveOPortT', '膨胀阀出口温度', '°C', 26.0),108614:('CompressorOPortT', '压缩机出口温度', '°C', 29.27),108616:('CompressorOPortP', '压缩机出口压力', 'MPa', 0.6854),108908:('EvaporatorValveOPortX', '蒸发器出口干度', '-', 1.0),108708:('CondenserOPortX', '冷凝器出口干度', '-', 0.0),108914:('EvaporatorValveOPortT', '蒸发器出口温度', '°C', 0.0),108714:('CondenserOPortT', '冷凝器出口温度', '°C', 26.0),108600:('CompressorIPortM', '压缩机入口流量', 'kg/s', 0.08)}

 The hase buckets are:
BucketID 0    [(108808, ('ExpansionValveOPortX', '膨胀阀出口干度', '°C', 0.0))]
BucketID 1    []
BucketID 2    []
BucketID 3    []
BucketID 4    []
BucketID 5    []
BucketID 6    [(108814, ('ExpansionValveOPortT', '膨胀阀出口温度', '°C', 26.0))]
BucketID 7    []
BucketID 8    []
BucketID 9    [(108614, ('CompressorOPortT', '压缩机出口温度', '°C', 29.27))]
BucketID 10    []
BucketID 11    [(108616, ('CompressorOPortP', '压缩机出口压力', 'MPa', 0.6854))]
BucketID 12    []
BucketID 13    [(108908, ('Evaporato

we see that many of the hash buckets are **empty**. 


In [14]:
CompressorTagIDList=[108600,108616,108614,108914,108908]
for tag in CompressorTagIDList:
    thebucket=D.getValue(tag)   
    print(tag,thebucket)

108600 ('CompressorIPortM', '压缩机入口流量', 'kg/s', 0.08)
108616 ('CompressorOPortP', '压缩机出口压力', 'MPa', 0.6854)
108614 ('CompressorOPortT', '压缩机出口温度', '°C', 29.27)
108914 ('EvaporatorValveOPortT', '蒸发器出口温度', '°C', 0.0)
108908 ('EvaporatorValveOPortX', '蒸发器出口干度', '-', 1.0)


**hash table smaller sise ,collisions**

* numBucket=5

In [None]:
numBuckets=5
# numBuckets 5 <entries 10
D = intDict(numBuckets)
for item in Entrys:
    D.addEntry(item[0],item[1])

print('The intDict is:')
print(D)

print('\n', 'The hase buckets are:')
i=0
for hashBucket in D.buckets:
    print('BucketID',i,'  ', hashBucket)
    i=i+1

**one, two, or three tuples** depending upon <b>the number of collisions</b> that occurred

In [None]:
CompressorTagIDList=[108600,108616,108614,108914,108908]
for tagid in CompressorTagIDList:
    thebucket=D.getValue(tagid)   
    print(tag,thebucket)

### 3.3 The complexity of **getValue**


If there were <b>no collisions</b> it would be <b>O(1)</b>,  

* because each hash bucket would be of length 0 or 1.

There might be <b>collisions</b>，

* If everything hashed to **the same bucket**, it would be <b>O(n)</b> where n is the number of entries in the dictionary，because the code would perform a linear search on that hash bucket.


In [None]:
tagid=108808
thebucket=D.getValue(tagid)  
print(tagid,thebucket)

By making the hash table large enough,

we can reduce the number of collisions sufficiently to allow us to treat the complexity as O(1).

* 如果可以提供一个足够大的数组，为每个关键字保留一个位置，就可以**直接寻址**技术，时间复杂度是O(1)。

## Further Reading 

### 1 Hash: intDict in C

* intDict.h/c

* mainintDict.c

In [7]:
%%file ./demo/include/intDict.h
#ifndef INTDICTH
#define INTDICTH

typedef struct _node
{
	int key;
	int value;
	struct _node *next;
} Node;

typedef struct _hashtable
{
	int numBuckets;
	Node **buckets; //the linked list stack
} Hashtable;

// Create hash table
Hashtable *createHash(int numBuckets);

// free hash table
void *freeHash(Hashtable *hTable);

// hash function for int keys
int inthash(int key, int numBuckets);

// Add Entry to table - keyed by int
void addEntry(Hashtable *hTable, int key, int value);

// Lookup  by int key
Node *searchEntry(Hashtable *hTable, int key);

// Get by int key
int getValue(Hashtable *hTable, int key);

#endif


Overwriting ./demo/include/intDict.h


In [6]:
%%file ./demo/src/intDict.c

#include <stdio.h>
#include <stdlib.h>
#include "intDict.h"

// Create hash table
Hashtable *createHash(int numBuckets)
{
	Hashtable *table = (Hashtable *)malloc(sizeof(Hashtable *));
	if (!table)
	{
		return NULL;
	}

	table->buckets = (Node **)malloc(sizeof(Node) * numBuckets);
	if (!table->buckets)
	{
		free(table);
		return NULL;
	}

	table->numBuckets = numBuckets;
	// initialize the head pointer of the bucket stack to NULL
	for (int i = 0; i < table->numBuckets; i++)
		table->buckets[i] = NULL;

	return table;
}

void *freeHash(Hashtable *hTable)
{
	Node *b, *p;
	for (int i = 0; i < hTable->numBuckets; i++)
	{
		b = hTable->buckets[i];
		while (b != NULL)
		{
			p = b->next;
			free(b);
			b = p;
		}
	}
	free(hTable->buckets);
	free(hTable);
}

// hash function for int keys
int inthash(int key, int numBuckets)
{
	return key % numBuckets;
}

// Lookup  by int key
Node *searchEntry(Hashtable *hTable, int key)
{
	Node *p;
	int addr = inthash(key, hTable->numBuckets);
	p = hTable->buckets[addr];

	while (p && p->key != key)
		p = p->next;

	return p;
}

// Add Entry to table - keyed by int
void addEntry(Hashtable *hTable, int key, int value)
{
	int addr;
	Node *p, *entry;
	p = searchEntry(hTable, key);
	if (p)
	{
		return;
	}
	else
	{ /*
          add a new item on the top of the linked list stack 
          and a pointer to the top element.  
       */
		addr = inthash(key, hTable->numBuckets);
		entry = (Node *)malloc(sizeof(Node));
		entry->key = key;
		entry->value = value;
		entry->next = hTable->buckets[addr];
		hTable->buckets[addr] = entry;
	}
}

// Get by int
int getValue(Hashtable *hTable, int key)
{
	Node *p;
	p = searchEntry(hTable, key);
	if (p)
	{
		return p->value;
	}
}


Overwriting ./demo/src/intDict.c


In [18]:
%%file ./demo/src/mainintDict.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "intDict.h"

int main()
{
	int numBuckets = 5;
	int numEntries = 20;
	Hashtable *hTable;
	int *key;
	int *value;

	hTable = createHash(numBuckets);
	key = (int *)malloc(sizeof(int) * numEntries);
	value = (int *)malloc(sizeof(int) * numEntries);
    
	printf("The value of the intDict is:\n");
	printf("(key value)\n");
    srand(time(NULL));
	for (int i = 0; i < numEntries; i++)
	{
		key[i] = rand() % 100000;
		value[i] = i;

		addEntry(hTable, key[i], value[i]);
		printf("(%d %d)\n", key[i], value[i]);
	}

	printf("\nThe buckets(the linked list stack) are: \n");
	for (int i = 0; i < hTable->numBuckets; i++)
	{
		Node *b, *p;
		b = hTable->buckets[i];
		printf("bucket %d :", i);
		if (b)
		{
			for (p = b; p != NULL; p = p->next)
				printf(" (%d %d) ", p->key, p->value);
			printf("\n");
		}
		else
			printf("\n");
	}

	printf("\nHash search(even):\n");
	printf("(key value) : key -> value:\n");
	for (int i = 0; i < numEntries; i++)
	{
		if (i % 2 == 0)
		{
			int val = getValue(hTable, key[i]);
			printf("(%d  %d): -> %d \n", key[i], value[i], val);
		}
	}

	free(key);
	free(value);

	freeHash(hTable);

	return 0;
}


Overwriting ./demo/src/mainintDict.c


In [19]:
!gcc -o ./demo/bin/mainintDict ./demo/src/mainintDict.c ./demo/src/intDict.c -I./demo/include

In [20]:
!.\demo\bin\mainintDict 

The value of the intDict is:
(key value)
(25728 0)
(20974 1)
(15875 2)
(14998 3)
(10386 4)
(240 5)
(26769 6)
(6258 7)
(27060 8)
(21420 9)
(15157 10)
(6202 11)
(31358 12)
(2859 13)
(2152 14)
(16855 15)
(1348 16)
(8767 17)
(12863 18)
(14169 19)

The buckets(the linked list stack) are: 
bucket 0 : (16855 15)  (21420 9)  (27060 8)  (240 5)  (15875 2) 
bucket 1 : (10386 4) 
bucket 2 : (8767 17)  (2152 14)  (6202 11)  (15157 10) 
bucket 3 : (12863 18)  (1348 16)  (31358 12)  (6258 7)  (14998 3)  (25728 0) 
bucket 4 : (14169 19)  (2859 13)  (26769 6)  (20974 1) 

Hash search(even):
(key value) : key -> value:
(25728  0): -> 0 
(15875  2): -> 2 
(10386  4): -> 4 
(26769  6): -> 6 
(27060  8): -> 8 
(15157  10): -> 10 
(31358  12): -> 12 
(2152  14): -> 14 
(1348  16): -> 16 
(12863  18): -> 18 


### 2 Unordered Map(C++11)

Unordered maps are associative containers that store elements formed by the combination of a key value and a mapped value, and which allows for fast retrieval of individual elements based on their keys.

In an unordered_map, the key value is generally used to uniquely identify the element, while the mapped value is an object with the content associated to this key. Types of key and mapped value may differ.

Internally, the elements in the unordered_map are not sorted in any particular order with respect to either their key or mapped values, but organized into buckets depending on their hash values to allow for fast access to individual elements directly by their key values (with a constant average time complexity on average).

In [None]:
%%file ./demo/src/demo1_unordered_map.cpp

#include <iostream>
#include <string>
#include <tuple>
#include <unordered_map>
 
using namespace std;
typedef tuple<string,string,string,float> tupTag;
 
int main()
{  
    unordered_map<int, tupTag> tags;
    tags[108600] =(tupTag){"CompressorIPortM","压缩机入口质量流量","kg/s",0.08 };
    cout << "Tag 108600:  " <<get<0>(tags[108600]) <<"\t"<< get<1>(tags[108600])
         << "\t"<<get<2>(tags[108600])<< "\t"<<get<3>(tags[108600])<<endl;
    return 0;
}

In [None]:
!g++ -fexec-charset=GBK -o ./demo/bin/demo1_unordered_map.exe ./demo/src/demo1_unordered_map.cpp 

In [None]:
!.\demo\bin\demo1_unordered_map 

## Further Reading

* 严蔚敏，李冬梅，吴伟民. 数据结构（C语言版），人民邮电出版社（第2版）,2015年2月  

* Mark Allen Weiss. Data Structures and Algorithm Analysis in C

* Hash table https://en.wikipedia.org/wiki/Hash_table