# Hashing or Hash Tables

A `hash table` is a data structure where elements are accessed by a keyword rather than an index number, unlike in lists and arrays. 

In this data structure, the data items are stored in `key-value` pairs i.e. dictionaries.

### Advantages:
- Search, Insert and Delete operations O(1) Time Complexity on an average.

### Not useful when:

1. Finding closest value.
2. Sorted Data. - Order is not maintained in dictionary keys.
3. Prefix Searching.
    
**Note:** For `1` and `2`, we can use AVL Tree or Red-Black Tree. or Cell Balancing Binary Search Trees.
      


### Applications Of Hashing:

1. Dictionaries.
2. Database Indexing.
3. Cryptography.
4. Caches.
5. Symbol Tables in Compilers / Interpretors.
6. Rosters.

### How Hash Functions Work?

- Should always map a large key to same small key. - i.e. generate `unique` short keys.
- Should generate values from 0 to m-1.
- Should be fast, O(1) for integers and O(len) for strings of length 'len'.
- Should uniformly distribute large keys into Hash Table Slots.

#### `Example`: 

For `phone numbers` which are generally large keys. We convert them to smaller keys and store them for searching.

1. The `Hash Table size` for this application is proportional to the number of keys you are going to insert. 

In [6]:
keys = [50, 21, 48, 17, 15, 49, 56, 22, 23, 25]
print(f'Number of keys = {len(keys)}')

# Hash Function
# hash(key) = Key % 7
# We take 7 because 7 is the nearest prime number.

hashed_keys = []

for i in keys:
    x = i % 7
    hashed_keys.append(x)
    
hashed_keys

Number of keys = 10


[1, 0, 6, 3, 1, 0, 0, 1, 2, 4]

In [7]:
## Hash Collision:

sum(map(ord, '9141288090'))

522

In [5]:
sum(map(ord, '9131288190'))

522

In [2]:
import random

def generate_phone():
    x = random.randint(1000000000, 9999999999)
    return x;

In [5]:
for i in range(1, 10):
    x = generate_phone()
    # 'm' ----> prime number close to the number of keys we want to insert.
    m = 7 
    print(f'Phone Number {i}: {x} is hashed as {x % m}')

Phone Number 1: 3710298904 is hashed as 4
Phone Number 2: 9544129883 is hashed as 1
Phone Number 3: 5624004151 is hashed as 3
Phone Number 4: 9063042873 is hashed as 3
Phone Number 5: 6826149043 is hashed as 0
Phone Number 6: 9908273419 is hashed as 2
Phone Number 7: 4420613629 is hashed as 5
Phone Number 8: 9676711475 is hashed as 4
Phone Number 9: 4515234763 is hashed as 4


## Birthday Paradox:

1. 23 people ---> 50 %
2. 70 people ---> 99 % that two of them share the same birthday.

## How do we address Collisions
* If we know keys in advance, then we can build a `Perfect Hashing`. 
    * Something like building dictionary of known words.
* If we do not know keys, then collisions are bound to occur.
    * Chaining - We make a chain of items that collide.
    * Open Addressing.
        * Linear Probing.
        * Quadratic Probing.
        * Double Hashing.

### Chaining Performance 

- m = No. of slots in Hash Tables.
- n = No. of Keys to be inserted.

Load Factor: 

* alpha = n/m

### Linked Lists

- Basically whenever collision happens, we insert the item at the end of list.
- Not cache friendly because nodes are at different locations.
- Search, insert, insert all the 3 operations will be O(l), where l is length of linkedlist.

### Dynamic Sized Arrays (list in Python)
- Cache Friendly.
- Vectors in C++.
- Arraylist in Java.

### Cell Balancing BST
- AVL Trees
- Red Black Trees.

In [1]:
7 % 2

1

In [33]:
class custom_hash:
    '''
    1. Constructor to create list of empty lists.
    2. Basically create seven empty lists [ [], [], [], [], [], [], [] ]
    3. 
    '''
    
    def __init__(self, b):
        self.bucket = b
        self.table = [[] for x in range(b)]
        
    def insert(self, x):
        i = x % self.bucket
        self.table[i].append(x)
        
    def remove(self, x):
        i = x % self.bucket
        if x in self.table[i]:
            self.table[i].remove(x)
        else:
            print('Element is absent')
    
    def search(self, x):
        i = x % self.bucket
        return x in self.table[i]

In [34]:
h = custom_hash(7)

h.insert(70)
h.insert(71)
h.insert(9)
h.insert(56)
h.insert(72)

In [35]:
print(h.search(56))

True


In [36]:
h.remove(56)
print(h.search(56))

False


In [37]:
h.remove(56)

Element is absent


## Open Addressing 

Multiple ways of implementing Open Addressing:

1. Linear Probing.
2. Quadratic Probing.
3. Double Hashing.

`Formula:` hash(key, l) = (h1 + (i * h2) ) % m

m = 7

i = `i`th time the collision occurs. i can never be zero.

h1(key) = key % 7

h2(key) = x - (key % x), where x is length of the array.

### Performance Analysis of Search

alpha = n/m (should be <= 1)

`Assumption:` Every probe sequence looks at a random location.

`(1-alpha)` Fraction of the Table is empty.

Expected No. of Probes required = $ (1 \over (1 - \alpha)) $

In [3]:
import random

# Function to generate n number of random numbers
def generate_nos(n):
    ar = []
    for i in range(0, n):
        x = random.randint(0, 100)
        ar.append(x)
    return ar;

# assign a list of randomly generated numbers to a list
array = generate_nos(7)

m1 = 7

if len(array) % 2 == 0:
    x = len(array)-1
else:
    x = len(array)

print("Formula: Final_hash_key = (h1 + (i * h2) ) % m")
print(f'x = {x} \n')
for i in range(0, len(array)):
    key = array[i]
    h1 = key % 7   # remainder
    h2 = x - (key % x)
    hash_key = (h1 + i * h2) % m1
    print(f'For {key}, h1 = {h1}, h2 = {h2} and Final Hash Key = {hash_key}')

Formula: Final_hash_key = (h1 + (i * h2) ) % m
x = 7 

For 46, h1 = 4, h2 = 3 and Final Hash Key = 4
For 7, h1 = 0, h2 = 7 and Final Hash Key = 0
For 94, h1 = 3, h2 = 4 and Final Hash Key = 4
For 50, h1 = 1, h2 = 6 and Final Hash Key = 5
For 61, h1 = 5, h2 = 2 and Final Hash Key = 6
For 62, h1 = 6, h2 = 1 and Final Hash Key = 4
For 51, h1 = 2, h2 = 5 and Final Hash Key = 4


## Clustering Problem

A problem with Linear Probing. 

Due to clustering or mapping of multiple items to the same linkedlist header, the operations such as Search, Add and Delete become costly.

### Solution to Clustering Problem:

1. Quadratic Probing:

Load Factor = $ {{n}\over{m}} $

    * n = Number of Keys
    * m = Number of slots in Hash Table.
    * h1 = key % 7
    * h2 = 

### Comparing Chaining and Open Addressing

**Chaining:**
- Not Cache Friendly.

**Open Addressing:**
- Cache Friendly.

In [1]:
def time_complexity(alpha):
    x = 1 + alpha # Chaining
    y = 1/(1-alpha) # Open Addressing

    print('Time Complexity\n')
    print(f'Chaining = {x} \nOpen Addressing = {round(y, 9)}')
    
# Time Complexity
time_complexity(0.09)

Time Complexity

Chaining = 1.09 
Open Addressing = 1.098901099


In [2]:
# So if 90 % the Hash Table occupied or alpha = 0.9

for i in range(0, 100, 10):
    x = 1 + (i*0.01)
    print(f'For Chaining if {i} % of Hash Table is occupied = {round(x, 5)}')

For Chaining if 0 % of Hash Table is occupied = 1.0
For Chaining if 10 % of Hash Table is occupied = 1.1
For Chaining if 20 % of Hash Table is occupied = 1.2
For Chaining if 30 % of Hash Table is occupied = 1.3
For Chaining if 40 % of Hash Table is occupied = 1.4
For Chaining if 50 % of Hash Table is occupied = 1.5
For Chaining if 60 % of Hash Table is occupied = 1.6
For Chaining if 70 % of Hash Table is occupied = 1.7
For Chaining if 80 % of Hash Table is occupied = 1.8
For Chaining if 90 % of Hash Table is occupied = 1.9


In [4]:
for i in range(0, 100, 10):
    y = 1/ ( 1- (i*0.01) )
    print(f'For Open Addressing if {i} % of Hash Table is occupied = {round(y, 5)}')

For Open Addressing if 0 % of Hash Table is occupied = 1.0
For Open Addressing if 10 % of Hash Table is occupied = 1.11111
For Open Addressing if 20 % of Hash Table is occupied = 1.25
For Open Addressing if 30 % of Hash Table is occupied = 1.42857
For Open Addressing if 40 % of Hash Table is occupied = 1.66667
For Open Addressing if 50 % of Hash Table is occupied = 2.0
For Open Addressing if 60 % of Hash Table is occupied = 2.5
For Open Addressing if 70 % of Hash Table is occupied = 3.33333
For Open Addressing if 80 % of Hash Table is occupied = 5.0
For Open Addressing if 90 % of Hash Table is occupied = 10.0
