# Hashing
* Instead of using searching algorithms that are O(n) or O(log(n)), we can store data in a data structure that can be searched in O(1)
* This data structure is a **hash table**
* a **hash function** maps items to a specific **slot** in the hash table
    * **perfect hash functions** maps all items to a unique slot
* **load factor**: ratio of num_items:table_size
* **collisions**: when multiple items get assigned to the same slot by hash function

## Hash Functions
* Unfortunately, there's no systematic way to construct a perfect hash function (though perfect hashing is typically not needed)
* One way to always have a perfect hash function is to increase hash table to be able to accomodate all possible values in item range
    * Not possible when there're tons of possible values e.g storing all possible social security number requires $size=10^9$
* **Goal**: create hash function that minimizes number of collisions, easy to compute, evenly distributes items in hash table

## Common Hash Functions

### Remainder Function
* one of the simplist, most typical hashing function
* simply assign the slot to be the remainder of the value
$$
h(item) = item \% c \\
c = tableSize (typically)
$$


### Folding Method
* expands on the remainder function
* Steps:
    1. split the value into equal-sized segments (last piece may be smaller) e.g 123-456-7890 -> [12,34,56,78,90], segment_size = 2
    2. Add up the segments to get has value e.g 12+34+78+90 = 270
    3. Take the remainder of the value, based on the table size


In [5]:
# Hash Folding

def foldingHash(item, table_size):
    """
    Comes up with a hash value for social security numbers in a table of certain size
    Args:
        item: (str) social security number
        table_size: (int) number of slots
    Returns:
        (int): index in table to store item
    """
    item = "".join(item.split('-'))
    segment_len = 2
    segments = [int(item[i:i+segment_len]) for i in range(0,len(item),segment_len)]
    segment_sum = sum(segments)
    return segment_sum % table_size

print(foldingHash('123-45-6789', 11))

2


### Mid-Square Method
* another method that expands on the remainder function

* Steps:
    1. Square the value e.g for value 44, $44^2 = 1936$
    2. Extract the middle of the squared value e.g 1936 -> 93
    3. Take the remainder of the table size e.g 93%11 = 5 

In [25]:
# Hash Midsquare

def midSquareHash(item, table_size):
    """
    Create hash value for item using the mid-square method
    Args:
        item: (int)
        table_size: (int) size of table
    Return:
        (int): hash value
    """
    square = str(int(item)**2)
    mid_point = len(square) // 2
    mid_square = square[mid_point-1:mid_point+1] if len(square)%2 == 0 else square[mid_point]
    hash_value = int(mid_square)%table_size
    return hash_value

print(midSquareHash(54,11) == 3)
print(midSquareHash(26,11) == 7)
print(midSquareHash(93,11) == 9)
print(midSquareHash(17,11) == 8)
print(midSquareHash(77,11) == 4)
print(midSquareHash(31,11) == 6)


True
True
True
True
True
True


## Using Ordinal Values To Hash Strings
* Steps
    1. take ordinal value of each character in the string
    1. sum them up
    1. return the remainder of the table size 
* Extra Step to avoid collisions with anagrams: weight each char by adding its index to the ordinal value before summing them

In [33]:
# Hasing Strings with Ordinal Values

def hashStrings(item, table_size):
    """
    Creates a hash value for a string
    Args:
        item: (str) item to hash
        table_size: (int) size of table
    Return:
        (int): hash value
    """
    ord_list = [ord(item[i])+i for i in range(len(item))]
    return sum(ord_list) % table_size

print(hashStrings("cat", 11) == 7)


True


## Collision Resolution (Rehashing)
* Reassigning an item to another slot if the slot in the assigned hash value is occupied

### Methods for Collision Resolution
* Open Addressing: try to find next open slot in hash table
    * Linear probing: increment hash value by 1 until there's a empty slot
        * rehash(pos) = (pos+1)%table_size
        * drawback: tends to cause clustering, leaving large sections of the table empty if collisions tend to happen at a specific hash value
    * Linear probing with skips: increment hash value by a constant greater than 1
        * "plus 3 probe": rehash(pos) = (pos+3)%table_size
        * resolves clustering issue in linear probing
        * table size should be prime number to ensure that all slots in table will be visited when doing linear probe with skips
    * Quadratic probing: increment hash value by successive perfect squares i.e (1,4,9,16...)
        * $rehash(pos) = (pos+i^2)$
* Chaining: make each slot a linked list, and simply append each item to the linked list at the assigned slot
    * drawback: hash table can devolve into a linked list if collisions tend to happen at any particular value, defeating the purpose of hashing

### Comparing Collision Resolution Methods
* depends on the load factor $\lambda = \frac{numItems}{tableSize}$
* table below shows average number of comparisons needed for a search
* In general, probing methods are more efficient went load factor is lower. Chaining is better when load factors are higher

|Method | Successful Search | Unsuccessful Search |
|-|-|-|
|Linear Probing | $\frac{1}{2}(1+\frac{1}{1-\lambda})$ | $\frac{1}{2}(1+\frac{1}{1-\lambda})^2$ |
|Chaining | $1+\frac{\lambda}{2}$ | $\lambda$|
    


## Using Hash Map To Solve Problems

[Finding Intersection of Two Arrays](6-Search&Sort/x.Exercises.easy/ArrayIntersection.ipynb)