# Maps and Hashing
#### Last modified on: July 14, 2019
#### Author: Emma Teng

## Sets and Maps

- There is no order in Set, while list has order
- Map is a set-based data structure, like a dictionary.
    
    Map = <Key, Value>
- A group of keys is a set and keys are unique

In [1]:
locations = {'North America': {'USA': ['Mountain View']}}
locations['North America']['USA'].append('Atlanta')
locations['Asia'] = {'India': ['Bangalore']}
locations['Asia']['China'] = ['Shanghai']
locations['Africa'] = {'Egypt': ['Cairo']}

usa_sorted = sorted(locations['North America']['USA'])
print (1)
for city in usa_sorted:
    print (city)

asia_cities = []
print (2)
for countries, cities in locations['Asia'].items():
    city_country = cities[0] + " - " + countries 
    asia_cities.append(city_country)
asia_sorted = sorted(asia_cities)
for city in asia_sorted:
    print (city)

1
Atlanta
Mountain View
2
Bangalore - India
Shanghai - China


## Hashing

Using a hash function allows us to do look ups in constant time.

Hash functions:
> Value ----Hash Function---> Hash value

We give our number to a hash function, which spits out a hash code that turns into the index of an array. Then we can go to our array and get our original value in constant time, since an array look up with an index happens in constant time.

For example, we can use the last few digits of a big number and divided it by 10 and use the reminder as the new code,

> '12345' -> 45%10 -> 5

>'12346' -> 46%10 -> 6

> '12347' -> 47%10 -> 7


## Collisions

There are times when a hash function will spit out the same hash value for two different inputs, such as '0123456' -> 56%10 = 6 and '6543216' -> 16%10 = 6. This situation is called **collision**.

There are two main ways to fix collisions:
- change the value in hash function or change our hash function completely, so we can have more than enough slots to store all of our potential values. For example, use 12 instead of 10 as the divider.

- we can aslo keep the original hash function but change the structure of our array. For example, instead of storing one hash value in each slot, we can store some type of lists or buckets that contains all values hased at that spot.

As a result:
- Search time will still be $O(1)$, but it if we change the hash function everytime there is a collision, moving all of our data to a new array requires more space and time complexity.

- With the bucket approach, we still need to iterate through some collection,  though a shorter one. Therefore, Hash function have a constant lookup time in the average case, but in the worse case, it will be $O(m)$.

Depending on different situations, the solution is chosen from above two methods. We can also design a second hash function inside of the large hash function to split up the large bucket even more.

## Hash Maps

> <Key, Value> ---- Hash Function on Key ------> Hash Value: <K, V>

`If only we had a function that could give us arrays indices for any key value that we gave it!`

**Hash functions for strings**

we can treat `abcd` as $$a * p^0 + b * p^2 + c * p^3 + d * p^4$$
We use prime numbers because the provide a good distribution. The most common prime numbers used for this function are 31 and 37.

In [16]:
class HashMap:
    
    def __init__(self, initial_size=10):
        self.bucket_array = [None for _ in range(initial_size)]
        self.p = 37
        self.num_entries = 0
        
    def put(self, key, value):
        pass
    
    def get(self, key):
        pass
    
    def get_bucket_index(self, key):
        return self.get_hash_code(key)
    
    def get_hash_code(self, key):
        key = str(key)
        num_buckets = len(self.bucket_array)
        current_coefficient = 1
        hash_code = 0
        for character in key:
            hash_code += ord(character) * current_coefficient
            current_coefficient *= self.p

        return hash_code

In [17]:
hash_map = HashMap()

bucket_index = hash_map.get_bucket_index("abcd")
print(bucket_index)

5204554


**Compression Function**

But the values are huge, we cannot create such large arrays. A very simple, good, and effective compression function can be ` mod len(array)`. The `modulo operator %` returns the remainder of one number when divided by other. 

In [19]:
class HashMap:
    
    def __init__(self, initial_size = 10):
        self.bucket_array = [None for _ in range(initial_size)]
        self.p = 31
        self.num_entries = 0
        
    def put(self, key, value):
        pass
    
    def get(self, key):
        pass
        
    def get_bucket_index(self, key):
        bucket_index = self.get_hash_code(keu)
        return bucket_index
    
    def get_hash_code(self, key):
        key = str(key)
        num_buckets = len(self.bucket_array)
        current_coefficient = 1
        hash_code = 0
        for character in key: 
            # 3 compressions
            hash_code += ord(character) * current_coefficient
            hash_code = hash_code % num_buckets                       # compress hash_code
            current_coefficient *= self.p
            current_coefficient = current_coefficient % num_buckets   # compress coefficient

        return hash_code % num_buckets                                # one last compression before returning
    
    
    def size(self):
        return self.num_entries

**Collision Handling**

There are two popular ways in which we handle collisions.

1. Closed Addressing or Separate chaining. Closed addressing is a clever technique where we use the same bucket to store multiple objects. The bucket in this case will store a linked list of key-value pairs. Every bucket has it's own separate chain of linked list nodes.
2. Open Addressing. we do the following:
    * If, after getting the bucket index,  the bucket is empty, we store the object in that particular bucket
    
    * If the bucket is not empty, we find an alternate bucket index by using another function which modifies the current hash code to give a new code.
   
Separate chaining is a simple and effective technique to handle collisions and that is what we discuss here.