# Data Structures and Algorithm - Hash Tables

* Built in hash table - which is dictionaries (dictionaries are made up of a Key Value pair)
  * We're going to need a hash function or a hash method.​‌ And what we're going to do is perform a hash on the key.​ So we take that key, run it through the hash, and in addition to getting our key value pair back, we get ​‌an address.​‌ So that's going to be the address where we store that key value pair.​‌  

* two characteristics about hash that are important
  1. that it is one way.​‌首先，这是单向的。
      * So if we take nails we run it through the hash and we get the number two.​ The thing that we cannot do is take the two, put it through the hash and have it produce nails.​‌
  1. it is deterministic 确定性(means that for a particular hash function.​‌)
      * Every time we put nails in, we expect to get the number two every time.​ That means it is deterministic.​‌

* build hash table - We'll create our own address space by creating a list.​ And then we'll create methods.​‌ So in this case to set an item we're going to have a key and a value.​‌
  ![hash table store - 1](./NotesImages/HashTable_store1.png)
  ![hash table store - 2](./NotesImages/HashTable_store2.png)
  ![hash table store - 3](./NotesImages/HashTable_store3.png)
  ![hash table store - 4](./NotesImages/HashTable_store4.png)
  ![hash table store - 5](./NotesImages/HashTable_store5.png)

* Collision
  * a collision happens when you put a key value pair at an address where there was already a key value​‌ pair
  ![hash table collision](./NotesImages/HashTable_collision.png)
  * this technique of dealing with collisions, where you just put them at the same address, is called **separate chaining.​‌独立链接**
    * for a given key of a certain number of letters, it will always be the same number of operations​ to calculate the hash.​ ‌That means that the hash method itself is O(1).因此，对于给定的、包含特定数量字母的密钥，其运算次数始终相同。‌计算哈希值。‌这意味着哈希方法本身的复杂度为 O(1)。The worst possible scenario would be that all of these items would be put at the same address, and we would have to iterate through all of them, and it would be O(n)
    * collisions are going to be fairly rare.​ So we treat hash tables which are implemented as dictionaries in Python as O(1). And it is O(1) to place a key value pair or to look up something by key.​‌
    1. nested list
  ![hash table separate chaining way 1](./NotesImages/HashTable_separatechaining1.png)
    1. linked list
  ![hash table separate chaining way 2](./NotesImages/HashTable_separatechaining2.png)
  * So in another popular way of dealing with collisions, if we already have a key value pair at the address​ that this maps to, what you do is you go down until you find an empty address and you put the key value​ pair there.​ And then if you have another one, you're going to keep going until you find an empty spot like this. And then store this. This is called **linear probing.​线性探测** (And that is a form of open addressing.​  And there are a lot of ways to do this with open addressing.​‌ Linear probing is just one of them, but this makes it where you don't have more than one key value​‌ pair at any address.)  
    
* So one of the points I want to make about a hash table is that **you should always have a prime number​‌ of addresses**. And the reason for that is a prime number increases the amount of randomness for how the key value pairs​‌ are going to be distributed through the hash table, so it reduces your collisions.​‌

* Both Insert and Lookup by key in a Hash Table is O(1)
* Since a Hash Table is O(1) for Insert and Lookup it is **not** always better than a Binary Search Tree. (Binary Search Trees are sorted which makes them better at searching for all values that fall within a range.) 
* Key lookup is 0(1) but not value.
  

In [2]:
## build our hash table constructor - nested list to handle collisions
class HashTable:
    def __init__(self, size = 7):  
        # self.data -> we call this list data map
        self.data_map = [None] * size  # it's going to create a list with seven items in it.​‌ And all of those are just going to contain none.​‌

    def __hash(self, key):  # the hash is what we pass the key to to determine the address where we store that key value​‌ pair
        my_hash = 0
        for letter in key:
            # ord(letter) that's short for ordinal.​‌ord()是序数词的缩写。
            # ord() -> function to get the ASCII value of a character, And what this does is it gets the Ascii number for each letter as we are looping through it.​‌
            # times 23 -> we multiply it by a prime number to help with the randomness of the hash function.​‌ the reason for that is 23 is a prime number.​‌ You could plug any prime number in here.​‌
            # modulo operator - % -> to make sure that we don't go out of the bounds of our data list.​‌ So we do modulo the length of the data list.​#‌ modulo gives you the remainder when you divide.​‌ So if you divide any number by seven, the remainder is going to be anywhere from 0 to 6.​‌ And 0 to 6 is exactly our address space.​‌         
            my_hash = (my_hash + ord(letter) * 23) % len(self.data_map)
        return my_hash
    
    def print_table(self):
        for i, val in enumerate(self.data_map):
            print(i, ": ", val)

    # The set item method is going to use the hash method on the key to create the address.​
    def set_item(self, key, value):
        # the first thing we need to do is figure out the address where we're going to store our key value pair
        index = self.__hash(key)
        #  initialize this empty list at that address
        if self.data_map[index] is None:
            self.data_map[index] = []
        # we are appending a list with the key and value to handle collisions.​‌
        self.data_map[index].append([key, value])

    def get_item(self, key):
        index = self.__hash(key)
        if self.data_map[index] is not None:
            for i in range(len(self.data_map[index])):
                if self.data_map[index][i][0] == key:
                    return self.data_map[index][i][1]
        return None

    # we're going to take all of the keys out of the hash table, put them into a list, and then return that list.
    def keys(self):
        all_keys = []
        """ for item in self.data_map:
            if item is not None:
                for key_value in item:
                    all_keys.append(key_value[0])"""
        for i in range(len(self.data_map)): 
            if self.data_map[i] is not None:
                for j in range(len(self.data_map[i])):
                    all_keys.append(self.data_map[i][j][0])
        return all_keys


my_hash_table = HashTable()
my_hash_table.print_table()

## set some items
my_hash_table.set_item('bolts', 1400)
my_hash_table.set_item('washers', 50)
my_hash_table.set_item('lumber', 70)
my_hash_table.print_table()

## get some items
print(my_hash_table.get_item('bolts'))
print(my_hash_table.get_item('washers'))
print(my_hash_table.get_item('lumber'))

## get all keys
print(my_hash_table.keys())


0 :  None
1 :  None
2 :  None
3 :  None
4 :  None
5 :  None
6 :  None
0 :  None
1 :  None
2 :  None
3 :  None
4 :  [['bolts', 1400], ['washers', 50]]
5 :  None
6 :  [['lumber', 70]]


* Interview question
  * So we're going to bring up two lists.​ So what we want to determine is whether these two lists have an item in common.​‌
  ![Interview question - 2 list compare](./NotesImages/HashTable_interviewquestion.png)
    1. First approach(the inefficient way) - The obvious approach, what we would call the naive approach would be to create nested for loops.​‌
        * So we'll have one for loop to go through the first list, then a second one to compare all the items in the second list to that number.​ So we'll see if that's equal to one. And if that is or that is, then we'll iterate the first for loop and move that over one. And then we'll go through the second one again.​ Then we iterate the first for loop again, go through this and finally find a match.​ And because these are nested for loops, this is O(n^2).​‌
    1. Second approach(recommend) - 
    ![Interview question - 2 list compare - hash tables](./NotesImages/HashTable_interviewquestion2.png)
        * we're going to use a dictionary since this is the hash table section.​ And what we're going to do is loop through the first list.​ We'll take the one and put it into the dictionary.​ We'll just make the value in this key value pair true.​ And we'll do the same thing for the three.​ Make that true and five is the key.​ True is the value.​ And we have three key value pairs.​‌ So we had to loop through that first list. That's O(n).​ Now we will loop through the second list to compare the two to the dictionary.​ To see is the two in the dictionary. So every time you look for an item in a dictionary by the key, it is o(1).​ And we'll do the same thing for the four.​ And then we'll do the same thing for the five. And of course, this gives us a match.​ So we had to go through each list once, which you could say is O(2n). But of course we drop the constants and it is O(n) and O(n) is far more efficient than O(n^2).​‌
      ![Interview question - 2 list compare - Big O](./NotesImages/HashTable_interviewquestion3.png)


In [None]:
#Compare 2 list - use nested for loops
def item_in_common(list1, list2):
    for i in list1:
        for j in list2:
            if i == j:
                return True
    return False

list1 = [1,2,3,4,5]
list2 = [6,7,8,9,10,3]

print(item_in_common(list1, list2))  # True

In [4]:
#Compare 2 list - use hash table
def item_in_common2(list1, list2):
    my_dict = {}
    for i in list1:
        my_dict[i] = True
    print(my_dict)

    for j in list2:
        if j in my_dict:
            return True
    return False


list1 = [1,2,3,4,5]
list2 = [6,7,8,9,10,3]
print(item_in_common2(list1, list2))  # True


{1: True, 2: True, 3: True, 4: True, 5: True}
True
