# Hashing

## TODO - Updates

In [1]:
#include <string>
#include <sstream>
#include <iostream>
#include <vector>
#include <list>
#include <algorithm>  // find
using namespace std;

## The Table

In [2]:
string table[10];

In [3]:
void reset_table() {
    for (int i = 0; i < 10; i++) {
        table[i] = "";
    }    
}

In [4]:
void print_table() {
    for (int i = 0; i < 10; i++) {
        cout << i << ": " << table[i] << endl;
    }
}

In [5]:
reset_table();
print_table()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


## The Hash Function

In [6]:
int string_hash(string const& value) {
    int result = 0;
    for (auto c : value) {
        result += (int)c;
    }
    return result;
}

In [7]:
int another_hash(string const& item) {
    return int(item[0]);
}

In [8]:
string_hash("foobar")

633

In [9]:
another_hash("foobar")

102

In [10]:
string_hash("I love cs235!")

976

In [11]:
another_hash("I love cs235!")

73

## Table + Hash Function

In [12]:
void add_item(string const& item, int (*hash)(string const&)) {
    int pos = hash(item) % 10;
    table[pos] = item;
}

In [13]:
reset_table();
add_item("foobar", &another_hash);
add_item("bazquux", &another_hash);
add_item("win!", &another_hash);
print_table()

0: 
1: 
2: foobar
3: 
4: 
5: 
6: 
7: 
8: bazquux
9: win!


In [14]:
reset_table();
add_item("foobar", &string_hash);
add_item("foobar", &string_hash);
add_item("bazquux", &string_hash);
add_item("win!", &string_hash);
print_table()

0: 
1: 
2: 
3: foobar
4: bazquux
5: 
6: 
7: win!
8: 
9: 


In [15]:
bool has_item(string const& item, int (*hash)(string const&)) {
    return table[hash(item) % 10] == item;
}

In [16]:
has_item("foobar", &string_hash)

true

In [17]:
has_item("frobnicate", &string_hash)

false

In [18]:
void remove_item(string const& item, int (*hash)(string const&)) {
    table[hash(item) % 10] = "";
}

In [19]:
remove_item("foobar", &string_hash);
print_table()

0: 
1: 
2: 
3: 
4: bazquux
5: 
6: 
7: win!
8: 
9: 


## Introducing: The HashTable

- A **hash function** converts a value into an integer
- A **hash table** uses a hash function to determine the location in which to store the value

What is the big-O complexity to add, remove, or lookup a value?

- The time it takes to convert a value into an index is $O(1)$
- Add, remove, or lookup are just additional constant operations.

$O(1)$!

<div style='font-size: 200pt'> 💪🏻 </div>

## Hash Functions: *Revisited*

In [20]:
int hash_7(string const& value) {
    return 7;
}

In [21]:
hash_7("foo")

7

In [22]:
hash_7("bar")

7

In [23]:
int rand_hash(string const& value) {
    return rand();
}

In [24]:
rand_hash("bar")

1721727823

In [30]:
rand_hash("bar")

231436237

### Hash Function Qualities

The choice of hash function matters. What kind of function do we want?

- **Determinism**: the same value will ALWAYS yield the same hashcode
  - No `rand` in the hash function!

- **Efficiency**: the hashcode can be computed quickly.
  - If it takes longer to compute the hashcode than to insert into a BST, that's no good.

- **Defined range**: the distribution covers the full defined range
  - If my array is 1000 slots long, but my hash function only produces values between 0..10, that's no good.

- **Uniformity**: the hashcodes are uniformly distributed across the full possible space
  - If my hash function tends to output even numbers but not odd numbers, that's no good.

## Hash Tables: *Revisited*

In [31]:
reset_table();
print_table();

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


In [32]:
add_item("foo", &string_hash);
add_item("oof", &string_hash);
print_table();

0: 
1: 
2: 
3: 
4: oof
5: 
6: 
7: 
8: 
9: 


Is it possible to build a hash function that will never produce collisions?

We will always need to handle collisions. 

How should we do it?

## HashTable Collisions

One strategy is to use "probing".

If the slot an item is assigned is occupied, you follow a deterministic algorithm to find another (hopefully empty) slot. 

This gets complicated. Don't use probing.

### Chaining

Instead of storing the items directly, each slot stores a list of items. 

First determine the slot an item should go in, then search the list in that slot. 

As long as the number of items assigned to the same slot stays small, the performance doesn't degrade.

When the number of items gets closer to the capacity of the array, it's time to grow the array.

```
0: foo, quux
1:
2: bar
3: baz, zip
4: 
5:
6: win
7: win!
8: cs235
9: abc
```

### Growing

- Create a new array
- Re-add each item to the table

Why not simply copy the lists over to the new array? Why do we need to re-add each item individually?

Assume an array size of 10. The hashcodes `1812` and `7502` will end up in the same slot:

In [33]:
1812 % 10

2

In [34]:
7502 % 10

2

But when I increase the array size to 20, these same hashcodes now fall in different slots:

In [35]:
1812 % 20

12

In [36]:
7502 % 20

2

## Big O

What is the big-O for add, remove, and contains?

- Computing the position is $O(1)$
- Finding the bucket is $O(1)$
- Assuming the hash function uniformly distributes the data, then the probability that there is a collision will be small
  - You can tune the grow parameter to improve performance
- Growing adds $n$ items over again, but it only happens once every $n$ items, so the amortized complexity is $O(1)$
- All together: $O(1)$

What are the pathological cases for a hashtable?

- All the items end up in the same bucket: $O(n)$

## Iteration order

When you iterate through the values of a hash table, what order to they come out?

## Hash Maps

To turn a set into a map, you store key-value-pairs instead of just values.

```c++
template<class K, class V>
class HashMap {
    list<pair<K,V>>* _table;
    public:
    bool insert(K const& key, V const& value) {
        // Use only the key to find the bucket
        auto& bucket = _get_bucket(key);
        
        // if pair with key not in bucket then add to bucket
        bucket.push_back(Pair<K,V>(key, value));
    }
    
    V& operator[](K const& key) {
        auto& bucket = _get_bucket(key);
        for (auto& pair : bucket) {
            if (pair.first == key) { return pair.second; }
        }
        // Didn't find it. Make new pair with default value;
        bucket.push_back(Pair<K,V>(key, V());
        return bucket.back().second;
    }
}
```

- are all the references necessary?
  - What happens if the bucket is not a reference?
  - What happens if the pair is not a reference?

## How to Hash Anything

- https://en.cppreference.com/w/cpp/utility/hash

In [1]:
#include <functional>

In [3]:
std::hash<std::string>{}("hello world!")

6594337730806245023

In [4]:
std::hash<std::string>{}("hello world?")

11656130126939175289

In [5]:
std::hash<int>{}(1234567)

1234567

## Python Dictionaries

Python uses the term *dictionary* to mean *hashmap*.

However, Python dictionaries have a few additional qualities to note:
- they iterate in **insertion** order
- they use open-addressing (with pseudo-random probing) instead of chaining

How is the insertion order preserved?

## Key Ideas

- Hash functions convert a value into an integer
- Hash tables use hash functions to store values in $O(1)$ time
- Hash maps use hash tables to store key-value pairs. 