# Hashing

In [1]:
#include <string>
using std::string;

#include <sstream>
using std::stringstream;

#include <iostream>
using std::cout, std::endl;
cout << std::boolalpha;  // print booleans as "true" and "false" instead of 1 and 0

#include <vector>
using std::vector;

#include <list>
using std::list;

#include <algorithm>
using std::find;

#include <unordered_set>
using std::unordered_set;

#include <functional>
// using std::hash;

## The Table

In [2]:
string table[10];

In [3]:
void reset_table() {
    for (size_t i = 0; i < 10; i++) {
        table[i] = "";
    }    
}

In [4]:
void print_table() {
    for (size_t i = 0; i < 10; i++) {
        cout << i << ": " << table[i] << endl;
    }
}

In [5]:
reset_table();
print_table()

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


## The Hash Function

A **hash function** is any method that turns the input item into an unsigned int (i.e. `size_t`).

In [6]:
size_t string_hash(string const& item) {
    return item[0];    
}

In [7]:
cout << string_hash("foobar") << endl;

102


In [8]:
cout << string_hash("I love cs235!") << endl;

73


In [9]:
cout << string_hash("I love BYU!") << endl;

73


## Table + Hash Function

In [None]:
void add_item(string const& item, size_t (*hash)(string const&)) {
    int pos = hash(item) % 10;
    table[pos] = item;
}

<div style='font-size: 150px'>🤔 🤪 🤨 🤓 🫣 😶‍🌫️ 😵‍💫 </div>

In [10]:
typedef size_t (*hasher)(string const&);

In [11]:
void add_item(string const& item, hasher hash) {
    size_t pos = hash(item) % 10;
    table[pos] = item;
}

In [12]:
cout << "hash(foobar): " 
     <<  string_hash("foobar") << endl;

cout << "hash(bazquux): " 
     << string_hash("bazquux") << endl;

cout << "hash(win!): " 
     << string_hash("win!") << endl;


hash(foobar): 102
hash(bazquux): 98
hash(win!): 119


In [13]:
reset_table();
add_item("foobar", string_hash);
add_item("bazquux", string_hash);
add_item("win!", string_hash);

print_table()

0: 
1: 
2: foobar
3: 
4: 
5: 
6: 
7: 
8: bazquux
9: win!


In [14]:
bool has_item(string const& item, hasher hash) {
    return table[hash(item) % 10] == item;
}

In [15]:
cout << has_item("foobar", string_hash) << endl;

true


In [16]:
void remove_item(string const& item, hasher hash) {
    table[hash(item) % 10] = "";
}

In [17]:
remove_item("foobar", string_hash);
print_table();

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: bazquux
9: win!


## Introducing: The HashTable

- A **hash function** converts a value into an integer
- A **hash table** uses a hash function to determine the location in which to store the value

What is the big-O complexity to add, remove, or lookup a value?

- The time it takes to convert a value into an index is $O(1)$
- Add, remove, or lookup are just additional constant operations.

$O(1)$!

<div style='font-size: 200pt'> 💪🏻 </div>

## Hash Tables: *Revisited*

In [18]:
reset_table();
print_table();

0: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 


In [19]:
add_item("foo", string_hash);
print_table();

0: 
1: 
2: foo
3: 
4: 
5: 
6: 
7: 
8: 
9: 


In [20]:
add_item("foobar", string_hash);
print_table();

0: 
1: 
2: foobar
3: 
4: 
5: 
6: 
7: 
8: 
9: 


A **collision** is when two values have the same hash code.

In [21]:
size_t a_better_hash(string const& item) {
    size_t result = 0;
    for (auto c : item) {
        result += c;
    }
    return result;
}

In [22]:
reset_table();
add_item("foo", a_better_hash);
print_table();

0: 
1: 
2: 
3: 
4: foo
5: 
6: 
7: 
8: 
9: 


In [23]:
add_item("foobar", a_better_hash);
print_table();

0: 
1: 
2: 
3: foobar
4: foo
5: 
6: 
7: 
8: 
9: 


In [24]:
add_item("oof", a_better_hash);
print_table();

0: 
1: 
2: 
3: foobar
4: oof
5: 
6: 
7: 
8: 
9: 


<div style="font-size: 200px"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 🤦🏻‍♂️ </div>

Can you create a hash function that is guaranteed not to create collisions?

You'll always have to deal with collisions, but the fewer collisions a hash function creates, the better.

## HashTable Collisions

There are two approaches to handling collisions: **probing** and **chaining**.

We'll discuss *chaining* today.

### Chaining

Instead of storing the items directly, each slot stores a **list** of items. 

First determine the slot an item should go in, then search the list in that slot. 

As long as the number of items assigned to the same slot stays small, the performance doesn't degrade.

When the number of items gets closer to the capacity of the array, it's time to grow the array.

```
0: foo, quux
1:
2: bar
3: baz, zip
4: 
5:
6: win
7: win!
8: cs235
9: abc
```

### Growing

- Create a new array
- **Re-add** each item to the table

Why not simply copy the lists over to the new array? Why do we need to re-add each item individually?

Assume an array size of 10. The hashcodes `1812` and `7502` will end up in the **same** slot:

In [25]:
cout << 1812 % 10 << endl;

2


In [26]:
cout << 7502 % 10 << endl;

2


But when I increase the array size to 20, these same hashcodes now fall in **different** slots:

In [27]:
cout << 1812 % 20 << endl;

12


In [28]:
cout << 7502 % 20 << endl;

2


## Big O

What is the big-O for add, remove, and contains?

- Computing the position is $O(1)$
- Finding the bucket is $O(1)$
- Assuming the hash function uniformly distributes the data, then the probability that there is a collision will be small
  - You can tune the grow parameter to improve performance
- Growing adds $n$ items over again, but it only happens once every $n$ items, so the amortized complexity is $O(1)$
- All together: $O(1)$

What are the pathological cases for a hashtable?

- All the items end up in the same bucket: $O(n)$

## Iteration order

When you iterate through the values of a hash table, what order to they come out?

https://stackoverflow.com/a/78240361/2288986

## Hash Functions: *Revisited*

In [32]:
size_t hash_7(string const& value) {
    return 2;
}

In [33]:
cout << hash_7("foo") << endl;

2


In [34]:
cout << hash_7("bar") << endl;

2


In [35]:
size_t rand_hash(string const& value) {
    return rand();
}

In [36]:
reset_table();

In [70]:
add_item("bar", rand_hash);
print_table();

0: 
1: bar
2: 
3: bar
4: bar
5: 
6: bar
7: bar
8: bar
9: 


In [77]:
cout << has_item("bar", rand_hash) << endl;

false


### Hash Function Qualities

The choice of hash function matters. The goal is to make collisions as rare as possible, but still be able to find the items.

What kind of function do we want?

- **Determinism**: the same value will ALWAYS yield the same hashcode
  - No `rand` in the hash function!

- **Efficiency**: the hashcode can be computed quickly.
  - If it takes longer to compute the hashcode than to insert into a BST, that's no good.

- **Defined range**: the distribution covers the full defined range
  - If my array is 1000 slots long, but my hash function only produces values between 0..10, that's no good.

- **Uniformity**: the hashcodes are uniformly distributed across the full possible space
  - If my hash function tends to output even numbers but not odd numbers, that's no good.

## How to Hash Anything

- https://en.cppreference.com/w/cpp/utility/hash

In [78]:
// hash comes from <functional>
std::hash<string> hashfun;

In [79]:
cout << hashfun("hello world!") << endl;

6594337730806245023


In [80]:
cout << hashfun("hello world?") << endl;

11656130126939175289


In [81]:
std::hash<int> inthash;
cout << inthash(1234567) << endl;

1234567


## Python Dictionaries

Python uses the term *dictionary* to mean *hashmap*.

However, Python dictionaries have a few additional qualities to note:
- they iterate in **insertion** order
- they use open-addressing (with pseudo-random probing) instead of chaining

How is the insertion order preserved?

## Key Ideas

- Hash functions convert a value into an integer
- Hash tables use hash functions to store values in $O(1)$ time
- Hash maps use hash tables to store key-value pairs. 