# Probing

- explain linear probing
- what about default values?
  - what if I want to store a 0, or an ""?
  - store a bit for "occupied"
- what about deleting an item from a chain?
  - store a bit for "deleted"
  - this gets cleared on grow
- explain quadratic probing (briefly)
- explain pseudorandom probing

- Implement ordered dictionary
  - Use `unordered_map` and `vector`

- bar: 5
- foobar: 5
- foo: 6
- quux: 7
- quux2: 7
- abc: 3
- abcde: 3

**Outline**

- Collisions: probing
  - look for another empty spot
- Store in array of items instead of lists
  - Now we need a default value for our item -> store a struct
- How do we find an empty slot?
  - linear probing - find the next empty slot
- Problem: what happens when you delete an item?
  - now the chain is broken and later items no longer "exist" in the table
  - store an `is_deleted` flag
  - new items can go in an `is_deleted` slot (and flag is clear), but chains continue over them even when empty
- Discussion on performance
  - As the matrix gets full, the performance degrades
  - How long is each chain?
- Grow
  - usually grow at 0.5 or 0.8 capacity
  - re-add the items to the new array (sets all the `is_deleted` flags back to empty)
- Problem: Linear probing makes large blocks -> creates the worse-case performance
  - quadratic probing helps
  - pseudorandom probing is best

In [None]:
#include <string>
using std::string;

#include <functional>  // std::hash

In [None]:
std::hash<string> sh;

In [None]:
cout << "foo: " <<  sh("foo") % 8 << endl;
cout << "foobar: " <<  sh("foobar") % 8 << endl;
cout << "bar: " << sh("bar") % 8 << endl;
cout << "baz: " << sh("baz") % 8 << endl;
cout << "quux: " << sh("quux") % 8 << endl;
cout << "abc: " << sh("abc") % 8 << endl;
cout << "xyz: " << sh("xyz") % 8 << endl;
cout << "123: " << sh("123") % 8 << endl;
cout << "bazquux: " << sh("bazquux") % 8 << endl;
cout << "321: " << sh("321") % 8 << endl;
cout << "?: " << sh("?") % 8 << endl;
cout << "!: " << sh("!") % 8 << endl;
cout << ":): " << sh(":)") % 8 << endl;
cout << ":-): " << sh(":-)") % 8 << endl;
cout << ":P: " << sh(":P") % 8 << endl;
cout << ":(: " << sh(":(") % 8 << endl;


In [66]:
#include "table1.h"

In [67]:
Table1 table;
table.print();

0:  
1:  
2:  
3:  
4:  
5:  
6:  
7:  


In [68]:
table.insert("foo");
table.insert("bar");
table.print();

0:  
1:  
2:  foo
3:  bar
4:  
5:  
6:  
7:  


In [69]:
cout << table.contains("") << endl;

1


In [70]:
#include "table2.h"

In [71]:
Table2 table;
table.print();

0(0):  
1(0):  
2(0):  
3(0):  
4(0):  
5(0):  
6(0):  
7(0):  


In [72]:
cout << table.contains("") << endl;

0


In [73]:
table.insert("");
table.insert("foo");
table.print();
cout << table.contains("") << endl;


0(0):  
1(0):  
2(1):  foo
3(0):  
4(0):  
5(0):  
6(1):  
7(0):  
1


In [74]:
table.insert("bar");
table.print();

0(0):  
1(0):  
2(1):  foo
3(1):  bar
4(0):  
5(0):  
6(1):  
7(0):  


In [75]:
table.insert("foobar");
table.print();

0(0):  
1(0):  
2(1):  foo
3(1):  foobar
4(0):  
5(0):  
6(1):  
7(0):  


In [76]:
#include "table3.h"

In [77]:
Table3 table;

In [78]:
table.insert("foo");
table.insert("bar");
table.print();

0(0):  
1(0):  
2(1):  foo
3(1):  bar
4(0):  
5(0):  
6(0):  
7(0):  


In [79]:
table.insert("foobar");
table.print();

0(0):  
1(0):  
2(1):  foo
3(1):  bar
4(1):  foobar
5(0):  
6(0):  
7(0):  


In [80]:
cout << table.contains("bar") << endl;
cout << table.contains("foobar") << endl;

1
1


In [81]:
table.reset();
table.insert(":)");  // takes foo's preferred spot
table.print();

0(0):  
1(0):  
2(1):  :)
3(0):  
4(0):  
5(0):  
6(0):  
7(0):  


In [82]:
table.insert("bar");  // right after foo's spot
table.insert("foo");  // take a hike!
table.print();

0(0):  
1(0):  
2(1):  :)
3(1):  bar
4(1):  foo
5(0):  
6(0):  
7(0):  


In [83]:
cout << table.remove("bar") << endl;
table.print();

1
0(0):  
1(0):  
2(1):  :)
3(0):  bar
4(1):  foo
5(0):  
6(0):  
7(0):  


In [84]:
cout << table.contains("foo") << endl;

0


In [85]:
#include "table4.h"

In [86]:
Table4 table;

In [87]:
table.insert(":)");
table.insert("bar");
table.print();

0(0)[0]:  
1(0)[0]:  
2(1)[0]:  :)
3(1)[0]:  bar
4(0)[0]:  
5(0)[0]:  
6(0)[0]:  
7(0)[0]:  


In [88]:
table.insert("foo");
table.print();

0(0)[0]:  
1(0)[0]:  
2(1)[0]:  :)
3(1)[0]:  bar
4(1)[0]:  foo
5(0)[0]:  
6(0)[0]:  
7(0)[0]:  


In [89]:
cout << table.remove("bar") << endl;
table.print();

1
0(0)[0]:  
1(0)[0]:  
2(1)[0]:  :)
3(0)[1]:  bar
4(1)[0]:  foo
5(0)[0]:  
6(0)[0]:  
7(0)[0]:  


In [90]:
cout << table.contains("foo") << endl;

1


## Big O?

- As long as the chains stay short, the performance stays near $O(1)$
- If the density is capped at 2/3, then the probability of finding an empty slot is 1/3
  - The expected number of hops of probability 1/3 each is 1 / (1/3) = 3. 
  - So at worst, each chain will be an average of 3 steps each.

In [92]:
table.reset();
table.insert(":)");
table.insert("foo");
table.insert("bar");
table.insert("123");
table.print();

0(0)[0]:  
1(0)[0]:  
2(1)[0]:  :)
3(1)[0]:  foo
4(1)[0]:  bar
5(1)[0]:  123
6(0)[0]:  
7(0)[0]:  


- What is the probability the next item ends in slot 6?

<div style="font-size: 100px">$\frac{5}{8}$ 😱</div>

The big problem with *linear probing* is that chains increase the likelihood of items getting added to the chain, which increase the likelihood the chain gets longer...

**Quadratic** probing uses steps of 1, 4, 9, 16, etc.

This helps, but what you really want is that each item added has uniform probability it could end up in any spot at each step...

In [93]:
#include "table5.h"

In [94]:
Table5 table;

In [95]:
table.reset();
table.insert(":)");
table.insert("bar");
table.insert("foo");
table.print();

0(0)[0]:  
1(0)[0]:  
2(1)[0]:  :)
3(1)[0]:  bar
4(0)[0]:  
5(1)[0]:  foo
6(0)[0]:  
7(0)[0]:  


In [96]:
table.insert("123");
table.insert("xyz");
table.insert("foobar");
table.print();

0(0)[0]:  
1(1)[0]:  xyz
2(1)[0]:  :)
3(1)[0]:  bar
4(1)[0]:  foobar
5(1)[0]:  foo
6(0)[0]:  
7(1)[0]:  123


## Other details

- When you reach some portion of capacity (e.g. 66%) then grow the array
  - deleted items count towards capacity
  - re-insert all items into new array

### Python's Implementation

- https://github.com/python/cpython/blob/main/Objects/dictobject.c#L261
- https://github.com/python/cpython/blob/main/Objects/dictobject.c#L995


## Key Ideas

- Store everything in a single array
  - wrap the items in a struct with entry and deleted flags
  - grow when you run out of room
- use probing (also called *open addressing*) to find an alternate slot when the preferred slot is already taken
  - linear probing tends to create islands that degrade performance
  - pseudo-random probing keeps the chains smaller 