## Binary Search Trees

BST Rules:
- nodes in the _left subtree_ satisfy l.key $\leq$ u.key
- nodes in the _right subtree_ satisfy r.key $\geq$ u.key

**i.e. go left get smaller, go right get bigger**

| Complexity | Average Case       | Worst Case |
| ---------- | ------------       | ---------- |
| Search     | $O(\log n) = O(h)$ | $\Theta(n)$ |
| Insert     | $O(\log n) = O(h)$ | $\Theta(n)$ |
| Delete     | $O(\log n) = O(h)$ | $\Theta(n)$ |

*Chain:* $h = n - 1$, *Perfect Tree:* $h = \log(n + 1) - 1$

<img src="../media/BSTheight.png" alt="drawing" width="650"/>



## Red Black BST Trees

**Red Black Trees** (RBTs) are a modified BST, that add rules to ensure insertions and deletions result in more balanced final tree structures, hence they are known as 'self-balancing'. As many of the tasks of BSTs are $O(h)$, this gives assuredly fast _search, insertion and deletion_ operations.


| Complexity | Worst Case |
| ---------- | ---------- |
| Search     | $O(\log n)$ |
| Insert     | $O(\log n)$ |
| Delete     | $O(\log n)$ |

**Rules:**

| Number    | Property |
| --------- | -------- |
| 1         | Every node is *red* or *black* |
| 2         | The root node is *black* |
| 3         | The 'Nil' node is *black* |
| 4         | If a node is *red* then both children are *black* |
| 5         | From any node, all paths to leaf nodes have equal numbers of *black* nodes |

Therefore, all root-leaf paths have the same number of *black* nodes, and all *red* nodes have black children.
This places a limit on the maximum size of the longest path (height):

$ h_{max} \leq 2 \cdot h_{min} $

Noting that a perfectly balanced tree has $ n = 2^{h+1} -1 $ nodes, and that the min h must be less than the perfect height $ h_{min} \leq \log_2 (n + 1) \therefore h_{min} = \Theta( \log_2 n)$.

$ h_{min} \leq h_{max} \leq 2 \cdot h_{min} \Rightarrow h_{max} = \Theta( \log_2 n)$ == self balancing result!

## B-Trees

B-Trees are a self-balancing search-trees with fast search, insertion and deletion, where nodes can have many children.

**Degree `t` implies `t` to `2t` children**

When a B-tree has $t=2$ and allows key counts of 1 to 3 supporting nodes of 2 to 4, it is called a **2-4** or **2-3-4** Tree.

**Worst case Search/Insert/Delete** $\Rightarrow O( \log n )$, storage is $\Theta (n)$

A B-tree with minimum degree `t`:
- Nodes have attributes `keys` [list] and `is_leaf`.
- Internal nodes have `len(keys) + 1` children where:
- The keys of the node **separate its children**.
- $ t - 1\leq $ node `keys` $ \leq 2t -1$ (except for the root). 
- $ t \leq $ `children` $ \leq 2t$ (except for the root). 
- All leaves have the same depth. (ideally balanced)

### Tree Size and Shape

A B-tree with height $h$, min degree $t$, key count $n$:

- All leaves have the same depth
- All nodes have $t - 1\leq $ node `keys` $ \leq 2t -1$

The maximum number of nodes in degree t _fat_ tree grows (root + level 1 + level 2...) according to:

$ 1 + 2 + 2 \cdot 2t + 2 \cdot 2t \cdot 2t ...$ $ = 1 + 2 + 2 \cdot (2t) + 2 \cdot (2t)^2 = 1 + a + ar + ar^2 \\\\
  n \leq 1 + 2 ( \frac{(2t)^h - 1}{2t - 1} )$


The minimum number of nodes in degree t _skinny_ tree grows (root + level 1 + level 2...) according to:

$ 1 + 2 + 2 \cdot t + 2 \cdot t \cdot t ...$ $ = 1 + 2 + 2 \cdot t + 2 \cdot t^2 = 1 + a + ar + ar^2 \\\\
  n \geq  1 + 2 ( \frac{t^h - 1}{t - 1} )$

The number of keys held by n nodes in a _skinny_ tree (as each key holds t - 1 bar the root):

$ k \geq 1 + (t-1)2 ( \frac{t^h - 1}{t - 1} ) = 1 + 2(t^h - 1) = 2t^h - 1$

Therefore, where $n$ is the number of keys:

$ n \geq 2t^h - 1$, $ \frac{1}{2}(n + 1) \geq t^h \\\\
  h \leq \log_t \left( \frac{1}{2}(n + 1) \right) \\\\
  h_{max} = \lfloor \log_t \left( \frac{1}{2}(n + 1) \right) \rfloor$



## HashTables

### Hash Functions

A hash function, $h:U\rightarrow \{0,\cdots m-1\}$, maps every element of a universe of keys $U=\{k_0,k_1,\cdots,k_{m-1}\}$ to one element in a set $\{0,\cdots m-1\}$. e.g. $h(k) = k\mod m$.

An **Ideal $h(k)$** would roll a fair, random, $m$-sided die at boot to select a slot for each $k$ - An independent uniform random hash function.

1. Fast to  Compute
2. Minimise Collisions: $k_i \neq k_j \; \& \; h(k_i) = h(k_j)$

#### Static Forms

Vulnerable to suboptimal key distributions giving many collisions.

**Division Method** - $h(k) = k\mod m$ 

**Multiplication Method** - $h(k) = \lfloor m \cdot (Ak\mod 1) \rfloor$  *for*  $A \in (0, 1)$ 

#### Random Forms

*Universal* Families, e.g. Family H:

$P_{h \in H} \left[ h(k_i) = h(k_j)\right] \leq \frac{1}{m} \; \forall \; i \neq j$

In other words, any two different keys of the universe collide with probability at most $\frac{1}{m}$ when hash function $h$ is drawn uniformly at random from $H$. This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key. 

### Chaining

Chaining is a solution to managing collisions, using element-wise doubly linked lists:

<img src="../media/hashDLL.png" alt="drawing" width="250"/>

**Worst Case:**

* All $n$ keys collide $\Rightarrow$ All objects are placed in the same slot 
* Search is then $\Theta(n)$ with linked list linear search
* Could be $\Theta( \log n)$ if the chains are ordered for binary search

**Average Case:** (Cost of search)

* Define a table *Load Factor* - $ \alpha = \hat{=} \frac{n}{m}$ for $n$ items stored amongst $m$ entries in the table.
* Assume hash function is *Universal*, Collision probability $ < \frac{1}{m}$
* $\mathbb{E}[$ Chain Length $] = \frac{n}{m} = \alpha$
* Av Cost = $\Theta(1 + \alpha)$ - Hash + chain search
  
### Open Addressing 

Open addressing is a form of chain-free collision handling.

*linear probing* where slots are sequentially searched until an empty one is found. Permutations of $(0, 1,..., m-1)$. 

**Double Hashing:**  $h(k, i) = (h_1(k) + i h_2(k) \mod m)$.  $h_2(k)$ and $m$ must be coprime.

Number of probes in an unsuccessful search, assuming independent uniform permutation hashing:

Max Probes: $\frac{1}{1-\alpha} = 1 + \alpha + \alpha^2 + \alpha^3 + ...$ : at least 1, more than 1, more than 2 and so on... 

**Estimate the number of table accesses (probes) needed when inserting a new entry into the table.**

Let $\alpha=\frac{k}{n}$ be the load factor of the hash table. The probability of a successful insert in one probe is $1-\alpha$. Probability of a successful insert in exactly $t$ probes is $P(\text{success in }t)=(1-\alpha) \alpha^{t-1}$. <br>

Hence, the expected number of inserts is $E[t]= (1-\alpha)\sum_{t=0}^{\infty} t \alpha^{t-1}=\frac{1}{1-\alpha}$.

$E[t]= (1-\alpha)\sum_{t=0}^{\infty} t \alpha^{t-1}= (1-\alpha)(1 + 2 \alpha + 3 \alpha^2 + 4 \alpha^3...)$

$E[t]= (1 + 2 \alpha + 3 \alpha^2 + 4 \alpha^3...) - (\alpha + 2 \alpha^2 + 3 \alpha^3...)$
$E[t]= (1 + \alpha + \alpha^2 + \alpha^3...) = \frac{1}{1-\alpha}$
