# Elementary Symbol Tables and Binary Search Trees

A **symbol table** is a key-value pair abstraction where you can *insert* a value with a specified key, and given a key, *search* for the corresponding value.

One example is a domain name server (DNS) lookup, where the URL is the key and the IP address is the value. There are many other examples in computing. In genomics, you can use a symbol table to find markers in a genome where the key is DNA string, and the value is the known position.

The basic symbol table API is to set up an **associative array abstraction**, which associates one value with each key. One simple implementation is to use an array, where the index is the key (the drawback is that the key must be an integer). The two basic operations are `put(key, value)` to insert a new key-value pair, and `get(key)` to retrieve the value associated with the given key.

Other operations you'll probably want include a `delete(key)` operation to remove a key-value pair, a `contains(key)` operation to return a Boolean whether a key is there or not, an `isEmpty()` one to check if the table is empty, a `size()` operation to get the number of key-value pairs, and an iterable `keys()` to return all the keys in the table.

Some general conventions are:
- Values are not `null`
- The `get()` method returns `null` if a key isn't present
- The `put()` method overwrites an old value with new value if the key is already in the table

Implementation of `contains()`:

```py
def contains(key):
    return get(key) != None
```

Lazy implementation of `delete()`:

```py
def delete(key):
    put(key, None)
```

## Keys and Values

The values should be any generic type, but there are some assumptions for keys:
- Assume keys are comparable (is one less than another?)
- Assume keys are any generic type, and you can test them for equality
- Assume keys are any generic type, and you can scramble them

A best practice is to use immutable types for symbol table keys.


## Equality Test (Java-specific)

Equality tests in Java should meet the following criteria (for any references `x`, `y`, and `z`):
- Reflexive: `x.equals(x)` is `True`
- Symmetric: `x.equals(y)` iff `y.equals(x)`
- Transitive: if `x.equals(y)` and `y.equals(z)`, then `x.equals(z)`
- Non-null: `x.equals(null)` is `False`

The default implementation in Java is `(x == y)` - do they refer to the same object? The "standard" recipe to check equality for user-defined types is:
- Optimization for reference equlity (do they refer to the same object)
- Check against `null`
- Check that two objects are of the same type and cast
- Compare each significant field (for primitives use `==`, for objects use `equals()`, and if field is an array, apply to each entry)

## Elementary Implementations

One elementary implementation is to maintain a linked list (ordered or unordered), where the nodes in the list have key-value pairs. To search for a key, it requires a scan through all keys until you find a match. Inserting a new key requires you to scan through the list until you find a match, then if there's no match, you add the new node to the front. The challenge lies in creating efficient search and insert methods, which was previously covered with the binary search method.

Binary search in this capacity would have two ordered parallel arrays (one with the keys and one the values).

**Sequential Search (unordered list)**
- Worst-case (after N inserts): search is $N$, insert is $N$
- Average case: search is $N/2$, insert is $N$

**Binary Search (ordered array)**
- Worst-case (after N inserts): search is $\log N$, insert is $N$
- Average case: search is $\log N$, insert is $N/2$

Keys need to be a type that can be compared in order to be ordered. You can also implement `min()` and `max()` functions to find those keys, or have a priority queue-type setup with `deleteMinKey()` or `deleteMaxKey()` operations.

One major flaw with binary search, though, is that it can't maintain a dynamic table efficiently.

## Binary Search Trees (BST) Overview

**Binary search trees** are an efficient way to implement symbol table algorithms - it is a binary tree in symmetric order. A **binary tree** has nodes which contain information, and every node has two links to binary trees that are disjoint from one another - a left tree and right tree. Any of these links can be null - either the left, the right, or both. Each node in the tree is a root of a subtree (which refers to the nodes below) and the nodes below any given one are that node's children.

A **binary search tree** has **symmetric order** where each node has a key. The key is larger than all the keys in its left subtree, and smaller than all the keys in its right subtree. (Note that this is different than a heap, where every node is larger than both its children). The values are stored in the node along with its key.

So a node has four fields: a key, a value, a reference to the left subtree (to smaller keys), and a reference to the right subtree (to larger keys).

The shape of a BST depends on the order of insertion of the keys. Best case is it's perfectly balanced, worst case the tree has depth equal to the number of keys (they came in ordered and they're all stacked along the left or right sides).

The number of compares for search and insert is equal to 1 + depth of node.

**Proposition** If $N$ distinct keys are inserted into a BST in random order, the expected number of compares for a search/insert is ~$2 \ln N$ (the average path). **Proof** 1-1 correspondence with quicksort partitioning.

**Proposition** If $N$ distinct keys are inserted in random order, the expected height of the tree is ~$4.311 \ln N$.

However, the worst-case scenario is the tree has height $N$. There's an exponentially small chance of this happening when the keys are inserted in random order, but the issue lies in that the client provides the keys (and in whatever order they want).

## Ordered Operations in a BST

You can create ordered symbol table operations that use the BST data structure as the underlying implementation.

The `min` and `max` functions find the smallest and largest keys, respectively, in the table. Just navigate to the left-most or right-most key in the tree.

The `floor` function finds the largest key $\leq$ the given key and the `ceiling` function finds the smallest key $\geq$ the given key. To implement a `floor` functions, for example, you generally need to check 3 cases: 1) the number $k$ equals the key at the root (floor of $k$ is $k$), 2) the number is less than the key at the root (the floor of $k$ is in the left subtree), and 3) the number is greater than the key at the root (the floor of $k$ is in the right subtree if there is any key $\leq k$, otherwise it is the key in the root).

A `rank` function would find how many keys are $\lt k$ - you can implement a recursive function to check 3 cases. If $k$ equals the key at the root, the rank (number of keys smaller) is the size of the left subtree. If $k$ is less than the root, you recursively get the `rank` of $k$, but within the left subtree (you initially start with the entire tree at the root). If $k$ is greater than the root, you add 1 (for the root), plus the size of the left subtree (your $k$ is greater than all of those keys), then add the recursively returned `rank` of the key in the right subtree.

You iterate over the tree structure first by traversing the left subtree, then enqueuing the key, then traversing the right subtree.

All operations have an order of growth that's proportional to the height of the BST (proportional to $\log N$ if you insert the keys in random order). Ordered iteration is order $N$.

## Delete Key-Value Pairs in a BST

Truly dynamic symbol tables need the functionality to delete key-value pairs. All implementations have issues with this operation.

There are a few approaches in a BST. The lazy one is mark the key with a "tombstone" (set its value to `null`) but leave the key in the table to guide searches. The cost is $2 \ln N$ per insert, search, and delete (if keys are inserted in random order) where $N$ is the number of key-value pairs inserted into the BST. The downside is you eventually run into a memory overload.

To delete the minimum value, you can follow the left subtrees until you hit a node with a `null` left link, replace that node with its right link, then update the subtree counts.

The **Hibbard deletion** (to remove a node $t$ containing key $k$) covers a few cases:
- Case 0 (0 children): delete $t$ and set the parent link to `null`, then update the node counts
- Case 1 (1 child, similar to deleting the min): delete $t$ by replacing the parent link to the node with the link to the child node, then update the node counts
- Case 2 (2 children): find the minimum key in $t$'s right subtree (left-most node with a `null` left link) which becomes the successor key, delete that key and update the right subtree (process in case 1), then remove node $t$ by putting the successor key in its place

One issue with Hibbard deletion is that it leads to a lack of tree symmetry - always replacing a deleted node with its successor makes the tree less balanced over time. Researchers found after a sequence of random insertions and deletions, the height of the tree was $\sqrt{N}$ vs. its original $\log N$. If you switch randomly between using the successor and the predecessor of a node, theoretically you'll revert to $\log N$ tree height, but no one has proved this mathematically. This is a longstanding open problem, similar to merging in place, where no good solution has emerged after 50 years.

So the average case to delete a node in a BST is $\sqrt{N}$ and if you allow delete, all order of growth operations revert to $\sqrt{N}$.

In [2]:
class Node:
    def __init__(self, key, val, left=None, right=None, parent=None):
        self.key = key
        self.val = val
        self.left = left
        self.right = right
        self.parent = parent
    
    def is_root(self):
        return not self.parent
    
    def has_left_child(self):
        return self.left
    
    def has_right_child(self):
        return self.right
    
    def is_leaf(self):
        return not (self.left or self.right)
    
    def is_left_child(self):
        return self.parent and self.parent.left == self
    
    def is_right_child(self):
        return self.parent and self.parent.right == self
    
    def has_any_children(self):
        return self.right or self.left

    def has_two_children(self):
        return self.right and self.left

    def update_val(self, key, val, lc, rc):
        self.key = key
        self.val = val
        self.left = lc
        self.right = rc
        if self.has_left_child():
            self.left.parent = self
        if self.has_right_child():
            self.right.parent = self


In [3]:
class Binary_Search_Tree:
    def __init__(self):
        self.root = None
        self.size = 0

    def size(self):
        return self.size

    def __len__(self):
        return self.size
    
    def put(self, key, val):
        # Public method to insert a new key in the BST
        self.root = self._put(self.root, key, val)
        self.size = self.size + 1
    
    def _put(self, node_x, key, val):
        # Private method that recursively searches where to place
        # the new key in the BST. Returns a node
        if node_x is None:
            return Node(key, val)
        if key < node_x.key:
            node_x.left = self._put(node_x.left, key, val)
        elif key > node_x.key:
            node_x.right = self._put(node_x.right, key, val)
        else:
            node_x.val = val
        return node_x
    
    def get(self, key):
        # Returns the value associated with given key if in table, None otherwise
        node_x = self.root
        while node_x is not None:
            if key < node_x.key:
                node_x = node_x.left
            elif key > node_x.key:
                node_x = node_x.right
            else:
                return node_x.val
        return None
        
    def floor(self, key):
        # Return the largest key in the BST <= a given key
        node_x = self._floor(self.root, key)
        if node_x is None:
            return None
        return node_x.key
    
    def _floor(self, node_x, key):
        if node_x is None:
            return None
        if key == node_x.key:
            return node_x
        elif key < node_x.key:
            # Floor is in left subtree
            return self._floor(node_x.left, key)
        
        # Floor may be in right subtree or is root of that subtree
        node_t = self._floor(node_x.right, key)
        if node_t is not None:
            return node_t
        else:
            return node_x
    
    def ceiling(self, key):
        # Return the smallest key in the BST >= a given key
        node_x = self._ceiling(self.root, key)
        if node_x is None:
            return None
        return node_x.key
    
    def _ceiling(self, node_x, key):
        if node_x is None:
            return None
        if key == node_x.key:
            return node_x
        elif key > node_x.key:
            # Ceiling is in right subtree
            return self._ceiling(node_x.right, key)
        
        # Ceiling may be in the left subtree or is root of that subtree
        node_t = self._ceiling(node_x.left, key)
        if node_t is not None:
            return node_t
        else:
            return node_x
    
    # TODO: Implement max, min, and delete functionality    
    def get_max(self):
        # Return the largest key in the tree
        pass
    
    def get_min(self):
        # Return the smallest key in the tree
        pass
    
    def delete(self, key):
        # Remove the Node for a given key
        pass
    
    def __setitem__(self, key, val):
        self.put(key, val)
    
    def __getitem__(self, key):
        return self.get(key)

    def __contains__(self, key):
        if self.get(key, self.root):
            return True
        else:
            return False

In [5]:
# Test Node and BST functionality
bst = Binary_Search_Tree()

# Add nodes to the BST, value equal key. Check put operation
keys = [5, 3, 8, 1, 4, 6, 10]
for k, v in zip(keys, keys):
    bst.put(k, v)

# Check get operation
for k in keys:
    print('Key: {}, Val: {}'.format(k, bst.get(k)))

print('Tree size: {}'.format(bst.size))

for k in [7, 2, 11]:
    print('Key: {}\nCeiling: {}\nFloor: {}'.format(k, bst.ceiling(k), bst.floor(k)))


Key: 5, Val: 5
Key: 3, Val: 3
Key: 8, Val: 8
Key: 1, Val: 1
Key: 4, Val: 4
Key: 6, Val: 6
Key: 10, Val: 10
Tree size: 7
Key: 7
Ceiling: 8
Floor: 6
Key: 2
Ceiling: 3
Floor: 1
Key: 11
Ceiling: None
Floor: 10


## Summary

The worst case (WC) is after $N$ inserts, and the average case (AC) is after $N$ random inserts.

| Implementation | WC Search | WC Insert | WC Delete | AC Search | AC Insert | AC Delete | Ordered Iteration? |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Sequential Search (unordered list) | $N$ | $N$ | $N$ | $N/2$ | $N$ | $N/2$ | No |
| Binary Search (ordered array) | $\lg N$ | $N$ | $N$ | $\lg N$ | $N/2$ | $N/2$ | Yes |
| Binary Search Tree (BST) | $N$ | $N$ | $N$ | $1.39 \lg N$ | $1.39 \lg N$ | ? | Yes |
| Goal | $\lg N$ | $\lg N$ | $\lg N$ | $\lg N$ | $\lg N$ | $\lg N$ | Yes |
