# Elementary Symbol Tables and Binary Search Trees

A **symbol table** is a key-value pair abstraction where you can *insert* a value with a specified key, and given a key, *search* for the corresponding value.

One example is a domain name server (DNS) lookup, where the URL is the key and the IP address is the value. There are many other examples in computing. In genomics, you can use a symbol table to find markers in a genome where the key is DNA string, and the value is the known position.

The basic symbol table API is to set up an **associative array abstraction**, which associates one value with each key. One simple implementation is to use an array, where the index is the key (the drawback is that the key must be an integer). The two basic operations are `put(key, value)` to insert a new key-value pair, and `get(key)` to retrieve the value associated with the given key.

Other operations you'll probably want include a `delete(key)` operation to remove a key-value pair, a `contains(key)` operation to return a Boolean whether a key is there or not, an `isEmpty()` one to check if the table is empty, a `size()` operation to get the number of key-value pairs, and an iterable `keys()` to return all the keys in the table.

Some general conventions are:
- Values are not `null`
- The `get()` method returns `null` if a key isn't present
- The `put()` method overwrites an old value with new value if the key is already in the table

Implementation of `contains()`:

```py
def contains(key):
    return get(key) != None
```

Lazy implementation of `delete()`:

```py
def delete(key):
    put(key, None)
```

## Keys and Values

The values should be any generic type, but there are some assumptions for keys:
- Assume keys are comparable (is one less than another?)
- Assume keys are any generic type, and you can test them for equality
- Assume keys are any generic type, and you can scramble them

A best practice is to use immutable types for symbol table keys.


## Equality Test (Java-specific)

Equality tests in Java should meet the following criteria (for any references `x`, `y`, and `z`):
- Reflexive: `x.equals(x)` is `True`
- Symmetric: `x.equals(y)` iff `y.equals(x)`
- Transitive: if `x.equals(y)` and `y.equals(z)`, then `x.equals(z)`
- Non-null: `x.equals(null)` is `False`

The default implementation in Java is `(x == y)` - do they refer to the same object? The "standard" recipe to check equality for user-defined types is:
- Optimization for reference equlity (do they refer to the same object)
- Check against `null`
- Check that two objects are of the same type and cast
- Compare each significant field (for primitives use `==`, for objects use `equals()`, and if field is an array, apply to each entry)

## Elementary Implementations

One elementary implementation is to maintain a linked list (ordered or unordered), where the nodes in the list have key-value pairs. To search for a key, it requires a scan through all keys until you find a match. Inserting a new key requires you to scan through the list until you find a match, then if there's no match, you add the new node to the front. The challenge lies in creating efficient search and insert methods, which was previously covered with the binary search method.

Binary search in this capacity would have two ordered parallel arrays (one with the keys and one the values).

**Sequential Search (unordered list)**
- Worst-case (after N inserts): search is $N$, insert is $N$
- Average case: search is $N/2$, insert is $N$

**Binary Search (ordered array)**
- Worst-case (after N inserts): search is $\log N$, insert is $N$
- Average case: search is $\log N$, insert is $N/2$

Keys need to be a type that can be compared in order to be ordered. You can also implement `min()` and `max()` functions to find those keys, or have a priority queue-type setup with `deleteMinKey()` or `deleteMaxKey()` operations.

One major flaw with binary search, though, is that it can't maintain a dynamic table efficiently.

## Binary Search Trees (BST)

**Binary search trees** are an efficient way to implement symbol table algorithms - it is a binary tree in symmetric order. A **binary tree** has nodes which contain information, and every node has two links to binary trees that are disjoint from one another - a left tree and right tree. Any of these links can be null - either the left, the right, or both. Each node in the tree is a root of a subtree (which refers to the nodes below) and the nodes below any given one are that node's children.

A **binary search tree** has **symmetric order** where each node has a key. The key is larger than all the keys in its left subtree, and smaller than all the keys in its right subtree. (Note that this is different than a heap, where every node is larger than both its children). The values are stored in the node along with its key.

So a node has four fields: a key, a value, a reference to the left subtree (to smaller keys), and a reference to the right subtree (to larger keys).

The shape of a BST depends on the order of insertion of the keys. Best case is it's perfectly balanced, worst case the tree has depth equal to the number of keys (they came in ordered and they're all stacked along the left or right sides).

The number of compares for search and insert is equal to 1 + depth of node.

**Proposition** If $N$ distinct keys are inserted into a BST in random order, the expected number of compares for a search/insert is ~$2 \ln N$ (the average path). **Proof** 1-1 correspondence with quicksort partitioning.

**Proposition** If $N$ distinct keys are inserted in random order, the expected height of the tree is ~$4.311 \ln N$.

However, the worst-case scenario is the tree has height $N$. There's an exponentially small chance of this happening when the keys are inserted in random order, but the issue lies in that the client provides the keys (and in whatever order they want).