# Hash Tables

A **hash table** is a data structure that implements symbol tables  with constant-time performance for core operations, provided that search keys are standard data types or simply defined. It uses different access to the data compared to other BSTs, and doesn't support ordered operations.

## Hashing and Hash Functions

The basic idea of **hashing** is to save items in a *key-indexed table* where the index is a function of the key. A **hash function** is a method for computing the array index from the given key.

Issues:

- Computing the hash function
- Equality test: method for checking whether two keys are equal
- Collision resolution: algorithm and data structure to handle two keys that hash to the same array index

Hashing is a classic space-time tradeoff. With no space limitation, you use a trivial hash function with the key as the index. With no time limitation, you hash everything to the same place (use a trivial collision resolution) and do sequential search. In the real world with space and time limitations, you use hashing.

**Computing the hash function**

The idealistic goal is to scramble the keys uniformly to produce a table index - it's efficiently computable and each table index is equally likely for each key. You have an array the size of $M$ and $N$ keys to insert.

There are usually built-in methods in your language of choice for doing this with different data types.

**Modular hashing:**

- Hash code: an integer between $-2^{31}$ and $2^{31} - 1$
- Hash function: you want it to produce an integer between $0$ and $M - 1$ (for use as the array index). $M$ is typically a prime or a power of 2

The math that makes this work is to hash the key, get rid of the sign by taking `and` with $2^{31}$ in hexidecimal, then mod with $M$: `(hash(key) & 0x7fffffff) % M`.

The **uniform hashing assumption** is that each key is equally likely to hash to an integer between $0$ and $M - 1$. Think of throwing balls at random into $M$ bins. The **birthday problem** (when will you expect your first collision, or when 2 people in a room have the same birthday) is you'd expect two balls in the same bin after ~$\sqrt{\pi M / 2}$ tosses. The **coupon collector** assumption is that you expect every bin has $\geq 1$ ball after ~$M \ln M$ tosses. And the **load balancing** problem says that after $M$ tosses, you expect that the most loaded bin has $\Theta (\log M / \log \log M)$ balls.

## Separate Chaining

**Separate chaining** is a collision resolution strategy that makes use of the linked list data structure. A collision happens when two distinct keys hash to the same index.

The **birthday problem** said that you need huge amounts of memory (quadratic) to avoid a collision, which is too much in practice. And the **coupon collector** and **load balancing** problems said that collisions will be evenly distributed. 

So, how do you handle collisions efficiently?

Separate chaining is one solution that uses an array of $M \lt N$ linked lists:

- Hash: map kay to integer $i$ between $0$ and $M - 1$
- Insert: put at front of $i^{th}$ chain (if not already there)
- Search: need to search only $i^{th}$ chain

Proposition is that under the uniform hashing assumption, the probability that the number of keys in a list is within a constant factor of $N/M$ is extremely close to $1$. The proof shows that the distribution of the list size obeys a binomial distribution.

The consequence is that the number of probes for search/insert is proportional to $N/M$:

- $M$ is too large $\implies$ too many empty chains
- $M$ is too small $\implies$ chains are too long
- Typical choice: $M \text{~} N/5$ for constant-time operations

## Linear Probing

**Linear probing** is another popular collision resolution method. It uses **open addressing**, so when a key collides, it finds the next empty slot and puts the data there. Instead of keeping linked lists in positions along an array, you just allocate an array the size of the data.

The idea is to first hash the key to map it to an integer $i$ between $0$ and $M-1$. Then insert the key at index $i$ in the array if that slot if free, if not, try $i+1$, $i+2$, etc. To search, get the hash of the key, check index $i$ in the array, if that slot is occupied but there's no match, try $i+1$, $i+2$, etc. until you find it or hit an empty slot. A group of keys in consecutive slots in the array are called a cluster.

The size of the array $M$ *must be* greater than the number of key-value pairs $N$. Implementations usually include array re-sizing code to adjust $M$ for the size of the data.

In the earlier days of computing where memory was at a premium, a lot of care went into preventing empty keys or linked lists. One observation was that new keys were likely to hash into the middle of big clusters, and big clusters were likely to get longer.

**Knuth's parking problem** frames it as cars parking on a one-way street, if every car starts looking for a place at a random time, how far do they have to go to find a spot? He could show that if half the spots are occupied ($M/2$), on average, half find a spot in one place, half go one more spot (mean displacement of ~$3/2$). But when it's full ($M$ cars), the mean displacement is ~$\sqrt{\pi M/8}$.

**Proposition:** under uniform hashing assumption, the average number of probes in a linear probing hash table of size $M$ that contains $N = \alpha M$ keys is:

**Search hit:**
$$
\text{~} \frac{1}{2} \bigg( 1 + \frac{1}{(1-\alpha)} \bigg)
$$

**Search miss or Insert:**
$$
\text{~} \frac{1}{2} \bigg( 1 + \frac{1}{(1-\alpha)^2} \bigg)
$$

**Parameters:**

- $M$ too large $\implies$ too many empty array entries
- $M$ too small $\implies$ search time blows up
- Typical choice: $\alpha = N/M \text{~} \frac{1}{2}$ so the number of probes for a hit is ~3/2 and for search miss/insert is ~5/2

In [1]:
class LinearProbingHashST:
    def __init__(self, M):
        self.M = M
        self.key_arr = [None] * M
        self.vals_arr = [None] * M
        self.count = 0
    
    def hash_function(self, key):
        return (hash(key) & 0x7fffffff) % self.M

    def put(self, key, value):
        # Inserts key and value into arrays at hashed or next avail index
        i = self.hash_function(key)
        while self.key_arr[i] is not None and self.key_arr[i] != key:
            # Will be inf loop if array is full and doesn't contain key
            i = (i + 1) % self.M
        self.key_arr[i] = key
        self.vals_arr[i] = value
        self.count += 1
    
    def get(self, key):
        i = self.hash_function(key)
        while self.key_arr[i] is not None:
            # Will be inf loop if array is full and doesn't contain key
            if key == self.key_arr[i]:
                return self.vals_arr[i]
            i = (i + 1) % self.M
        return None
    
    def __contains__(self, key):
        return self.get(key) is not None
    
    def __str__(self):
        return str(self.key_arr)
    
    def __len__(self):
        return self.count

In [2]:
lin_probe = LinearProbingHashST(11)
for letter in ['E', 'S', 'T', 'Q', 'A', 'B', 'Z', 'H', 'P']:
    lin_probe.put(letter, lin_probe.hash_function(letter))
print(lin_probe)
for letter in ['E', 'S', 'T', 'Q', 'A', 'B', 'Z', 'H', 'P']:
    print(lin_probe.get(letter), end=", ")

['B', 'P', None, 'E', 'S', 'T', 'H', 'A', 'Z', 'Q', None]
3, 4, 4, 9, 7, 0, 8, 3, 0, 

## Hash Table Context

Hashing is widely used and appears in a lot of different contexts.

In Java, computing the hash function for long strings took time, so an early version (1.1) only examined every 8-9 evenly spaced characters. The benefit was it saved time performing the arithmetic of the hash function. This highlights a consideration with hashing: does the cost of computing the hash function for a complicated key exceed the cost of searching and using a simpler structure (like a binary search tree)? The downside in this example was there was great potential for bad collision patterns in typical data, like urls that are similar.

When you need guaranteed performance in real world scenarios (aircraft control or controlling someone's pacemaker), you can't always rely on the uniform hashing assumption holding in practice. In these situations, you may want the performance guarantee of a red-black search tree.

Denial of Service (DOS) attacks on the web rely on hash functions saving certain information to the same slot, and take advantage of that.

Hashing is extremely important in e-commerce - they use **one-way hash functions**, where it's hard to find a key that will hash to a desired value (or two keys that hash to the same value). Examples include `SHA-2`, `WHIRLPOOL`, and `RIPEMD-160`. They're used as a digital fingerprint, message digest, or for storing passwords. Unfortunately, they're also too expensive to practically use in symbol table implementations.

### Separate Chaining vs. Linear Probing

**Separate Chaining**
- Easier to implement delete functionality
- Performance degrades gracefully
- Clustering is less sensitive to poorly-designed hash function

**Linear Probing**
- Less wasted space
- Better cache performance (for huge tables)

Classic considerations are how to implement delete and how to resize?

Some improved variations:

**Two-probe Hashing (separate chaining variation)**
- Hash to two positions, insert key in shorter of the two chains
- Reduces expected length of the longest chain to $\log \log N$

**Double Hashing (linear-probing variation)**
- Use linear probing, but skip colliding slots by a variable amount, not just 1 each time
- Effectively eliminates clustering
- Can allow table to become nearly full
- More difficult to implement delete

**Cuckoo Hashing (linear-probing variation)**
- Hash key to two positions; insert key into either position; if occupied, reinsert displaced key into its alternative position (and recur)
- Constant worst-case time for search

### Hash Tables vs. Balanced Search Trees

**Hash Tables**
- Simpler to code (if you don't have to create the hash function)
- No effective alternative for unordered keys (if you don't have order in the keys at all, you have to use hashing because you don't have the necessary `compareTo()` function to use a BST)
- Faster for simple keys (a few arithmetic operations versus $\log N$ compares)
- Can have better system support for strings (e.g. in Java, with cached hash code)

**Balanced Search Trees**
- Stronger performance guarantee
- Support for ordered symbol table operations
- Easier to implement `compareTo()` function correctly than it is to do `equals()` and `hashCode()`

Most systems include both, so programmers can decide which works better for their specific application. Generally, the main reason to use a hash table over a red-black BST is that you get better performance in practice on typical inputs.

## Symbol Table Applications

### Sets

A mathematical **set** is a collection of distinct keys. The data structure is even simpler than symbol tables because there's no associated value, just keys. The implementation is just to remove any reference to "value" in any of the symbol table structures.

Another application is an **exception filter**, where you read in a list of words from one file, then print out all the words from standard input that are {in, not in} the given list. Useful in spell checkers (identify misspelled words), in a browser (mark visited pages, block sites), or with credit cards (check for stolen cards).

### Dictionary Clients

An example of a dictionary client is to create an application that builds a symbol table off a CSV file, which takes three arguments: the file name, an integer for which column/field to use as the key, and an integer for which one to use as the value.

For example, a CSV file of website domain names in one column and IP addresses in the other. You'd run the program name "LookupCSV" from the command line with: `python LookupCSV ip.csv 0 1`.

Since ordered operations (such as rank and ordered iteration) are not needed, a hash table implementation (such as linear probing) is most suitable for this example.

### Indexing Clients

Another common function easily handled by symbol tables is indexing.

For file indexing, the goal is to create an index for a list of specified files so that you can efficiently find all the files that contain a given search query string. Spotlight or Find programs on your computer do this.

You can implement with a symbol table where the key is the query string and the value is the set of files containing that string.

### Sparse Vectors

Sparse vectors in math present another way to apply symbol tables. When there are a lot of zero entries in a matrix ("sparse"), the standard dot product implementation (taking the sum product of a matrix row with a vector column) isn't as efficient as it could be.

For a vector, the classic representation is a 1D array, where every element is saved in the array. You have constant time access to elements but the space is proportional to $N$.

The symbol table representation for a vector (a **sparse vector**) uses the index as the key and the entry as the value for only non-zero entries. It is an efficient iterator and the space used is proportional to the number of non-zero entries. You can use a hash table, because the order in which you process the entries isn't important, you just want non-zero values.

For a matrix, the classic representation is a 2D array, where each row of the matrix is saves as an array, and the space is proportional to $N^2$. Instead, you can use a sparse matrix representation where each row of the matrix is a sparse vector. You have efficient access to the elements and the space is proportional to the number of non-zero entries (plus $N$).

## Search Tree Summary

The worst case (WC) is after $N$ inserts, and the average case (AC) is after $N$ random inserts.

The height of any red-black BST on $n$ keys (regardless of the order of insertion) is guaranteed to be between $\log⁡_{2} n$ and $2 \log_{⁡2}n$.

| Implementation | WC Search | WC Insert | WC Delete | AC Search | AC Insert | AC Delete | Ordered Iteration? |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Sequential Search (unordered list) | $N$ | $N$ | $N$ | $N/2$ | $N$ | $N/2$ | No |
| Binary Search (ordered array) | $\lg N$ | $N$ | $N$ | $\lg N$ | $N/2$ | $N/2$ | Yes |
| Binary Search Tree (BST) | $N$ | $N$ | $N$ | $1.39 \lg N$ | $1.39 \lg N$ | ? | Yes |
| 2-3 Tree | $c \lg N$ | $c \lg N$ | $c \lg N$ | $c \lg N$ | $c \lg N$ | $c \lg N$ | Yes |
| Red-Black BST | $2 \lg N$ | $2 \lg N$ | $2 \lg N$ | $1.00 \lg N$\* | $1.00 \lg N$\* | $1.00 \lg N$ | Yes |
| Hash: Separate Chaining | $\lg N$\*\* | $\lg N$\*\* | $\lg N$\*\* | $3 \cdot 5$\*\* | $3 \cdot 5$\*\* | $3 \cdot 5$\*\* | No |
| Hash: Linear Probing | $\lg N$\*\* | $\lg N$\*\* | $\lg N$\*\* | $3 \cdot 5$\*\* | $3 \cdot 5$\*\* | $3 \cdot 5$\*\* | No |

\* Exact coefficient unknown but extremely close to 1  
\*\* Under uniform hashing assumption