# Hash Tables

A **hash table** is a data structure that implements symbol tables  with constant-time performance for core operations, provided that search keys are standard data types or simply defined. It uses different access to the data compared to other BSTs, and doesn't support ordered operations.

## Hashing and Hash Functions

The basic idea of **hashing** is to save items in a *key-indexed table* where the index is a function of the key. A **hash function** is a method for computing the array index from the given key.

Issues:

- Computing the hash function
- Equality test: method for checking whether two keys are equal
- Collision resolution: algorithm and data structure to handle two keys that hash to the same array index

Hashing is a classic space-time tradeoff. With no space limitation, you use a trivial hash function with the key as the index. With no time limitation, you hash everything to the same place (use a trivial collision resolution) and do sequential search. With space and time limitations, you use hashing.

**Computing the hash function**

The idealistic goal is to scramble the keys uniformly to produce a table index - it's efficiently computable and each table index is equally likely for each key.

There are usually built-in methods in your language of choice for doing this with different data types.

**Modular hashing:**

- Hash code: an integer between $-2^{31}$ and $2^{31} - 1$
- Hash function: you want it to produce an integer between $0$ and $M - 1$ (for use as the array index). $M$ is typically a prime or a power of 2

The **uniform hashing assumption** is that each key is equally likely to hash to an integer between $0$ and $M - 1$. Think of throwing balls at random into $M$ bins. The **birthday problem** (when will you expect your first collision, or when 2 people in a room have the same birthday) is you'd expect two balls in the same bin after ~$\sqrt{\pi M / 2}$ tosses. The **coupon collector** assumption is that you expect every bin has $\geq 1$ ball after ~$M \ln M$ tosses.

## Separate Chaining

**Separate chaining** is a collision resolution strategy that makes use of the elementary linked list data structure. A collision is when two distinct keys hashing to the same index.

The **birthday problem** said that you need huge amounts of memory (quadratic) to avoid a collision, which is too much in practice. And the **coupon collector** and **load balancing** problems said that collisions will be evenly distributed. So, how do you handle collisions efficiently?

Separate chaining is one solution that uses an array of $M \lt N$ linked lists:

- Hash: map kay to integer $i$ between $0$ and $M - 1$
- Insert: put at front of $i^{th}$ chain (if not already there)
- Search: need to search only $i^{th}$ chain