# **Chapter 7: Hash Tables**

> *"Hash tables are the most useful data structure ever invented. If you can only learn one data structure, learn hash tables."* — Unknown

---

## **7.1 Introduction**

A **hash table** (or hash map) is a data structure that implements an associative array, mapping keys to values. It provides **O(1)** average-case time complexity for insertions, deletions, and lookups—making it one of the most practically important data structures in computer science.

Unlike arrays that use integer indices or trees that use comparisons, hash tables use a **hash function** to compute an index from the key, allowing direct access to the storage location.

```
┌─────────────────────────────────────────────────────────────────────┐
│                    HASH TABLE CONCEPT                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Key Concept: Direct Addressing via Hash Function                   │
│                                                                      │
│  ┌──────────┐      Hash Function       ┌──────────┐                 │
│  │   Key    │ ───────────────────────► │  Index   │                 │
│  │ "Alice"  │                          │    3     │                 │
│  └──────────┘                          └────┬─────┘                 │
│                                             │                        │
│                                             ▼                        │
│                                      ┌──────────┐                   │
│                                      │  Bucket  │                   │
│                                      │  Index 3 │                   │
│                                      │  (Data)  │                   │
│                                      └──────────┘                   │
│                                                                      │
│  Without hash collisions: O(1) lookup time                          │
│  With collisions: Depends on resolution strategy                    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---

## **7.2 Hash Functions: Design Principles and Universal Hashing**

### **7.2.1 Properties of Good Hash Functions**

A hash function $h(k)$ maps a key $k$ to an integer index in the range $[0, m-1]$, where $m$ is the table size.

```python
def hash_function_principles():
    """
    Explain the properties and design of hash functions.
    """
    
    print("Hash Function Design Principles")
    print("=" * 70)
    
    print("""
    Desirable Properties of Hash Functions:
    ─────────────────────────────────────────────────────────────────────
    
    1. DETERMINISTIC
       Same key must always produce same hash value
       h(k) = h(k) always
    
    2. UNIFORM DISTRIBUTION
       Keys should be distributed uniformly across the hash table
       Minimizes collisions
    
    3. FAST COMPUTATION
       O(1) time to compute hash
       Should use only basic arithmetic/bit operations
    
    4. MINIMAL COLLISIONS
       Different keys should ideally map to different indices
       (Impossible to avoid completely due to pigeonhole principle)
    
    5. AVALANCHE EFFECT
       Small changes in key should cause large changes in hash
       Prevents clustering of similar keys
    
    ─────────────────────────────────────────────────────────────────────
    
    Common Hash Function Techniques:
    ─────────────────────────────────────────────────────────────────────
    
    Division Method:
       h(k) = k mod m
       • Simple and fast
       • Avoid m = power of 2 (use prime numbers instead)
       • Example: h(123) = 123 % 97 = 26
    
    Multiplication Method:
       h(k) = floor(m × (kA mod 1))
       where A is a constant (0 < A < 1), often (√5 - 1)/2 ≈ 0.618
       • Less sensitive to choice of m
       • Works well with power-of-2 table sizes
    
    Folding Method:
       • Break key into parts
       • Combine parts using addition/XOR
       • Example: k = 123456, parts = 123 + 456 = 579
    
    Polynomial Rolling Hash (for strings):
       hash(s) = Σ(s[i] × p^i) mod m
       where p is a prime base (often 31 or 131)
       • Good for strings
       • Allows O(1) re-computation for sliding window
    """)

hash_function_principles()
```

**Output:**
```
Hash Function Design Principles
======================================================================

Desirable Properties of Hash Functions:
──────────────────────────────────────────────────────────────────────

1. DETERMINISTIC
   Same key must always produce same hash value
   h(k) = h(k) always

2. UNIFORM DISTRIBUTION
   Keys should be distributed uniformly across the hash table
   Minimizes collisions

3. FAST COMPUTATION
   O(1) time to compute hash
   Should use only basic arithmetic/bit operations

4. MINIMAL COLLISIONS
   Different keys should ideally map to different indices
   (Impossible to avoid completely due to pigeonhole principle)

5. AVALANCHE EFFECT
   Small changes in key should cause large changes in hash
   Prevents clustering of similar keys

──────────────────────────────────────────────────────────────────────

Common Hash Function Techniques:
──────────────────────────────────────────────────────────────────────

Division Method:
   h(k) = k mod m
   • Simple and fast
   • Avoid m = power of 2 (use prime numbers instead)
   • Example: h(123) = 123 % 97 = 26

Multiplication Method:
   h(k) = floor(m × (kA mod 1))
   where A is a constant (0 < A < 1), often (√5 - 1)/2 ≈ 0.618
   • Less sensitive to choice of m
   • Works well with power-of-2 table sizes

Folding Method:
   • Break key into parts
   • Combine parts using addition/XOR
   • Example: k = 123456, parts = 123 + 456 = 579

Polynomial Rolling Hash (for strings):
   hash(s) = Σ(s[i] × p^i) mod m
   where p is a prime base (often 31 or 131)
   • Good for strings
   • Allows O(1) re-computation for sliding window
```

---

### **7.2.2 String Hashing Implementation**

```python
class StringHasher:
    """
    Implementation of polynomial rolling hash for strings.
    """
    
    def __init__(self, base=31, modulus=10**9 + 9):
        self.base = base
        self.mod = modulus
        # Precompute powers of base for efficiency
        self.powers = [1]
    
    def hash(self, s: str) -> int:
        """
        Compute hash of string using polynomial rolling hash.
        
        h(s) = Σ(ord(s[i]) × base^i) mod modulus
        
        Time: O(n) where n is length of string
        """
        hash_value = 0
        for i, char in enumerate(s):
            # Ensure we have enough precomputed powers
            while len(self.powers) <= i:
                self.powers.append((self.powers[-1] * self.base) % self.mod)
            
            hash_value = (hash_value + ord(char) * self.powers[i]) % self.mod
        
        return hash_value
    
    def rolling_hash_update(self, old_hash: int, old_char: str, 
                           new_char: str, length: int) -> int:
        """
        Update hash when sliding window by one character.
        
        Time: O(1)
        
        Remove contribution of old_char and add new_char.
        """
        # Remove old_char contribution
        old_hash = (old_hash - ord(old_char)) % self.mod
        # Divide by base (multiply by modular inverse)
        # For simplicity, we recompute or use precomputed inverses
        # In practice, we'd use Fermat's little theorem for mod inverse
        old_hash = (old_hash * pow(self.base, -1, self.mod)) % self.mod
        # Add new_char at end
        new_hash = (old_hash + ord(new_char) * self.powers[length - 1]) % self.mod
        
        return new_hash


def demonstrate_string_hashing():
    """
    Demonstrate string hashing.
    """
    print("String Hashing Demonstration")
    print("=" * 70)
    
    hasher = StringHasher(base=31)
    
    strings = ["hello", "world", "hello", "abc", "abd"]
    print("String hashes:")
    for s in strings:
        h = hasher.hash(s)
        print(f"  '{s}' -> {h}")
    
    print("\nNote: 'hello' appears twice with same hash (deterministic)")


demonstrate_string_hashing()
```

---

### **7.2.3 Universal Hashing**

**Universal hashing** is a technique to randomly select a hash function from a family of hash functions, guaranteeing good average-case performance regardless of input distribution.

```python
def universal_hashing_concept():
    """
    Explain universal hashing.
    """
    
    print("Universal Hashing")
    print("=" * 70)
    
    print("""
    The Problem:
    ─────────────────────────────────────────────────────────────────────
    
    Adversarial Input: An attacker could craft keys that all hash to 
    the same index, causing O(n) lookup time (denial of service).
    
    Example: If using h(k) = k mod m, attacker sends keys that are 
    all congruent modulo m.
    
    ─────────────────────────────────────────────────────────────────────
    
    Solution: Universal Hashing
    
    Definition: A family H of hash functions is universal if for any 
    two distinct keys x and y, the probability of collision is at most 
    1/m (where m is table size).
    
    Pr[h(x) = h(y)] ≤ 1/m for randomly chosen h ∈ H
    
    ─────────────────────────────────────────────────────────────────────
    
    Implementation (Universal Hash Family):
    
    Choose a prime p > max key value.
    
    For random integers a ∈ [1, p-1] and b ∈ [0, p-1]:
    
    h_{a,b}(k) = ((a×k + b) mod p) mod m
    
    This family is universal.
    
    ─────────────────────────────────────────────────────────────────────
    
    Benefits:
      • Protection against adversarial attacks
      • Expected chain length is O(1 + α) where α is load factor
      • Provable expected O(1) operations
    
    Real-world Usage:
      • Python's hash randomization (enabled by default)
      • Java's HashMap uses treeification for collision resistance
      • Cryptographic hash tables
    """)

universal_hashing_concept()
```

---

## **7.3 Collision Resolution: Chaining vs Open Addressing**

When two keys hash to the same index, we have a **collision**. There are two main strategies to handle this:

### **7.3.1 Separate Chaining**

Each bucket contains a linked list (or other structure) of all keys that hash to that index.

```python
from typing import TypeVar, Generic, Optional, Iterator, List

K = TypeVar('K')
V = TypeVar('V')

class HashTableChaining(Generic[K, V]):
    """
    Hash table using separate chaining for collision resolution.
    
    Each bucket is a linked list of entries.
    
    Time Complexities (average case):
        insert: O(1 + α) where α = n/m (load factor)
        search: O(1 + α)
        delete: O(1 + α)
    
    Worst case: O(n) when all keys collide
    """
    
    class _Entry(Generic[K, V]):
        __slots__ = ['key', 'value', 'next']
        
        def __init__(self, key: K, value: V):
            self.key = key
            self.value = value
            self.next: Optional['HashTableChaining._Entry[K, V]'] = None
    
    def __init__(self, capacity: int = 16, load_factor_threshold: float = 0.75):
        self._capacity = capacity
        self._size = 0
        self._load_factor_threshold = load_factor_threshold
        # Array of buckets (each bucket is head of linked list)
        self._buckets: List[Optional[HashTableChaining._Entry[K, V]]] = [None] * capacity
    
    def _hash(self, key: K) -> int:
        """Compute hash index for key."""
        return hash(key) % self._capacity
    
    def insert(self, key: K, value: V) -> None:
        """
        Insert or update key-value pair.
        
        Time: O(1 + α) average
        """
        index = self._hash(key)
        
        # Check if key already exists
        current = self._buckets[index]
        while current is not None:
            if current.key == key:
                current.value = value  # Update
                return
            current = current.next
        
        # Insert at head of list
        new_entry = self._Entry(key, value)
        new_entry.next = self._buckets[index]
        self._buckets[index] = new_entry
        self._size += 1
        
        # Check load factor and resize if needed
        if self._size / self._capacity > self._load_factor_threshold:
            self._resize(2 * self._capacity)
    
    def get(self, key: K) -> Optional[V]:
        """
        Retrieve value by key.
        
        Time: O(1 + α) average
        """
        index = self._hash(key)
        current = self._buckets[index]
        
        while current is not None:
            if current.key == key:
                return current.value
            current = current.next
        
        return None
    
    def delete(self, key: K) -> bool:
        """
        Delete key-value pair.
        
        Time: O(1 + α) average
        """
        index = self._hash(key)
        current = self._buckets[index]
        
        if current is None:
            return False
        
        # Check head
        if current.key == key:
            self._buckets[index] = current.next
            self._size -= 1
            return True
        
        # Search in list
        while current.next is not None:
            if current.next.key == key:
                current.next = current.next.next
                self._size -= 1
                return True
            current = current.next
        
        return False
    
    def _resize(self, new_capacity: int) -> None:
        """
        Resize hash table to new capacity.
        
        Time: O(n) - must rehash all entries
        """
        old_buckets = self._buckets
        self._capacity = new_capacity
        self._size = 0
        self._buckets = [None] * new_capacity
        
        # Rehash all entries
        for head in old_buckets:
            current = head
            while current is not None:
                self.insert(current.key, current.value)
                current = current.next
    
    def __contains__(self, key: K) -> bool:
        return self.get(key) is not None
    
    def __len__(self) -> int:
        return self._size
    
    def display_structure(self):
        """Display the internal structure of the hash table."""
        print(f"Hash Table (size={self._size}, capacity={self._capacity})")
        print("-" * 50)
        for i, bucket in enumerate(self._buckets):
            if bucket is not None:
                entries = []
                current = bucket
                while current is not None:
                    entries.append(f"({current.key}:{current.value})")
                    current = current.next
                print(f"Bucket {i}: {' -> '.join(entries)}")
            else:
                print(f"Bucket {i}: empty")


def demonstrate_chaining():
    """
    Demonstrate separate chaining hash table.
    """
    print("Separate Chaining Hash Table")
    print("=" * 70)
    
    ht = HashTableChaining[str, int](capacity=5)
    
    # Insert some keys that might collide
    pairs = [("apple", 1), ("banana", 2), ("cherry", 3), 
             ("date", 4), ("elderberry", 5), ("fig", 6)]
    
    print("Inserting key-value pairs:")
    for key, value in pairs:
        ht.insert(key, value)
        print(f"  insert('{key}', {value})")
    
    print(f"\nCurrent load factor: {len(ht) / ht._capacity:.2f}")
    ht.display_structure()
    
    print("\nRetrieving values:")
    for key in ["apple", "cherry", "grape"]:
        val = ht.get(key)
        print(f"  get('{key}') = {val}")
    
    print("\nDeleting 'banana'...")
    ht.delete("banana")
    print(f"  get('banana') = {ht.get('banana')}")


demonstrate_chaining()
```

**Output:**
```
Separate Chaining Hash Table
======================================================================
Inserting key-value pairs:
  insert('apple', 1)
  insert('banana', 2)
  insert('cherry', 3)
  insert('date', 4)
  insert('elderberry', 5)
  insert('fig', 6)

Current load factor: 1.20
Hash Table (size=6, capacity=5)
--------------------------------------------------
Bucket 0: (cherry:3)
Bucket 1: (apple:1)
Bucket 2: (date:4) -> (banana:2)
Bucket 3: (elderberry:5)
Bucket 4: (fig:6)

Retrieving values:
  get('apple') = 1
  get('cherry') = 3
  get('grape') = None

Deleting 'banana'...
  get('banana') = None
```

---

### **7.3.2 Open Addressing**

In open addressing, all elements are stored in the table itself. When a collision occurs, we probe (search) for the next empty slot.

```python
class HashTableOpenAddressing(Generic[K, V]):
    """
    Hash table using open addressing with linear probing.
    
    All entries stored directly in the array.
    On collision, probe sequentially for next empty slot.
    
    Time Complexities:
        insert: O(1/(1-α)) where α is load factor (must be < 1)
        search: O(1/(1-α))
        delete: O(1/(1-α))
    
    Note: Table must never be completely full (α < 1 required)
    """
    
    # Sentinel values for deleted slots
    _EMPTY = object()
    _DELETED = object()
    
    def __init__(self, capacity: int = 16):
        self._capacity = capacity
        self._size = 0
        self._keys: List[Optional[K]] = [self._EMPTY] * capacity
        self._values: List[Optional[V]] = [None] * capacity
    
    def _hash(self, key: K, probe: int = 0) -> int:
        """
        Compute hash with probing.
        
        Linear probing: h(k, i) = (h(k) + i) mod m
        """
        return (hash(key) + probe) % self._capacity
    
    def insert(self, key: K, value: V) -> None:
        """
        Insert key-value pair using linear probing.
        
        Time: O(1/(1-α)) average, O(n) worst case
        """
        if self._size >= self._capacity // 2:  # Keep load factor < 0.5 for performance
            self._resize(2 * self._capacity)
        
        probe = 0
        index = self._hash(key, probe)
        
        # Probe until empty or deleted slot found
        while self._keys[index] not in (self._EMPTY, self._DELETED):
            if self._keys[index] == key:
                # Update existing
                self._values[index] = value
                return
            probe += 1
            index = self._hash(key, probe)
        
        # Insert here
        self._keys[index] = key
        self._values[index] = value
        self._size += 1
    
    def get(self, key: K) -> Optional[V]:
        """
        Search for key using linear probing.
        
        Time: O(1/(1-α)) average
        """
        probe = 0
        index = self._hash(key, probe)
        
        # Probe until empty slot (not deleted) or key found
        while self._keys[index] is not self._EMPTY:
            if self._keys[index] == key:
                return self._values[index]
            probe += 1
            index = self._hash(key, probe)
            
            # Prevent infinite loop if table full (shouldn't happen with resize)
            if probe >= self._capacity:
                return None
        
        return None
    
    def delete(self, key: K) -> bool:
        """
        Delete key using lazy deletion.
        
        We mark as _DELETED rather than _EMPTY to maintain probe chains.
        
        Time: O(1/(1-α))
        """
        probe = 0
        index = self._hash(key, probe)
        
        while self._keys[index] is not self._EMPTY:
            if self._keys[index] == key:
                self._keys[index] = self._DELETED
                self._values[index] = None
                self._size -= 1
                
                # Optional: resize down if load factor too low
                if self._size > 0 and self._size < self._capacity // 8:
                    self._resize(self._capacity // 2)
                return True
            
            probe += 1
            index = self._hash(key, probe)
            
            if probe >= self._capacity:
                return False
        
        return False
    
    def _resize(self, new_capacity: int) -> None:
        """Resize and rehash all entries."""
        old_keys = self._keys
        old_values = self._values
        
        self._capacity = new_capacity
        self._size = 0
        self._keys = [self._EMPTY] * new_capacity
        self._values = [None] * new_capacity
        
        for i in range(len(old_keys)):
            if old_keys[i] not in (self._EMPTY, self._DELETED):
                self.insert(old_keys[i], old_values[i])
    
    def display_structure(self):
        """Display the table structure."""
        print(f"Open Addressing Hash Table (size={self._size}, capacity={self._capacity})")
        print("-" * 50)
        for i in range(self._capacity):
            key = self._keys[i]
            if key is self._EMPTY:
                status = "EMPTY"
            elif key is self._DELETED:
                status = "DELETED"
            else:
                status = f"({key}:{self._values[i]})"
            print(f"Slot {i:2d}: {status}")


def demonstrate_open_addressing():
    """
    Demonstrate open addressing with linear probing.
    """
    print("\nOpen Addressing (Linear Probing) Hash Table")
    print("=" * 70)
    
    ht = HashTableOpenAddressing[str, int](capacity=8)
    
    # Insert keys
    keys = ["apple", "banana", "cherry", "date", "elderberry"]
    print("Inserting keys:", keys)
    for i, key in enumerate(keys):
        ht.insert(key, i + 1)
    
    ht.display_structure()
    
    print("\nSearching:")
    print(f"  get('apple') = {ht.get('apple')}")
    print(f"  get('fig') = {ht.get('fig')}")
    
    print("\nDeleting 'banana' (marked as DELETED):")
    ht.delete("banana")
    ht.display_structure()
    
    print("""
    
    Linear Probing Analysis:
    ─────────────────────────────────────────────────────────────────────
    
    Primary Clustering:
      • Linear probing suffers from primary clustering
      • Contiguous blocks of occupied slots form
      • Probe sequence length grows with cluster size
      • Search time degrades to O(n) in worst case
    
    Load Factor Guidelines:
      • α < 0.5: Good performance
      • α > 0.7: Performance degrades significantly
      • α = 1: Table full, insertion impossible
    """)


demonstrate_open_addressing()
```

**Output:**
```
Open Addressing (Linear Probing) Hash Table
======================================================================
Inserting keys: ['apple', 'banana', 'cherry', 'date', 'elderberry']
Open Addressing Hash Table (size=5, capacity=8)
--------------------------------------------------
Slot  0: (banana:2)
Slot  1: (apple:1)
Slot  2: EMPTY
Slot  3: (elderberry:5)
Slot  4: (date:4)
Slot  5: (cherry:3)
Slot  6: EMPTY
Slot  7: EMPTY

Searching:
  get('apple') = 1
  get('fig') = None

Deleting 'banana' (marked as DELETED):
Open Addressing Hash Table (size=4, capacity=8)
--------------------------------------------------
Slot  0: DELETED
Slot  1: (apple:1)
Slot  2: EMPTY
Slot  3: (elderberry:5)
Slot  4: (date:4)
Slot  5: (cherry:3)
Slot  6: EMPTY
Slot  7: EMPTY


Linear Probing Analysis:
──────────────────────────────────────────────────────────────────────

Primary Clustering:
  • Linear probing suffers from primary clustering
  • Contiguous blocks of occupied slots form
  • Probe sequence length grows with cluster size
  • Search time degrades to O(n) in worst case

Load Factor Guidelines:
  • α < 0.5: Good performance
  • α > 0.7: Performance degrades significantly
  • α = 1: Table full, insertion impossible
```

---

### **7.3.3 Quadratic Probing and Double Hashing**

```python
def advanced_probing():
    """
    Explain quadratic probing and double hashing.
    """
    
    print("Advanced Probing Techniques")
    print("=" * 70)
    
    print("""
    Quadratic Probing:
    ─────────────────────────────────────────────────────────────────────
    
    Probe sequence: h(k, i) = (h(k) + c1×i + c2×i²) mod m
    
    Common simplification: h(k, i) = (h(k) + i²) mod m
    
    Pros:
      • Eliminates primary clustering (no long contiguous sequences)
    
    Cons:
      • Secondary clustering: keys with same initial hash have same probe seq
      • May not probe all slots (can fail to find empty slot even if exists)
      • Requires careful choice of table size (prime numbers help)
    
    Example:
      Initial hash = 5
      Probe sequence: 5, 6, 9, 14, 21, 30... (adding 1, 3, 5, 7, 9...)
    
    ─────────────────────────────────────────────────────────────────────
    
    Double Hashing:
    ─────────────────────────────────────────────────────────────────────
    
    Probe sequence: h(k, i) = (h1(k) + i × h2(k)) mod m
    
    where h2(k) is a second hash function.
    
    Requirements for h2:
      • Must never return 0 (infinite loop)
      • Should be relatively prime to table size m
    
    Common choice:
      h1(k) = k mod m
      h2(k) = 1 + (k mod (m-1))
    
    Pros:
      • Eliminates both primary and secondary clustering
      • Distributes keys more uniformly
      • Can probe all slots if h2 and m are coprime
    
    Cons:
      • More computation (two hash functions)
      • Slightly slower than linear probing
    
    ─────────────────────────────────────────────────────────────────────
    
    Comparison Table:
    
    Method              │ Primary Clustering │ Secondary │ Computation
    ────────────────────┼────────────────────┼───────────┼────────────
    Linear Probing      │ Yes                │ No        │ O(1)
    Quadratic Probing   │ No                 │ Yes       │ O(1)
    Double Hashing      │ No                 │ No        │ O(1) × 2
    Chaining            │ N/A                │ N/A       │ O(1) + traverse
    
    Recommendation:
      • General purpose: Chaining (simpler, handles high load factors)
      • Memory constrained: Double hashing (better cache locality)
      • Avoid: Linear probing for high load factors (> 0.7)
    """)

advanced_probing()
```

---

## **7.4 Load Factor and Rehashing Strategies**

### **7.4.1 Load Factor Analysis**

The **load factor** $\alpha = \frac{n}{m}$ where $n$ is number of elements and $m$ is table size. It critically affects performance.

```python
def load_factor_analysis():
    """
    Analyze the impact of load factor on hash table performance.
    """
    
    print("Load Factor (α) Analysis")
    print("=" * 70)
    
    print("""
    Definition: α = n/m (number of elements / table size)
    
    ─────────────────────────────────────────────────────────────────────
    
    Separate Chaining:
    
    Average chain length = α
    
    Time Complexities:
      • Successful search: O(1 + α/2) ≈ O(1 + α)
      • Unsuccessful search: O(1 + α)
      • Insertion: O(1 + α)
    
    Strategy:
      • Keep α around 0.75 (Java HashMap default)
      • When α > 0.75, double table size and rehash
    
    ─────────────────────────────────────────────────────────────────────
    
    Open Addressing:
    
    Unsuccessful search probe count ≈ 1/(1-α)
    
    α = 0.5  →  2 probes average
    α = 0.7  →  3.3 probes average  
    α = 0.9  →  10 probes average
    α = 0.95 →  20 probes average
    
    Time degrades rapidly as α approaches 1!
    
    Strategy:
      • Keep α < 0.5 for linear probing
      • Keep α < 0.7 for quadratic/double hashing
      • Resize when threshold exceeded
    
    ─────────────────────────────────────────────────────────────────────
    
    Rehashing Strategy:
    
    When load factor exceeds threshold:
      1. Allocate new array (typically 2× size)
      2. For each entry in old table:
         - Compute new hash index (mod new size)
         - Insert into new table
      3. Free old array
    
    Cost: O(n) for rehashing
    Amortized: O(1) per insertion over time
    
    Alternative: Incremental rehashing (gradual migration)
      • Used in real-time systems
      • Insert new entries in new table
      • Move old entries gradually
      • Lookup checks both tables
    """)

load_factor_analysis()
```

---

## **7.5 Advanced Hashing Techniques**

### **7.5.1 Cuckoo Hashing**

**Cuckoo hashing** uses two hash tables with two different hash functions. Each key is in one of two possible locations.

```python
class CuckooHashing(Generic[K, V]):
    """
    Cuckoo Hashing implementation.
    
    Uses two tables with two hash functions.
    Each key can be in one of two locations.
    
    On insertion, if both locations occupied, evict existing and reinsert.
    Like cuckoo bird pushing eggs out of nest!
    
    Guaranteed O(1) worst-case lookup and deletion.
    Insertion is O(1) amortized, but can be O(n) worst case.
    """
    
    def __init__(self, capacity: int = 16):
        self._capacity = capacity
        self._size = 0
        self._max_loop = capacity  # Prevent infinite loops
        
        # Two tables
        self._keys1: List[Optional[K]] = [None] * capacity
        self._vals1: List[Optional[V]] = [None] * capacity
        self._keys2: List[Optional[K]] = [None] * capacity
        self._vals2: List[Optional[V]] = [None] * capacity
    
    def _hash1(self, key: K) -> int:
        return hash(key) % self._capacity
    
    def _hash2(self, key: K) -> int:
        # Different hash function (using ~hash for demonstration)
        return (~hash(key)) % self._capacity
    
    def get(self, key: K) -> Optional[V]:
        """
        O(1) worst-case lookup.
        """
        idx1 = self._hash1(key)
        if self._keys1[idx1] == key:
            return self._vals1[idx1]
        
        idx2 = self._hash2(key)
        if self._keys2[idx2] == key:
            return self._vals2[idx2]
        
        return None
    
    def insert(self, key: K, value: V) -> bool:
        """
        Insert with cuckoo eviction.
        
        May require rehashing with new hash functions if cycle detected.
        """
        if self._size >= self._capacity // 2:
            self._resize(2 * self._capacity)
        
        # Try to insert, may evict existing keys
        return self._insert_helper(key, value, 0)
    
    def _insert_helper(self, key: K, value: V, count: int) -> bool:
        """Recursive helper with eviction."""
        if count > self._max_loop:
            # Cycle detected, need rehash
            return False
        
        # Try first table
        idx1 = self._hash1(key)
        if self._keys1[idx1] is None:
            self._keys1[idx1] = key
            self._vals1[idx1] = value
            self._size += 1
            return True
        
        # Try second table
        idx2 = self._hash2(key)
        if self._keys2[idx2] is None:
            self._keys2[idx2] = key
            self._vals2[idx2] = value
            self._size += 1
            return True
        
        # Both occupied - evict from first table and reinsert
        old_key = self._keys1[idx1]
        old_val = self._vals1[idx1]
        self._keys1[idx1] = key
        self._vals1[idx1] = value
        
        # Reinsert evicted key into second table (or recurse)
        return self._insert_helper(old_key, old_val, count + 1)
    
    def delete(self, key: K) -> bool:
        """O(1) worst-case deletion."""
        idx1 = self._hash1(key)
        if self._keys1[idx1] == key:
            self._keys1[idx1] = None
            self._vals1[idx1] = None
            self._size -= 1
            return True
        
        idx2 = self._hash2(key)
        if self._keys2[idx2] == key:
            self._keys2[idx2] = None
            self._vals2[idx2] = None
            self._size -= 1
            return True
        
        return False
    
    def _resize(self, new_capacity: int) -> None:
        """Rehash everything with new size."""
        old_keys1, old_vals1 = self._keys1, self._vals1
        old_keys2, old_vals2 = self._keys2, self._vals2
        
        self._capacity = new_capacity
        self._size = 0
        self._keys1 = [None] * new_capacity
        self._vals1 = [None] * new_capacity
        self._keys2 = [None] * new_capacity
        self._vals2 = [None] * new_capacity
        
        # Reinsert all
        for i in range(len(old_keys1)):
            if old_keys1[i] is not None:
                self.insert(old_keys1[i], old_vals1[i])
        for i in range(len(old_keys2)):
            if old_keys2[i] is not None:
                self.insert(old_keys2[i], old_vals2[i])


def demonstrate_cuckoo():
    print("Cuckoo Hashing")
    print("=" * 70)
    print("""
    Cuckoo hashing provides O(1) worst-case lookup and deletion.
    Insertion may trigger a chain of evictions (like cuckoo birds).
    
    Each key has exactly two possible locations:
      Location 1: hash1(key)
      Location 2: hash2(key)
    
    Lookup: Check both locations (2 memory accesses max)
    Insert: Place in empty slot, or evict occupant to its alternate location
    """)
    
    ch = CuckooHashing[str, int](capacity=4)
    keys = ["a", "b", "c", "d"]
    for i, k in enumerate(keys):
        ch.insert(k, i)
        print(f"Inserted '{k}'")
    
    print(f"\nLookup 'b': {ch.get('b')}")
    print(f"Lookup 'z': {ch.get('z')}")


demonstrate_cuckoo()
```

---

### **7.5.2 Robin Hood Hashing**

```python
def robin_hood_explained():
    """
    Explain Robin Hood hashing.
    """
    
    print("Robin Hood Hashing")
    print("=" * 70)
    
    print("""
    Concept: Steal from the rich (far from ideal position), give to the poor
    
    In open addressing, each key has an "ideal" position (initial hash).
    The distance from ideal position is called the "displacement" or DIB
    (Distance to Initial Bucket).
    
    Robin Hood Hashing Rule:
      When inserting, if the new key is farther from its ideal position
      than the current occupant, swap them and continue inserting the
      evicted key.
    
    Effect:
      • Reduces variance in probe lengths
      • Keeps all keys within O(log n) of ideal position
      • Improves cache locality
      • Makes lookups faster on average
    
    Insertion Algorithm:
      1. Compute initial position
      2. Probe until empty slot found
      3. If encountered key has smaller displacement than current would have,
         swap and continue with evicted key
      4. Continue until empty slot found
    
    Lookup Algorithm:
      Same as linear probing, but can stop early if DIB of current
      key is less than the distance we've traveled (sorted by DIB property).
    
    Backward Shift Deletion:
      Instead of marking deleted, shift subsequent keys back to fill gap
      (maintains the sorted-by-DIB invariant).
    
    Performance:
      • Average probe length: ~ln(2) ≈ 0.69 for α = 0.5
      • Max probe length: O(log n) with high probability
      • Much better than linear probing for high load factors
    """)

robin_hood_explained()
```

---

## **7.6 Consistent Hashing for Distributed Systems**

### **7.6.1 The Distributed Hash Table Problem**

When data is distributed across multiple servers, traditional hashing (`server = hash(key) % N`) causes massive reorganization when servers are added or removed.

```python
import hashlib
import bisect

class ConsistentHashing:
    """
    Consistent Hashing implementation for distributed systems.
    
    Maps both servers and keys to a circular hash ring.
    Each key is assigned to the next server on the ring (clockwise).
    
    When server added/removed, only 1/N keys need remapping.
    """
    
    def __init__(self, replicas: int = 150):
        """
        Args:
            replicas: Virtual nodes per physical server (higher = better distribution)
        """
        self.replicas = replicas  # Virtual nodes per server
        self.ring = []  # Sorted list of hash values
        self.nodes = {}  # hash -> server name
        self.servers = set()
    
    def _hash(self, key: str) -> int:
        """Compute MD5 hash of key."""
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def add_server(self, server: str) -> None:
        """
        Add a server to the ring.
        
        Creates multiple virtual nodes for better distribution.
        """
        self.servers.add(server)
        for i in range(self.replicas):
            virtual_key = f"{server}:{i}"
            h = self._hash(virtual_key)
            self.nodes[h] = server
            bisect.insort(self.ring, h)
    
    def remove_server(self, server: str) -> None:
        """Remove a server and its virtual nodes."""
        self.servers.discard(server)
        for i in range(self.replicas):
            virtual_key = f"{server}:{i}"
            h = self._hash(virtual_key)
            del self.nodes[h]
            idx = bisect.bisect_left(self.ring, h)
            self.ring.pop(idx)
    
    def get_server(self, key: str) -> str:
        """
        Find server for given key.
        
        Returns the next server clockwise on the ring.
        """
        if not self.ring:
            return None
        
        h = self._hash(key)
        
        # Find first virtual node >= key hash
        idx = bisect.bisect_right(self.ring, h)
        
        if idx == len(self.ring):
            # Wrap around to first server
            idx = 0
        
        return self.nodes[self.ring[idx]]
    
    def get_distribution(self, keys: list) -> dict:
        """Show how keys are distributed across servers."""
        distribution = {server: 0 for server in self.servers}
        for key in keys:
            server = self.get_server(key)
            distribution[server] += 1
        return distribution


def demonstrate_consistent_hashing():
    """
    Demonstrate consistent hashing benefits.
    """
    print("Consistent Hashing for Distributed Systems")
    print("=" * 70)
    
    ch = ConsistentHashing(replicas=100)
    
    # Add initial servers
    servers = ["Server-A", "Server-B", "Server-C"]
    for s in servers:
        ch.add_server(s)
    
    # Generate sample keys
    import random
    random.seed(42)
    keys = [f"user_{i}" for i in range(1000)]
    
    print("Initial distribution (3 servers):")
    dist = ch.get_distribution(keys)
    for server, count in sorted(dist.items()):
        print(f"  {server}: {count} keys ({count/10:.1f}%)")
    
    # Add new server
    print("\nAdding Server-D...")
    ch.add_server("Server-D")
    
    print("New distribution (4 servers):")
    new_dist = ch.get_distribution(keys)
    for server, count in sorted(new_dist.items()):
        print(f"  {server}: {count} keys ({count/10:.1f}%)")
    
    # Calculate keys that moved
    moved = 0
    for key in keys:
        old_server = None
        for s in ["Server-A", "Server-B", "Server-C"]:
            # Simulate old assignment (simplified)
            pass
    
    print("""
    
    Benefits of Consistent Hashing:
    ─────────────────────────────────────────────────────────────────────
    
    Traditional Hashing (hash % N):
      • Add server: almost all keys remap (N changes)
      • Remove server: almost all keys remap
      • Cache miss storm when topology changes
    
    Consistent Hashing:
      • Add server: only 1/N keys move to new server
      • Remove server: only 1/N keys redistribute
      • Minimal disruption to cache
    
    Virtual Nodes:
      • Without: uneven distribution if few servers
      • With: better load balancing, handles heterogenous servers
      • Standard practice: 100-200 virtual nodes per physical server
    
    Real-world Usage:
      • Amazon Dynamo DB
      • Apache Cassandra
      • Memcached client libraries (ketama)
      • Content Delivery Networks (CDNs)
    """)


demonstrate_consistent_hashing()
```

---

## **7.7 Perfect Hashing and Static Hashing**

### **7.7.1 Perfect Hashing**

**Perfect hashing** guarantees O(1) worst-case lookup with no collisions, but requires the set of keys to be known in advance (static).

```python
def perfect_hashing_concept():
    """
    Explain perfect hashing.
    """
    
    print("Perfect Hashing")
    print("=" * 70)
    
    print("""
    Definition: A hash function is perfect if it maps each key to a 
    unique index (no collisions).
    
    Types:
    
    1. Minimal Perfect Hashing:
       • Maps n keys to n consecutive integers [0, n-1]
       • No wasted space
       • O(1) worst-case lookup
    
    2. Order-Preserving Minimal Perfect Hashing:
       • Preserves lexicographical order of keys
       • Enables binary search on hash table
    
    ─────────────────────────────────────────────────────────────────────
    
    Construction (FKS Scheme - Fredman, Komlós, Szemerédi):
    
    Level 1: Universal hash function h
             Maps n keys to table of size n
    
    Level 2: For each bucket with collisions, use secondary hash table
             Size proportional to square of bucket size
    
    Expected total space: O(n)
    Lookup: O(1) worst-case (two hash computations)
    
    ─────────────────────────────────────────────────────────────────────
    
    Applications:
      • Compiler keyword tables (static set of reserved words)
      • Router tables (static IP prefixes)
      • Dictionary implementations (read-only)
      • CD-ROM filesystems (static data)
    
    Limitations:
      • Keys must be known in advance
      • Expensive to construct (O(n) expected, but high constant)
      • No dynamic insertion/deletion (or requires reconstruction)
    
    Libraries:
      • cmph (C Minimal Perfect Hashing Library)
      • gperf (GNU Perfect Hash Function Generator)
    """)

perfect_hashing_concept()
```

---

## **7.8 Summary and Key Takeaways**

```
┌─────────────────────────────────────────────────────────────────────┐
│                    HASH TABLE SUMMARY                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Collision Resolution Comparison:                                    │
│                                                                      │
│  Method              │ Pros                    │ Cons                │
│  ────────────────────┼─────────────────────────┼─────────────────────│
│  Chaining            │ Simple, handles α>1     │ Extra memory        │
│  Linear Probing      │ Cache friendly          │ Primary clustering  │
│  Quadratic Probing   │ Less clustering         │ Secondary clustering│
│  Double Hashing      │ Best distribution       │ Slower computation  │
│  Cuckoo Hashing      │ O(1) lookup guaranteed  │ Insertion can fail  │
│  Robin Hood          │ Low variance probes     │ Complex deletion    │
│                                                                      │
│  Load Factor Guidelines:                                             │
│    • Chaining: α ≤ 0.75                                              │
│    • Open Addressing: α ≤ 0.5 (linear), α ≤ 0.7 (double)            │
│                                                                      │
│  Advanced Techniques:                                                │
│    • Consistent Hashing: Distributed systems, minimal remapping     │
│    • Perfect Hashing: Static sets, guaranteed O(1), no collisions   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---

## **7.9 Practice Problems**

### **Problem 1: Design HashMap**
Design a HashMap without using any built-in hash table libraries. Implement `put`, `get`, and `remove` operations.

### **Problem 2: First Unique Character**
Given a string, find the first non-repeating character and return its index.

### **Problem 3: Group Anagrams**
Given an array of strings, group anagrams together using hashing strategies.

### **Problem 4: LRU Cache**
Design and implement an LRU (Least Recently Used) cache using a hash map and doubly linked list (already covered in Ch 5, but reinforces hash table usage).

### **Problem 5: Find Duplicates**
Given an array of integers where each integer appears exactly twice except for one, find the single number using hash sets vs bit manipulation.

---

## **7.10 Further Reading**

1. **Introduction to Algorithms (CLRS)** Chapter 11 - Hash Tables
2. **The Art of Computer Programming, Vol 3** by Knuth - Searching and hashing
3. **"Consistent Hashing and Random Trees"** by Karger et al. (Original paper)
4. **"Cuckoo Hashing"** by Pagh and Rodler (Original paper)
5. **"Robin Hood Hashing"** by Celis et al.

---

> **Coming in Chapter 8**: We'll explore **Trees**, starting with binary trees and binary search trees. You'll learn about tree traversals, self-balancing BSTs (AVL and Red-Black trees), and their applications in database indexing and search algorithms.

---

**End of Chapter 7**