Here are **in-depth notes** on the lecture *"Indexing and Hashing/2: Indexing/2"* from the IIT Madras DBMS course. These notes are structured to cover every concept and learning outcome in detail.

---

# 📘 **DBMS - Module 42: Indexing and Hashing/2: Indexing/2**

## 🎯 **Learning Outcomes**

1. Recap **Balanced Binary Search Trees (BSTs)** as optimal in-memory search data structures.
2. Understand challenges with **external search data structures** for persistent data.
3. Study **2-3-4 Trees** as a precursor to **B-Trees** and **B+ Trees**—the foundation for efficient external indexing in databases.

---

## 🔁 **1. Recap: Balanced Binary Search Trees (BSTs)**

### 📌 **Search Time Complexity**

* Binary Search Tree (BST) search time: **O(h)** where `h = height of the tree`.

  * Best case: **O(log n)** (balanced tree).
  * Worst case: **O(n)** (skewed/unbalanced tree).

### 📌 **Problem**

* Repeated inserts/deletes without balancing may increase `h` up to `n`.
* A skewed tree degrades search to **linear time**, defeating BST purpose.

---

## 🌳 **2. Ensuring Balanced BSTs**

### ✅ **Self-Balancing BSTs**

* Guarantee: Height always **O(log n)**.

#### ⚙️ **AVL Trees**

* Ensures: Height difference between left and right subtrees at any node ≤ 1.
* If violated during insert/delete → **rebalancing via rotations**.
* Oldest form of self-balancing BST.
* **Cons**: Expensive rotations; hard to scale for large data.

---

### 🎲 **Randomized BSTs**

* **Assumption**: If data is random, subtree sizes tend to be balanced.
* Real-world data is biased → Tree becomes unbalanced.

#### 💡 **Randomization Strategy**

* Control insertion direction **independent of data values**.
* Guarantee (not in worst-case): **Height remains O(log n) with high probability**.

---

### 🪜 **Skip Lists (Multilevel Linked Lists)**

* Hierarchical ordered linked lists.
* Bottom level: All data.
* Each upper level: Randomly selected subset (½, ¼, ⅛...).
* Expected search time: **O(log n)**.

---

### 📉 **Splay Trees (Amortized Strategy)**

* Idea: Most recent/frequent items moved toward root.
* No strict height control, but:

  * Over a sequence of operations, **average** time is **O(log n)**.
  * Based on **amortized analysis**.

---

### ⚠️ **Limitations of BSTs for Database Use**

| Property                   | BSTs (AVL, etc.)         | Limitation for DB               |
| -------------------------- | ------------------------ | ------------------------------- |
| In-memory use              | ✅ Good                   | ❌ Cannot handle disk-based data |
| Rotations & rebalancing    | Complex, costly          | ❌ Overhead                      |
| Scale                      | Small to medium datasets | ❌ Not scalable                  |
| Mapping to external memory | ❌ Not direct             | ❌                               |

---

## 💾 **3. External Search Data Structures**

### ❗ **Need for Disk-Friendly Structures**

* Database systems deal with **large volumes** of data → disk storage essential.
* In-memory structures like AVL, splay trees fail due to:

  * Frequent costly updates.
  * Inability to scale with millions of records.

---

## 🌲 **4. 2-3-4 Trees – Precursor to B/B+ Trees**

### 🧠 **Key Idea**

* Ensure **all leaves are at the same depth**.
* Guarantees tree height: **h = O(log n)**.
* Achieved by allowing **multi-key nodes**.

---

### 📐 **Node Types in 2-3-4 Tree**

| Node Type | Keys | Children |
| --------- | ---- | -------- |
| 2-node    | 1    | 2        |
| 3-node    | 2    | 3        |
| 4-node    | 3    | 4        |

---

### 🔍 **Search Operation**

* Like BST but extended:

  * In 3-node: check if key is < first, between two, or > last.
  * In 4-node: 4 comparisons and 4 branches.

---

### ➕ **Insert Operation**

#### Steps:

1. **Search** for correct leaf node.
2. Three scenarios:

   * Insert into a 2-node → becomes 3-node.
   * Insert into a 3-node → becomes 4-node.
   * Insert into a 4-node → **split needed**.

#### 💥 **Splitting a 4-Node**

* Middle key promoted to parent.
* 2 new child nodes created:

  * Left node with left key.
  * Right node with right key.
* Parent may:

  * Be 2-node → becomes 3-node.
  * Be 3-node → becomes 4-node (may trigger further splits).
  * Be root → new root created → **height increases by 1**.

✔ Only place where **height increases**: Splitting a 4-node **at the root**.

---

### ⛓ **Early vs Late Split Strategy**

| Strategy    | When to Split a 4-node | Benefit                      |
| ----------- | ---------------------- | ---------------------------- |
| Early Split | On the way **down**    | Avoid post-insert complexity |
| Late Split  | On the way **back up** | May defer work               |

* Both strategies yield same **asymptotic performance** (O(log n)).

---

### ✏️ **Example**

Insert sequence: `10, 30, 60, 20, 50, 40, 70, 80, 15, 90, 100`

* Start with 10 → 2-node.
* 30 → 3-node.
* 60 → 4-node.
* Insert 20 → split 4-node, promote 30 → new root.
* Continue splitting 4-nodes early.
* Height increases **only** when 4-node at **root** is split.

---

### ➖ **Delete Operation (Concept Only)**

* Similar to insertion but in reverse.
* May involve **merging nodes** rather than splitting.

---

## ✅ **Advantages of 2-3-4 Tree**

* All leaf nodes at same depth → balanced structure.
* Search complexity: **O(log n)**.
* Sorted data storage.
* **Easily generalizable** to external memory structures (→ B/B+ trees).
* Can be adapted to **wider nodes** (like pages on disk).

---

## ❌ **Drawbacks**

* **Multiple node types**: 2-node, 3-node, 4-node.

  * Adds complexity in maintaining structure.
* Frequent **node construction and destruction** → overhead.

---

### 💡 **Optimization Strategy**

* Use **uniform node size** (4-node).

  * Insert fewer keys.
  * Some space may be wasted → acceptable trade-off for performance.
* Constraint: Every node (except root) should be **at least half full**.

---

## 🧱 **Foundation for B/B+ Trees**

* 2-3-4 Tree is the **in-memory equivalent** of **B-tree**.
* B-tree nodes = disk pages.
* Splitting/merging logic from 2-3-4 Tree helps manage **persistent index structures**.

---

## 📝 **Summary**

| Concept                     | Notes                                              |
| --------------------------- | -------------------------------------------------- |
| BST                         | Fast search; fails when unbalanced                 |
| Balanced BSTs (AVL etc.)    | Maintain O(log n); not disk-friendly               |
| Randomized BST, Splay Trees | Good for memory; not scalable                      |
| External Structure Need     | For large datasets and persistent indexing         |
| 2-3-4 Tree                  | Balanced, log(n) height, multiple key nodes        |
| Insert/Deletion             | Involve split/merge without frequent height change |
| Precursor to B/B+ Trees     | Forms basis for database external index structure  |
