

Here is a comprehensive and **in-depth note** for the **DBMS module: Indexing and Hashing/3: Indexing/3**, based on the transcript. This covers all topics and concepts thoroughly to meet the learning outcomes:

---

# 📘 DBMS Notes  
## Module: Indexing and Hashing/3: Indexing/3  
### Subject: **Database Management Systems (DBMS)**  
### Learning Outcomes:
- Understand the **design of B+ Tree Index Files** as a generalization of 2-3-4 Tree  
- Understand the **fundamentals of B-Tree Index Files**

---

## 🔶 1. Recap: 2-3-4 Trees – The Foundation

### What is a 2-3-4 Tree?
- A **balanced multi-way search tree** with **3 types of nodes**:
  - **2-node**: 1 key, 2 children
  - **3-node**: 2 keys, 3 children
  - **4-node**: 3 keys, 4 children
- **All leaf nodes are at the same depth** – ensures balanced structure.
- Acts as a **precursor** to B+ Trees.

---

## 🔷 2. B+ Tree: Structure & Properties

### Definition
A **B+ Tree** is a **balanced multi-level index tree** where:
- All paths from the root to leaves have the **same length**
- The **leaf nodes contain the actual data** (or pointers to records)
- The **internal nodes only contain keys and child pointers**
- **Leaf nodes are linked** sequentially (for range queries)
- **Sorted order is preserved** in the leaf nodes

### Key Features
| Feature | Description |
|--------|-------------|
| **Balanced** | Height is always log-based, ensuring fast search |
| **Multi-level Indexing** | Uses internal nodes for routing, leaves for data |
| **Linked Leaves** | Leaf nodes are linked via pointers for sequential access |
| **Sorted Leaf Keys** | Allows both binary search and range scan |
| **Single Structure for Index & Records** | Can be used for index files or full records |

---

## 🧩 3. Trade-Offs in Node Size (Fan-out)

### Trade-off in `n` (max keys/pointers per node):
- **Larger `n`**:
  - Fewer tree levels ⇒ **smaller height**
  - **More comparisons** per node (linear scan within node)
  - **May not fit in memory block**
- **Smaller `n`**:
  - More tree levels ⇒ **greater height**
  - **Fewer comparisons**, but **more disk accesses**

### Balance Rule:
- Every **internal node** must be **at least half full**:
  - At least `⌈n/2⌉` keys and `⌈n/2⌉ + 1` pointers
- **Root node** can be an exception (can have fewer keys)
- **Leaf nodes** also must be at least half full
- Leaf nodes also carry a **pointer to next leaf**

---

## 🔎 4. Searching in a B+ Tree

### Procedure:
1. Start from **root**
2. At each internal node:
   - Use **linear search** on keys to find correct child pointer
3. Traverse down to **leaf node**
4. **Search within leaf** to find key (or determine absence)

### Example:
Searching for key 55:
- Root: Compare against keys → route down
- Internal node: Compare → go to leaf
- Leaf node: Linear search → find 55

---

## ✍️ 5. Insertion in B+ Tree

### Steps:
1. **Search** for the location in the appropriate **leaf node**
2. If **space exists**, insert in sorted order
3. If **no space**:
   - **Split** the node into two (half-half split)
   - **Promote** the first key of the new right node to the parent
4. If parent is full, **repeat split recursively**
5. May cause **tree height to increase**

### Properties Maintained:
- Nodes are **at least half full**
- **Sorted order** preserved
- All **leaf nodes stay at same level**

---

## ❌ 6. Deletion in B+ Tree

### Steps:
1. **Search** and delete the key from leaf
2. If **leaf becomes under-full**:
   - **Borrow** from sibling (if possible)
   - Else, **merge with sibling** and **adjust parent**
3. May **propagate upwards** (if parent becomes under-full)
4. **Tree height may decrease**

### Properties Maintained:
- Minimum occupancy maintained
- All leaves remain at same depth

---

## 🌳 7. Performance Analysis

### Tree Height:
- For `K` keys, node capacity `n`, tree height ≈ `log₍n/2₎(K)`
- Example:
  - `K = 10⁶`, `n = 100` ⇒ height ≈ log₅₀(10⁶) ≈ 4
  - A binary tree would need log₂(10⁶) ≈ 20 levels
- **B+ Tree has ~5x height improvement** over binary trees

---

## 📁 8. Index File vs Record File

### Use of B+ Tree:
- Can be used for:
  - **Index files**
  - **Record files (actual data)**
- Hence, it's a **self-organizing data structure**
  - Automatically manages overflow and maintains performance

---

## 🔄 9. Handling Duplicates in B+ Tree

### Approaches:
- **Relax strict inequality** in search to allow duplicates
- Keep **list of record pointers** for same key
- Or make key **unique** by appending a **record ID**

---

## 🧷 10. Storage Optimization

### Record Pointer Overhead:
- At leaf level, pointers to actual records stored
- If records **relocate**, all affected pointers must be updated ⇒ costly

### Solutions:
- Use **primary index key** in secondary indexes instead of record pointer
  - Saves cost of update during relocation
  - Increases query time slightly but is amortized

---

## 🔠 11. Indexing with Strings

### Issues:
- **Variable-length keys** ⇒ **variable fan-out**
- Makes balancing harder

### Solution:
- Use **prefix compression**
  - e.g., “Silas” and “Silberschatz” ⇒ compressed prefix “Silb”
  - Reduces space without compromising uniqueness

---

## 🔁 12. B-Tree vs B+ Tree

| Feature | B+ Tree | B Tree |
|--------|---------|--------|
| **Key Occurrence** | Keys appear in both internal & leaf nodes | Keys appear only once |
| **Data Location** | Only in **leaf** nodes | In **both internal & leaf** nodes |
| **Pointer Size** | Extra pointer in leaf for sequential access | No such need |
| **Search Path** | Always goes to **leaf node** | May stop early at internal node |
| **Fan-out** | Higher (as internal nodes are compact) | Lower (internal nodes store data too) |
| **Popular Use** | Very common in databases | Less common |

### Why B+ Tree is preferred:
- Higher **fan-out** ⇒ smaller height
- **Easier range queries** via leaf links
- **Uniform structure** for index and record files

---

## ✅ Summary of Key Properties

- B+ Tree is a **generalization of 2-3-4 tree**
- Efficient for **search, insert, delete**
- Keeps tree **balanced**
- Supports both **random** and **sequential** access
- Preferred structure in **database indexing**

---

## 📌 Final Points

- All nodes (except root) must be **at least half full**
- **Internal nodes** contain keys and child pointers
- **Leaf nodes** contain keys, record pointers, and **next leaf pointer**
- **B+ Trees** avoid file degradation issues that affect index-sequential files
- **B Trees** compact storage but introduce complexity and reduce fan-out