Alright — here’s a **detailed, structured, and complete** set of notes for the DBMS topic **"Query Processing and Optimization/1: Processing"** based on your transcript.
I’ve broken it down so that it matches the **learning outcomes** and covers **every single point** mentioned in the lecture, with added clarity where needed.

---

# **DBMS – Query Processing and Optimization/1: Processing**

## **Learning Outcomes**

1. **Understand the overall flow for Query Processing**
2. **Define the Measures of Query Cost**

---

## **1. Overview of Query Processing**

### **1.1 Context from Previous Modules**

* In earlier modules, we studied **Backup and Recovery**, **Logging**, **Hot Backups**, **Recovery Algorithms**, and **RAID**.
* Now shifting focus to **Query Processing**:

  * How an SQL query flows through the DBMS.
  * Cost measures for query execution.
  * Basic algorithms for operations like **selection**, **sorting**, and **join**.
  * How these algorithms affect query performance.

---

### **1.2 Query Processing Flow**

**Main Stages:**

1. **Parsing & Translation**

   * SQL query is parsed (like any programming language).
   * Translated into **Relational Algebra (RA)** expressions.
   * SQL ↔ Relational Calculus ↔ Relational Algebra equivalence.
2. **Optimization**

   * Relational algebra expression may have multiple equivalent forms.
   * Not all are equally efficient → choose the one with lowest cost.
   * Primary cost measure = **time**; sometimes **space**.
   * Uses **statistics from the system catalog** (e.g., tuple counts, index info) to choose an efficient plan.
3. **Execution Plan**

   * Specifies the exact operations (e.g., selection order, join method, index usage).
4. **Evaluation Engine**

   * Executes the plan step-by-step using access methods (e.g., B+ tree index, sequential scan).
   * Produces the **query output**.

---

**Example:**
SQL Query:

```sql
SELECT salary 
FROM instructor
WHERE salary < 75000;
```

* **Option 1:** Selection → Projection.
* **Option 2:** Projection (salary only) → Selection.
* Both are equivalent RA expressions.
* The optimizer chooses the cheaper one.

---

## **2. Measures of Query Cost**

### **2.1 Primary Cost Metric**

* **Total elapsed time** to answer a query.
* Depends on:

  * **Disk access time** (dominant factor).
  * CPU time (often small).
  * Network communication time (ignored here for simplicity).
* We **focus on disk access** cost.

---

### **2.2 Disk Access Cost Components**

1. **Seek Time (tS)**

   * Time to position the disk head over the desired track.
2. **Block Transfer Time (tT)**

   * Time to read/write a block after the head is positioned.
3. **Write Cost**

   * Higher than read cost (includes verification step).

---

### **2.3 Cost Formula**

Let:

* `B` = Number of blocks transferred.
* `S` = Number of seeks.
* `tT` = Time to transfer one block.
* `tS` = Time to seek one block.

**Total Cost**:

$$
\text{Cost} = (B \times tT) + (S \times tS)
$$

* Ignore output writing cost (unless explicitly required).
* Focus only on **block transfers** & **seeks**.

---

### **2.4 Buffer Effect**

* More **buffer memory** ⇒ fewer disk I/O operations ⇒ lower cost.
* Difficult to estimate exactly → use **worst-case estimates** (like algorithm complexity analysis).

---

## **3. Common Relational Operations & Their Costs**

### **3.1 Selection**

#### **Case A – Single Condition**

* **A1**: Linear Search (no index, arbitrary condition)

  * Cost = 1 seek + `br` block transfers (where `br` = #blocks in file).
* **A2**: Linear Search on Equality of Key

  * On average, only half the blocks are read.
* **A3**: Primary Index on Key

  * Cost = `(hi + 1) * (tS + tT)` where `hi` = height of B+ tree.
* **A4**: Primary Index on Non-Key

  * Read index → multiple matching tuples spread across blocks.
* **A5**: Secondary Index on Key

  * Similar to **A3** in cost behavior.
* **A6**: Secondary Index on Non-Key

  * Cost depends on number of matching records `n`.

---

#### **Case B – Conjunctions (`AND`)**

* Use the **most selective condition** (lowest output size) first.
* Apply other conditions in **main memory** on filtered tuples.
* If **composite index** exists, use it directly.

---

#### **Case C – Disjunctions (`OR`)**

* If indexes exist for all conditions → use them and take **union** of results.
* Otherwise → perform a **linear scan**.

---

#### **Case D – Negation (`NOT`)**

* Usually requires a **linear scan**.

---

### **3.2 Sorting**

**Methods:**

1. **Index-Based Sort**

   * Read tuples in index order (may cause poor block access pattern).
2. **In-Memory Sort**

   * If data fits in memory → use QuickSort or similar.
3. **External Sort-Merge** (most common for large data)

   * Phase 1: Create sorted runs of size `m` blocks (fit in memory).
   * Phase 2: Merge runs iteratively until one sorted file remains.
   * Optimization: If number of runs ≤ available memory blocks, merge in **one pass**.

---

### **3.3 Join**

#### **Nested Loop Join**

* For each tuple in outer relation, scan entire inner relation.
* Very costly → `br * bs + br` (br = blocks in outer, bs = blocks in inner).
* **Optimization**: Put **smaller relation** as outer.

#### **Block Nested Loop Join**

* Process blocks instead of tuples → fewer disk I/Os.
* Further optimization: Load **M - 2** blocks of outer relation at a time.

#### **Index Nested Loop Join**

* Use index on join attribute of inner relation.
* For each outer tuple, use index to directly fetch matching inner tuples.
* Drastically reduces cost for equality/natural joins.

---

### **3.4 Duplicate Elimination**

* **Sorting-based**: Sort tuples → adjacent duplicates → remove.
* **Hashing-based**: Hash tuples into buckets → remove duplicates in each bucket.
* Optimization: Remove duplicates **during merge phase** of external sort.

---

### **3.5 Projection**

* Apply projection → eliminate duplicates (same methods as above).
* Often performed **together** with duplicate elimination.

---

### **3.6 Aggregation**

* Group tuples by aggregation key.
* Maintain running aggregates in memory:

  * **Count**, **Min**, **Max**, **Sum** computed on the fly.
  * **Average** computed as `sum / count` after processing all tuples.

---

## **4. Summary Table: Factors Affecting Query Cost**

| Factor                         | Impact                                       |
| ------------------------------ | -------------------------------------------- |
| Number of disk block transfers | Directly proportional to cost                |
| Number of seeks                | Large seek times increase cost significantly |
| Index availability             | Can drastically reduce cost                  |
| Buffer size                    | Larger buffers reduce disk I/O               |
| Order of operations            | Changes total cost significantly             |
| Data size and selectivity      | High selectivity reduces cost                |

---

## **5. Key Takeaways**

* Query processing transforms SQL → Relational Algebra → Execution Plan.
* **Disk I/O dominates query cost** in most cases.
* Cost formula:

  $$
  \text{Cost} = B \times tT + S \times tS
  $$
* Choosing the right **operation order** and **access method** is critical for performance.
* Operations like **selection**, **sorting**, **join**, **projection**, and **aggregation** have multiple possible algorithms, each with different cost behaviors.