Here’s your **in-depth notes** for *DBMS – Query Processing and Optimization / 2: Optimization*
I’ve covered **every point mentioned** in the transcript, organized logically, and tied them back to the **learning outcomes** you gave.

---

## **Query Processing and Optimization – Part 2: Optimization**

### **1. Introduction**

* Query processing and optimization aim to find the **most efficient way** to execute a given query.
* The goal is to minimize **query cost**, primarily in terms of:

  * **Disk transfers**
  * **Disk seeks**
* In the previous module:

  * Discussed selection, sorting, join algorithms, duplicate removal, projection, and aggregation.
* In this module:

  * Focus on **query optimization basics**.
  * Learn how to:

    1. Transform relational expressions to create equivalent alternatives.
    2. Choose the **best cost** query evaluation plan from these alternatives.

---

### **2. Query Optimization: Basic Steps**

Once the query is parsed and translated into a relational algebra expression:

1. **Generate equivalent expressions**

   * Apply transformation (equivalence) rules to create alternatives.
2. **Choose algorithms for each operation**

   * e.g., Which join method to use.
3. **Evaluate cost**

   * Select the best evaluation plan based on estimated cost.

**Representation:**

* Queries are represented as **expression trees** in optimization.
* Nodes: relational operators (selection, join, projection, etc.)
* Leaves: base relations (tables).

---

### **3. Example – Moving Selection Down**

#### Initial Query:

* Retrieve names of instructors and courses they teach in the **Music department**.

#### Naive Plan:

1. Join Teacher and Course.
2. Join with Instructor.
3. Apply **selection** (`department_name = 'Music'`).
4. Apply **projection** (name, title).

#### Optimization Idea:

* The selection `department_name = 'Music'` applies **only** to the Instructor relation.
* Move the selection **before** the join (push down selection).
* Benefits:

  * Reduces the number of tuples early.
  * Later joins operate on smaller relations.
  * Saves processing time.

---

### **4. Evaluation Plan (Annotated Query Tree)**

* An **evaluation plan** = query tree + **specific algorithms** for each operation.
* Example: Retrieve instructors in Music dept. teaching in year 2019.

#### Optimized Steps:

1. **Selection on Instructor** (`dept = 'Music'`):

   * Use **index scan** (instructor data is static → indexing efficient).
2. **Selection on Teachers** (`year = 2019`):

   * No index → use linear scan.
3. **Join results** of the above selections using **merge join**.
4. **Join with Course** relation using **hash join** (efficient due to unique course IDs).
5. **Projection** (name, title) with **duplicate removal** via sorting.

---

### **5. Key Points on Query Optimization**

* Heuristic example:

  * Apply filters **as early as possible**.
  * Use indexes where possible.
* Real optimizer:

  * Must explore **alternatives systematically**.
  * Cost estimation based on **statistics** (relation size, index availability, selectivity).

---

### **6. Relational Expression Equivalence**

**Definition:**

* Two relational algebra expressions are **equivalent** if they produce the **same set of tuples** for all **legal database instances**.
* Notes:

  * Tuple **order** does not matter (set semantics).
  * For SQL (multiset semantics):

    * Tuples may be duplicated.
    * Equivalence requires **same multiplicity** for each tuple.
  * Differences in results for **illegal instances** (violating constraints) are ignored.

---

### **7. Equivalence Rules in Relational Algebra**

The optimizer uses **equivalence rules** to transform queries.

#### **Selection Rules**

1. **Conjunctive decomposition**:

   ```
   σθ1 AND θ2 (E) = σθ1 ( σθ2 (E) )
   ```
2. **Commutativity**:

   ```
   σθ1 ( σθ2 (E) ) = σθ2 ( σθ1 (E) )
   ```

#### **Projection Rules**

* Only the **final** projection matters.
* Multiple projections → keep only the smallest required attribute set.

#### **Selection and Join**

3. Selection after Cartesian product = Theta join:

   ```
   σθ ( E1 × E2 ) = E1 ⋈θ E2
   ```
4. Selection after theta join (merge conditions):

   ```
   σθ2 ( E1 ⋈θ1 E2 ) = E1 ⋈(θ1 AND θ2) E2
   ```

#### **Join Rules**

5. **Commutativity**:

   ```
   E1 ⋈ E2 = E2 ⋈ E1
   ```
6. **Associativity** (natural join):

   ```
   (E1 ⋈ E2) ⋈ E3 = E1 ⋈ (E2 ⋈ E3)
   ```
7. **Theta join associativity** – condition-specific.

#### **Selection Distributes over Join**

* If condition refers only to attributes from one relation:

  ```
  σθ1 AND θ2 (E1 ⋈θ E2) = (σθ1(E1)) ⋈θ (σθ2(E2))
  ```

#### **Projection Distributes over Join**

* If join condition only needs some attributes:

  * Project each relation to **only necessary attributes** for:

    * Output list
    * Join condition

#### **Set Operation Rules**

* Union, intersection → commutative and associative.
* Selection distributes over union, intersection, difference.
* Projection distributes over union.

---

### **8. Applying Transformation Rules – Examples**

#### Example 1 – Push Selection

* Move `dept = 'Music'` down to Instructor before join.

#### Example 2 – Two Conditions

* `(dept = 'Music') AND (year = 2009)`:

  1. Change join associativity to join smaller intermediate results first.
  2. Push each selection to its respective relation (Instructor, Teachers).
  3. Reduce size before final join.

#### Example 3 – Push Projection

* Perform projection early to reduce attributes carried through joins.

---

### **9. Generating Alternative Plans**

**Approaches:**

* **Brute force enumeration**:

  * Apply all equivalence rules to all subexpressions until no new forms.
  * Exponential time → impractical.
* **Dynamic programming**:

  * Generate and keep **best plan so far** for each subexpression.
  * **Prune** plans that cannot lead to better results.

**Space optimization:**

* Avoid duplicating identical subexpressions.
* Use **multi-tree pointers** to share subexpression results.

---

### **10. Cost-Based Optimization**

* Cost depends on:

  * Statistics of base tables.
  * Estimated size of intermediate results.
  * Algorithmic costs for each operator.
* **Same query** may have different optimal plans for different datasets.

---

### **11. Summary – Optimization Process**

1. Parse SQL → relational algebra expression.
2. Apply **equivalence rules** to generate **alternative expressions**.
3. Annotate with algorithm choices → evaluation plans.
4. Estimate costs using statistics.
5. Choose **lowest-cost** plan.
6. Pass evaluation plan to execution engine.

---

✅ **Learning Outcomes Achieved:**

* **Basic issues for optimizing queries**:

  * Need to reduce relation sizes early.
  * Importance of indexes, join ordering, pushing selection/projection.
  * Use of statistics in cost estimation.
* **Transformation of relational expressions**:

  * Equivalence rules create multiple alternatives.
  * Early filtering and attribute reduction significantly improve performance.