---

# üèóÔ∏è The Foundation ‚Äî How Snowflake Executes Queries

Before we talk about optimization, let‚Äôs remember **how Snowflake works internally**:

* Data is stored in **Remote Storage** (S3, Azure Blob, GCS).
* Data is organized into **Micro-partitions** (16MB compressed chunks).
* Queries don‚Äôt scan whole tables ‚Üí they **prune micro-partitions** (skip scanning unnecessary ones).
* Compute is done in **Virtual Warehouses (VWHs)**.
* Optimizations = "How can we help Snowflake prune smarter and compute less?"

Keep this mental picture in mind. Now let‚Äôs go deeper.

---

# üîç Optimizing Queries (Primary Checks)

When your query feels slow, don‚Äôt immediately jump to scaling warehouses. That‚Äôs like hiring 100 workers when 5 smart ones could finish the job if instructions were clearer.

Instead, check these areas **step by step**:

---

### 2.1 Complexity of Query

üëâ Ask: *‚ÄúIs this query doing more than it needs to?‚Äù*

* Example: A BI analyst writes:

```sql
SELECT DISTINCT customer_id, customer_name, region, COUNT(*) 
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN payments p ON o.order_id = p.order_id
WHERE region = 'APAC'
GROUP BY customer_id, customer_name, region;
```

Problem:

* `DISTINCT` + `GROUP BY` = duplication of effort.
* Multiple joins ‚Üí not all needed.

Fix: Remove redundancy:

```sql
SELECT c.customer_id, c.customer_name, c.region, COUNT(*) 
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE region = 'APAC'
GROUP BY c.customer_id, c.customer_name, c.region;
```

---

### 2.2 Unnecessary Joins or Aggregations

Case: A data scientist joins 6 tables but only uses 3 of them in SELECT.
üëâ Teach your brain: *If you don‚Äôt need it, don‚Äôt join it.*

---

### 2.3 Query Profile (Where‚Äôs the time spent?)

Snowflake provides a **Query Profile graph**.

* Green = Scanning.
* Purple = Joining.
* Yellow = Aggregating.

üëâ Suppose 80% of time is in **JOIN step** ‚Üí maybe join keys aren‚Äôt aligned.
üëâ Suppose 70% is in **SCAN step** ‚Üí maybe poor partition pruning.

Always look here before guessing.

---

### 2.4 Join Keys with Duplicates

Case:

* `orders` table has duplicate `order_id` rows.
* You join with `order_items`.
  üëâ This explodes into millions of rows unexpectedly.

Solution: Deduplicate first with `ROW_NUMBER()` or `QUALIFY`.

---

### 2.5 Cartesian Joins

A Cartesian join happens when no `ON` condition exists (or wrong key used).

* Example: 1M rows √ó 1M rows ‚Üí 1 trillion rows.
  üëâ Always double-check join conditions.

---

### 2.6 Other Issues (Must-know)

* **Functions in WHERE** ‚Üí avoid wrapping columns inside functions (`DATE(created_at)`), instead rewrite (`created_at BETWEEN '2024-01-01' AND '2024-01-31'`).
* **Select only needed columns** ‚Üí reduces scan.
* \*\*Avoid SELECT \*\*\* in production.

---

### üîë Teaching Question (for self-check)

üëâ If a query is slow, how would you **differentiate** between a join problem and a scan problem in Snowflake?
*(Hint: Look at Query Profile)*

---

# üóÉÔ∏è Chapter 3: Storage Considerations

This is where your question about **Automatic Clustering, Search Optimization Service, and Materialized Views** fits in.

But before we jump:
üëâ You asked: *‚ÄúWhere does data get ordered/changed ‚Äî Remote Storage or Micro-partition layer?‚Äù*

Answer:

* Data in **Remote Storage** is immutable.
* Once ingested, Snowflake chops it into **Micro-partitions**.
* Optimizations (clustering, search optimization, materialized views) **don‚Äôt rewrite your storage files**, they influence **how new micro-partitions are created and how metadata is used to prune/locate them**.

So think: **Remote storage = frozen lake. Metadata + micro-partitions = map that tells Snowflake where to fish.**

---

### 3.1 Automatic Clustering

* Purpose: Keep data **physically organized** in micro-partitions along chosen columns (like `region`, `date`).
* Without clustering ‚Üí queries may scan many partitions.
* With clustering ‚Üí Snowflake automatically reorganizes new micro-partitions to align with clustering keys.

**Scenario:**

* `sales` table = 10TB.
* Queries mostly filter `WHERE order_date BETWEEN ...`.
* Without clustering ‚Üí Snowflake scans 10TB every time.
* With clustering ‚Üí Micro-partitions neatly follow date ranges ‚Üí pruning reduces scan to \~500GB.

üëâ But clustering costs compute. Only enable if pruning benefit >> cost.

---

### 3.2 Search Optimization Service (SOS)

* Purpose: Speed up **point lookups** or highly selective filters.
* Example: Query = `WHERE customer_email = 'abc@gmail.com'`.
* Normally Snowflake scans many micro-partitions ‚Üí slow.
* SOS builds extra metadata index-like structures ‚Üí lets Snowflake jump directly to the partition.

üëâ Key: Great for **high-cardinality columns** (emails, IDs), not for low-cardinality (gender = 'M').

---

### 3.3 Materialized Views

* Purpose: Store **pre-computed results** of a query (unlike normal views).
* Example:

```sql
CREATE MATERIALIZED VIEW mv_monthly_sales AS
SELECT region, month(order_date), SUM(amount)
FROM sales
GROUP BY region, month(order_date);
```

* When user queries:

```sql
SELECT * FROM mv_monthly_sales WHERE region='APAC';
```

‚Üí Instantly served (since result pre-computed).

* Snowflake keeps it fresh automatically when base table changes.

üëâ Downside: Extra storage + compute to maintain. Use for **frequently queried aggregations**.

---

# üõ†Ô∏è Chapter 4: Missing but Important Concepts

You didn‚Äôt mention these, but they‚Äôre **critical in Snowflake optimization**:

### 4.1 Result Caching

* If the exact same query runs again ‚Üí results served instantly from cache (no compute).
* Best practice: Ensure queries are written consistently (no unnecessary timestamps like `CURRENT_TIMESTAMP` that break caching).

---

### 4.2 Metadata & Partition Pruning

* Always filter on columns that Snowflake can prune (like dates, IDs).
* Bad: `WHERE YEAR(order_date)=2024`.
* Good: `WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'`.

---

### 4.3 Warehouse Sizing: Scale Up vs Scale Out

* **Scale Up**: Increase warehouse size (more CPU/memory) ‚Üí good for single large queries.
* **Scale Out**: Add more clusters ‚Üí good for many concurrent users.

üëâ Story: If you‚Äôre building one giant building, hire stronger workers (scale up). If you‚Äôre building 100 houses at the same time, hire more teams (scale out).

---

### 4.4 Data Modeling Choices

* Snowflake works best with **denormalized schemas** for analytics (star schema).
* Over-joining normalized schemas = slow queries.

---

# üß≠ Chapter 5: Practice Scenarios

Let me give you 2 practice thought experiments:

1. Query is slow. Query Profile shows **90% scan time**.
   üëâ What‚Äôs your fix? (*Clustering or partition pruning optimization*)

2. Query is slow. Query Profile shows **80% join time**.
   üëâ What‚Äôs your fix? (*Fix duplicate keys, check join strategy, remove unnecessary joins*).

---

# ‚úÖ Wrap-up

So in summary:

* **Query optimization** starts with simplifying SQL and analyzing Query Profile.
* **Storage optimization** happens in micro-partitions (not raw remote storage). Techniques:

  * Automatic Clustering ‚Üí better partition pruning.
  * SOS ‚Üí faster point lookups.
  * Materialized Views ‚Üí pre-computed results.
* Don‚Äôt forget **caching, warehouse scaling, and pruning strategies**.
* Always think: ‚ÄúAm I reducing **data scanned** or **data processed**?‚Äù

---
