---

# Snowflake Performance Optimization — A Comprehensive Guide

---

## 1. What is Performance Tuning in General?

### Story to set the context:

Imagine you run a huge online store — "MegaShop." Millions of customers visit daily, and they want to see product details, place orders, and get reports instantly. But your data system is slow; queries take minutes. Your job is to **tune the system** to make everything faster and smoother.

**Performance tuning** is the art and science of optimizing how queries and data operations are executed to reduce latency and improve throughput. It means identifying bottlenecks and applying the right techniques to get results faster, cheaper, and efficiently.

---

### Traditional Database Performance Tuning (for context):

In classic databases like Oracle, MySQL, SQL Server, you often use:

* **Indexes** to speed up data lookups.
* **Primary keys** to enforce uniqueness and sometimes help query planners.
* **Partitions** to divide large tables so queries touch only relevant parts.
* **Analyze execution plans** to understand what the database engine does internally.
* **Remove unnecessary full table scans** especially on huge tables.
* **Cache small tables** to avoid repeated disk I/O.
* **Use query hints** to force query planners to choose specific indexes or join methods.
* **Ordering joins** to reduce data shuffling and improve join performance.

---

### But — this is **NOT** exactly how Snowflake works!

---

## 2. How Snowflake is Different? What Does Snowflake Performance Tuning Really Mean?

### Story continuation:

In MegaShop, you decide to migrate to Snowflake, a cloud-native data platform. After migration, you realize many old tuning tricks don’t apply — but Snowflake has its own unique ways of optimizing performance.

### Key differences:

* **No traditional indexes or primary keys enforcement** (Primary keys and foreign keys are metadata only, not enforced).
* **Automatic micro-partitioning** by Snowflake behind the scenes (you don’t create partitions explicitly).
* **Massively parallel processing (MPP) architecture with compute clusters (virtual warehouses).**
* **Automatic query optimization engine and caching at multiple layers.**

---

## 3. Snowflake’s Fundamental Performance Optimization Concepts

Let's break down what really matters:

### 3.1 Micro-partitions

* Snowflake automatically splits tables into **micro-partitions**, contiguous units of storage that contain metadata like min/max column values.
* This metadata lets Snowflake **prune irrelevant micro-partitions** during query time, avoiding scanning unnecessary data.
* **You cannot create partitions manually**, but you can influence clustering.

### 3.2 Clustering Keys (Manual Optimization)

* If your table grows very large, Snowflake might not prune micro-partitions effectively.
* You can define a **clustering key** to reorganize data so micro-partitions are sorted by that key.
* Example: If MegaShop often filters orders by `order_date`, defining clustering on `order_date` helps queries skip irrelevant micro-partitions.

### 3.3 Virtual Warehouses (Compute Resources)

* You can **scale compute power up or out** by resizing or adding warehouses.
* Larger warehouses = more nodes = faster query but higher cost.
* Auto-suspend and auto-resume features avoid waste.

### 3.4 Query Caching

* Snowflake caches data at several levels:

  * **Result Cache:** If the exact query ran before and underlying data didn't change, Snowflake returns cached results instantly.
  * **Local Disk Cache:** Each compute node caches data it reads to speed up repeated scans.
  * **Metadata Cache:** Cached micro-partition metadata speeds pruning.

### 3.5 Data Pruning (Skipping irrelevant data)

* Snowflake uses min/max statistics in micro-partitions to skip irrelevant partitions, greatly speeding up queries.

### 3.6 Query Profiling & Query Plan Analysis

* Snowflake provides **Query Profile** in the UI to visualize query steps, identify slow operations like big scans, spills, or joins.
* This helps identify bottlenecks and optimize.

---

## 4. Now, let's map your points — What applies, what not, and what to do instead in Snowflake.

---

### 4.1 "We add indexes, primary keys"

* **Traditional databases:** Indexes speed up lookup.
* **Snowflake:** There are **no user-managed indexes**. Primary keys are only for metadata, not enforced, and do NOT speed up queries.
* **Optimization Instead:** Use **clustering keys** to optimize micro-partition pruning for large tables.

**Example:**
MegaShop has a huge `orders` table with billions of rows. Queries filter on `customer_id`. Defining clustering on `customer_id` improves pruning and query speed.

---

### 4.2 "We create table partitions"

* **Traditional:** You manually partition big tables (like by date).
* **Snowflake:** Snowflake automatically **micro-partitions** data; no manual partitioning allowed.
* **Optimization Instead:** Use clustering keys if automatic pruning isn’t sufficient.

---

### 4.3 "We analyze query execution plan"

* This is **very important**.
* Snowflake provides detailed **Query Profile** which visualizes:

  * Time spent in scanning, filtering, joins, data shuffles, etc.
  * Bottlenecks like large data scans or compute spillover.
* Use Query Profile to identify if your queries are doing large scans, improper joins, or spilling to disk.

---

### 4.4 "Remove unnecessary large-table full-table scans"

* In Snowflake, **full table scans happen if filters do not help pruning**.
* Ensure your filters use columns with good micro-partition pruning.
* Use **clustering keys** or rewrite queries to leverage pruning.
* Avoid `SELECT *` if you only need few columns.

---

### 4.5 "Cache small-table full-table scans"

* Snowflake caches data internally, but to improve join performance on small tables, use:

  * **Broadcast Join Hint** (`/*+ BROADCAST(table) */`) in SQL to force small tables to be broadcasted to all nodes, avoiding shuffle.
* Snowflake automatically chooses broadcast vs. shuffle join based on size but can be overridden.

---

let’s unpack **“Cache small-table full-table scans”** and **Broadcast Join Hint** in Snowflake with a story so it’s crystal clear.

---

### 1️⃣ First, the Big Picture

Snowflake already **caches data** in multiple layers (local SSD cache in the Virtual Warehouse, Remote Disk cache, and Result Cache).
When you join **a huge table** with **a tiny table**, the **query optimizer** decides the join strategy:

* **Broadcast Join** → Copy the small table to every compute node so each node can join locally. (No shuffling of large table.)
* **Shuffle Join** → Split both tables into chunks based on the join key, shuffle those chunks across nodes so matching rows end up together. (Costly for large datasets.)

---

### 2️⃣ Why “Cache small-table full-table scans” helps

If you have a very small table (say a lookup table of country codes, product categories, etc.), you can **force Snowflake to broadcast it** instead of shuffling.

This:

* **Avoids large network movement** of the big table’s rows.
* Makes join faster.
* Reduces compute cost.

---

### 3️⃣ Story Example — “The Airport Immigration Desk”

Imagine:

* **Big Table** → `PASSENGERS` (50 million rows — all passengers arriving at an airport).
* **Small Table** → `COUNTRY_CODES` (200 rows — country code to country name mapping).

You want to find all passengers from countries starting with **'B'**.

---

### ❌ Without Broadcast Join (Shuffle Join)

Snowflake might decide:

1. Split **PASSENGERS** into multiple chunks (based on join key `country_code`).
2. Split **COUNTRY\_CODES** into chunks.
3. Shuffle both so that all matching `country_code` values end up on the same node.
4. Join the chunks on each node.

Problem:

* Even though `COUNTRY_CODES` is tiny, it still gets shuffled unnecessarily.
* And `PASSENGERS` — a massive dataset — might also get shuffled, which is expensive.

---

### ✅ With Broadcast Join

If you tell Snowflake:

```sql
SELECT /*+ BROADCAST(cc) */
    p.passenger_id,
    p.name,
    cc.country_name
FROM PASSENGERS p
JOIN COUNTRY_CODES cc
  ON p.country_code = cc.country_code
WHERE cc.country_name LIKE 'B%';
```

Then:

* Snowflake sends **COUNTRY\_CODES** (tiny table) to **all nodes** instantly (fast).
* Each node already has its own slice of `PASSENGERS` and can join locally with `COUNTRY_CODES` — **no need to move `PASSENGERS` at all**.
* This is much faster and cheaper.

---

### 4️⃣ Visualization

```
Without Broadcast Join:
PASSENGERS chunks  → shuffle across network → join
COUNTRY_CODES chunks → shuffle across network → join

With Broadcast Join:
COUNTRY_CODES → copied to each node (tiny, fast)
PASSENGERS chunks → joined locally with COUNTRY_CODES (no shuffle)
```

---

### 5️⃣ When to Use This

You **don’t need** to use `BROADCAST` hint most of the time — Snowflake’s optimizer detects small tables and broadcasts them automatically.

You **should** use it if:

* You know a table is small (but Snowflake’s statistics might not).
* Snowflake unexpectedly chooses a shuffle join for a small table.
* The query plan shows expensive repartitioning steps.

---



### 4.6 "Verify optimal index usage"

* No indexes in Snowflake, so this does not apply.
* Instead, verify **clustering** and **data pruning** effectiveness.

---

### 4.7 "Using hints to tune Oracle SQL"

* Snowflake supports **some query hints** like broadcast join, join strategy, etc.
* Use hints **sparingly**; Snowflake’s optimizer is strong.
* Always test with and without hints.

---

### 4.8 "Self-order the table joins"

* In traditional DBs, join order can matter a lot.
* Snowflake’s optimizer chooses join order automatically.
* However, writing well-structured queries and **using clustering and broadcast join hints** can improve join efficiency.

---

## 5. Additional Important Topics for Snowflake Performance

### 5.1 Warehouse Sizing & Concurrency

* Bigger warehouse = faster query but higher cost.
* If many users run queries, scaling out (multi-cluster warehouses) helps concurrency without queuing.
* Scenario: MegaShop has 50 analysts running reports at once; a multi-cluster warehouse auto-scales to serve everyone smoothly.

### 5.2 Data Types and Compression

* Snowflake uses automatic compression.
* Choosing optimal data types (e.g., using INTEGER vs STRING where possible) reduces storage and speeds queries.

### 5.3 Avoid Data Skew

* When joining tables, skew (very large partitions for some keys) can cause some compute nodes to be overwhelmed.
* To fix, pre-aggregate data, or repartition via clustering.

let’s break down **Avoid Data Skew** in Snowflake with an easy story and then go deeper into what it means for performance.

---

### 1️⃣ What is Data Skew?

In a distributed system like Snowflake, data is split into **partitions** and spread across compute nodes.
When joining two tables, Snowflake **shuffles** data by the join key so matching rows end up on the same node.

If one **join key value** has way more rows than others, that key’s partition becomes **huge** — and the node that gets it will have to process way more work.
This is **data skew** → imbalance in workload across nodes.

---

### 2️⃣ Story Example — “Restaurant Orders”

Imagine:

* **BIG Table** → `ORDERS` (1 billion rows — orders from multiple restaurants)
* **SMALL Table** → `RESTAURANTS` (10,000 rows — restaurant info)
* Join key → `restaurant_id`

If:

* Most restaurants have **10,000 orders** (normal load)
* But **McDonald’s** has **200 million orders** (way bigger)
* When Snowflake shuffles data by `restaurant_id` to join, the partition for McDonald’s will be **huge**, and one compute node will get stuck with this monster partition while others finish early and wait.

This slows down the whole query because **the query finishes only when the slowest node finishes**.

---

### 3️⃣ How it looks in Snowflake query execution

In the **Query Profile**:

* Some partitions (nodes) show processing **gigabytes of data**
* Others show processing only **megabytes of data**
* The big ones are the skewed keys

---

### 4️⃣ How to Fix It

### **Option 1 — Pre-Aggregate**

If possible, aggregate skewed data before the join so the monster partition shrinks.

```sql
-- Instead of joining raw orders
SELECT restaurant_id, SUM(order_amount) as total_sales
FROM ORDERS
GROUP BY restaurant_id;
-- Now join with RESTAURANTS
```

This reduces the amount of data per key before joining.

---

### **Option 2 — Repartition / Add Distribution Key**

If you can, **spread the skewed data more evenly** by:

* Adding another column to the join key (e.g., `restaurant_id` + `order_date`)
* Randomly assigning a distribution bucket to break large groups apart

Example:

```sql
SELECT *
FROM (
    SELECT o.*, HASH(order_id) % 10 as bucket
    FROM ORDERS o
) o2
JOIN (
    SELECT r.*, seq_bucket as bucket
    FROM RESTAURANTS r
    CROSS JOIN (SELECT seq4() % 10 as seq_bucket FROM TABLE(GENERATOR(ROWCOUNT => 10)))
) r2
ON o2.restaurant_id = r2.restaurant_id AND o2.bucket = r2.bucket;
```

This trick splits large groups into multiple smaller ones.

---

### **Option 3 — Clustering**

If skew is due to **storage layout**, clustering can help Snowflake prune data better before join.

```sql
ALTER TABLE ORDERS CLUSTER BY (restaurant_id);
```

This doesn’t remove skew but can reduce how much data is read.

---

### 5️⃣ Summary Table

| Problem        | Cause                              | Fix                                  |
| -------------- | ---------------------------------- | ------------------------------------ |
| Data skew      | One/few keys have most of the data | Pre-aggregate, split key, bucketize  |
| Node imbalance | One node gets too much work        | Add extra key to distribute evenly   |
| Slow joins     | Shuffling huge partitions          | Broadcast small tables, bucket joins |

---


### 5.4 Materialized Views

* Use materialized views for expensive repeated queries.
* Snowflake maintains these automatically and speeds up query time.


let’s make **Materialized Views** (MVs) in Snowflake simple and memorable.

---

### 1️⃣ What is a Materialized View?

A **Materialized View** is like taking a *snapshot* of your query’s results and storing it physically (like a pre-computed table).

* You write a query once
* Snowflake runs it and **stores the result**
* Snowflake automatically keeps it **up-to-date** when the underlying tables change
* When you query the MV, it’s **way faster** because it’s already pre-processed

---

### 2️⃣ Story Example — “The Coffee Shop Sales Report”

Imagine:

* You have a giant **`ORDERS`** table with **1 billion rows**
* Every day, you run:

```sql
SELECT store_id, SUM(sales_amount) AS total_sales
FROM ORDERS
WHERE order_date >= CURRENT_DATE - INTERVAL '7 DAYS'
GROUP BY store_id;
```

This is expensive — scanning 1B rows every time is slow and costs credits.

---

### 3️⃣ How Materialized View Helps

Instead of recalculating **from scratch** each time:

```sql
CREATE MATERIALIZED VIEW mv_weekly_sales AS
SELECT store_id, SUM(sales_amount) AS total_sales
FROM ORDERS
WHERE order_date >= CURRENT_DATE - INTERVAL '7 DAYS'
GROUP BY store_id;
```

* Snowflake stores **`mv_weekly_sales`** like a table
* When new orders come in, Snowflake **incrementally updates** the MV in the background
* Querying:

```sql
SELECT * FROM mv_weekly_sales;
```

is **instant** because the heavy lifting is already done.

---

### 4️⃣ Why it’s Faster

Without MV:

* Query engine scans huge table → filters → aggregates → returns

With MV:

* Just reads small, pre-aggregated result
* Less I/O + less CPU = faster & cheaper queries

---

### 5️⃣ Things to Keep in Mind

* **Best for repeated, expensive queries** — the more you reuse it, the more you save.
* Snowflake updates MVs **automatically**, but this has **storage cost** and **compute cost** for refreshing.
* Works best when **base table changes slowly** compared to how often you query.
* **Cannot** use non-deterministic functions (`CURRENT_TIMESTAMP`, `RANDOM()`).
* Good with **aggregations**, **joins**, and **filters** that are stable.

---

### 6️⃣ Real Example in Analytics

**Scenario**: Marketing team asks for daily “active users” count by country.
Instead of calculating daily from billions of log records:

```sql
CREATE MATERIALIZED VIEW mv_active_users AS
SELECT country, COUNT(DISTINCT user_id) AS daily_active_users
FROM USER_EVENTS
WHERE event_date >= CURRENT_DATE - INTERVAL '1 DAY'
GROUP BY country;
```

Now, dashboard queries:

```sql
SELECT * FROM mv_active_users;
```

load in milliseconds.

---

✅ **Bottom line:**
Materialized Views in Snowflake = *“Set it once, Snowflake keeps it fresh, you get results instantly.”*

---


### 5.5 Using Streams & Tasks

* For incremental data loads and transformations, using Snowflake Streams and Tasks can optimize performance by processing only changed data.

let’s break down **Streams & Tasks in Snowflake** in a way that clicks instantly.

---

### 1️⃣ The Problem They Solve

Normally, when new data arrives in a table, you have two options:

* **Reprocess everything** (wasteful)
* Or somehow process **only the new/changed rows**

That’s where **Streams** + **Tasks** come in.

---

### 2️⃣ What is a Snowflake Stream?

A **Stream** is like a **“change log”** or **“to-do list”** for a table.

* It tracks **only the rows that have been inserted, updated, or deleted** since you last checked it.
* It doesn’t store full copies of data — just metadata about the changes.

Think: *“Hey, since your last run, these are the rows that changed — process just these!”*

---

### Example: Incremental ETL with Streams

Let’s say you have a raw landing table:

```sql
CREATE OR REPLACE TABLE raw_orders (
    order_id INT,
    product_id INT,
    quantity INT,
    order_date DATE
);
```

You create a Stream on it:

```sql
CREATE OR REPLACE STREAM raw_orders_stream ON TABLE raw_orders;
```

Now:

* If **10,000 new rows** land today, the stream will list only those 10,000 rows (not the 50 million in the table).
* Once you read them in your transformation, Snowflake marks them as processed.

---

### 3️⃣ What is a Snowflake Task?

A **Task** is like a **scheduled job** or **cron job** inside Snowflake.

* You define a SQL statement to run at a set frequency or trigger.
* It can run every minute, every hour, daily, or even in a dependency chain.

---

### Example: Automating the Transformation

You want to process new orders every hour:

```sql
CREATE OR REPLACE TASK process_new_orders
WAREHOUSE = my_etl_wh
SCHEDULE = 'USING CRON 0 * * * * UTC'  -- every hour
AS
INSERT INTO processed_orders
SELECT *
FROM raw_orders_stream;
```

Now, Snowflake automatically:

1. Checks the stream for new rows every hour.
2. Inserts just those rows into `processed_orders`.

---

### 4️⃣ Why Streams + Tasks Boost Performance

* **No full-table scans** → Only changed rows are processed.
* **Automation** → No need for external schedulers or manual runs.
* **Faster ETL pipelines** → Less data = less compute cost.
* **Reliable** → Snowflake handles tracking and scheduling.

---

### 5️⃣ Real-World Scenario

**Data warehouse with daily sales feeds**:

* Sales table gets millions of rows daily.
* Instead of aggregating all sales each night:

  * **Stream** captures new rows.
  * **Task** runs hourly to update the aggregated “sales by store” table with just the new data.
* Your dashboards stay up to date **without reprocessing history**.

---

✅ **Bottom line**:
**Streams** = change tracking
**Tasks** = scheduling
Together → *“Process just the new stuff, automatically, inside Snowflake.”*

---


## 6. Real Case Scenario — Putting It All Together

---

### Scenario: MegaShop Reporting

MegaShop has a massive sales dataset with 5 billion rows in `sales_data`. Analysts often query total sales by `region` and `order_date`.

**Problem:** Queries are slow (minutes), and warehouses are huge but still not enough.

---

### Step 1: Analyze Query Profile

* Find full scans of `sales_data`.
* Filters on `region` and `order_date` are not pruning many micro-partitions.

---

### Step 2: Add Clustering Key

* Define clustering on `(region, order_date)`.
* Snowflake reclusters data in background, organizing micro-partitions to group same region and date.

---

### Step 3: Rerun Queries

* Query Profile shows micro-partition pruning increased significantly.
* Scan data reduced from 5B rows to 50M rows.
* Query runtime dropped from 10 minutes to 1 minute.

---

### Step 4: Use Broadcast Join Hint

* Join with small `region_metadata` table.
* Add `/*+ BROADCAST(region_metadata) */` hint.
* Join shuffles data less, speeding query further.

---

### Step 5: Warehouse Resize and Multi-cluster

* During peak hours, analysts run many reports.
* Enable multi-cluster warehouse with auto-scale.
* Queries don’t queue anymore.

---

## 7. Must-Know Questions (for deep understanding)

* Why doesn’t Snowflake use traditional indexes, and how does it optimize query performance instead?
* What are micro-partitions and how do they affect query speed?
* How does clustering key improve pruning and what are its costs?
* When should you consider resizing a virtual warehouse versus rewriting queries?
* How does Snowflake handle join operations internally, and when would you use broadcast join hints?
* What is the role of query profiling in optimizing Snowflake queries?
* How do result caching and data caching work in Snowflake?
* What strategies help avoid data skew in joins?
* How do materialized views help in Snowflake performance?

---

## Summary — Key Takeaways

| Traditional DB Tuning | Snowflake Approach                                |
| --------------------- | ------------------------------------------------- |
| Indexes               | No indexes; relies on micro-partition pruning     |
| Primary Keys          | Metadata only, no enforcement                     |
| Manual Partitioning   | Automatic micro-partitions, manual clustering     |
| Query Hints           | Limited hints, optimizer mostly automatic         |
| Join Order            | Optimizer chooses, hints available                |
| Caching               | Result, local disk, metadata cache auto-managed   |
| Warehouse Size        | Adjustable compute size and multi-cluster scaling |

---

## Final Thought

Snowflake performance tuning is about **understanding how Snowflake stores data internally (micro-partitions)**, leveraging **clustering for large datasets**, using **compute resources wisely**, and **reading query profiles to pinpoint inefficiencies**.

---



---

# Must-Know Questions and Answers on Snowflake Performance Optimization

---

### 1. Why doesn’t Snowflake use traditional indexes, and how does it optimize query performance instead?

**Answer:**
Unlike traditional databases that rely heavily on user-created indexes (like B-trees) to speed up data lookups, Snowflake **does not use indexes** in the classic sense.

**Why?** Because Snowflake uses a **columnar, cloud-optimized architecture with automatic micro-partitioning.**

* **Micro-partitions:** When data is loaded, Snowflake automatically divides it into small contiguous units (micro-partitions, \~16MB each compressed).
* Each micro-partition stores metadata such as **min/max values for each column** and other statistics.

**How does this help?**
When you run a query with filters, Snowflake uses this metadata to **prune irrelevant micro-partitions** (skip scanning those that cannot contain matching data). This drastically reduces the amount of data scanned, leading to faster queries.

**Example:**
If your query filters on `order_date = '2025-01-01'`, Snowflake only scans micro-partitions whose `order_date` range includes this date, skipping all others automatically — no need for indexes.

---

### 2. What are micro-partitions and how do they affect query speed?

**Answer:**
Micro-partitions are the foundational storage unit in Snowflake:

* Automatic data files (\~16MB compressed) that Snowflake creates internally.
* Stored column-wise.
* Contain metadata for each column like min/max values, number of distinct values, null counts, etc.

**Impact on speed:**
During query execution, Snowflake leverages micro-partition metadata for **pruning**, i.e., skipping scanning micro-partitions that cannot satisfy the query filter predicates.

This avoids full table scans and reduces IO and CPU usage.

**Example:**
A 1TB table might be broken into 60,000 micro-partitions. For a query filtering on a date range of 1 day, only a small fraction of these partitions need scanning.

---

### 3. How does clustering key improve pruning and what are its costs?

**Answer:**
If data is inserted in a random order, micro-partitions might not be organized well, reducing pruning effectiveness.

**Clustering Key:**
A user-defined column or set of columns to **organize data physically** in micro-partitions (similar to sorting). Snowflake reclusters data in the background accordingly.

**Benefits:**

* Better micro-partition pruning on clustered columns.
* Faster queries filtering on clustering keys.

**Costs:**

* Clustering consumes compute resources (cost).
* Reclustering background tasks run periodically and might take time for very large tables.
* Over-clustering or clustering on high cardinality columns can be inefficient.

**Example:**
If MegaShop clusters the `orders` table by `(region, order_date)`, queries filtering on these columns scan fewer micro-partitions.

---

### 4. When should you consider resizing a virtual warehouse versus rewriting queries?

**Answer:**

* **Resize warehouse (scale up):**
  When queries are CPU or memory-bound, and rewriting queries doesn’t reduce the data scanned, increasing warehouse size (more nodes) gives more parallelism and faster execution.

* **Rewrite queries:**
  When queries scan too much data unnecessarily or do inefficient joins. Optimize filters, projections (select only needed columns), join strategies.

**Rule of thumb:**

* First, try to optimize query logic and data design.
* If still slow due to volume, scale warehouse size.
* For many concurrent users, consider multi-cluster warehouses instead of a single large one.

---

### 5. How does Snowflake handle join operations internally, and when would you use broadcast join hints?

**Answer:**

Snowflake supports several join strategies:

* **Broadcast join:** Small table is copied to all compute nodes to avoid shuffling large tables.
* **Shuffle join:** Large tables are redistributed by join key among nodes to perform join.

**Snowflake optimizer automatically picks the best join strategy** based on table size.

**Use broadcast join hints when:**

* You know a table is small and want to force broadcast join to speed up.
* The optimizer picks shuffle join causing performance issues.

**Example:**
`SELECT /*+ BROADCAST(small_table) */ * FROM large_table JOIN small_table ON ...`

---

### 6. What is the role of query profiling in optimizing Snowflake queries?

**Answer:**

**Query Profile** is an interactive visual tool in Snowflake UI showing:

* Detailed step-by-step execution plan.
* Time spent on scanning, filtering, joins, shuffles, and data spill.
* Data scanned vs. returned rows.
* Helps identify bottlenecks like full scans, large data shuffles, or skewed processing.

Using it, you can:

* Detect inefficient operations.
* Pinpoint if clustering, filtering, or warehouse size needs adjustment.
* Understand which part of query takes most time.

---

### 7. How do result caching and data caching work in Snowflake?

**Answer:**

* **Result Cache:**
  If the exact same query was run recently and underlying data unchanged, Snowflake returns the cached results immediately — zero compute cost.

* **Local Disk Cache:**
  Each compute node caches data files it reads, so repeated scans of same data by that node are faster.

* **Metadata Cache:**
  Micro-partition metadata cached for pruning without reloading.

**This layered caching significantly speeds repeated or similar queries.**

---

### 8. What strategies help avoid data skew in joins?

**Answer:**

**Data skew** happens when one or few join key values have disproportionately large data, causing uneven work distribution among compute nodes.

**Avoid skew by:**

* Pre-aggregating or filtering large keys.
* Choosing join keys with uniform data distribution.
* Using clustering keys to physically order data.
* Avoiding joins on high-skew keys or implementing logic to split skewed keys.

---

### 9. How do materialized views help in Snowflake performance?

**Answer:**

Materialized Views:

* Store precomputed query results physically.
* Automatically updated incrementally when base data changes.
* Speed up expensive repeated queries by avoiding re-computation.

**Use cases:**

* Complex joins or aggregations used often.
* Heavy transformation queries.

---

# Summary Table of Questions and Short Answers

| Question                           | Short Summary                                                   |
| ---------------------------------- | --------------------------------------------------------------- |
| Why no traditional indexes?        | Uses micro-partitions & pruning instead.                        |
| What are micro-partitions?         | Small data files with metadata for pruning.                     |
| Clustering key?                    | Organizes data for better pruning, costs extra compute.         |
| Resize warehouse or rewrite query? | Rewrite first; resize if compute-bound.                         |
| Join handling?                     | Automatic broadcast/shuffle; hints to force broadcast.          |
| Query profiling role?              | Visualize query steps & bottlenecks.                            |
| Caching?                           | Result, local disk, metadata caches speed repeated queries.     |
| Avoid skew?                        | Uniform key distribution, clustering, pre-aggregation.          |
| Materialized views?                | Precomputed, incrementally maintained results for fast queries. |

---
