

---

# 1) Why sampling? (the real-world story)

Imagine you lead the data platform for an ecommerce company, **ShopFast**. The events table `events` stores clickstream and purchase events — it’s 3 TB and grows fast. Your product manager wants an exploratory dashboard showing daily conversion rate trends for the last 90 days while the ML team needs a quick training sample for experimentation.

Problems:

* Running full scans on the 3 TB table for quick EDA or dashboard prototypes is slow and expensive.
* You need **fast, "good enough" answers** for visualization and experimentation — not always 100% exact.

**Solution idea**: take a carefully chosen subset (sample) of the large table and run queries on that subset. Sampling gives:

* Faster query response time (smaller data scanned).
* Lower warehouse credits used.
* A way to bootstrap models and visualizations quickly.

But sampling trades exactness for speed — so we must understand *how* Snowflake samples and which sampling algorithm fits which use-case. ([Snowflake Docs][1])

---

# 2) Approximate Query Processing (AQP) — concept + Snowflake affordances

**AQP** is the family of techniques that return *approximate* answers far faster and cheaper than exact computation. Two common patterns in Snowflake:

1. Use **sampling** (TABLESAMPLE / SAMPLE) to run queries on a subset and extrapolate results.
2. Use **built-in approximate functions** (e.g., `APPROX_COUNT_DISTINCT`, `APPROX_PERCENTILE`) which implement probabilistic algorithms (HyperLogLog, t-Digest, etc.) and are optimized for speed/low memory.

When to prefer built-in approximate functions:

* You need an approximate aggregate (distinct counts, percentiles) — use `APPROX_*` functions (they’re robust and often preferable to manual sampling). ([Snowflake Docs][2])

When to prefer sampling:

* You need to run arbitrary complex queries (GROUP BY, ML feature extraction, quick visualizations) and are willing to accept small error bounds in exchange for speed.

---

# 3) Snowflake sampling in practice — syntax & demo SQL

Snowflake supports `SAMPLE` and `TABLESAMPLE` (synonymous). Two major sampling methods: `BERNOULLI | ROW` (row-based) and `SYSTEM | BLOCK` (block-based). You can also request fixed-size samples (`N ROWS`) and supply a repeatable `SEED` for deterministic samples (only supported for `SYSTEM`/`BLOCK`). ([Snowflake Docs][1])

### Basic examples

Fraction-based Bernoulli (default):

```sql
-- ~10% of rows (row-based / Bernoulli)
SELECT * FROM events SAMPLE (10);
-- equivalent
SELECT * FROM events TABLESAMPLE BERNOULLI (10);
```

Fraction-based block/system with seed:

```sql
-- ~3% of blocks, repeatable sample with seed 82
SELECT * FROM events SAMPLE SYSTEM (3) SEED (82);
-- or
SELECT * FROM events SAMPLE BLOCK (0.012) REPEATABLE (99992);
```

Fixed-size:

```sql
-- exact 100 rows (unless table has fewer)
SELECT * FROM events SAMPLE (100 ROWS);
```

Repeatable/deterministic notes:

* `SEED(...)` / `REPEATABLE(...)` makes a `SYSTEM` sample deterministic **for the same unchanged table**. It is not supported for `BERNOULLI` seeds and not supported on views/subqueries. ([Snowflake Docs][1])

---

# 4) How the two sampling methods *work* — intuition + story

## A) BERNOULLI / ROW (row-based sampling)

**Mechanics (simple)**: imagine flipping a weighted coin for **every row**. Each row is independently included with probability `p/100`. So expected sample size ≈ `p/100 * n`. Because it's per-row randomness, this method yields an unbiased random subset of rows (statistically closest to true random sampling).

**When it matters**:

* Good for statistical sampling where each row should have an independent inclusion chance.
* Useful when data distribution within blocks matters (BERNOULLI won't bias toward specific micro-partitions).

**Drawbacks**:

* More CPU work because Snowflake must decide inclusion per row across many micro-partitions — can be slower/scan more micro-partition metadata than SYSTEM. For very large tables, overhead is acceptable but still higher than SYSTEM. ([Snowflake Docs][1])

**Story**: You want a truly random sample of `events` so that your conversion-rate estimator is unbiased. Use `SAMPLE (5)` (BERNOULLI default) — each event has independent 5% chance to be picked.

## B) SYSTEM / BLOCK (block-based sampling)

**Mechanics (simple)**: Snowflake flips the coin per **block / micro-partition** (think: choose whole micro-partitions with probability `p/100`). Micro-partitions contain contiguous rows and column statistics.

**When it matters**:

* SYSTEM is **often much faster** because it works at micro-partition granularity and can avoid decoding lots of rows; great when you need speed and are okay with slight block-level bias. Snowflake documentation specifically notes SYSTEM/BLOCK is often faster than BERNOULLI. ([Snowflake Docs][1])

**Drawbacks**:

* Potential bias: if your data is ordered (e.g., time-sorted) or micro-partitions cluster similar values, SYSTEM sampling can under- or over-represent certain values (biased sample), especially for **small tables** or when sampling percentages are tiny.
* For tiny tables, block-level granularity makes the sample less representative.

**Story**: You’re building a dashboard where speed trumps tiny bias. For a humongous `events` table, do `SAMPLE SYSTEM (2)` to get an approximate view fast — you’ll get quick results, but verify if micro-partition layout could bias results (e.g., if all failed payments are in a small subset of micro-partitions).

---

# 5) When to use which? Practical guidance

* **Exploratory analysis / dashboards** where speed is critical and slight bias is acceptable → use `SYSTEM` (block) sampling. Optionally add `SEED` for reproducibility of dashboard preview data.
* **Statistical experiments, sampling for model training, or when unbiasedness is required** → use `BERNOULLI`/`ROW` sampling.
* **If you need a fixed number of rows** (exact N) → use `SAMPLE (N ROWS)`, but note `SYSTEM` + fixed-size is **not supported**. Fixed-size sampling may prevent some optimizations and can be slower. ([Snowflake Docs][1])
* **If consistent sample between runs is needed for reproducible experiments** → use `SYSTEM` sampling with `SEED(...)`. (Note: `SEED` is only supported for SYSTEM/BLOCK sampling and not for ROW/Bernoulli; and sampling on copy may differ.) ([Snowflake Docs][1])

---

# 6) Advantages & disadvantages (quick table)

* BERNOULLI / ROW

  * Advantage: unbiased per-row sampling; works well for joins if no seed used.
  * Disadvantage: slower than SYSTEM; more expensive for huge tables.

* SYSTEM / BLOCK

  * Advantage: faster, often cheaper because it selects entire micro-partitions.
  * Disadvantage: possible sampling bias due to micro-partitioning; `SEED` only supported for block sampling; can't do fixed-size + seed.

* General trade-offs:

  * Sampling reduces scanned data (cost) but introduces sampling variance. Choose method by error tolerance + performance need. ([Snowflake Docs][1])

---

# 7) Important semantics & gotchas (from the docs + experience)

* **Sampling after a JOIN**: sampling on the result of a `JOIN` is allowed only when the sampling is row-based (BERNOULLI) and **no seed** is used. Also, if you apply SAMPLE to tables in a join separately, the sample is applied to each table before joining — it does *not* reduce join cost unless sampling is applied before the join as part of the plan or you sample a subquery result. In some cases sampling is done after join processing — so it might not reduce join compute cost. Always test. ([Snowflake Docs][1])
* **Determinism**: If you specify the same `SEED` and the table hasn't changed, `SYSTEM` samples are repeatable. But a copy/clone of the table might produce different sample even with same seed because micro-partitions/state may differ. ([Snowflake Docs][1])
* **Fixed-size sampling**: returns exact requested rows (if table larger), but **SYSTEM** and `SEED` aren’t supported with fixed-size sampling. Fixed-size can prevent optimizations and be slower. ([Snowflake Docs][1])

---

# 8) Demo: full practical walkthrough (examples you can run)

Assume a table `prod.events` with columns `(event_ts, user_id, event_type, amount)`.

### 1) Quick dashboard preview (fast)

```sql
-- Fast approximate preview, ~1% of micro-partitions
SELECT event_type, COUNT(*) as cnt
FROM prod.events
SAMPLE SYSTEM (1)
GROUP BY event_type
ORDER BY cnt DESC;
```

Use this to get a rough distribution within seconds. Use `SEED(42)` if you want the preview to be repeatable across refreshes. ([Snowflake Docs][1])

### 2) Bias-aware check: compare SYSTEM vs BERNOULLI

```sql
-- BERNOULLI (row-based)
SELECT event_type, COUNT(*) as cnt_bernoulli
FROM prod.events
SAMPLE (1)  -- default is ROW / BERNOULLI
GROUP BY event_type;

-- SYSTEM (block-based)
SELECT event_type, COUNT(*) as cnt_system
FROM prod.events
SAMPLE SYSTEM (1)
GROUP BY event_type;
```

Compare `cnt_bernoulli` vs `cnt_system` to see if SYSTEM introduces visible bias for your partitioning. If they diverge significantly, prefer BERNOULLI.

### 3) Reproducible development sample for testing

```sql
-- Reproducible sample using SYSTEM + SEED
CREATE OR REPLACE TABLE dev.events_sample AS
SELECT *
FROM prod.events
SAMPLE SYSTEM (2) SEED (12345);
```

This creates a dev table quickly (selective rows) you can share with devs.

### 4) Fixed-size sample for exact N rows

```sql
SELECT *
FROM prod.events
SAMPLE (1000 ROWS);
```

Useful when you need a small dataset of exact size for UI demos or tests.

---

# 9) Sampling vs CLONE — what's the difference and when to use which?

**Zero-copy CLONE** creates a metadata-only copy of a table/schema/database — the clone points to the same underlying micro-partitions until you change data (copy-on-write). A clone is effectively an instant, full copy (no immediate storage cost until changes occur). `CREATE ... CLONE` is ideal when you need the *entire* dataset (exact), e.g., for full-scale integration testing or backups. ([Snowflake Docs][3])

**Sampling** creates a subset of the original data (physical selection of rows) — smaller data volume, quicker scans, different content from clone.

### When sampling is *more efficient* than clone:

* You only need a **subset** for rapid prototyping, EDA, or ML experimentation — sampling reduces compute and storage dramatically (you can create a smaller physical table rather than a full clone that references whole dataset).
* Cloning is great for an *exact* copy that preserves all rows and metadata but doesn’t reduce the dataset size. Clone is not cheaper if you truly need a smaller working set — clone gives you the full data logically and could still cause operations that scan the parent micro-partitions.

### Example decision:

* Need **exact** production snapshot for debugging a data issue → use `CREATE TABLE foo_clone CLONE foo;`.
* Need **small dataset** for rapid model training or dashboard prototype → `CREATE TABLE foo_sample AS SELECT * FROM foo SAMPLE (1);` — cheaper to scan and process.

**Key takeaway:** clone: instant full logical copy (useful when you need full fidelity). Sampling: create a reduced, faster-to-scan dataset (useful for speed/cost). ([Snowflake Docs][3])

---

# 10) Statistical correctness & validation: how to be safe

* **Always validate** samples by comparing a few aggregate metrics (mean, median, counts per category) against full-table results for a few checkpoints.
* For estimates derived from samples, compute confidence intervals if possible (e.g., standard error for proportions) — helps convey expected error to stakeholders.
* For cardinality/percentile needs, prefer Snowflake’s `APPROX_*` functions — they use formal algorithms (HLL, t-Digest) and will often give better accuracy/perf than naive sampling for those aggregates. ([Snowflake Docs][2])

---

# 11)  Questions (must-know / quick checks)

Use these as quick self-assessment or flashcards:

1. Explain the difference between `SAMPLE (10)` and `SAMPLE SYSTEM (10)`. Which is default? What are trade-offs?
2. How does `SEED`/`REPEATABLE` work in Snowflake sampling? For which method(s) is it supported?
3. If I sample before or after a JOIN, how does it affect cost and accuracy? What are the constraints?
4. When would you use `SAMPLE (N ROWS)` vs `SAMPLE (p)`? What optimizations might be affected?
5. How does Snowflake’s zero-copy `CLONE` work under the hood and when is cloning better than sampling for development environments?
6. Name two Snowflake approximate functions and the algorithms they use (e.g., HyperLogLog, t-Digest).
7. Describe a validation procedure to check whether sampling bias affects your metric (list concrete SQL queries or checks).
8. What are micro-partitions and why do they matter for `SYSTEM` sampling (explain potential bias)?
9. How would you make a sampling strategy reproducible across multiple developers?
10. If you get a wildly different result between `SAMPLE SYSTEM (1)` and `SAMPLE (1)`, what are the possible causes and how would you investigate?

(If you want, I’ll give model answers for each — say the word and I’ll expand.) ([Snowflake Docs][1])

---

# 12) Practical checklist / best practices (actionable)

* Start with `SYSTEM` for quick dashboards; validate vs `BERNOULLI` occasionally.
* Use `SEED(...)` on `SYSTEM` when you need reproducible dev datasets.
* Use `APPROX_COUNT_DISTINCT` / `APPROX_PERCENTILE` for cardinality/percentile problems instead of sampling + exact function when possible. ([Snowflake Docs][2])
* If sampling for ML, ensure class balance (stratified sampling) if necessary — sampling uniformly may under-sample rare classes.
* Log sample parameters (method, percent, seed) alongside results — makes analytics reproducible and auditable.
* For joins: prefer sampling on subquery result if you intend to reduce the work (apply sample to the join result, not to table operands unless appropriate).

---

# 13) Extra: stratified sampling & deterministic reproducibility (practical trick)

Snowflake's sample is uniform. For **stratified sampling** (e.g., preserve proportions of `country`), do:

```sql
CREATE OR REPLACE TABLE events_stratified AS
SELECT * FROM (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY country ORDER BY HASH(user_id)) AS rn,
         COUNT(*) OVER (PARTITION BY country) AS cnt
  FROM prod.events
) t
WHERE rn <= GREATEST(1, ROUND(cnt * 0.01)); -- ~1% per country
```

This preserves representation per stratum. You can also use `HASH()` or deterministic `RANDOM()` seeds to make selection repeatable.

---



[1]: https://docs.snowflake.com/en/sql-reference/constructs/sample "SAMPLE / TABLESAMPLE | Snowflake Documentation"
[2]: https://docs.snowflake.com/en/sql-reference/functions/approx_count_distinct?utm_source=chatgpt.com "APPROX_COUNT_DISTINCT"
[3]: https://docs.snowflake.com/en/sql-reference/sql/create-clone?utm_source=chatgpt.com "CREATE <object> … CLONE"
[4]: https://docs.snowflake.com/en/sql-reference/functions/approx_percentile?utm_source=chatgpt.com "APPROX_PERCENTILE"
[5]: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.Table.sample?utm_source=chatgpt.com "snowflake.snowpark.Table.sample"




---

## **1️⃣ What’s the difference between `SAMPLE (10)` and `SAMPLE SYSTEM (10)`? Which is default? What are trade-offs?**

### ✅ **Concept**

* `SAMPLE (10)` → uses **BERNOULLI (row-based)** sampling.
* `SAMPLE SYSTEM (10)` → uses **SYSTEM (block-based)** sampling.

**Default** → `BERNOULLI`.

### 🧠 **Mechanism**

* **BERNOULLI** → decides *per row* whether to include it (independent random coin toss for each row).
* **SYSTEM** → decides *per micro-partition (block)* whether to include it (if a block is chosen, all its rows come together).

### ⚖️ **Trade-offs**

| Method        | Pros                                         | Cons                                                    |
| ------------- | -------------------------------------------- | ------------------------------------------------------- |
| **BERNOULLI** | True random, unbiased                        | Slower (checks every row), more compute                 |
| **SYSTEM**    | Much faster (works at micro-partition level) | May introduce bias if data is clustered (e.g., by date) |

### 💬 **How to answer**

> “By default, Snowflake uses Bernoulli sampling, which randomly selects individual rows. For large tables, I often switch to `SYSTEM` because it’s faster — it samples entire micro-partitions. The trade-off is that `SYSTEM` can be biased if the table is clustered, so I validate it using summary checks.”

---

## **2️⃣ How does `SEED` work in Snowflake sampling? For which method is it supported?**

### ✅ **Concept**

* `SEED()` (or `REPEATABLE()`) makes a sample **deterministic** — same table + same seed = same sample every time.

### ⚠️ **Supported only for:**

* `SYSTEM` / `BLOCK` sampling.

**Not supported for**:

* `BERNOULLI` (row-based sampling)
* **JOINs**, **views**, or **subqueries**

### 🧩 **Example**

```sql
SELECT * FROM sales SAMPLE SYSTEM (5) SEED (100);
```

→ Every time you run this query (on the same table), you’ll get the **same 5% of data**.

### 💬 **How to answer**

> “The `SEED` makes the sample repeatable — great for reproducible dashboards or ML experiments. But it only works for `SYSTEM` sampling, not for Bernoulli, since block-level seeds are deterministic while row-level randomization isn’t repeatable in Snowflake.”

---

## **3️⃣ If I sample before or after a JOIN, how does it affect cost and accuracy?**

### ✅ **Concept**

* **Sampling before JOIN** reduces data scanned → lower cost, faster execution.
* **Sampling after JOIN** → full data processed → only final output is reduced → cost stays high.

### ⚠️ **Important restriction**

Snowflake only allows sampling **after a join** when:

* It’s **row-based** (BERNOULLI)
* **No SEED** is used.

### 🧩 **Example**

```sql
-- Sampling BEFORE join → cheaper
SELECT * FROM (
    SELECT * FROM customers SAMPLE (10)
) c
JOIN orders o ON c.id = o.cust_id;

-- Sampling AFTER join → full cost
SELECT * FROM customers c
JOIN orders o ON c.id = o.cust_id
SAMPLE (10);
```

### 💬 **How to answer**

> “If I apply sampling before the join, it reduces both compute and IO. But if I sample after, Snowflake still joins all rows first — it’s only reducing the output size, not cost. So, I always sample subqueries before joining when cost or performance matters.”

---

## **4️⃣ When would you use `SAMPLE (N ROWS)` vs `SAMPLE (PERCENT)`?**

### ✅ **Concept**

* `SAMPLE (N ROWS)` → fixed-size sample.
* `SAMPLE (PERCENT)` → percentage-based sample.

### ⚖️ **Trade-offs**

| Type      | Use-case                                                 | Notes                                                         |
| --------- | -------------------------------------------------------- | ------------------------------------------------------------- |
| `N ROWS`  | When you need exact count (e.g., 1,000 rows for testing) | Works with **BERNOULLI** only, not `SYSTEM` or `SEED`.        |
| `PERCENT` | When dataset size changes                                | Scales dynamically; works with both `BERNOULLI` and `SYSTEM`. |

### 💬 **How to answer**

> “I use `N ROWS` when I need a fixed-sized dataset for debugging or demo environments. For general EDA or analytics, I prefer a percentage-based sample, because it adjusts as table size changes.”

---

## **5️⃣ How does Snowflake’s zero-copy CLONE work, and when is CLONE better than SAMPLE?**

### ✅ **Concept**

* `CREATE ... CLONE` → instant metadata-only copy of a table/schema/database.
* Data isn’t duplicated until modified (copy-on-write).
* `SAMPLE` → creates a physically smaller subset of rows.

### ⚖️ **When to use**

| Operation  | Use for                                       | Data size             |
| ---------- | --------------------------------------------- | --------------------- |
| **CLONE**  | Exact copy for debugging, backups, sandboxing | Full (same as source) |
| **SAMPLE** | Small subset for analysis/ML                  | Reduced               |

### 💬 **How to answer**

> “`CLONE` is great for making a full environment snapshot instantly — it’s metadata-only. But if I just need a smaller dataset to prototype, I use `SAMPLE`. Sampling saves both time and compute, while clone preserves the entire dataset structure.”

---

## **6️⃣ Name two Snowflake approximate functions and the algorithms they use.**

### ✅ **Functions**

| Function                  | What it estimates          | Algorithm       |
| ------------------------- | -------------------------- | --------------- |
| `APPROX_COUNT_DISTINCT()` | Cardinality (unique count) | **HyperLogLog** |
| `APPROX_PERCENTILE()`     | Percentiles, medians       | **t-Digest**    |

### 💬 **How to answer**

> “For approximate aggregations, Snowflake offers `APPROX_COUNT_DISTINCT`, which uses HyperLogLog to estimate unique counts, and `APPROX_PERCENTILE`, which uses t-Digest for percentile estimation. These functions are faster and memory-efficient — perfect for large datasets.”

---

## **7️⃣ How can you validate whether sampling bias affects your metrics?**

### ✅ **Concept**

Compare **aggregates** from the full dataset vs the sample:

* Count
* Mean
* Category distribution
* Percentiles

### 🧩 **Example**

```sql
-- Full dataset
SELECT event_type, COUNT(*), AVG(amount) FROM events GROUP BY event_type;

-- Sampled dataset
SELECT event_type, COUNT(*), AVG(amount)
FROM events SAMPLE SYSTEM (2)
GROUP BY event_type;
```

→ Compare ratios and ensure differences are small.

### 💬 **How to answer**

> “I validate sampling quality by comparing key metrics — like averages, counts, and category proportions — between the full and sampled datasets. If the deltas are small and consistent across runs, the sample is representative.”

---

## **8️⃣ What are micro-partitions and why do they matter for SYSTEM sampling?**

### ✅ **Concept**

* Snowflake physically stores data in **micro-partitions** (compressed blocks of ~16MB uncompressed data).
* `SYSTEM` sampling works by randomly choosing **micro-partitions**, not individual rows.

### ⚠️ **Impact**

If a table is **clustered by time or category**, entire micro-partitions might contain similar data → SYSTEM sample may over/under-represent certain values.

### 💬 **How to answer**

> “SYSTEM sampling selects entire micro-partitions. So, if my table is clustered — say by date — a SYSTEM sample might only pull from certain days. That’s why I validate with Bernoulli or shuffle data before sampling.”

---

## **9️⃣ How would you make sampling reproducible across multiple developers?**

### ✅ **Techniques**

1. Use `SAMPLE SYSTEM (...) SEED (<value>)` — ensures same blocks chosen.
2. Store seed and percentage in metadata.
3. Materialize the sampled dataset in a shared schema.

### 🧩 **Example**

```sql
CREATE OR REPLACE TABLE dev.events_sample AS
SELECT * FROM prod.events SAMPLE SYSTEM (2) SEED (100);
```

### 💬 **How to answer**

> “To make sampling consistent across environments, I fix the `SEED` value and method in SQL. For collaborative projects, I store the sample as a materialized table — this ensures everyone works with the same deterministic subset.”

---

## **🔟 If results differ wildly between `SAMPLE SYSTEM (1)` and `SAMPLE (1)`, what could cause that?**

### ✅ **Possible causes**

1. **Micro-partition bias** — SYSTEM chose blocks that are not representative.
2. **Table is clustered** (e.g., time-ordered → SYSTEM picks recent data only).
3. **Small sample size** → high variance in results.
4. **Different inclusion logic** — SYSTEM operates on blocks, BERNOULLI on rows.

### 💬 **How to answer**

> “Huge discrepancies usually come from micro-partition bias. SYSTEM might have selected partitions containing skewed data — like only recent sales. In that case, I either use Bernoulli or stratified sampling for fair representation.”

---

# ✅ **Summary Table for Revision**

| Topic                  | Key Points                                 |
| ---------------------- | ------------------------------------------ |
| Default sampling       | `BERNOULLI` (row-based)                    |
| Fastest method         | `SYSTEM` (block-based)                     |
| Deterministic sampling | Only `SYSTEM` supports `SEED`              |
| Pre-join sampling      | Reduces cost                               |
| Post-join sampling     | Reduces output only                        |
| Fixed vs percent       | Fixed = exact N rows; Percent = scalable   |
| CLONE vs SAMPLE        | CLONE = full copy; SAMPLE = subset         |
| Approximate functions  | HyperLogLog, t-Digest                      |
| Micro-partition        | Physical unit; used by SYSTEM sampling     |
| Validation             | Compare aggregates between full and sample |

---
