# Snowflake → S3 Unloading (with stories, gotchas & rock-solid validation)

Imagine you’re the data engineering lead for “SkyCart,” an e-commerce company. Every morning by 06:00, Marketing expects a fresh, partitioned export of yesterday’s orders in S3 for downstream tools (Athena, Glue jobs, a Python notebook). Your job: make the unload **correct, repeatable, and easy to audit**—and be able to prove it.

Below I’ll teach you the fundamentals, then go step-by-step (with copy-pasteable SQL), then deep-dive into validation so you can **prove** the S3 files are complete and correct. I’ll also add missing but important topics (formats, partitioning, encryption, performance, costs, and failure modes) and finish with must-know practice questions.

---

## 1) The 3 core ideas (fundamentals)

1. **UNLOAD = `COPY INTO <external location>`**
   In Snowflake, unloading means running `COPY INTO '<s3://...>' FROM (<query>)` (or `COPY INTO @my_external_stage/... FROM (<query>)`). You can use a named external stage (recommended for security) or a direct S3 URL with a storage integration. Options control file format (CSV/Parquet/JSON), partitioning, compression, naming, and overwrite behavior. ([Snowflake Documentation][1])

2. **Consistency & repeatability**
   Every `SELECT` (and your unload query is a `SELECT`) reads a **single snapshot** of data as of the moment the statement starts. To make validation bullet-proof across multiple statements, capture a timestamp and use the **Time Travel `AT (TIMESTAMP => …)`** clause in both your `COUNT(*)` validation and the unload query so they read **the same snapshot**.

3. **Validation is not optional**
   You want to prove:

   * the **row count** in S3 == the **row count** from Snowflake at the same snapshot;
   * the **file set** in S3 is what you expect (names/partitions);
   * the **data shape** matches (columns, types, null handling, delimiters).
     We’ll use: the `COPY` output (with `DETAILED_OUTPUT=TRUE`), a **read-back query** directly from S3 via the stage, and optional spot checks/hashes. ([Snowflake Documentation][1])

---

## 2) One-time setup (the secure way)

### a) Create an AWS IAM Role and Storage Integration

You’ll let Snowflake **assume** an IAM role to write into your bucket/prefix.

**In Snowflake:**

```sql
use role ACCOUNTADMIN;

-- 1) Create storage integration (replace ARNs and bucket path)
create or replace storage integration SKY_OUT_INT
  type = external_stage
  storage_provider = s3
  enabled = true
  storage_aws_role_arn = 'arn:aws:iam::123456789012:role/skycart-snowflake-writer'
  storage_allowed_locations = ('s3://skycart-analytics/exports/');

-- 2) Get values to finish AWS trust (external id, user arn)
describe integration SKY_OUT_INT;
```

*In AWS IAM, create/update the role trust policy to allow Snowflake’s **AWS IAM user ARN** with the **external ID** returned by `DESCRIBE INTEGRATION`, and attach an S3 policy allowing `s3:PutObject`, `s3:ListBucket`, (optionally `s3:DeleteObject` if you’ll use `OVERWRITE=TRUE`) on the allowed prefix.* ([Snowflake Documentation][2])

### b) (Recommended) Create a named external stage

```sql
use role SYSADMIN;
use database PROD;
use schema SHARED;

create or replace stage SKY_S3_STAGE
  url = 's3://skycart-analytics/exports/'
  storage_integration = SKY_OUT_INT;
```

Named stages centralize credentials and let you query/list files easily later. ([Snowflake Documentation][3])

### c) Define reusable file formats

We’ll demo **Parquet** (great for downstream analytics) and **CSV** (for tools that need delimited text).

```sql
-- Parquet format (column names preserved, compressed)
create or replace file format FF_PARQUET
  type = parquet;

-- CSV format (standard, with headers)
create or replace file format FF_CSV
  type = csv
  field_delimiter = ','
  record_delimiter = '\n'
  skip_header = 0
  null_if = ('\\N','NULL')
  empty_field_as_null = true
  field_optionally_enclosed_by = '"'
  compression = auto;  -- Snowflake chooses sensible default
```

(You can override these inline on the COPY command if needed.) ([Snowflake Documentation][1])

---

## 3) The unload—**step-by-step with a real scenario**

### Goal

Export “yesterday’s” completed orders from `PROD.SALES.ORDERS` into S3, **partitioned by `order_date`** (one folder per day), in **Parquet**, and make it **idempotent**.

### a) Capture a consistent snapshot time

```sql
use role ANALYST;
use warehouse ETL_XL;
use database PROD;
use schema SALES;

set SNAP_TS = current_timestamp();
```

### b) (Optional preview) Validate the query itself before exporting

```sql
-- Prove the query returns the expected rows (no unload yet)
select * from (
  select order_id, customer_id, total_amount, order_date, updated_at
  from ORDERS
  where status = 'COMPLETED'
    and order_date = dateadd(day, -1, current_date())
  qualify row_number() over (order by order_id) <= 5
)
  at (timestamp => $SNAP_TS);
```

For quick “does my query work?” checks, Snowflake also supports `VALIDATION_MODE = RETURN_ROWS` on `COPY INTO <location>` to run the query instead of unloading. ([Snowflake Documentation][1])

### c) Count rows at the same snapshot (for later comparison)

```sql
set ROWS_EXPECTED = (
  select count(*) from ORDERS
    at (timestamp => $SNAP_TS)
   where status = 'COMPLETED'
     and order_date = dateadd(day, -1, current_date())
);
select $ROWS_EXPECTED as rows_expected;
```

### d) Unload to S3 (Parquet, partitioned)

```sql
-- Folder: s3://skycart-analytics/exports/orders/parquet/
copy into @PROD.SHARED.SKY_S3_STAGE/orders/parquet/
from (
  select order_id, customer_id, total_amount, order_date, updated_at
  from ORDERS
    at (timestamp => $SNAP_TS)
  where status = 'COMPLETED'
    and order_date = dateadd(day, -1, current_date())
)
file_format = (format_name = FF_PARQUET)
-- Write files under a predictable partition folder structure:
partition by (to_varchar(order_date, 'YYYY-MM-DD'))
-- Helpful for uniqueness/auditing of filenames:
include_query_id = true
-- Return per-file stats so we can sum/verify:
detailed_output = true;
```

> **Notes you should understand**
>
> * `PARTITION BY` creates folder layers like `…/partition_0=2025-08-29/…`. (Names may reflect column order as `partition_0`, etc.) This is ideal for downstream engines.
> * `PARTITION BY` **cannot** be combined with `SINGLE=TRUE` or `OVERWRITE=TRUE`. If you need to overwrite, target a **new** dated prefix each day (e.g., `…/dt=2025-08-29/`) and manage retention with lifecycle rules. ([Snowflake Documentation][1])

### e) (Alternative) Unload to S3 in CSV with headers

```sql
copy into 's3://skycart-analytics/exports/orders/csv/'
  from ( select * from ORDERS
           at (timestamp => $SNAP_TS)
         where status='COMPLETED'
           and order_date = dateadd(day, -1, current_date()) )
  storage_integration = SKY_OUT_INT
  file_format = (format_name = FF_CSV)
  header = true
  max_file_size = 50000000         -- ~50 MB target chunks
  include_query_id = true
  detailed_output = true;
```

> **CSV gotchas**: Think through **NULL vs empty string**, quotes and escapes. Set `NULL_IF` and `FIELD_OPTIONALLY_ENCLOSED_BY` consciously to avoid downstream surprises. ([Snowflake Documentation][1])

---

## 4) Validation: **prove nothing is missing and the data matches**

### A) Use the `COPY` result set as your first audit

`COPY INTO <location>` returns a result set you can immediately capture:

```sql
-- Capture the unload report for auditing
create or replace temporary table TMP_UNLOAD_AUDIT as
select * from table(result_scan(last_query_id()));
select sum(rows_unloaded) as rows_in_files from TMP_UNLOAD_AUDIT;
```

* With `DETAILED_OUTPUT=TRUE`, you get **one row per file** including `file_path`, file size, and `rows_unloaded`. Summing `rows_unloaded` gives total exported rows.
* `INCLUDE_QUERY_ID=TRUE` stamps filenames with the query id, which is gold for traceability/deduping.
  (These behaviors are documented in the `COPY INTO <location>` options.) ([Snowflake Documentation][1])

**Compare totals**:

```sql
select $ROWS_EXPECTED as rows_expected;
select sum(rows_unloaded) as rows_in_files from TMP_UNLOAD_AUDIT;
```

These must match exactly. If not, investigate (see “Common pitfalls & fixes” below).

### B) List and verify file/partition structure

```sql
-- Show what landed in S3 (via the external stage)
list @PROD.SHARED.SKY_S3_STAGE/orders/parquet/;
```

You can also create a quick directory report (names, sizes, timestamps) and diff it over time. Snowflake stages and listing semantics are well documented. ([Snowflake Documentation][4])

### C) **Read back from S3** and recount in Snowflake

You can query staged files directly—no table needed:

```sql
-- Recount Parquet files directly in S3
select count(*) as rows_read_back
from @PROD.SHARED.SKY_S3_STAGE/orders/parquet/
  ( file_format => 'FF_PARQUET' );

-- Recount CSV
select count(*) as rows_read_back
from @PROD.SHARED.SKY_S3_STAGE/orders/csv/
  ( file_format => 'FF_CSV' );
```

This is the cleanest way to prove “what’s in S3 equals the source snapshot.” It’s officially supported for both internal and external stages. ([Snowflake Documentation][4])

### D) Optional content checks

Do light spot checks to catch delimiter/encoding issues:

```sql
-- Spot check min/max by partition
select to_varchar(metadata$filename) as file, count(*) cnt, min($3) min_amt, max($3) max_amt
from @PROD.SHARED.SKY_S3_STAGE/orders/csv/ (file_format => 'FF_CSV')
group by 1
order by 1
limit 20;
```

(`METADATA$FILENAME` and friends are accessible when querying staged files.) ([Snowflake Documentation][4])

---

## 5) Formats, encryption, performance & costs (things people forget)

### Choosing a file format

* **Parquet** (default Snappy compression) → smaller files, typed columns, faster scans in Athena/Glue/Spark.
* **CSV** → human-friendly, ubiquitous, but bigger and needs careful null/quote handling.
  Snowflake supports Parquet/CSV/JSON/Avro/ORC for unload. See file-format options in `COPY INTO <location>`. ([Snowflake Documentation][1])

### Partitioning strategy

Use `PARTITION BY` on fields you’ll filter downstream (e.g., `order_date`, `region`). It creates hierarchical folders and can drastically cut costs in engines like Athena. Remember: not compatible with `SINGLE=TRUE` or `OVERWRITE=TRUE`. ([Snowflake Documentation][1])

### Encryption

If your bucket uses SSE-S3 by default, you’re fine. If you need **KMS**:

```sql
copy into @SKY_S3_STAGE/secure/orders/
from ( select ... )
file_format = (format_name = FF_PARQUET)
encryption = ( type = 'AWS_SSE_KMS', kms_key_id = 'arn:aws:kms:us-east-1:123456789012:key/abcd-...' );
```

Supported values are `AWS_SSE_S3`, `AWS_SSE_KMS`, or `NONE`. ([Snowflake Documentation][1])

### Performance tips

* **Don’t `ORDER BY`** in unload queries unless you need it; it forces global sort and slows things down.
* Tune **`MAX_FILE_SIZE`** to produce 50–250 MB compressed files for balanced parallelism.
* For huge exports, prefer **Parquet** and **partitioning**.
* Size the warehouse (e.g., `ETL_XL` vs `ETL_2XL`) and consider multi-cluster if concurrency is needed.

### Costs

* You pay for **warehouse time** during unload.
* **Cloud egress** can apply if your Snowflake account region and S3 bucket region differ—keep them co-located.
* S3 storage and request costs apply as usual.

---

## 6) Common pitfalls & how to fix them

1. **Counts don’t match.**

   * You didn’t read the same snapshot: ensure both `COUNT(*)` and `COPY` use `AT (TIMESTAMP => $SNAP_TS)`.
   * You applied filters differently: copy the exact `WHERE` clause into both.
   * Your CSV file format is swallowing rows (e.g., multiline fields or bad quoting). Revisit `FIELD_OPTIONALLY_ENCLOSED_BY`, `ESCAPE_UNENCLOSED_FIELD`, `RECORD_DELIMITER`. Try a small sample and read back from S3 to spot the pattern.

2. **“AccessDenied” or nothing lands in S3.**

   * Storage integration lacks permission to the exact prefix, or the bucket policy/trust policy isn’t set with Snowflake’s **IAM user ARN** + **external ID**. Recheck `DESCRIBE INTEGRATION` and bucket/role policies. ([Snowflake Documentation][2])

3. **`PARTITION BY` with `OVERWRITE`/`SINGLE` errors.**
   Limitations are by design; use a new dated prefix instead of overwrite, and avoid `SINGLE=TRUE` when partitioning. ([Snowflake Documentation][1])

4. **Parquet + timezone types.**
   Certain timestamp flavors (e.g., `TIMESTAMP_TZ`/`TIMESTAMP_LTZ`) can error when unloading to Parquet; cast to `TIMESTAMP_NTZ` first (best practice in many pipelines). ([Snowflake Documentation][1])

5. **Zero rows exported.**
   Snowflake won’t create a data file if the query returns 0 rows. If your downstream expects the path to exist, create a small marker file separately or design consumers to tolerate missing partitions. ([Snowflake Documentation][1])

---

## 7) A tidy, reusable “daily export” pattern (put it on a TASK)

1. Wrap the steps into a stored procedure (capture snapshot → count → unload → verify → write audit row).
2. Schedule with a **TASK** at 05:45 so files are ready by 06:00.
3. Use a **date-stamped prefix**: `.../orders/dt=YYYY-MM-DD/` to avoid overwrites and make lineage clean.

---

## 8) Full walk-through (copy/paste block)

```sql
-- 0) Context
use role SYSADMIN;
use warehouse ETL_XL;
use database PROD;
use schema SALES;

-- 1) Snapshot & expected count
set SNAP_TS = current_timestamp();

set ROWS_EXPECTED = (
  select count(*) from ORDERS
    at (timestamp => $SNAP_TS)
   where status='COMPLETED'
     and order_date = dateadd(day, -1, current_date())
);

-- 2) Unload to S3 (Parquet, partitioned, auditable names)
copy into @PROD.SHARED.SKY_S3_STAGE/orders/parquet/
from (
  select
    order_id, customer_id, total_amount::number(12,2) as total_amount,
    order_date, updated_at::timestamp_ntz as updated_at
  from ORDERS at (timestamp => $SNAP_TS)
  where status='COMPLETED'
    and order_date = dateadd(day, -1, current_date())
)
file_format = (format_name = FF_PARQUET)
partition by (to_varchar(order_date, 'YYYY-MM-DD'))
include_query_id = true
detailed_output = true;

-- 3) Audit the COPY output
create or replace temporary table TMP_UNLOAD_AUDIT as
select * from table(result_scan(last_query_id()));

select $ROWS_EXPECTED as rows_expected;
select sum(rows_unloaded) as rows_in_files from TMP_UNLOAD_AUDIT;

-- 4) Read back from S3 and compare again
select count(*) as rows_read_back
from @PROD.SHARED.SKY_S3_STAGE/orders/parquet/
  (file_format => 'FF_PARQUET');

-- 5) Optional: list files (eyeball partitions & sizes)
list @PROD.SHARED.SKY_S3_STAGE/orders/parquet/;
```

(Options like `ENCRYPTION = (TYPE='AWS_SSE_KMS', KMS_KEY_ID='…')`, or switching to CSV with `FILE_FORMAT=FF_CSV` are easy tweaks.) ([Snowflake Documentation][1])

---

## 9) Your validation checklist (print-worthy)

* [ ] **Same snapshot** used in both count and unload (`AT (TIMESTAMP => $SNAP_TS)`).
* [ ] `COPY` **result set captured**; sum of `rows_unloaded` equals expected count. ([Snowflake Documentation][1])
* [ ] **Read-back count** from S3 via stage equals expected count. ([Snowflake Documentation][4])
* [ ] **Partitions present** as designed (e.g., `partition_0=YYYY-MM-DD/`).
* [ ] **File sizes reasonable** (not tons of tiny files, not a single huge file).
* [ ] **CSV**: nulls/quotes/escapes validated; **Parquet**: timestamp types cast as needed. ([Snowflake Documentation][1])
* [ ] (If required) **Encryption** validated (KMS key id). ([Snowflake Documentation][1])

---

## 10) Extra patterns you’ll use in the wild

* **Idempotency**: Use `INCLUDE_QUERY_ID=TRUE` to avoid accidental overwrites, or emit into a **dated prefix** and treat each run as immutable output. ([Snowflake Documentation][1])
* **Schema evolution**: Prefer Parquet; CSV + headers is fragile for evolving schemas.
* **Downstream friendliness**: Pick partition columns your consumers filter by; avoid too many tiny files.
* **Auditing**: Insert a row into an `EXPORT_AUDIT` table with `query_id`, `rows_expected`, `rows_in_files`, `prefix`, and a `verification_status`.

---

## Must-know questions (to test yourself)

1. What are the pros/cons of unloading to a **named external stage** vs a direct **S3 URL** with `STORAGE_INTEGRATION`?
2. How does Snowflake guarantee **read consistency**, and how do you use `AT (TIMESTAMP => …)` for validation across multiple statements?
3. Why use **`INCLUDE_QUERY_ID`** and **`DETAILED_OUTPUT`** in `COPY INTO <location>`? What do you get back and how do you use it? ([Snowflake Documentation][1])
4. Explain why **`PARTITION BY`** can’t be combined with **`SINGLE=TRUE`** or **`OVERWRITE=TRUE`**, and how you design around it. ([Snowflake Documentation][1])
5. When would you pick **Parquet** over **CSV**, and what **CSV file format** options prevent data corruption (nulls, quotes, newlines)? ([Snowflake Documentation][1])
6. How do you **read back** your S3 files in Snowflake to validate counts and content without loading into a table? Show the exact SQL. ([Snowflake Documentation][4])
7. What permissions/policies are required for a **storage integration** to write to S3, and how do **external ID** and **IAM trust** fit in? ([Snowflake Documentation][2])
8. What are the **encryption** options for unloading to S3 and when do you need `AWS_SSE_KMS`? Show the syntax. ([Snowflake Documentation][1])
9. What happens if your query returns **zero rows** and your downstream expects a file? How do you design for this? ([Snowflake Documentation][1])

---
