# Snowflake Staging (Internal Stages) — a practical, story-driven deep dive

*Mentor hat on.* We’ll build your mental model first, then get hands-on with a realistic “big file” load using SnowSQL, and finish with a crisp checklist and must-know questions. I’ll also gently fix any common misconceptions along the way.

---

## 1) The mental model: what is a **stage**?

Think of a **stage** as Snowflake’s “loading dock.” Files arrive here first, then forklifts (**COPY INTO**) move rows into your tables. Stages live **inside Snowflake** (internal) or **outside** (external: S3/Azure/GCS).
Your focus today is **internal** stages: Snowflake stores the files for you, secures them, and you pay Snowflake storage.

### Three flavors you’ll see (all can be internal)

* **User stage**: `@~` — private scratchpad per user (great for ad-hoc).
* **Table stage**: `@%table_name` — automatically created for each table; perfect when files are **only for that table**.
* **Named stage**: `@mystage` — reusable object you `CREATE STAGE`. Add defaults (file format, copy options), organize subfolders/prefixes, and share with roles/teams.

> **Misconception to fix**: “Table stage is always best for a single table.”
> *Usually true*, but if you need broader access control, reusable defaults, or structured subfolders, a **named stage** can still be the better choice—even for a single table.

---

## 2) Core commands you’ll actually use (cheat-sheet)

**Staging file management**

```sql
-- See files
LIST @%orders;                      -- table stage
LIST @mystage/incoming/;            -- named stage with a "subfolder" (prefix)

-- Delete files in a stage
REMOVE @%orders PATTERN='.*\.bad$'; -- delete only *.bad
```

**Move data between local ↔ stage (SnowSQL only)**

```sql
-- Upload from your laptop/server to a stage
PUT file://C:\loads\orders_2025_08_24.csv @%orders AUTO_COMPRESS=TRUE PARALLEL=8 OVERWRITE=FALSE;

-- Download from a stage to your machine
GET @%orders file://C:\downloads\orders_backup\ PARALLEL=8 PATTERN='.*2025_08_24.*';
```

**Load from stage → table**

```sql
COPY INTO ORDERS
  FROM @%orders
  FILE_FORMAT=(FORMAT_NAME = ff_orders_csv)
  PATTERN='.*orders_2025_08_24.*'
  ON_ERROR='ABORT_STATEMENT';
```

**(Optional) Unload from table → stage**

```sql
COPY INTO @mystage/exports/orders_2025_08_24/
  FROM (SELECT * FROM ORDERS WHERE order_date = '2025-08-24')
  FILE_FORMAT=(TYPE=CSV COMPRESSION=GZIP);
```

> Tip: After any `COPY INTO`, inspect the result instantly:

```sql
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));
```

---

## 3) File formats (the “decoder ring”)

Create **file format objects** once; reuse everywhere. They make complex CSVs, JSON, or Parquet easy and consistent.

**CSV example**

```sql
CREATE OR REPLACE FILE FORMAT ff_orders_csv
  TYPE = CSV
  FIELD_DELIMITER = ','
  FIELD_OPTIONALLY_ENCLOSED_BY = '"'
  SKIP_HEADER = 1
  NULL_IF = ('\\N','NULL','')
  EMPTY_FIELD_AS_NULL = TRUE
  TRIM_SPACE = TRUE;
```

**JSON example (newlines JSON)**

```sql
CREATE OR REPLACE FILE FORMAT ff_json_lines
  TYPE = JSON
  STRIP_OUTER_ARRAY = TRUE; -- if each file contains one big array
```

**Parquet example**

```sql
CREATE OR REPLACE FILE FORMAT ff_parquet TYPE = PARQUET;
```

> **Gotchas**
> • If your local file is **already gzipped** (`.gz`), don’t double-compress: use `AUTO_COMPRESS=FALSE` in `PUT` and set `SOURCE_COMPRESSION=GZIP` if needed.
> • For CSVs with commas/newlines **inside quotes**, make sure `FIELD_OPTIONALLY_ENCLOSED_BY='"'` is set, or you’ll get column-count errors.

---

## 4) Table stage vs Named stage (when to use what)

| Use case                                                      | Table stage `@%table`               | Named stage `@mystage`              |
| ------------------------------------------------------------- | ----------------------------------- | ----------------------------------- |
| Files only for one table                                      | ✅ Best fit                          | Possible but not necessary          |
| Reuse defaults (file format / copy options) across many loads | Meh                                 | ✅ Strong                            |
| Team collaboration / fine-grained privileges                  | Limited (inherits table privileges) | ✅ Grant READ/WRITE/USAGE explicitly |
| Clean subfolder conventions (`incoming/processed/failed/`)    | Basic                               | ✅ Clean and scalable                |
| Long-lived pipelines                                          | OK                                  | ✅ Prefer                            |

---

## 5) Best practices (internal stages)

1. **Right-size your files**: target **100–250 MB compressed** for fast parallel loads. Avoid one giant monolith; split if possible.
2. **Create and reuse file formats**; don’t inline all options in every `COPY`.
3. **Test the load** before committing: use `VALIDATION_MODE`.
4. **Decide your error policy up front**: `ON_ERROR` = `'ABORT_STATEMENT' | 'CONTINUE' | 'SKIP_FILE' | 'SKIP_FILE_<n>'`.
5. **Idempotency**: keep original filenames stable; Snowflake tracks loaded files and **skips duplicates** unless you set `FORCE=TRUE`.
6. **Cleanup**: use `PURGE=TRUE` in `COPY` or `REMOVE` after successful loads to reduce storage cost.
7. **Security**: grant the **least** privileges (READ/WRITE on named stages only to loaders), and prefer named stages for shared pipelines.
8. **Observability**: capture `LAST_QUERY_ID()` from `COPY` and store the load stats; keep a lightweight load history table.

---

## 6) The scenario: “Upload a **large** file for one table using SnowSQL + table stage”

### The story

You’re the data engineer for **RiverRetail**. Finance sent you a chunky CSV: `orders_2025_08_24.csv` (15 GB). It belongs to a single table `ORDERS`. You’ll use the **table stage** (`@%ORDERS`) and **SnowSQL** (CLI) because the web UI is meant for smaller files and lacks the advanced knobs we need.

#### a) Prepare the target table

```sql
CREATE OR REPLACE TABLE ORDERS (
  order_id           NUMBER,
  order_ts           TIMESTAMP_NTZ,
  customer_id        NUMBER,
  amount             NUMBER(12,2),
  currency           VARCHAR(3),
  notes              STRING
);
```

#### b) Create a reusable file format (CSV with header, quoted fields)

```sql
CREATE OR REPLACE FILE FORMAT ff_orders_csv
  TYPE = CSV
  FIELD_DELIMITER = ','
  SKIP_HEADER = 1
  FIELD_OPTIONALLY_ENCLOSED_BY = '"'
  NULL_IF = ('','NULL')
  EMPTY_FIELD_AS_NULL = TRUE
  TRIM_SPACE = TRUE;
```

#### c) Upload with **SnowSQL** `PUT`

From your terminal (connected with a role that owns/has rights on `ORDERS`):

```bash
snowsql -a <account> -u <user>
-- inside snowsql:
PUT file://C:\RiverRetail\loads\orders_2025_08_24.csv @%ORDERS \
    AUTO_COMPRESS=TRUE PARALLEL=8 OVERWRITE=FALSE;
```

**Why these options?**

* `AUTO_COMPRESS=TRUE` → client compresses before upload (smaller, faster).
* `PARALLEL=8` → concurrent parts (tune based on local CPU/network).
* `OVERWRITE=FALSE` → don’t clobber if a file with the same name already exists.

> **If file is already `.gz`**:
> Use `AUTO_COMPRESS=FALSE` and (optionally) `SOURCE_COMPRESSION=GZIP`:
>
> ```sql
> PUT file://.../orders_2025_08_24.csv.gz @%ORDERS AUTO_COMPRESS=FALSE SOURCE_COMPRESSION=GZIP;
> ```

Check what landed:

```sql
LIST @%ORDERS;
```

#### d) **Dry-run** the load (validate, don’t insert)

```sql
COPY INTO ORDERS
  FROM @%ORDERS
  FILES = ('orders_2025_08_24.csv.gz')       -- use exact name returned by LIST
  FILE_FORMAT = (FORMAT_NAME = ff_orders_csv)
  VALIDATION_MODE = 'RETURN_ERRORS';
```

* If you want to preview rows Snowflake *would* load:

```sql
COPY INTO ORDERS
  FROM @%ORDERS
  FILES = ('orders_2025_08_24.csv.gz')
  FILE_FORMAT = (FORMAT_NAME = ff_orders_csv)
  VALIDATION_MODE = RETURN_1000_ROWS;
```

Review validation output:

```sql
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));
```

**Fix common errors** (examples)

* Column count mismatch → check delimiter/enclosure options.
* Bad timestamp → set `TIMESTAMP_FORMAT` in file format (`AUTO`, or a pattern).
* Corrupt lines → pick a policy: `ON_ERROR='SKIP_FILE_5'` (skip file if >5 errors), or `'CONTINUE'` (load good rows, log bad).

#### e) **Load for real**

```sql
COPY INTO ORDERS
  FROM @%ORDERS
  FILES = ('orders_2025_08_24.csv.gz')
  FILE_FORMAT = (FORMAT_NAME = ff_orders_csv)
  ON_ERROR = 'ABORT_STATEMENT'   -- choose the strictness you want
  PURGE = TRUE                   -- remove file from stage if load succeeds
  FORCE = FALSE                  -- don't reload if Snowflake thinks it's the same file
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;  -- helpful if column order differs
```

Confirm outcome:

```sql
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));  -- rows_loaded, errors_seen, etc.
SELECT COUNT(*) FROM ORDERS WHERE CAST(order_ts AS DATE) = '2025-08-24';
```

#### f) **If you must reload** the same filename

* Set `FORCE=TRUE` in `COPY` (tells Snowflake to ignore the duplicate-file guard), **or**
* Rename the staged file to a new unique name before loading (safer for lineage).

#### g) **Housekeeping**

* If you didn’t use `PURGE=TRUE`, clean up:

```sql
REMOVE @%ORDERS PATTERN='.*orders_2025_08_24.*';
```

---

## 7) Extra depth knobs you’ll appreciate

* **Parallelism & warehouse size**: Larger warehouses load faster thanks to more threads. If your files are split well (100–250 MB), scaling up can reduce wall-clock time significantly.
* **Patterns vs files**:

  ```sql
  COPY INTO ORDERS FROM @mystage/incoming/ PATTERN='.*orders_2025_08_.*\.csv\.gz';
  ```

  Patterns are great for daily partitions (prefixes like `incoming/2025/08/24/`).
* **Skips and truncation**: Prefer fixing the source or file format to using `TRUNCATECOLUMNS=TRUE` (last resort).
* **Auditing**: Save `LAST_QUERY_ID()` and `CURRENT_TIMESTAMP()` to a control table with the row counts you expect. This becomes your SLA dashboard.
* **Permissions (internal stages)**:

  * **Named stage**: grant `USAGE` (see stage), `READ` (GET, LIST), `WRITE` (PUT, REMOVE).
  * **Table stage**: access is tied to table privileges; loaders typically need `INSERT` on the table and the ability to `PUT` to its stage (ownership often used).

---

## 8) Quick cookbook

**Create a named internal stage with defaults**

```sql
CREATE OR REPLACE STAGE stg_finance_in
  FILE_FORMAT = (FORMAT_NAME = ff_orders_csv);
-- (internal by default since no external URL/credentials specified)
```

**Use a named stage in COPY without restating format**

```sql
COPY INTO ORDERS
  FROM @stg_finance_in
  PATTERN='.*2025_08_24.*'
  ON_ERROR='CONTINUE';
```

**List only “.bad” reject files**

```sql
LIST @stg_finance_in PATTERN='.*\.bad$';
```

**Download a sample back to your laptop**

```sql
GET @stg_finance_in file://C:\samples\ PARALLEL=4 PATTERN='.*sample.*';
```

---

## 9) What most people miss (but you won’t)

* **Duplicate protection**: Snowflake tracks which files were loaded into a table and **skips them** on subsequent runs. Use `FORCE=TRUE` judiciously and keep filenames stable for idempotent pipelines.
* **Compression awareness**: Don’t recompress an already compressed file unless you intend to; it wastes time and can slow loading.
* **Validation first**: `VALIDATION_MODE` saves you from half-loaded tables and messy rollbacks.
* **Storage costs**: Internal stages are convenient but not free—`PURGE` or time-based `REMOVE` is part of a healthy pipeline.

---

## 10) Must-know questions to test yourself

1. Explain **user**, **table**, and **named** stages. When would you choose each for an internal stage?
2. Walk through the **end-to-end steps** to load a large CSV into a single table using a **table stage** and **SnowSQL**. Include commands.
3. What does `AUTO_COMPRESS` do in `PUT`? When should you set it to `FALSE`?
4. How does Snowflake prevent **duplicate loads**? When would you use `FORCE=TRUE`?
5. Compare `ON_ERROR='ABORT_STATEMENT'` vs `'CONTINUE'` vs `'SKIP_FILE_n'`. When is each appropriate?
6. Show how to **validate** a load without inserting data, and how to **inspect results** of a `COPY` run.
7. Why are **100–250 MB compressed** files recommended? What happens if you load a single 100 GB file?
8. What privileges do you need to **PUT/GET/LIST/REMOVE** on a **named internal stage** vs on a **table stage**?
9. Which **file format** settings fix comma-in-quotes and header row issues for CSV?
10. How do you clean up stage files after a successful load, and why might you choose **`PURGE=TRUE`**?

---

## 11) TL;DR blueprint for your exact input

* **Internal staging area**: prefer **table stage** for a single-table drop; consider a **named stage** if you need richer access control and defaults.
* **Commands**:

  * Upload: `PUT file://... @%table AUTO_COMPRESS=TRUE PARALLEL=8`
  * Download: `GET @%table file://...`
  * Use table stage: `@%table` ; named stage: create via `CREATE STAGE` then reference `@name`
  * Best practices: right-size files, reusable file formats, validate first, decide error policy, idempotency, purge/cleanup, minimal privileges.
* **Scenario** (large file, single table, SnowSQL):

  1. Create **file format** (`ff_orders_csv`).
  2. `PUT` the file to `@%ORDERS` (no double compression).
  3. `LIST` to verify.
  4. `COPY INTO ORDERS` with `VALIDATION_MODE` first.
  5. Real `COPY` with chosen `ON_ERROR`, `PURGE`, `MATCH_BY_COLUMN_NAME`.
  6. Inspect results via `RESULT_SCAN(LAST_QUERY_ID())`.
  7. Cleanup (`PURGE` or `REMOVE`).



---

## **1. Explain user, table, and named stages. When would you choose each for an internal stage?**

* **User stage (`@~`)**

  * Every user automatically gets one.
  * Files here are private to the user.
  * Best for **ad-hoc loads/tests** where only you need the data.
  * Example: Quickly testing a 5MB CSV you just downloaded.

* **Table stage (`@%table_name`)**

  * Every table automatically gets one.
  * Files are **meant for loading into that specific table**.
  * Great for **one-off large loads** where the file belongs *only* to that table.
  * Example: Finance sends a big CSV just for `ORDERS`; you `PUT` into `@%ORDERS` and `COPY` directly.

* **Named stage (`@mystage`)**

  * You explicitly `CREATE STAGE`.
  * Can hold defaults like file format, copy options.
  * Sharable between users/roles.
  * Best for **reusable pipelines** or **shared ingestion** with structured subfolders.
  * Example: Daily batch loads of sales data into multiple tables using `@sales_stage/incoming/`.

---

## **2. Walk through the end-to-end steps to load a large CSV into a single table using a table stage and SnowSQL.**

1. **Create the target table**:

   ```sql
   CREATE OR REPLACE TABLE ORDERS (...columns...);
   ```

2. **Create file format**:

   ```sql
   CREATE OR REPLACE FILE FORMAT ff_orders_csv TYPE=CSV SKIP_HEADER=1 FIELD_OPTIONALLY_ENCLOSED_BY='"';
   ```

3. **Upload the file with SnowSQL**:

   ```bash
   PUT file://C:\loads\orders.csv @%ORDERS AUTO_COMPRESS=TRUE PARALLEL=8;
   ```

4. **Check file landed**:

   ```sql
   LIST @%ORDERS;
   ```

5. **Validate load**:

   ```sql
   COPY INTO ORDERS
   FROM @%ORDERS
   FILE_FORMAT=(FORMAT_NAME=ff_orders_csv)
   VALIDATION_MODE='RETURN_ERRORS';
   ```

6. **Load for real**:

   ```sql
   COPY INTO ORDERS
   FROM @%ORDERS
   FILE_FORMAT=(FORMAT_NAME=ff_orders_csv)
   ON_ERROR='ABORT_STATEMENT'
   PURGE=TRUE;
   ```

---

## **3. What does `AUTO_COMPRESS` do in PUT? When should you set it to FALSE?**

* `AUTO_COMPRESS=TRUE` → SnowSQL compresses your file (gzip) **before uploading**. Faster transfer, smaller storage.
* Set `AUTO_COMPRESS=FALSE` if your file is **already compressed** (`.gz`, `.bz2`, `.zip`)—otherwise you’d double compress.

---

## **4. How does Snowflake prevent duplicate loads? When would you use `FORCE=TRUE`?**

* Snowflake tracks loaded files **by filename** in the table’s load history.
* If you run `COPY` again with the same filename, Snowflake **skips it** by default.
* Use `FORCE=TRUE` to reload even if Snowflake thinks it’s already loaded.
* Example: If your file was fixed but kept the same name, you need `FORCE=TRUE`.

---

## **5. Compare `ON_ERROR='ABORT_STATEMENT'` vs `'CONTINUE'` vs `'SKIP_FILE_n'`. When is each appropriate?**

* `ABORT_STATEMENT` → Stop everything on first error.

  * Best for **strict financial or critical data**.

* `CONTINUE` → Load good rows, log bad ones.

  * Best for **event data / logs** where some rows can be sacrificed.

* `SKIP_FILE_n` → Skip the file if more than `n` errors occur.

  * Best for **multi-file loads**: don’t waste time if a file is very corrupt.

---

## **6. Show how to validate a load without inserting data, and how to inspect results of a COPY run.**

* **Validate only**:

  ```sql
  COPY INTO ORDERS
  FROM @%ORDERS
  FILE_FORMAT=(FORMAT_NAME=ff_orders_csv)
  VALIDATION_MODE='RETURN_ERRORS';
  ```

* **Inspect results after COPY**:

  ```sql
  SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));
  ```

This returns row counts, errors seen, and file status.

---

## **7. Why are 100–250 MB compressed files recommended? What happens if you load a single 100 GB file?**

* Snowflake loads files **in parallel**. Each file is a unit of parallelism.
* 100–250 MB compressed = “sweet spot” for balancing throughput and parallelism.
* If you load a **single 100 GB file**, only **one thread** works on it → load is painfully slow.

---

## **8. What privileges do you need to PUT/GET/LIST/REMOVE on a named internal stage vs a table stage?**

* **Named stage**:

  * `USAGE` → see the stage
  * `READ` → `LIST` / `GET`
  * `WRITE` → `PUT` / `REMOVE`

* **Table stage**:

  * Controlled by **table privileges**. If you can `INSERT` into the table, you usually can stage and load files to it.

---

## **9. Which file format settings fix comma-in-quotes and header row issues for CSV?**

* For commas inside quotes:

  ```sql
  FIELD_OPTIONALLY_ENCLOSED_BY='"'
  ```
* For header row:

  ```sql
  SKIP_HEADER=1
  ```

---

## **10. How do you clean up stage files after a successful load, and why might you choose PURGE=TRUE?**

* **Option 1: Let Snowflake auto-remove** with:

  ```sql
  COPY INTO ORDERS ... PURGE=TRUE;
  ```
* **Option 2: Manually**:

  ```sql
  REMOVE @%ORDERS PATTERN='.*2025_08_24.*';
  ```

Why PURGE?

* Saves **storage cost**.
* Prevents accidental **reloading** of old files.

---



---

## **1) What if I recompress a file that’s already compressed?**

👉 Imagine you already have a file: `orders.csv.gz` (gzip-compressed).

Now in your `PUT` command you mistakenly write:

```sql
PUT file://orders.csv.gz @%ORDERS AUTO_COMPRESS=TRUE;
```

### What happens?

* SnowSQL takes your **already compressed gzip** and compresses it **again** into another gzip.
* The result is a **double-compressed file** (e.g., `.csv.gz.gz`).
* Inside Snowflake, when you try to load it with a CSV file format, Snowflake **won’t know how to decode it correctly** (it only decompresses once).
* You’ll likely get **garbled data** or **load failures** like:

  * “File not recognized as valid gzip”
  * Wrong row/column parsing (Snowflake sees binary junk instead of CSV).

### Analogy

It’s like putting a suitcase inside another suitcase and giving it to someone who only knows how to unzip **one layer**. They’ll open the first zipper and still see a packed bag inside that they don’t know how to open.

✅ **Best practice**:

* If file is **already compressed** (`.gz`, `.bz2`, `.zip`), use:

  ```sql
  PUT file://orders.csv.gz @%ORDERS AUTO_COMPRESS=FALSE;
  ```
* And if needed, tell Snowflake how it’s compressed:

  ```sql
  FILE_FORMAT=(TYPE=CSV COMPRESSION=GZIP)
  ```

---

## **2) What if I reinsert (upload) a file and `OVERWRITE=FALSE`?**

👉 Let’s say yesterday you uploaded `orders_2025_08_24.csv.gz` into your table stage:

```sql
PUT file://orders_2025_08_24.csv.gz @%ORDERS AUTO_COMPRESS=TRUE OVERWRITE=FALSE;
```

Now today you try again with the same command (same filename).

### What happens?

* Snowflake checks if the **exact same filename already exists** in the stage.
* Since `OVERWRITE=FALSE`, Snowflake will **skip the upload**.
* It will **not replace** the old file, even if the contents are different.
* You’ll see an output message like:

  ```
  Skipping file because it already exists and OVERWRITE=FALSE
  ```

### Why does this matter?

* If finance sent you an updated version of the file (with the same name), you’ll accidentally still be working with the **old version**.
* To truly replace it, you must either:

  * Use `OVERWRITE=TRUE` in `PUT`, **or**
  * Rename the new file before uploading (e.g., `orders_2025_08_24_v2.csv.gz`).

### And then in COPY INTO?

* Snowflake’s load history will **remember that filename** was already loaded, so a plain `COPY` will skip it (to avoid duplicates).
* If you really want to reload that same filename into the table, you’ll also need:

  ```sql
  COPY INTO ORDERS FROM @%ORDERS FORCE=TRUE;
  ```

---

✅ **Summary**:

* **Recompress** → leads to double compression → unreadable files or errors.
* **Reinsert with `OVERWRITE=FALSE`** → Snowflake will skip uploading; you’ll keep the old file in stage.

---
