
---

# 🚀 Snowflake COPY Command Deep Dive

## 1. What is COPY Command in Snowflake?

Think of Snowflake as a house (your **database**), and you want to move your furniture (data) from a truck (files in **S3, Azure Blob, GCS, or Snowflake Stage**) into the house (a **table**).

The **COPY INTO <table>** command is the “movers team” that takes care of this. It knows how to:

* Pick which boxes (files) to load
* Unpack them according to format (CSV, JSON, Parquet, Avro, ORC)
* Handle errors if some items don’t fit well (bad records)
* Keep track of what’s already loaded (no double moving unless you ask for it)

So COPY Command is **the backbone of all ETL/ELT in Snowflake**.

---

## 2. File Level Options

### 🔹 The `FILES` Option

This is when you want to **load specific files** instead of all files in a stage.

**Scenario:**
Your upstream system drops daily sales data into an S3 bucket like this:

```
s3://company-data/sales/
 ├── sales_20250901.csv
 ├── sales_20250902.csv
 ├── sales_20250903.csv
 ├── sales_20250904.csv
```

Normally, if you run:

```sql
COPY INTO sales_table
FROM @my_s3_stage/sales/
FILE_FORMAT = (TYPE = CSV);
```

👉 Snowflake will try to load **all four files**.

But let’s say you only want `sales_20250903.csv`. Then:

```sql
COPY INTO sales_table
FROM @my_s3_stage/sales/
FILES = ('sales_20250903.csv')
FILE_FORMAT = (TYPE = CSV);
```

That’s **FILE option in action.**

⚠️ **Important restriction**: You **can’t use negation (like `NOT 'sales_20250903.csv'`)** with `FILES`.
Snowflake doesn’t support exclusion here — it only allows you to **pick explicitly**.

👉 If you want exclusion, you must use `PATTERN` instead (we’ll cover that below).

---

### 🔹 `ON_ERROR = CONTINUE` vs `ON_ERROR = ABORT`

Errors happen when:

* A column has wrong data type
* Missing delimiters in CSV
* Unexpected JSON structure

**Scenario Example:**
Your file has 1000 rows. Out of them, 5 rows are malformed.

* `ON_ERROR = ABORT_STATEMENT` (default):
  Snowflake **stops everything**. Nothing gets loaded.
  It’s like movers finding a broken box and refusing to unload the entire truck.

  ```sql
  COPY INTO sales_table
  FROM @my_s3_stage/sales/
  FILE_FORMAT = (TYPE = CSV)
  ON_ERROR = 'ABORT_STATEMENT';
  ```

* `ON_ERROR = CONTINUE`:
  Snowflake **loads all valid rows**, skips the bad ones.
  So 995 rows get loaded, 5 rejected.

  ```sql
  COPY INTO sales_table
  FROM @my_s3_stage/sales/
  FILE_FORMAT = (TYPE = CSV)
  ON_ERROR = 'CONTINUE';
  ```

👉 Which one to use?

* Use `ABORT` when data quality must be 100% strict (like financial transactions).
* Use `CONTINUE` when you’d rather not block the pipeline (like logs or clickstream).

---

### 🔹 The `PATTERN` Option

This is super powerful because often data is **partitioned by date or category** in S3.

**Example: Logs in S3**

```
s3://company-data/logs/
 ├── date=2025-09-01/part-000.csv
 ├── date=2025-09-02/part-001.csv
 ├── date=2025-09-03/part-002.csv
```

Now you want to only load **Sept 2nd logs**. You can use:

```sql
COPY INTO logs_table
FROM @my_s3_stage/logs/
PATTERN = '.*date=2025-09-02/.*[.]csv'
FILE_FORMAT = (TYPE = CSV);
```

👉 PATTERN uses **regex**. Common examples:

* `'.*202509.*'` → load all September 2025 files
* `'.*2025090[1-5].*'` → load files from Sept 1–5
* `'.*\.json'` → load only JSON files

📌 Remember: PATTERN is where you can “exclude” indirectly by choosing regex that ignores certain files.

---

## 3. Deeper into ON\_ERROR

We already saw CONTINUE vs ABORT. But let’s go deeper into **reject handling**.

Snowflake allows you to **capture rejected rows** for analysis.

---

### 🔹 How to capture rejected records

After a COPY command runs, you can query the **load history**:

```sql
SELECT *
FROM INFORMATION_SCHEMA.LOAD_HISTORY
ORDER BY LAST_LOAD_TIME DESC;
```

Each COPY operation gets a **query ID**.

👉 You can also fetch the query ID of your last COPY like this:

```sql
SELECT LAST_QUERY_ID();
```

---

### 🔹 Query rejected records

Snowflake stores rejected rows in the **`VALIDATION_MODE`** or via system function `VALIDATE`.

Example:

```sql
COPY INTO sales_table
FROM @my_s3_stage/sales/
FILE_FORMAT = (TYPE = CSV)
ON_ERROR = 'CONTINUE';
```

Now, to check rejected rows:

```sql
SELECT *
FROM TABLE(VALIDATE(
  sales_table,
  JOB_ID => '<query_id_of_copy_command>'
));
```

This gives you details like:

* File name
* Line number
* Column parsing error
* Row content

---

### 🔹 Create a table of rejected rows

You can persist rejected records for debugging:

```sql
CREATE OR REPLACE TABLE rejected_sales AS
SELECT *
FROM TABLE(VALIDATE(
  sales_table,
  JOB_ID => '<query_id_of_copy_command>'
));
```

Now you can analyze why they failed — maybe wrong delimiter, missing column, or wrong encoding.

---

## 4. Must-Know Questions to be Ready

Here are the questions I would grill you on if you joined my team:

1. What’s the difference between `FILES` and `PATTERN` in COPY command?
2. Why can’t you use negation with `FILES`? How do you handle exclusion instead?
3. When would you choose `ON_ERROR = CONTINUE` vs `ON_ERROR = ABORT_STATEMENT` in a pipeline?
4. How do you capture rejected rows after a COPY command?
5. Show me how you’d get the query ID of a COPY operation.
6. How would you load only “last 3 days of data” from a partitioned S3 folder using PATTERN?
7. If you want to load zipped files, how would you configure COPY command?
8. What happens if you run COPY command multiple times on the same file? How does Snowflake handle duplicates?

---



---

# 📌 Must-Know COPY Command Questions (with Detailed Answers)

---

### 1. **What’s the difference between `FILES` and `PATTERN` in COPY command?**

* **FILES**

  * Lets you load **specific files by name**.
  * Example:

    ```sql
    COPY INTO sales_table
    FROM @my_s3_stage/sales/
    FILES = ('sales_20250901.csv', 'sales_20250902.csv')
    FILE_FORMAT = (TYPE = CSV);
    ```
  * Works well if you know exact filenames.

* **PATTERN**

  * Uses **regex** to load files that match a naming pattern.
  * Example:

    ```sql
    COPY INTO sales_table
    FROM @my_s3_stage/sales/
    PATTERN = '.*2025090[1-5].*'
    FILE_FORMAT = (TYPE = CSV);
    ```
  * Works best for **partitioned datasets** (like date folders or daily dumps).

👉 **Key difference**:

* `FILES` = Explicit filenames (like picking items by hand).
* `PATTERN` = Regex-based bulk selection (like setting rules).

---

### 2. **Why can’t you use negation with `FILES`? How do you handle exclusion instead?**

* `FILES` only accepts **explicit file lists**. It doesn’t support `NOT` or `EXCEPT`.
* Example ❌ (invalid):

  ```sql
  FILES != ('sales_20250901.csv')
  ```

👉 **Solution**: Use `PATTERN`.
If you want everything **except Sept 1st**, you can do:

```sql
COPY INTO sales_table
FROM @my_s3_stage/sales/
PATTERN = '^(?!.*20250901).*\.csv'
FILE_FORMAT = (TYPE = CSV);
```

This regex means: “Load all `.csv` files that **do not contain 20250901**.”

---

### 3. **When would you choose `ON_ERROR = CONTINUE` vs `ON_ERROR = ABORT_STATEMENT` in a pipeline?**

* **ON\_ERROR = ABORT\_STATEMENT**

  * Pipeline **fails immediately** if even one bad record is found.
  * Use when:

    * Data must be **100% correct**.
    * Example: Bank transactions → you cannot afford to skip rows.

* **ON\_ERROR = CONTINUE**

  * Pipeline **loads valid rows**, skips bad ones.
  * Use when:

    * You prefer data flow to continue even if some rows are bad.
    * Example: Web logs → one malformed row shouldn’t block millions of valid rows.

👉 Rule of thumb:

* **Financial/critical systems** → `ABORT`
* **High-volume semi-structured data** → `CONTINUE`

---

### 4. **How do you capture rejected rows after a COPY command?**

After running COPY:

1. Get the **query ID**:

   ```sql
   SELECT LAST_QUERY_ID();
   ```

2. Check rejected rows:

   ```sql
   SELECT *
   FROM TABLE(VALIDATE(
     sales_table,
     JOB_ID => '<query_id>'
   ));
   ```

👉 This returns: file name, line number, error reason, and bad row content.

---

### 5. **Show me how you’d get the query ID of a COPY operation.**

Simply run:

```sql
SELECT LAST_QUERY_ID();
```

This gives the ID of your **last query in the session**, which includes COPY commands.

👉 You can also fetch query history from:

```sql
SELECT *
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_USER(RESULT_LIMIT => 5));
```

---

### 6. **How would you load only “last 3 days of data” from a partitioned S3 folder using PATTERN?**

Assume your S3 data is partitioned like this:

```
s3://company-data/logs/
 ├── date=20250910/file1.csv
 ├── date=20250911/file2.csv
 ├── date=20250912/file3.csv
 ├── date=20250913/file4.csv
```

👉 To load **Sept 11–13 only**:

```sql
COPY INTO logs_table
FROM @my_s3_stage/logs/
PATTERN = '.*date=202509(11|12|13)/.*\.csv'
FILE_FORMAT = (TYPE = CSV);
```

Here `(11|12|13)` is regex for the last 3 days.

---

### 7. **If you want to load zipped files, how would you configure COPY command?**

Snowflake can natively **decompress files** when loading.

Example for a GZIP file:

```sql
COPY INTO sales_table
FROM @my_s3_stage/sales/
FILE_FORMAT = (TYPE = CSV COMPRESSION = GZIP);
```

Supported compression types: `AUTO, GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE`

👉 Snowflake detects automatically if `COMPRESSION = AUTO` is set.

---

### 8. **What happens if you run COPY command multiple times on the same file? How does Snowflake handle duplicates?**

By default, **Snowflake prevents loading the same file twice** in the same table.

It tracks this using **metadata in the load history**.

* If you try to load again:

  * Snowflake **skips already loaded files**.
  * Unless you use `FORCE = TRUE`.

Example:

```sql
COPY INTO sales_table
FROM @my_s3_stage/sales/
FILE_FORMAT = (TYPE = CSV)
FORCE = TRUE;
```

👉 `FORCE = TRUE` ignores history and reloads files (can cause duplicates).

So best practice:

* Keep `FORCE = FALSE` for production loads.
* Use staging tables if you need to reload for debugging.

---

# ✅ Quick Recap

1. `FILES` = explicit list, `PATTERN` = regex
2. Negation ❌ with `FILES`, ✅ with `PATTERN`
3. `ON_ERROR = ABORT` → strict data, `CONTINUE` → flexible pipelines
4. Rejected rows captured with `VALIDATE` and `LAST_QUERY_ID()`
5. Query ID via `LAST_QUERY_ID()` or `QUERY_HISTORY`
6. Load recent partitions via regex in `PATTERN`
7. Compressed files handled by `FILE_FORMAT (COMPRESSION = …)`
8. COPY avoids duplicates by default unless `FORCE = TRUE`

---
