
---

## ⚡ Frequent Real-World Snowflake Staging Issues (with detailed solutions)

---

### 1. 🚫 **Table Stages are not reusable across multiple tables**

* **Issue:** Developers often assume that a *table stage* (`@%table_name`) is a general-purpose location. But in Snowflake, a **table stage is tied only to that specific table**. You cannot use it to load data into other tables.
* **Why it happens:** Table stages are designed for temporary, table-specific loads. They’re automatically created per table and are not shared resources.
* **Real case:** A dev loads a CSV into `@%orders` stage but later wants to load the same file into `customers` — they get an error because `@%orders` belongs strictly to `orders` table.
* **Solution:**

  * If you want to use data across multiple tables → use a **named stage** (`CREATE STAGE my_stage`) or an **external stage** (S3, Azure Blob, GCS).
  * Rule of thumb: *Table stages = quick one-off loads. Named stages = reusable pipelines.*

---

### 2. 🎯 **Loading specific columns from staging files**

* **Issue:** CSV files often contain more columns than needed. Developers mistakenly think they must match table schema exactly or load all columns.
* **Why it happens:** `COPY INTO` works by positional mapping of file columns → table columns. If you don’t align them correctly, data mismatches or load failures happen.
* **Solution:**

  * Use **file formats** (`CREATE FILE FORMAT`) with options like `SKIP_HEADER`, `FIELD_OPTIONALLY_ENCLOSED_BY` to clean up raw files.
  * In `COPY INTO`, you can **select only the columns you need**:

    ```sql
    COPY INTO my_table(col1, col2, col3)
    FROM (SELECT t.$1, t.$3, t.$5
          FROM @my_stage/file.csv (FILE_FORMAT => my_csv_format) t);
    ```
  * This way you can ignore unnecessary columns and load only the required ones.

---

### 3. 📝 **Tracking file names & rows for auditing**

* **Issue:** In regulated environments, devs need to track *which file a record came from* and sometimes even the row number.
* **Why it happens:** By default, Snowflake just loads the data without adding file lineage unless explicitly told.
* **Solution:** Use Snowflake’s **metadata columns**:

  * `METADATA$FILENAME` → gives you the source file name
  * `METADATA$FILE_ROW_NUMBER` → gives you the line number

  ```sql
  COPY INTO my_table(col1, col2, file_name, row_number)
  FROM (
      SELECT t.$1, t.$2, METADATA$FILENAME, METADATA$FILE_ROW_NUMBER
      FROM @my_stage/file.csv (FILE_FORMAT => my_csv_format) t
  );
  ```

  ✅ Best practice: Always store these in audit tables for traceability.

---

### 4. 🔍 **Debugging mismatched values or column counts**

* **Issue:** Sometimes devs notice wrong/missing values after loading — often because **number of table columns ≠ number of file columns**.
* **Why it happens:** COPY INTO does a positional mapping — `$1 → col1`, `$2 → col2`. If the file has more/fewer columns, you’ll either get errors or garbage data in wrong columns.
* **Solution:**

  * ✅ Before loading, **query staged files directly**:

    ```sql
    SELECT $1, $2, $3
    FROM @my_stage/file.csv (FILE_FORMAT => my_csv_format);
    ```

    This avoids having to download and check manually.
  * Use `VALIDATION_MODE = RETURN_ERRORS` in COPY INTO to simulate load and see mismatches before actually loading.
  * If file schema evolves frequently, store data in a **raw landing table (all columns as VARIANT)**, then transform with SQL into structured tables.

---

### 5. ✂️ **Using SPLIT\_PART during file loading**

* **Issue:** Developers sometimes wonder: *Why use `SPLIT_PART` if the file already has delimiters?*
* **Why it happens:** Sometimes raw files are not cleanly delimited (e.g., entire row is dumped as one string with `|` inside). Or files have extra JSON/XML data inside a column.
* **Solution:** Use `SPLIT_PART` in a staging query:

  ```sql
  COPY INTO my_table(col1, col2)
  FROM (
    SELECT SPLIT_PART($1, '|', 1),
           SPLIT_PART($1, '|', 2)
    FROM @my_stage/file.csv
  );
  ```

  ✅ This helps when files don’t follow consistent formatting rules.

---

### 6. 🗂️ **Recompression confusion**

* **Issue:** Developers wonder: *What happens if I re-compress a file (gzip again) and re-upload?*
* **Reality:** Snowflake auto-detects file compression (gzip, bz2, etc.). If you re-compress and upload as a new file:

  * Snowflake can still read it fine.
  * But if the **file name changes**, Snowflake sees it as a new file → risk of duplicates.
  * If **file name is same and overwrite = true**, old file is replaced.

---

### 7. 🔄 **Re-uploading same files → duplicates**

* **Your Question:** *How can duplicates happen if `OVERWRITE=FALSE` makes Snowflake ignore same-named files?*
* **Explanation:**

  * Snowflake tracks **file ingestion by file name**, not by file content.
  * If you re-upload the same file but with a **different name** (e.g., `orders.csv` → `orders_v2.csv`), Snowflake treats it as new → loads again → duplicates.
  * If pipeline renames files automatically (common in ETL jobs, like `file_20250101.csv`, `file_20250101_copy.csv`) → duplicates slip in.
* **Solution:**

  * Use **deduplication strategy**:

    * Add a unique file tracking table (insert filename + checksum after load).
    * On subsequent loads, check if file already processed.
  * Or enforce `ON_ERROR = SKIP_FILE` and deduplication logic at query level (`ROW_NUMBER()` partitioned by business keys, keep only latest).

---

### 8. ⚠️ **File format mismatch issues**

* **Issue:** Common when delimiter/encoding/quotes are different than expected → load either fails or scrambles data.
* **Example:** CSV has `;` delimiter but file format defined as `,`. → Snowflake loads entire row as one column.
* **Solution:**

  * Always test files by querying stage before loading.
  * Maintain **centralized reusable file formats** (not per dev).
  * Keep strict contracts with data providers.

---

### 9. 🕐 **Slow loads due to large numbers of small files**

* **Issue:** Many small files → Snowflake spends time on overhead (opening/closing connections, planning), not actual data loading.
* **Solution:**

  * Bundle small files before staging (e.g., 100MB chunks are optimal).
  * Use external stages with Snowpipe auto-ingestion.

---

### 10. 🧩 **Semi-structured data confusion**

* **Issue:** JSON/Parquet files staged, but devs try to load directly into relational tables and fail.
* **Solution:**

  * Load into a **VARIANT column** first.
  * Then use Snowflake functions (`:key`, `OBJECT_INSERT`, `ARRAY_AGG`) to transform into structured tables.

---

✅ **Summary**:

* Table stages are isolated.
* Column mismatches → query staged files first.
* Use metadata columns for traceability.
* SPLIT\_PART helps with messy delimiters.
* Re-compressing is safe, but renaming files creates duplicates.
* Deduplication strategy is a must in production.

---




---

### **1. ❌ File not visible in stage after PUT command**

**Problem:**
You uploaded a file using `PUT` but when you query the stage (`LIST @mystage;`), the file doesn’t show up.
**Why it happens:**

* Wrong path used in the stage.
* File uploaded to the wrong stage (user stage instead of table stage).
* Local file path was incorrect in the `PUT`.

**Solution:**

* Always `LIST` immediately after `PUT` to confirm file presence.
* Use fully qualified stage paths (`@~`, `@%table`, `@schema.stage`).
* If using automation, log the `PUT` output because it shows where the file was actually stored.

---

### **2. 🕒 Time travel confusion — why is old file still loading?**

**Problem:**
You re-uploaded a file with the same name and set `OVERWRITE=TRUE`, but somehow the old data is still loaded.

**Why it happens:**

* Snowflake caches file history to prevent accidental re-ingestion.
* The **COPY INTO** command checks file metadata (name, size, checksum). If unchanged, it doesn’t reload.

**Solution:**

* Use `FORCE=TRUE` in `COPY INTO` to force reload regardless of history.
* Or version your files (e.g., `data_20250824_v1.csv`) to avoid confusion.

---

### **3. 📂 Folder structures in external stages (S3, Azure, GCS)**

**Problem:**
External stage has nested folders (`/2025/08/24/data.csv`). Developer runs `COPY INTO` without recursive settings, and no data is loaded.

**Why it happens:**

* By default, Snowflake doesn’t scan subdirectories unless `PATTERN` or `RECURSIVE=TRUE` is used.

**Solution:**

* Use `COPY INTO ... PATTERN='.*2025/08/24/.*'`
* Or set `RECURSIVE=TRUE` if you want to scan folders automatically.

---

### **4. 🏷 Wrong file format used during COPY**

**Problem:**
File loads with garbage values, NULLs, or unexpected column splits.

**Why it happens:**

* Used wrong delimiter (`,` vs `|`)
* CSV with headers wasn’t set with `SKIP_HEADER=1`.
* Compression type mismatch (gz vs parquet vs json).

**Solution:**

* Always define `FILE_FORMAT` explicitly (never rely on defaults).
* Test with a small sample file before large loads.
* Use `VALIDATION_MODE=RETURN_ERRORS` to preview issues without loading data.

---

### **5. 🧹 “Phantom files” causing reprocessing**

**Problem:**
Even after processing, files keep showing up in the stage or reload unexpectedly.

**Why it happens:**

* `COPY INTO` doesn’t delete files from stage.
* Files must be manually removed (internal stage) or lifecycle-managed (external stage).

**Solution:**

* For internal stages: run `REMOVE @mystage/file.csv;` after load.
* For external stages: configure storage lifecycle rules (S3 lifecycle, GCS bucket policy).

---

### **6. 🔑 Permissions issue on external stage**

**Problem:**
`COPY INTO` fails with *“Access denied”* when trying to read from S3/Azure/GCS.

**Why it happens:**

* Wrong IAM role or service principal permissions.
* Snowflake storage integration not correctly configured.

**Solution:**

* Use **storage integration objects** instead of embedding keys.
* Verify external bucket policies (must allow Snowflake’s cloud provider service principal).
* Test by running `LIST @ext_stage;` to check access.

---

### **7. 🧩 Schema evolution issue (new columns in file)**

**Problem:**
Your files now have 12 columns, but the table has 10. The load fails.

**Why it happens:**

* Snowflake expects file columns to match table columns by position.

**Solution:**

* Create a staging table with flexible schema (`VARIANT` column for JSON, or wide VARCHARs for CSV).
* Load raw → then transform into target table.
* If CSV, use `FILE_FORMAT (ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE)` to load partial columns.

---

### **8. ⏳ Large file splitting issues**

**Problem:**
Loading a single huge 200GB file takes forever.

**Why it happens:**

* Snowflake parallelism shines when files are **many small chunks** (ideal: 100–250MB compressed).

**Solution:**

* Pre-split large files before loading.
* For external stages (S3), use tools like AWS Glue or Spark to split.
* For internal stages, use Snowflake’s `COPY INTO` auto-splitting for supported formats like Parquet.

---

### **9. 🧮 Duplicate rows after re-uploads (your last question)**

**Problem:**
Even with `OVERWRITE=FALSE`, duplicate rows appear. Why?

**Why it happens:**

* `OVERWRITE=FALSE` prevents *stage file overwrite*, not duplicate ingestion.
* If you re-upload with a **different filename** (`data.csv` vs `data_v2.csv`), Snowflake sees it as new.
* `COPY INTO` ingests both → duplicates.

**Solution:**

* Maintain **file tracking table** to record what files have been loaded (`INSERT INTO log_table SELECT metadata$filename`).
* Run `COPY INTO ... FILES=()` only for *new files*.
* Use deduplication strategy in target (e.g., MERGE with unique keys).

---

### **10. ⚡ “Half-loaded” files when COPY is interrupted**

**Problem:**
You stop a running `COPY INTO`, and only part of the file’s rows exist in the table.

**Why it happens:**

* Snowflake commits rows in batches; if job is canceled, partial batches stay.

**Solution:**

* Use `COPY INTO ... ON_ERROR='CONTINUE'` only when safe.
* Better: load into a staging table, then run atomic `INSERT INTO target SELECT ...` to ensure consistency.
* If issue occurs, truncate staging and reload.

---
