
---

# 1. **Why do we even need history for COPY INTO?**

Think of this: You’re running a nightly pipeline that ingests sales files from an S3 bucket into your Snowflake `SALES_TRANSACTIONS` table. One day, your business analyst comes running to you:

> “Hey! Yesterday’s numbers look smaller than expected. Are you sure all files got loaded?”

This is where **COPY history views** become your superhero. They help you:

* Confirm if all files were loaded.
* Spot if some files were skipped or errored.
* Audit past loads (who loaded what, when, and how much).
* Troubleshoot issues like duplicates or partially loaded files.

So, they’re not just logs—they are **your forensic toolkit for data pipelines**.

---

# 2. **The LOAD\_HISTORY View**

This lives in the **`INFORMATION_SCHEMA`** and gives a lightweight summary of past COPY INTO operations.
Key things about it:

* It **retains only 14 days** of history.
* It shows **one row per file** loaded into a table.
* **Limit:** Only up to **10,000 rows** can be returned at once.
* Perfect for **quick checks** when troubleshooting.

👉 Example query:

```sql
SELECT *
FROM INFORMATION_SCHEMA.LOAD_HISTORY
WHERE TABLE_NAME = 'SALES_TRANSACTIONS'
  AND LAST_LOAD_TIME > DATEADD(day, -7, CURRENT_TIMESTAMP());
```

This would show you the past 7 days of file loads into the `SALES_TRANSACTIONS` table.

### Important Columns in `LOAD_HISTORY`

Let’s highlight the columns that actually matter in real-life troubleshooting:

* **FILE\_NAME** → The file’s name (super handy to track missing or duplicate files).
* **LAST\_LOAD\_TIME** → When the file was loaded.
* **ROW\_COUNT** → Number of rows successfully ingested from the file.
* **STATUS** → Whether the load succeeded or failed.
* **FIRST\_ERROR\_MESSAGE / FIRST\_ERROR\_LINE** → If something broke, you’ll see the first error here.
* **TABLE\_NAME & SCHEMA\_NAME** → To confirm the target location.

Think of LOAD\_HISTORY as your **flight departure board at the airport**: you see which flights (files) took off (loaded), which got delayed (errors), and when.

---

# 3. **The COPY\_HISTORY View**

Now here’s where things get **deeper**.
`COPY_HISTORY` is accessed using the **table function** `COPY_HISTORY(<table_name>, <start_time>, <end_time>)`.

This view is **richer than LOAD\_HISTORY** because it doesn’t just show rows per file—it also shows metadata about the COPY command execution itself.

👉 Example query:

```sql
SELECT *
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
    TABLE_NAME => 'SALES_TRANSACTIONS',
    START_TIME => DATEADD(day, -7, CURRENT_TIMESTAMP()),
    END_TIME   => CURRENT_TIMESTAMP()
));
```

### Important Columns in `COPY_HISTORY`

This is where you get that extra context:

* **FILE\_NAME** → Same as load history, file-level tracking.
* **LAST\_LOAD\_TIME** → Time of file load.
* **ROW\_COUNT / ERROR\_COUNT** → Rows successfully ingested vs. rows errored.
* **STATUS** → Success, error, or partially loaded.
* **FIRST\_ERROR\_MESSAGE / LINE / CHARACTER\_POS** → Pinpoint the first failure.
* **TABLE\_NAME** → Where the file landed.
* **PIPE\_NAME** → If you’re using Snowpipe, this tells you which pipe ingested it.
* **COPY\_STATEMENT** → The actual COPY INTO statement that ran.
* **PARSE\_TIMESTAMP** → When the file was parsed.
* **SCAN\_CYCLE / VALIDATION\_MODE** → More advanced details, especially for debugging validation runs.

So think of COPY\_HISTORY as your **black box recorder in an airplane**: it not only tells you that the plane flew but also how it flew, which pilot (pipe) flew it, and what the exact flight plan (COPY statement) was.

---

# 4. **LOAD\_HISTORY vs COPY\_HISTORY (Scenario-Based)**

Let’s imagine a **real case scenario**:

You work in an e-commerce company. Your data pipeline loads **daily order files** from S3 into Snowflake:

* Files: `orders_2025-09-10.csv`, `orders_2025-09-11.csv` …
* Target: `ORDERS` table.

### Case 1: Analyst says “Data for 11th is missing”

You run:

```sql
SELECT FILE_NAME, LAST_LOAD_TIME, STATUS
FROM INFORMATION_SCHEMA.LOAD_HISTORY
WHERE TABLE_NAME = 'ORDERS'
  AND LAST_LOAD_TIME > '2025-09-10';
```

And you see:

* `orders_2025-09-10.csv` → SUCCESS
* **`orders_2025-09-11.csv` → Not listed**

Boom! The file was never loaded. You quickly identify missing data.

### Case 2: Analyst says “Row count looks less than expected”

You check `COPY_HISTORY`:

```sql
SELECT FILE_NAME, ROW_COUNT, ERROR_COUNT, COPY_STATEMENT
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
    TABLE_NAME => 'ORDERS',
    START_TIME => DATEADD(day, -2, CURRENT_TIMESTAMP()),
    END_TIME   => CURRENT_TIMESTAMP()
));
```

And you find:

* `orders_2025-09-11.csv` → Loaded 900 rows, 100 rows errored, COPY statement was using a wrong file format.

Now you know **why** the numbers don’t match.

👉 **Moral of the story**:

* **LOAD\_HISTORY** = “Did the file arrive and load?”
* **COPY\_HISTORY** = “How exactly did the load happen, with what command, and were there any issues inside the file?”

---

# 5. **Constraints & Retention**

* **Both LOAD\_HISTORY & COPY\_HISTORY keep only 14 days of history.**
* LOAD\_HISTORY has a **row return limit of 10,000**. (So if you’re loading millions of small files daily, you’ll hit this limit fast.)
* COPY\_HISTORY is more scalable and **richer in metadata**.

In real projects, teams often **export history logs into separate audit tables** to preserve history beyond 14 days.

---

# 6. **Extra Must-Know Topics (Not in Your List but Critical)**

To make you truly **pipeline-ready**, you also need to know:

* **COPY Options** like `ON_ERROR`, `FORCE`, `VALIDATION_MODE` (because these settings affect what you’ll see in history).
* **Snowpipe’s role** with COPY\_HISTORY (since pipes continuously use COPY, history tells you what Snowpipe did).
* **Automation best practices**: Always create monitoring dashboards (e.g., in Tableau/Looker/Streamlit) based on LOAD\_HISTORY and COPY\_HISTORY so your team gets alerts proactively.

---

# 7. **Must-Answer Questions to Test Yourself**

(Think of these as checkpoints to see if you really understood.)

1. What’s the difference between LOAD\_HISTORY and COPY\_HISTORY?
2. Why is the retention limited to 14 days, and how do you keep longer history?
3. How would you troubleshoot if a file was missing from the target table?
4. If a file partially loads, which view will give you more insight and why?
5. What kind of pipeline alerting would you build using these views?

---

✅ So, if I sum it up:

* **LOAD\_HISTORY** = quick, high-level view (Was the file loaded? When? How many rows?).
* **COPY\_HISTORY** = deep dive (How was it loaded? Which COPY command? Any errors inside?).
* Both are **short-term windows into the past**, so smart teams build audit layers on top.

---



---

## **1. What’s the difference between LOAD\_HISTORY and COPY\_HISTORY?**

👉 **LOAD\_HISTORY**

* Lives in `INFORMATION_SCHEMA`.
* One row = one file loaded.
* Limited detail (just filename, row count, status, error message, load time).
* Great for **quick checks**: Did this file load successfully?

👉 **COPY\_HISTORY**

* Accessed with the `COPY_HISTORY` table function.
* Rich detail (filename, row count, error count, copy statement, pipe name, error position, parse time, etc.).
* Great for **deep troubleshooting**: Why did the load fail or partially succeed? Which COPY statement was used?

🔑 Think of it like:

* **LOAD\_HISTORY = flight departure board** (Did my flight leave?).
* **COPY\_HISTORY = black box recorder** (How did my flight actually fly, and what went wrong inside?).

---

## **2. Why is the retention limited to 14 days, and how do you keep longer history?**

* Snowflake **limits history to 14 days** to keep the metadata storage lightweight and fast. If they stored years of history for everyone, it would blow up system metadata storage.
* The **10,000 row limit** in LOAD\_HISTORY is another safeguard to prevent runaway queries.

👉 **How to keep longer history?**

* Build a **custom audit logging process**:

  * Schedule a job (via Snowflake Task or Airflow/DBT) to query `COPY_HISTORY` every day.
  * Insert the results into a permanent **AUDIT\_COPY\_HISTORY** table in your database.
  * This way, you own the history and can keep months/years of data for compliance, SLA monitoring, or debugging.

---

## **3. How would you troubleshoot if a file was missing from the target table?**

Scenario: An analyst says, *“The data for 2025-09-11 is missing from the ORDERS table.”*

Steps:

1. **Check LOAD\_HISTORY**:

   ```sql
   SELECT FILE_NAME, STATUS, LAST_LOAD_TIME
   FROM INFORMATION_SCHEMA.LOAD_HISTORY
   WHERE TABLE_NAME = 'ORDERS'
     AND LAST_LOAD_TIME > DATEADD(day, -3, CURRENT_TIMESTAMP());
   ```

   * If the file is not listed → it was **never loaded** (maybe missed in staging or COPY command).
   * If it’s listed but status = `LOAD_FAILED` → the file tried but failed.

2. **Check COPY\_HISTORY** for deeper info:

   ```sql
   SELECT FILE_NAME, ROW_COUNT, ERROR_COUNT, FIRST_ERROR_MESSAGE
   FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
       TABLE_NAME => 'ORDERS',
       START_TIME => DATEADD(day, -3, CURRENT_TIMESTAMP()),
       END_TIME   => CURRENT_TIMESTAMP()
   ));
   ```

   * If error count > 0 → File was partially loaded.
   * If COPY statement looks wrong → maybe file format mismatch.

✅ This two-step approach (LOAD\_HISTORY for existence, COPY\_HISTORY for details) solves 90% of file-missing cases.

---

## **4. If a file partially loads, which view will give you more insight and why?**

👉 Answer: **COPY\_HISTORY**

* LOAD\_HISTORY will only tell you the file name, row count, and that it succeeded/failed. It doesn’t explain *why*.
* COPY\_HISTORY gives:

  * Exact **error message** (e.g., invalid UTF-8 encoding at line 102).
  * **Error position** (line/character).
  * The **COPY statement** that ran (maybe wrong FILE FORMAT used).
  * **Error count** (so you know how many rows failed).

That extra detail is critical for fixing the root cause of partial loads.

---

## **5. What kind of pipeline alerting would you build using these views?**

Great question, because this is where a **pro data engineer** shines.

You can build alerts like:

1. **Missing File Alert**

   * Query LOAD\_HISTORY to check if today’s expected file was loaded.
   * If missing by a certain cutoff time → send Slack/email alert.

2. **Row Count Mismatch Alert**

   * Use COPY\_HISTORY to compare file row counts vs. expected counts (based on file metadata or upstream system).
   * If rows are fewer or error count > 0 → raise an alert.

3. **Load Failure Alert**

   * Monitor COPY\_HISTORY for `STATUS = LOAD_FAILED`.
   * Alert ops immediately with the error message.

4. **Snowpipe Monitoring**

   * Since COPY\_HISTORY also tracks Snowpipe, you can alert if a pipe hasn’t loaded any file in X hours.

👉 Typically, teams push these alerts into **Snowflake Tasks + Streams**, or into **external monitoring tools** (e.g., CloudWatch, Prometheus, Datadog).

---

✅ **Quick Recap of Answers:**

1. LOAD\_HISTORY = quick file-level view; COPY\_HISTORY = detailed COPY execution view.
2. Retention = 14 days (lightweight system); keep longer history by persisting into audit tables.
3. Troubleshoot missing files: check LOAD\_HISTORY for existence, COPY\_HISTORY for details/errors.
4. Partial load? Use COPY\_HISTORY for error insights.
5. Pipeline alerting: missing files, row count mismatch, load failures, Snowpipe inactivity.

---
