
---

# 🌐 External Tables in Snowflake

---

## 1. Let’s Start with a Story 📖

Imagine you are a Data Engineer in a healthcare company. You have **patient medical records** stored as **Parquet/CSV files in AWS S3**.
Now your analysts want to run **SQL queries** on those files directly without waiting for you to load them into Snowflake internal tables every time.

At first, you think: “Okay, let me just create a stage pointing to S3 and then create a **view** on top of those staged files.”

👉 Problem: Every time someone queries the view:

* Snowflake has to **scan the raw files in S3** again and again.
* For large data, multiple joins, or frequent queries → **super slow and costly**.
* Worst of all, Snowflake **does not know the metadata of those files** (like which files are already processed, which are new).

💡 This is where **External Tables** come in like superheroes.
They act as a **bridge** between **S3 files** and **Snowflake queries**, keeping metadata cached inside Snowflake to optimize queries.

---

## 2. Why Do We Need External Tables?

Let’s break it down:

* ✅ **Metadata Awareness**:
  External tables maintain a metadata layer inside Snowflake that keeps track of:

  * Which files exist in S3
  * Which new files appeared
  * Which files were updated or removed

* ✅ **Avoid Full Scans**:
  Instead of re-reading all S3 files every time, Snowflake queries only the **relevant files** based on metadata.

* ✅ **Performance Boost**:
  With metadata + partition pruning, queries run much faster.

* ✅ **Cost Saving**:
  Since fewer files are scanned → less data processed → less credit usage.

So external tables give us a **database-like experience** on top of raw cloud storage.

---

## 3. Syntax to Create External Table from S3

Let’s build step by step 👇

### Step 1: Create a Stage (pointing to S3 bucket)

```sql
CREATE OR REPLACE STAGE my_s3_stage
  URL='s3://mybucket/raw-data/'
  STORAGE_INTEGRATION = my_s3_integration
  FILE_FORMAT = (TYPE = PARQUET);
```

* `URL` → points to your S3 bucket.
* `STORAGE_INTEGRATION` → secure way to connect Snowflake to S3.
* `FILE_FORMAT` → defines data type (Parquet, CSV, JSON).

---

### Step 2: Create External Table

```sql
CREATE OR REPLACE EXTERNAL TABLE patient_records_ext
  (
    patient_id STRING AS (value:c1::string),
    name STRING AS (value:c2::string),
    age INT AS (value:c3::int),
    diagnosis STRING AS (value:c4::string)
  )
  WITH LOCATION=@my_s3_stage
  FILE_FORMAT = (TYPE = PARQUET)
  AUTO_REFRESH = TRUE
  PATTERN='.*.parquet';
```

🔎 Breakdown:

* **Column definitions**: `AS (value:c1::string)` means → take column from file JSON/Parquet field.
* **WITH LOCATION**: tells Snowflake where data resides (`@stage`).
* **FILE\_FORMAT**: defines type.
* **AUTO\_REFRESH = TRUE**: Snowflake automatically tracks new files added in S3 (via event notifications).
* **PATTERN**: regex to include/exclude files (e.g., only `.parquet` files).

---

### Step 3: Query the External Table

```sql
SELECT * FROM patient_records_ext WHERE age > 50;
```

This looks like a normal table, but behind the scenes, Snowflake:

* Uses **metadata** first (to know which files contain data).
* Reads only relevant files from S3.
* Returns result quickly.

---

## 4. Internal vs External Table

| Feature      | Internal Table                                         | External Table                                           |
| ------------ | ------------------------------------------------------ | -------------------------------------------------------- |
| Data Storage | Inside Snowflake                                       | Outside (S3, Azure, GCP)                                 |
| Performance  | Best (data is optimized in Snowflake micro-partitions) | Slower than internal, but faster than raw staged queries |
| Cost         | Higher (storage cost inside Snowflake)                 | Lower (storage in S3 is cheaper)                         |
| Metadata     | Built-in                                               | Maintained by external table object                      |
| Use Case     | Frequently queried, curated data                       | Semi-structured or raw data lake exploration             |

---

## 5. Real Case Scenario 🎯

Let’s say your company receives **daily patient files** in S3:

* `patients_2025-09-01.parquet`
* `patients_2025-09-02.parquet`
* `patients_2025-09-03.parquet`

If you use a **view on stage** → every query scans **all files**.

But with an **external table**, metadata records:

* On Sept 2 → new file detected `patients_2025-09-02.parquet`.
* On Sept 3 → only `patients_2025-09-03.parquet` added.

When analysts run a query:

```sql
SELECT COUNT(*) FROM patient_records_ext WHERE file_name = 'patients_2025-09-03.parquet';
```

Snowflake only scans the Sept 3 file, not all of them. 🚀

---

## 6. Key Features You Must Know

* **Partitioning**: You can partition external tables by folder structure (e.g., `/year=2025/month=09/day=01/`) for better pruning.
* **AUTO\_REFRESH**: Needs cloud storage event notifications configured. Without this, you must run `ALTER EXTERNAL TABLE REFRESH`.
* **Limitations**:

  * External tables are **read-only** (no insert/update/delete).
  * Query performance is good, but not as optimized as internal tables.
* **Use Case Fit**: Best for **data lakes**, **semi-structured JSON/Parquet data**, **rarely queried raw data**.

---

## 7. Must-Know Questions for Readiness 🤔

1. Why should we use external tables instead of just querying staged files?
2. What’s the difference between external and internal tables in Snowflake?
3. How does AUTO\_REFRESH work in external tables?
4. How does partition pruning work in external tables?
5. Can you insert/update/delete data in external tables?
6. How do you handle schema evolution (e.g., new column in Parquet file)?
7. What happens when a file in S3 is deleted? Will Snowflake external table reflect it?

---

✅ That’s the **end-to-end foundation of External Tables in Snowflake**.
Now you not only know the **syntax**, but also **why they exist, how they improve performance, and their real-world role in data engineering**.

---


---

## ❓ 1. Why should we use external tables instead of just querying staged files?

👉 If you just query files directly from a stage (e.g., with `select * from @stage` or creating a view on top of it):

* Snowflake scans **all files in the stage every single time**.
* No metadata → no idea which files are old, which are new.
* Performance goes down drastically as the number of files grows.

✅ External tables solve this problem:

* Maintain **metadata catalog** (list of files, partition info, changes).
* Allow Snowflake to only scan **relevant files**, boosting performance.
* Give you a “table-like” SQL experience on raw S3 files.

💡 Example:
If you query patient data for **Sept 2025**,

* A stage-based query scans 3 years of data.
* An external table only scans `year=2025/month=09` folder (metadata pruning).

---

## ❓ 2. What’s the difference between external and internal tables in Snowflake?

| Feature     | **Internal Table**                                        | **External Table**                                |
| ----------- | --------------------------------------------------------- | ------------------------------------------------- |
| Storage     | Data stored inside Snowflake (optimized micro-partitions) | Data stored outside (S3, GCS, Azure)              |
| Metadata    | Fully managed by Snowflake                                | Metadata managed by external table object         |
| Performance | High (compressed, clustered, partitioned)                 | Moderate (depends on S3 + partition pruning)      |
| Cost        | Snowflake storage cost                                    | Cloud storage cost (cheaper)                      |
| Flexibility | DML allowed (insert, update, delete)                      | Read-only (select only)                           |
| Use case    | Curated, frequently queried datasets                      | Raw/semi-structured data, rarely queried datasets |

---

## ❓ 3. How does AUTO\_REFRESH work in external tables?

* `AUTO_REFRESH = TRUE` enables Snowflake to **automatically detect new files in S3**.
* Works only if **event notifications** are configured on your bucket (via storage integration).
* Without notifications → you must manually run:

  ```sql
  ALTER EXTERNAL TABLE my_ext_table REFRESH;
  ```
* Refresh updates Snowflake’s metadata catalog with **new, deleted, or updated files**.

💡 Example:
If you drop a new file `patients_2025-09-06.parquet` in S3:

* With AUTO\_REFRESH → external table instantly knows about it.
* Without AUTO\_REFRESH → you won’t see it until you run `ALTER … REFRESH`.

---

## ❓ 4. How does partition pruning work in external tables?

Partitioning = Organizing files in folder structures that represent filters like `year`, `month`, `day`.

Example folder in S3:

```
s3://mybucket/patient_data/year=2025/month=09/day=06/file1.parquet
```

When you query:

```sql
SELECT * FROM patient_records_ext WHERE year=2025 AND month=09;
```

👉 Snowflake **only scans that folder’s files** instead of reading everything.
This drastically reduces cost and improves speed.

💡 Best practice: Always partition large external datasets by frequently filtered fields (date, region, etc.).

---

## ❓ 5. Can you insert/update/delete data in external tables?

❌ No. External tables are **read-only**.

* You can only **query** them.
* To transform or modify → you must **copy the data into an internal table**.

Example:

```sql
CREATE OR REPLACE TABLE patient_records_int AS
SELECT * FROM patient_records_ext WHERE age > 50;
```

Now, you can update/delete inside the **internal table**, but not in the external one.

---

## ❓ 6. How do you handle schema evolution (e.g., new column in Parquet file)?

This is a **real-world challenge**.

Suppose your Parquet files initially have:
`patient_id, name, age`

Later new files include:
`patient_id, name, age, diagnosis`

Options:

1. **Create columns with expressions** (using variants):

   ```sql
   CREATE EXTERNAL TABLE patient_records_ext (
       patient_id STRING AS (value:patient_id::string),
       name STRING AS (value:name::string),
       age INT AS (value:age::int),
       diagnosis STRING AS (value:diagnosis::string)
   )
   ```

   → For old files, `diagnosis` will be NULL.

2. **Re-create external table** if schema changes significantly.

💡 Best practice: Design columns using `VARIANT` for semi-structured data (JSON/Parquet), so schema changes don’t break queries.

---

## ❓ 7. What happens when a file in S3 is deleted? Will Snowflake external table reflect it?

* If **AUTO\_REFRESH = TRUE** with notifications → metadata updates and deleted file disappears.
* If no auto refresh → deleted file will still appear in metadata until you run:

  ```sql
  ALTER EXTERNAL TABLE my_ext_table REFRESH;
  ```

⚠️ But: Snowflake does not delete the file itself (since storage is external).
It only updates its **metadata catalog**.

---

# 🎯 Quick Recap (Cheat Sheet)

* External tables = table-like objects pointing to S3/GCS/Azure files.
* Provide metadata → faster queries, partition pruning, cost savings.
* Read-only.
* AUTO\_REFRESH = automatic metadata sync (needs bucket notifications).
* Partitioning = must for large data.
* Schema evolution handled via VARIANT or altering table.

---



---

# 🔎 1. Syntax to Check External Table File Registration History

External tables maintain a **metadata catalog** inside Snowflake. This catalog tells you **which S3 files are registered, when they were discovered, and their status (active/deleted)**.

You can check this metadata using:

```sql
-- View files tracked by an external table
SELECT *
FROM TABLE(INFORMATION_SCHEMA.EXTERNAL_TABLE_FILES(
    TABLE_NAME => 'PATIENT_RECORDS_EXT',
    DATABASE_NAME => 'MY_DB',
    SCHEMA_NAME => 'RAW'
));
```

🔎 Columns you’ll typically see:

* `FILE_NAME` → full S3 path of the file.
* `LAST_MODIFIED` → when the file was last updated in storage.
* `ROW_COUNT` → estimated row count if available.
* `FILE_SIZE` → size of file in bytes.
* `STATUS` → ACTIVE / DELETED (based on refresh).

👉 This tells you **exactly what Snowflake has registered** for that external table, instead of blindly scanning the bucket.

⚠️ Important: If `AUTO_REFRESH` is **disabled**, you’ll need to manually run:

```sql
ALTER EXTERNAL TABLE PATIENT_RECORDS_EXT REFRESH;
```

before the metadata table reflects new or removed files.

---

# 🔎 2. Comparing: View on Stage vs External Table

Let’s carefully break down what happens in each approach.

---

### ✅ (A) Creating a View to Query S3 Data Directly

Example:

```sql
CREATE OR REPLACE VIEW patient_records_vw AS
SELECT *
FROM @my_s3_stage (FILE_FORMAT => my_parquet_format);
```

👉 What happens when you query it:

* Snowflake goes to S3 **every single time** and reads all files.
* No metadata is stored in Snowflake about which files exist.
* Performance: Every query is like starting from scratch.
* Scalability: Bad — as files grow, query time explodes.

**What you get:**

* Simplicity (easy to set up).
* Always up-to-date with raw S3 bucket (no metadata sync required).

**What you miss:**

* No file-level metadata (can’t track new vs old files).
* No partition pruning (every file gets scanned).
* Costlier queries since all files are read each time.
* No visibility into file registration history.

---

### ✅ (B) Creating an External Table to Query S3 Data Directly

Example:

```sql
CREATE OR REPLACE EXTERNAL TABLE patient_records_ext
(
    patient_id STRING AS (value:patient_id::string),
    age INT AS (value:age::int)
)
WITH LOCATION=@my_s3_stage
FILE_FORMAT = (TYPE = PARQUET)
AUTO_REFRESH = TRUE;
```

👉 What happens when you query it:

* Snowflake **first looks at metadata** (which files are available).
* Only scans relevant files.
* Metadata catalog (via `EXTERNAL_TABLE_FILES`) gives full visibility into registered files.

**What you get:**

* File-level metadata (list of files, last modified, active/deleted status).
* Partition pruning (only scan relevant data folders).
* AUTO\_REFRESH keeps metadata in sync with S3 (if notifications enabled).
* Faster, cheaper queries for large datasets.

**What you miss:**

* External tables are **read-only** (no insert/update/delete).
* Metadata might lag behind if AUTO\_REFRESH is off (need manual refresh).
* Performance is better than views, but not as fast as Snowflake internal tables (since data still lives in S3).

---

# 📊 Comparison Table

| Feature           | **View on Stage**                   | **External Table**                             |
| ----------------- | ----------------------------------- | ---------------------------------------------- |
| Metadata on Files | ❌ None                              | ✅ Maintained (EXTERNAL\_TABLE\_FILES)          |
| Query Performance | ❌ Full scan of all files every time | ✅ Scans only relevant files using metadata     |
| Partition Pruning | ❌ Not possible                      | ✅ Supported (if folders are partitioned)       |
| Cost              | ❌ High (all files read)             | ✅ Lower (less data scanned)                    |
| Auto Refresh      | Not applicable                      | ✅ Can auto-detect new files                    |
| DML Support       | N/A                                 | Read-only                                      |
| Simplicity        | ✅ Very simple                       | Slightly more setup (integration + definition) |
| Best Use Case     | Quick exploration, small datasets   | Production-ready queries, large datasets       |

---

# 🏥 Real-World Scenario Example

Let’s go back to our **healthcare company** example.

* Your analyst wants to query **3 years of patient files** in S3.
* With a **view on stage** → Snowflake reads all 3 years of files even if query is just for **last 1 month**. Query might take 20 minutes.
* With an **external table** partitioned by `year/month/day`, querying last month’s data only scans those files. Query runs in 1 minute.

Additionally:

* With a **view** → you can’t check which files are being read.
* With an **external table** → you can run:

  ```sql
  SELECT * FROM TABLE(INFORMATION_SCHEMA.EXTERNAL_TABLE_FILES(TABLE_NAME=>'PATIENT_RECORDS_EXT'));
  ```

  to see **exactly which files Snowflake knows about**.

---

✅ So to summarize:

* **View on stage = Simple but dumb** (always scans everything, no metadata).
* **External table = Smart layer** (keeps metadata, optimizes scans, tracks files, supports partition pruning).

---
