
---

# **30 Data Transformation Examples (Stage → Table)**

We assume:

* Stage: `@mystage` (internal or external)
* File formats: CSV/JSON (custom defined)
* Target tables exist (`customer_clean`, `sales_clean`, etc.)

---

## **1. Trim Spaces**

```sql
INSERT INTO customer_clean (customer_id, customer_name)
SELECT 
    $1::INT,
    TRIM($2)
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **2. Convert to Upper/Lower Case**

```sql
INSERT INTO customer_clean (customer_id, email)
SELECT 
    $1::INT,
    LOWER($2)
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **3. Substring Extraction**

```sql
INSERT INTO orders_clean (short_order_id, order_id)
SELECT 
    SUBSTR($1, 1, 5),
    $1::INT
FROM @mystage/orders.csv
(FILE_FORMAT => my_csv_format);
```

---

## **4. Replace Special Characters**

```sql
INSERT INTO customer_clean (customer_id, phone)
SELECT 
    $1::INT,
    REGEXP_REPLACE($2, '[^0-9]', '')
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **5. Concatenate Columns**

```sql
INSERT INTO customer_clean (customer_id, full_name)
SELECT 
    $1::INT,
    CONCAT($2, ' ', $3)
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **6. Rounding Numbers**

```sql
INSERT INTO sales_clean (order_id, amount)
SELECT 
    $1::INT,
    ROUND($2::NUMBER, 2)
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **7. Convert String → Number**

```sql
INSERT INTO sales_clean (order_id, amount)
SELECT 
    $1::INT,
    TRY_TO_NUMBER($2) AS amount
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **8. Handling Nulls with Default Values**

```sql
INSERT INTO sales_clean (order_id, discount)
SELECT 
    $1::INT,
    COALESCE($2::NUMBER, 0)
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **9. Capping Values**

```sql
INSERT INTO sales_clean (order_id, amount)
SELECT 
    $1::INT,
    LEAST($2::NUMBER, 10000)
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **10. Categorizing Numeric Ranges**

```sql
INSERT INTO sales_clean (order_id, amount_category)
SELECT 
    $1::INT,
    CASE 
        WHEN $2::NUMBER < 100 THEN 'SMALL'
        WHEN $2::NUMBER BETWEEN 100 AND 1000 THEN 'MEDIUM'
        ELSE 'LARGE'
    END
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **11. Convert String to Date**

```sql
INSERT INTO sales_clean (order_id, order_date)
SELECT 
    $1::INT,
    TO_DATE($2, 'YYYY-MM-DD')
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **12. Extract Year, Month, Day**

```sql
INSERT INTO sales_clean (order_id, year, month, day)
SELECT 
    $1::INT,
    YEAR(TO_DATE($2, 'YYYY-MM-DD')),
    MONTH(TO_DATE($2, 'YYYY-MM-DD')),
    DAY(TO_DATE($2, 'YYYY-MM-DD'))
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **13. Date Difference Calculation**

```sql
INSERT INTO sales_clean (order_id, days_to_ship)
SELECT 
    $1::INT,
    DATEDIFF(DAY, TO_DATE($2, 'YYYY-MM-DD'), TO_DATE($3, 'YYYY-MM-DD'))
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **14. Add/Subtract Days**

```sql
INSERT INTO sales_clean (order_id, expected_delivery)
SELECT 
    $1::INT,
    DATEADD(DAY, 7, TO_DATE($2, 'YYYY-MM-DD'))
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **15. Truncate to First of Month**

```sql
INSERT INTO sales_clean (order_id, month_start)
SELECT 
    $1::INT,
    DATE_TRUNC('MONTH', TO_DATE($2, 'YYYY-MM-DD'))
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **16. Standardize Country Codes**

```sql
INSERT INTO customer_clean (customer_id, country)
SELECT 
    $1::INT,
    IFF($2 IN ('USA','US','United States'), 'US', $2)
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **17. Remove Duplicates (ROW\_NUMBER)**

```sql
INSERT INTO customer_deduped
SELECT customer_id, name, email
FROM (
    SELECT 
        $1::INT AS customer_id,
        $2 AS name,
        $3 AS email,
        ROW_NUMBER() OVER(PARTITION BY $1 ORDER BY $4 DESC) AS rn
    FROM @mystage/customers.csv
    (FILE_FORMAT => my_csv_format)
)
WHERE rn = 1;
```

---

## **18. Filter Invalid Records**

```sql
INSERT INTO sales_clean
SELECT $1::INT, $2::NUMBER
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format)
WHERE $2::NUMBER > 0;
```

---

## **19. Conditional Null Handling**

```sql
INSERT INTO customer_clean (customer_id, phone)
SELECT 
    $1::INT,
    NULLIF($2, '')
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **20. Flag Bad Records**

```sql
INSERT INTO customer_bad
SELECT 
    $1::INT, 
    IFF($2 = '' OR $3 = '', 'INVALID', 'VALID') AS record_status
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **21. Join with Dimension Table**

```sql
INSERT INTO sales_enriched (order_id, customer_key, amount)
SELECT 
    s.$1::INT AS order_id,
    c.customer_key,
    s.$2::NUMBER AS amount
FROM @mystage/sales.csv s
JOIN dim_customers c ON s.$3 = c.customer_id;
```

---

## **22. Surrogate Key Generation**

```sql
INSERT INTO customer_dim (surrogate_key, customer_id)
SELECT 
    MD5($1 || $2) AS surrogate_key,
    $1::INT AS customer_id
FROM @mystage/customers.csv
(FILE_FORMAT => my_csv_format);
```

---

## **23. Map Codes to Descriptions**

```sql
INSERT INTO sales_clean (order_id, status, status_desc)
SELECT 
    $1::INT,
    $2 AS status,
    DECODE($2, 'P', 'Pending', 'C', 'Completed', 'F', 'Failed') AS status_desc
FROM @mystage/sales.csv
(FILE_FORMAT => my_csv_format);
```

---

## **24. Slowly Changing Dimension (Type 2)**

```sql
MERGE INTO dim_customers t
USING (
    SELECT $1::INT AS customer_id, $2 AS email
    FROM @mystage/customers.csv
    (FILE_FORMAT => my_csv_format)
) s
ON t.customer_id = s.customer_id
WHEN MATCHED AND t.email <> s.email THEN
    UPDATE SET t.is_active = FALSE, t.end_date = CURRENT_DATE
WHEN NOT MATCHED THEN
    INSERT (customer_id, email, start_date, is_active)
    VALUES (s.customer_id, s.email, CURRENT_DATE, TRUE);
```

---

## **25. Check Referential Integrity**

```sql
SELECT s.$1::INT AS order_id
FROM @mystage/orders.csv s
LEFT JOIN dim_customers c ON s.$2 = c.customer_id
WHERE c.customer_id IS NULL;
```

---

## **26. Extract from JSON**

```sql
INSERT INTO orders_flat (order_id, customer, total)
SELECT 
    data:order_id::INT,
    data:customer::STRING,
    data:total::NUMBER
FROM @mystage/orders.json
(FILE_FORMAT => my_json_format);
```

---

## **27. Flatten Nested JSON Arrays**

```sql
INSERT INTO orders_items (order_id, item_id)
SELECT 
    data:order_id::INT,
    item.value:item_id::STRING
FROM @mystage/orders.json,
LATERAL FLATTEN(input => data:items) item
(FILE_FORMAT => my_json_format);
```

---

## **28. Dynamic JSON Key Parsing**

```sql
SELECT OBJECT_KEYS(data) AS keys
FROM @mystage/orders.json
(FILE_FORMAT => my_json_format);
```

---

## **29. Variant → String Conversion**

```sql
INSERT INTO customer_flat (customer_name)
SELECT data:customer_name::STRING
FROM @mystage/customers.json
(FILE_FORMAT => my_json_format);
```

---

## **30. Combine Semi-structured + Structured Data**

```sql
INSERT INTO orders_final (order_id, order_date, promo_code)
SELECT 
    o.$1::INT,
    TO_DATE(o.$2, 'YYYY-MM-DD'),
    j.data:promo_code::STRING
FROM @mystage/orders.csv o
JOIN @mystage/orders.json j 
ON o.$1::INT = j.data:order_id::INT
(FILE_FORMAT => my_csv_format);
```

---

✅ **Key Takeaways**:

1. Always start from `@stage_name` (internal, table, or external).
2. Apply transformations **inside the SELECT** from stage.
3. Define proper **file formats** (`FILE_FORMAT`) for CSV, JSON, Parquet.
4. Combine **casting, deduplication, cleansing, SCD handling, JSON parsing, joins** to make data analytics-ready.

---



---

## **1. Staging Files Are Queryable**

* When you upload files to a **stage** (`@stage_name`), they’re **not yet “real tables”**.
* But you **can query them using `SELECT`** with the proper `FILE_FORMAT`.

Example:

```sql
SELECT $1::INT AS order_id, $2::STRING AS customer
FROM @mystage/orders.csv
(FILE_FORMAT => my_csv_format);
```

* `$1, $2` → column placeholders in the staged file
* You can cast, clean, trim, convert dates, apply regex — **all standard transformations**

---

## **2. Almost All Table Transformations Work**

You can do things like:

* `CAST` / `::TYPE` conversions
* `TRIM`, `UPPER`, `LOWER`, `REGEXP_REPLACE`
* `COALESCE`, `NULLIF`
* `DATE`/`TIME` functions (`TO_DATE`, `DATEADD`, `DATEDIFF`)
* Aggregations (`SUM`, `COUNT`, `AVG`)
* Window functions (`ROW_NUMBER`, `RANK`)
* Conditional transformations (`CASE`, `IFF`)
* JSON/VARIANT processing (`data:field::STRING`)
* Flatten arrays (`LATERAL FLATTEN`)

✅ In short: **any SQL operation you can do on a table, you can do on staged data before loading it**.

---

## **3. The Only Differences**

| Feature            | Staging                                                        | Table                                                      |
| ------------------ | -------------------------------------------------------------- | ---------------------------------------------------------- |
| Persistent indexes | ❌ No                                                           | ✅ Yes                                                      |
| Constraints        | ❌ No (except semi-enforced)                                    | ✅ Yes                                                      |
| Updates / Deletes  | ❌ Not directly (must `COPY INTO` table first)                  | ✅ Yes                                                      |
| Metadata           | ✅ Can use `$FILENAME`, `$FILE_ROW_NUMBER`                      | ✅ Table metadata available                                 |
| Query performance  | ⚡ Reads directly from staged file; may be slower on huge files | ⚡ Snowflake tables use columnar storage + micro-partitions |

---

## **4. Practical Workflow**

1. **Stage the raw files** (`PUT @mystage/file.csv`)
2. **Transform while reading**:

```sql
SELECT 
    CAST($1 AS INT) AS order_id,
    TO_DATE($2, 'YYYY-MM-DD') AS order_date,
    UPPER($3) AS country
FROM @mystage/file.csv
(FILE_FORMAT => my_csv_format);
```

3. **Insert into target table** (optional: dedupe, SCD, enrichment)

---

### ✅ Key Takeaway

* **Transformation on staging = Preprocessing before table load.**
* This saves **compute**, avoids loading **bad/dirty data**, and lets you handle **huge files incrementally**.
* Once data is loaded into a **Snowflake table**, you gain **full SQL capabilities**, indexing, partitioning, and better query performance.

---
