
---

## 🔷 **1. Initial Load**

### 🔹 **What is Initial Load?**

The **initial load** is the very first-time load of data from source systems into the data warehouse. It brings historical data into the DWH to make it ready for production use.

### ✅ Characteristics:

* Done **only once** (before the system goes live).
* Brings **full snapshot** of all historical data.
* Used for **backfilling** dimensions and facts.
* Often happens in **batch** mode with bulk insert operations.
* Can take hours/days depending on volume.

### 📌 Real Scenario:

For a retail company launching a new warehouse, we brought:

* 5 years of customer transactions
* 10 years of product catalog history
* Complete store hierarchy

The volume was 2 TB, and the initial load was done over a weekend.

---

## 🔷 **2. Incremental Load**

### 🔹 **What is Incremental Load?**

Once the DWH goes live, **new or changed data** needs to be loaded periodically (hourly, daily, etc.). This is handled using **incremental loading** techniques.

### ✅ Categories of data changes:

1. **New Records**: e.g., new customer signups, new orders.
2. **Modified Records**: e.g., user updated their email or address.
3. **Deleted Records**: e.g., GDPR requests for deletion, deactivated products.

---

## 🔷 **3. Types of Incremental Load**

Let’s break them down with examples:

---

### 🔸 **Append (Insert Only)**

* Adds only new records into the warehouse.
* No updates or deletes performed.

📌 Example:

```sql
INSERT INTO sales_fact (order_id, customer_id, order_date, amount)
SELECT * FROM staging_sales
WHERE order_date = CURRENT_DATE - 1;
```

📍 Use When:

* Data is immutable (like logs, transactions).
* Easy to identify new records (like using `created_at` timestamp).

---

### 🔸 **In-Place Update**

* Performs `INSERT` for new data.
* Performs `UPDATE` for modified records (usually matched by primary key or natural key).
* May include `DELETE` for removed data if needed.

📌 Example (Upsert in Snowflake):

```sql
MERGE INTO customer_dim AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
```

📍 Use When:

* Data is mutable (e.g., customer address changes).
* You need to keep warehouse always in sync with source.

---

### 🔸 **Complete Replacement (Truncate and Reload)**

* Delete all data in target table.
* Load the latest full snapshot from source.

📌 Example:

```sql
TRUNCATE TABLE product_dim;

INSERT INTO product_dim
SELECT * FROM staging_product;
```

📍 Use When:

* Table is small.
* No reliable incremental identifiers (like `last_updated` column).
* Data volatility is high (e.g., reference datasets).

❗Caution: Can cause **report downtime** and **integrity issues** if not managed with care.

---

### 🔸 **Rolling Append**

* A variation of append load.
* Only a **rolling window of recent data** is loaded (e.g., last 30 days).
* Older data remains untouched.

📌 Example:

```sql
DELETE FROM order_fact
WHERE order_date >= CURRENT_DATE - 30;

INSERT INTO order_fact
SELECT * FROM staging_orders
WHERE order_date >= CURRENT_DATE - 30;
```

📍 Use When:

* High volume of recent data updates.
* System performance is critical.
* Ideal for clickstream or IoT data.

---

## 🔷 **What if Loading Is Not Done Properly?**

| Failure                  | Impact                                                     |
| ------------------------ | ---------------------------------------------------------- |
| Skipped Incremental Load | Reports show incomplete or outdated data                   |
| Duplicate Loading        | Double counting in reports, misleading KPIs                |
| Missing Delete Handling  | Obsolete data retained, e.g., deleted users still reported |
| Truncate & Reload Misuse | Downtime, broken joins, snapshot inconsistencies           |

---

## 🔷 Real-Life Example (Scenario Based)

**Company**: E-commerce platform with millions of daily orders.

**Data Sources**: Order DB (PostgreSQL), Customer DB (MongoDB), Shipping API (JSON dumps)

### Loading Strategy:

| Layer         | Strategy                               |
| ------------- | -------------------------------------- |
| Orders Fact   | Append Load (new orders daily)         |
| Customer Dim  | In-Place Update (email, phone updates) |
| Product Dim   | Complete Replacement (weekly snapshot) |
| Shipping Logs | Rolling Append (last 7 days window)    |

**Why?**

* Orders are immutable → append is safe.
* Customers update frequently → in-place update ensures consistency.
* Products table changes structure and description → safer to reload completely.
* Shipping logs are heavy and recent-data focused → rolling load is efficient.

---

## ✅ Important Questions

1. **How do you handle incremental loads in your warehouse?**
2. **What strategy would you choose for mutable vs. immutable data?**
3. **Can you explain the difference between append and in-place update loads?**
4. **Have you ever dealt with failed loads? How did you recover data integrity?**
5. **What are common risks of truncate-reload strategy and how do you mitigate them?**
6. **How do you track data change timestamps in source systems?**
7. **Which type of loading would you prefer for a product catalog that changes weekly? Why?**

---



---

### ✅ 1. **How do you handle incremental loads in your warehouse?**

**Answer:**

We use **metadata-driven ELT/ETL pipelines** to handle incremental loads. Each table has **CDC markers** like `last_updated_timestamp` or surrogate keys to detect changes. Based on the data nature:

* **Transactional fact tables**: we use **append-only** strategy.
* **Slowly changing dimensions (SCDs)**: we use **in-place upserts** (Type 1 or Type 2).
* For some high-volume logs, we use a **rolling window** approach (e.g., keep last 30 days).

These loads are orchestrated through **Apache Airflow** with proper **auditing, error handling**, and **idempotency** to ensure safe re-runs.

---

### ✅ 2. **What strategy would you choose for mutable vs. immutable data?**

**Answer:**

* For **immutable data** (e.g., logs, events), I choose **append** since records never change after creation.
* For **mutable data** (e.g., customer profiles, product info), I use **in-place update** or **SCD handling** because values can be corrected or enriched over time.

🔁 If a source doesn’t provide CDC flags, I rely on **hash-diff comparison** between staging and target.

---

### ✅ 3. **Can you explain the difference between append and in-place update loads?**

**Answer:**

* **Append**: Adds only **new records**. Existing data remains untouched. Ideal for transaction logs, clickstream data.
* **In-place update**: Updates existing rows and inserts new ones. Used for data that can **mutate**, like customer contact info.

📌 Example:
For `customer_dim`, if email changes:

* Append would ignore it.
* In-place update would reflect the latest email.

---

### ✅ 4. **Have you ever dealt with failed loads? How did you recover data integrity?**

**Answer:**

Yes — multiple times. For example, once a staging load was corrupted due to timezone misalignment in timestamp parsing.

To recover:

1. We immediately **quarantined** the corrupted batch.
2. Rolled back downstream tables using **time-partitioned backups**.
3. Re-ran the job with corrected logic.
4. Updated our **data validation rules** to include timezone consistency checks.

We also maintain a **load audit table** that tracks load status (start\_time, end\_time, row count, checksum) to identify anomalies early.

---

### ✅ 5. **What are common risks of truncate-reload strategy and how do you mitigate them?**

**Answer:**

**Risks:**

* **Downtime**: If truncate takes long, reporting gets affected.
* **Foreign key breaks**: If dimension-fact relationships are broken temporarily.
* **Accidental Data Loss**: If the source has bugs, you'll overwrite good data with bad.

**Mitigation:**

* Use **staging table + swap strategy**:

  * Load into a temp table.
  * Validate row counts, checksums.
  * Then atomically swap.

* Perform **row-level diff checks** before truncating.

---

### ✅ 6. **How do you track data change timestamps in source systems?**

**Answer:**

Best practice is to request source systems to include **audit fields**:

* `created_at`, `updated_at`, `deleted_at`
* Or an **incremental surrogate key** or **CDC log position**

If not available:

* I implement a **hashing strategy** (e.g., MD5 across important columns) to detect changes between previous and current records.
* For APIs, I use `since` parameter if supported, else snapshot + diff.

---

### ✅ 7. **Which type of loading would you prefer for a product catalog that changes weekly? Why?**

**Answer:**

I’d go with **complete replacement (truncate-reload)** strategy.

**Why?**

* Product catalogs are **relatively small** (few thousand to million rows).
* Changes are frequent and not always flagged.
* Easier and faster to load the entire clean snapshot weekly.

If performance is a concern, I’d explore **MERGE-based upserts** using `product_id` + `hashdiff`.

---
