
---

## 🔷 What is SCD (Slowly Changing Dimension)?

A **Slowly Changing Dimension (SCD)** is a **dimension table** in a data warehouse that stores **attributes that change slowly over time** (unlike transaction data, which changes frequently).

### 📌 Purpose:

To manage **historical changes** in dimensional data while supporting accurate **trend** and **point-in-time** analysis.

---

## 📌 Real-Life Example:

Let’s say you are tracking customers in a **retail company**. A customer may:

* Change their **address**
* Get **married** (change of last name)
* Switch **loyalty tiers**

You need a mechanism to either:

* Keep the **latest** value only
* Store a **limited** history
* Store the **complete** history

👉 That’s where **SCD types** come in.

---

## 🔢 Types of SCD — With Real-World Use Cases

---

### 🟩 **SCD Type 0 – Fixed Attributes**

#### ✅ Definition:

* **No changes** allowed to the attribute once it's inserted.
* Any update to this field is **ignored**.

#### 🧠 Purpose:

* To preserve **original values** that should **never change**.

#### 📍 Use Case:

* Tracking **original joining date** of an employee.
* Tracking **birthplace** of a customer.

#### 🧪 Interview Tip:

> “We use SCD Type 0 when the business rules require the attribute to remain **unchanged forever** regardless of any real-world updates.”

---

### 🟨 **SCD Type 1 – Overwrite the Data**

#### ✅ Definition:

* **Overwrite** the old value with the **new value**.
* No history is maintained.

#### 🧠 Purpose:

* When **only the current value** matters.
* Historical changes are **irrelevant**.

#### 📍 Real-Life Example:

* Customer updates their **email address**.
* Fixing a **data quality** issue (wrong spelling of a name).

#### 🔧 ETL Strategy:

* Use `UPDATE` statement to overwrite the value in place.

```sql
UPDATE customer_dim
SET email = 'new_email@example.com'
WHERE customer_id = 123;
```

#### 🧪 Interview Tip:

> "SCD Type 1 is great for attributes where changes happen rarely and don't affect analytics — such as contact information."

---

### 🟦 **SCD Type 2 – Preserve Full History**

#### ✅ Definition:

* Maintain **multiple records** for the same entity, each representing a valid time range.
* Full **history** is preserved.

#### 🧠 Purpose:

* To enable **point-in-time analysis**.
* Understand how values evolved over time.

#### 📍 Real-Life Example:

* A customer changes their **address**.
* Employee gets **promoted** to a new designation.

#### 📅 Fields in Table:

* `effective_from_date`
* `effective_to_date`
* `current_flag` (optional)
* `surrogate_key` (very important)

#### ❗ Do we need a surrogate key?

👉 **YES.**

* Because the **natural key (e.g., customer\_id)** is repeated across multiple historical rows.
* Surrogate key is used to uniquely identify each version.

#### 🛠️ ETL Strategy:

1. Check if change has occurred.
2. If yes:

   * Set the old record’s `effective_to_date = CURRENT_DATE - 1`.
   * Insert a new record with `from_date = CURRENT_DATE`, `to_date = NULL`.

```sql
-- Update old version
UPDATE customer_dim
SET effective_to_date = CURRENT_DATE - 1
WHERE customer_id = 123 AND effective_to_date IS NULL;

-- Insert new version
INSERT INTO customer_dim
(surrogate_key, customer_id, address, effective_from_date, effective_to_date)
VALUES (NEXTVAL, 123, 'New Address', CURRENT_DATE, NULL);
```

#### 🧪 Interview Tip:

> "SCD Type 2 is a must for audit trails or time-travel reporting — like tracking customer location when analyzing sales by region over time."

---

### 🟧 **SCD Type 3 – Store Limited History**

#### ✅ Definition:

* Store **limited historical changes**, typically the **previous value** and **current value**.

#### 🧠 Purpose:

* When only **one previous state** is enough for reporting.

#### 📍 Real-Life Example:

* Tracking **previous and current job title** of an employee.
* Keeping **previous and current address** of a customer.

#### 🔧 ETL Strategy:

* Add two columns:

  * `current_address`
  * `previous_address`
* When the value changes:

  * Move `current_address` → `previous_address`
  * Update `current_address` with new value

```sql
UPDATE customer_dim
SET previous_address = current_address,
    current_address = 'New Address',
    last_update_date = CURRENT_DATE
WHERE customer_id = 123;
```

#### 🧪 Interview Tip:

> "Use Type 3 when business wants to compare only **one-level-back** historical changes, like previous vs current state."

---

## 🟫 Other SCD Types (Less Common But Good to Know)

### 🔶 SCD Type 4 – History Table (Hybrid Approach)

#### ✅ Definition:

* Main dimension table stores only current data.
* A **separate history table** stores all changes.

#### 📍 Use Case:

* You want a **simple, fast** dimension table but still need full history for **audits**.

#### 🧪 Tip:

> "This is like combining Type 1 for the main table and Type 2 for the audit table."

---

### 🔷 SCD Type 6 – Hybrid (1 + 2 + 3)

#### ✅ Definition:

* Combines **Type 1**, **Type 2**, and **Type 3**.
* Stores:

  * Current value (Type 1)
  * Full history rows (Type 2)
  * Limited history attributes (Type 3)

#### 📍 Use Case:

* You want to support all forms of reporting from a **single table**.
* Large enterprise DWHs sometimes use this.

#### 🧪 Tip:

> "You’ll find Type 6 used in tools like SAP BW or advanced Oracle DWs where mixed behavior is required."

---

## 📋 Summary Table

| Type | History Maintained? | New Record? | Real-World Use Case             |
| ---- | ------------------- | ----------- | ------------------------------- |
| 0    | ❌ No                | ❌ No        | Birthplace                      |
| 1    | ❌ No                | ❌ No        | Email updates                   |
| 2    | ✅ Full              | ✅ Yes       | Address changes, salary history |
| 3    | ✅ Partial           | ❌ No        | Previous job title              |
| 4    | ✅ In another table  | ✅ Yes       | Hybrid audit trail              |
| 6    | ✅ Full + Partial    | ✅ Yes       | Advanced use cases              |

---

## 🧠 Important Questions

1. When do you use SCD Type 1 vs Type 2?
2. How do you implement SCD Type 2 in your ETL pipeline?
3. What role does surrogate key play in SCD Type 2?
4. Can SCD Type 3 and Type 2 be combined?
5. How would you manage large volumes of historical data in dimension tables?
6. If your business user says they want to see the “previous region of a customer,” what SCD would you choose?

---




---

### ✅ **1. When do you use SCD Type 1 vs Type 2?**

**Answer:**

> I use **SCD Type 1** when historical data is **not important**, and I only care about the **latest value**. This is suitable for correcting errors or for attributes that change infrequently and do not affect analysis — like fixing a spelling mistake or updating an email address.
>
> I use **SCD Type 2** when it’s critical to **track historical changes** over time. This is necessary when we want to do **point-in-time reporting** or trend analysis. For example, if a customer moves from Delhi to Mumbai, I want to know **when** the change happened, and be able to track **sales by location** accordingly.

---

### ✅ **2. How do you implement SCD Type 2 in your ETL pipeline?**

**Answer:**

> In SCD Type 2, I preserve history by keeping **multiple versions** of the same record, each with a **time window** of validity.
>
> The ETL logic includes:
>
> 1. **Lookup** the current record from the dimension table using a natural key (e.g., customer\_id).
> 2. **Compare** incoming values with existing values.
> 3. If a change is detected:
>
>    * **Update** the existing record’s `effective_to_date` to current date minus one day.
>    * **Insert** a new record with:
>
>      * A **new surrogate key**
>      * Updated value
>      * `effective_from_date = current date`
>      * `effective_to_date = NULL`
>
> This allows me to query history like:
>
> ```sql
> SELECT * FROM customer_dim
> WHERE customer_id = 123 AND '2023-01-15' BETWEEN from_date AND to_date;
> ```

---

### ✅ **3. What role does surrogate key play in SCD Type 2?**

**Answer:**

> In SCD Type 2, **surrogate keys** are essential because:
>
> * A single **natural key** like `customer_id` will have **multiple versions** (rows) over time.
> * Without surrogate keys, it would be **impossible to uniquely identify** each historical version.
> * They help maintain **referential integrity** between the fact table and the dimension table.
>
> For example, `customer_id = 123` might appear 4 times in the dimension table with different addresses, but each row will have a **different surrogate key** like 1001, 1002, 1003, etc.

---

### ✅ **4. Can SCD Type 3 and Type 2 be combined?**

**Answer:**

> Yes, they can be combined — this is typically referred to as **SCD Type 6 (Hybrid)**.
>
> In SCD Type 6:
>
> * We use **Type 2** to track all historical changes (with rows and dates).
> * We also maintain **Type 3-style columns**, like `previous_department`, to allow easy access to the most recent changes.
>
> This gives flexibility: **quick comparison** (Type 3) and **deep history** (Type 2) in the same table.
>
> However, it comes with increased **complexity** in ETL and data modeling, so I use this only when business explicitly requires **both behaviors**.

---

### ✅ **5. How would you manage large volumes of historical data in dimension tables?**

**Answer:**

> For high-volume SCD Type 2 tables, I follow these best practices:
>
> 1. **Partitioning** the dimension table by `effective_from_date` to speed up queries.
> 2. **Indexing** on natural key and current\_flag (or `to_date IS NULL`) to improve lookup performance.
> 3. Keeping only **current data in a separate view** (or mini dimension) for fast access in dashboards.
> 4. Using **data archival strategies** (move older records to cheaper storage like S3 or cold layer).
> 5. Using **hash-based change detection** in ETL to avoid unnecessary updates.

---

### ✅ **6. If your business user says they want to see the “previous region of a customer,” what SCD would you choose?**

**Answer:**

> In this case, we only care about the **previous and current** values — not the entire history.
> So I would choose **SCD Type 3**.
>
> I would add:
>
> * A column for `current_region`
> * A column for `previous_region`
>
> The ETL would:
>
> * Move the value of `current_region` into `previous_region`
> * Update `current_region` with the new value

> If the business later asks for a **full region history**, I can switch to **SCD Type 2** or implement a hybrid (SCD Type 6).

---

