# Slowly Changing Dimensions (SCDs)

## Introduction to Slowly Changing Dimensions

**Slowly Changing Dimensions (SCDs)** are dimension tables where attribute values change slowly over time. Different SCD types handle these changes in different ways, balancing between historical accuracy and current state.

## Slowly Changing Dimensions (SCDs) and Data Warehouse History

### Why SCDs Matter:

In operational systems, when a customer's address changes, the old address is typically overwritten. In data warehouses, we often need to:
- Preserve historical accuracy
- Track changes over time
- Support time-based analysis
- Maintain referential integrity with historical facts

### The Challenge:

When a dimension attribute changes:
- Should we overwrite the old value? (Lose history)
- Should we create a new row? (Preserve history, but how to link facts?)
- Should we add a column? (Track limited history)

### SCD Types:

1. **Type 1**: Overwrite (no history)
2. **Type 2**: Add new row (full history)
3. **Type 3**: Add new column (limited history)
4. **Type 4**: Historical table (separate history)
5. **Type 6**: Hybrid (combines 1, 2, 3)

## Design a Type 1 SCD

**Type 1 SCD**: Overwrite the old value with the new value. No history is preserved.

### Characteristics:
- Simplest approach
- No history maintained
- Current value only
- Updates existing row

### When to Use:
- Corrections to data errors
- Attributes that don't need history
- Current state is sufficient
- Performance is critical

### Structure:
```sql
CREATE TABLE customer_dimension (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(100),
    address VARCHAR(200),
    city VARCHAR(50),
    state VARCHAR(50),
    zip_code VARCHAR(10),
    email VARCHAR(100),
    phone VARCHAR(20)
);
```

### Update Process:
```sql
-- When customer address changes, simply update
UPDATE customer_dimension
SET address = 'New Address',
    city = 'New City',
    state = 'New State',
    zip_code = '12345'
WHERE customer_id = 'CUST-001';
```

### Pros and Cons:

**Advantages:**
- Simple implementation
- No additional storage
- Fast queries (no history to navigate)
- Easy to understand

**Disadvantages:**
- No historical accuracy
- Cannot track changes over time
- Historical reports may be inaccurate
- Loses audit trail

## Design a Type 2 SCD

**Type 2 SCD**: Add a new row when an attribute changes. Full history is preserved.

### Characteristics:
- Most common SCD type
- Preserves complete history
- Each version gets a new surrogate key
- Tracks effective and expiry dates

### When to Use:
- Need historical accuracy
- Track changes over time
- Support time-based analysis
- Most dimension attributes

### Structure:
```sql
CREATE TABLE customer_dimension (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(50),  -- Natural key (same for all versions)
    customer_name VARCHAR(100),
    address VARCHAR(200),
    city VARCHAR(50),
    state VARCHAR(50),
    zip_code VARCHAR(10),
    email VARCHAR(100),
    phone VARCHAR(20),
    effective_date DATE,
    expiry_date DATE,
    is_current BOOLEAN
);
```

### Example Data:

| customer_key | customer_id | address | city | effective_date | expiry_date | is_current |
|--------------|-------------|---------|------|----------------|-------------|------------|
| 1 | CUST-001 | 123 Main St | NYC | 2020-01-01 | 2023-06-15 | FALSE |
| 2 | CUST-001 | 456 Oak Ave | Boston | 2023-06-16 | 9999-12-31 | TRUE |

### Update Process:

```sql
-- Step 1: Expire current row
UPDATE customer_dimension
SET expiry_date = CURRENT_DATE - 1,
    is_current = FALSE
WHERE customer_id = 'CUST-001'
    AND is_current = TRUE;

-- Step 2: Insert new row
INSERT INTO customer_dimension (
    customer_key,
    customer_id,
    customer_name,
    address,
    city,
    state,
    zip_code,
    email,
    phone,
    effective_date,
    expiry_date,
    is_current
)
VALUES (
    NEXT_VALUE('customer_key_seq'),
    'CUST-001',
    'John Doe',
    '456 Oak Ave',
    'Boston',
    'MA',
    '02101',
    'john@example.com',
    '555-1234',
    CURRENT_DATE,
    '9999-12-31',
    TRUE
);
```

### Querying Type 2 SCDs:

```sql
-- Get current version
SELECT * FROM customer_dimension
WHERE customer_id = 'CUST-001'
    AND is_current = TRUE;

-- Get version at specific date
SELECT * FROM customer_dimension
WHERE customer_id = 'CUST-001'
    AND '2023-01-01' BETWEEN effective_date AND expiry_date;

-- Get all versions
SELECT * FROM customer_dimension
WHERE customer_id = 'CUST-001'
ORDER BY effective_date;
```

### Pros and Cons:

**Advantages:**
- Complete historical accuracy
- Track all changes
- Support time-based analysis
- Maintain referential integrity

**Disadvantages:**
- More storage required
- More complex queries
- Need to manage effective/expiry dates
- ETL complexity increases

## Maintain Correct Data Order with Type 2 SCDs

### Critical Rules for Type 2 SCDs:

1. **No Overlapping Dates**
   - Each version must have non-overlapping date ranges
   - One version must expire before next begins
   - Gap of one day between versions is acceptable

2. **One Current Row**
   - Only one row per natural key should have `is_current = TRUE`
   - Only one row should have `expiry_date = '9999-12-31'`

3. **Sequential Surrogate Keys**
   - New versions get new surrogate keys
   - Natural key remains the same
   - Surrogate keys should be sequential for same natural key

### Common Issues:

**Issue 1: Overlapping Dates**
```sql
-- WRONG: Overlapping dates
customer_key=1, effective_date='2020-01-01', expiry_date='2023-12-31'
customer_key=2, effective_date='2023-06-01', expiry_date='9999-12-31'
-- Dates overlap!

-- CORRECT: No overlap
customer_key=1, effective_date='2020-01-01', expiry_date='2023-05-31'
customer_key=2, effective_date='2023-06-01', expiry_date='9999-12-31'
```

**Issue 2: Multiple Current Rows**
```sql
-- WRONG: Multiple current rows
customer_key=1, is_current=TRUE, expiry_date='9999-12-31'
customer_key=2, is_current=TRUE, expiry_date='9999-12-31'

-- CORRECT: Only one current row
customer_key=1, is_current=FALSE, expiry_date='2023-05-31'
customer_key=2, is_current=TRUE, expiry_date='9999-12-31'
```

### Validation Queries:

```sql
-- Check for overlapping dates
SELECT 
    customer_id,
    COUNT(*) as overlap_count
FROM customer_dimension c1
WHERE EXISTS (
    SELECT 1 FROM customer_dimension c2
    WHERE c1.customer_id = c2.customer_id
        AND c1.customer_key != c2.customer_key
        AND c1.effective_date <= c2.expiry_date
        AND c1.expiry_date >= c2.effective_date
)
GROUP BY customer_id;

-- Check for multiple current rows
SELECT 
    customer_id,
    COUNT(*) as current_count
FROM customer_dimension
WHERE is_current = TRUE
GROUP BY customer_id
HAVING COUNT(*) > 1;

-- Check for gaps in history
SELECT 
    c1.customer_id,
    c1.expiry_date as end_date,
    c2.effective_date as next_start_date
FROM customer_dimension c1
JOIN customer_dimension c2 
    ON c1.customer_id = c2.customer_id
    AND c1.expiry_date < c2.effective_date
WHERE DATEDIFF(c2.effective_date, c1.expiry_date) > 1;
```

### Best Practices:

- **Atomic Updates**: Expire old row and insert new row in same transaction
- **Date Management**: Use consistent date logic (inclusive start, exclusive end OR inclusive both)
- **Validation**: Regularly validate date ranges
- **Indexing**: Index on (customer_id, effective_date, expiry_date)
- **Documentation**: Document date range logic clearly

## Design a Type 3 SCD

**Type 3 SCD**: Add a new column to store previous value. Limited history (only previous value).

### Characteristics:
- Stores current and previous value
- Limited history (only one previous value)
- Updates existing row
- Adds columns for previous values

### When to Use:
- Need limited history (just previous value)
- Changes are infrequent
- Don't need full change history
- Performance is important

### Structure:
```sql
CREATE TABLE customer_dimension (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(100),
    address VARCHAR(200),
    address_previous VARCHAR(200),  -- Previous value
    city VARCHAR(50),
    city_previous VARCHAR(50),      -- Previous value
    state VARCHAR(50),
    state_previous VARCHAR(50),    -- Previous value
    zip_code VARCHAR(10),
    zip_code_previous VARCHAR(10),  -- Previous value
    email VARCHAR(100),
    phone VARCHAR(20),
    change_date DATE                 -- When current value became effective
);
```

### Example Data:

| customer_key | customer_id | address | address_previous | city | city_previous | change_date |
|--------------|-------------|---------|------------------|------|---------------|-------------|
| 1 | CUST-001 | 456 Oak Ave | 123 Main St | Boston | NYC | 2023-06-16 |

### Update Process:

```sql
-- When address changes
UPDATE customer_dimension
SET address_previous = address,
    city_previous = city,
    state_previous = state,
    zip_code_previous = zip_code,
    address = '456 Oak Ave',
    city = 'Boston',
    state = 'MA',
    zip_code = '02101',
    change_date = CURRENT_DATE
WHERE customer_id = 'CUST-001';
```

### Querying Type 3 SCDs:

```sql
-- Get current and previous values
SELECT 
    customer_id,
    address as current_address,
    address_previous as previous_address,
    city as current_city,
    city_previous as previous_city,
    change_date
FROM customer_dimension
WHERE customer_id = 'CUST-001';
```

### Pros and Cons:

**Advantages:**
- Simple structure
- Fast queries (single row)
- Tracks previous value
- No complex date logic

**Disadvantages:**
- Limited history (only previous value)
- More columns needed
- Cannot track multiple changes
- Not suitable for frequent changes

### Variations:

**Type 3 with Multiple Previous Values:**
```sql
-- Store multiple previous values
address_current VARCHAR(200),
address_previous_1 VARCHAR(200),
address_previous_2 VARCHAR(200),
address_previous_3 VARCHAR(200)
```

## Summarize SCD concepts and implementations

### SCD Type Comparison:

| Type | History | Storage | Complexity | Use Case |
|------|---------|---------|------------|----------|
| **Type 1** | None | Low | Low | Corrections, non-critical attributes |
| **Type 2** | Full | High | High | Most dimensions, need history |
| **Type 3** | Limited (previous only) | Medium | Medium | Infrequent changes, need previous value |

### Decision Matrix:

**Use Type 1 when:**
- No historical accuracy needed
- Corrections to errors
- Performance critical
- Simple implementation required

**Use Type 2 when:**
- Full historical accuracy needed
- Track all changes over time
- Support time-based analysis
- Most common choice

**Use Type 3 when:**
- Need only previous value
- Changes are infrequent
- Don't need full history
- Performance important

### Hybrid Approach (Type 6):

You can combine SCD types for different attributes:
- Some attributes Type 1 (overwrite)
- Some attributes Type 2 (new row)
- Some attributes Type 3 (previous column)

### Key Takeaways:

1. **Type 1**: Overwrite, no history, simplest
2. **Type 2**: New row, full history, most common
3. **Type 3**: Previous column, limited history
4. **Choose based on**: Business requirements, change frequency, history needs
5. **Type 2 requires**: Careful date management, validation, proper ETL

### Best Practices:

- Document SCD type for each attribute
- Validate date ranges for Type 2
- Index appropriately (natural key, effective_date, is_current)
- Handle NULLs and unknown values
- Test ETL processes thoroughly
