# ETL Design

## Introduction to ETL Design

ETL Design involves creating detailed specifications for how data will be extracted, transformed, and loaded into the data warehouse. Good ETL design ensures data quality, performance, and maintainability.

## Build your ETL Design from your ETL Architecture

### ETL Architecture Components:

1. **Source Systems**
   - Identify all source systems
   - Understand data formats
   - Document data quality issues
   - Plan extraction methods

2. **Staging Area**
   - Design staging tables
   - Plan data validation
   - Design error handling
   - Plan audit logging

3. **Data Warehouse**
   - Design target tables
   - Plan transformation rules
   - Design loading strategies
   - Plan indexing and optimization

4. **Data Marts**
   - Design aggregation processes
   - Plan data distribution
   - Design refresh schedules

### ETL Design Process:

```
1. Analyze Source Systems
   ↓
2. Design Staging Layer
   ↓
3. Design Transformation Rules
   ↓
4. Design Target Tables
   ↓
5. Design Load Processes
   ↓
6. Design Error Handling
   ↓
7. Design Monitoring and Logging
```

### Design Considerations:

- **Performance**: Optimize for throughput and latency
- **Data Quality**: Validate and cleanse data
- **Reliability**: Handle errors gracefully
- **Maintainability**: Document and structure clearly
- **Scalability**: Design for growth

## Dimension Table ETL

### Dimension ETL Process:

1. **Extract**
   - Read from source systems
   - Handle different source formats
   - Capture all dimension attributes

2. **Transform**
   - Standardize formats
   - Handle missing values
   - Apply business rules
   - Generate surrogate keys

3. **Load**
   - Lookup existing dimensions
   - Handle SCD changes
   - Insert new dimensions
   - Update existing dimensions

### Dimension ETL Patterns:

**New Dimension Load:**
```sql
-- Extract and transform
INSERT INTO staging.dimension_staging
SELECT 
    source_dimension_id,
    source_dimension_name,
    source_category,
    ...
FROM source_system.dimension_table;

-- Load to dimension (Type 1)
INSERT INTO dw.product_dimension (
    product_key,
    product_id,
    product_name,
    category,
    ...
)
SELECT 
    NEXT_VALUE('product_key_seq'),
    product_id,
    product_name,
    category,
    ...
FROM staging.dimension_staging
WHERE NOT EXISTS (
    SELECT 1 FROM dw.product_dimension
    WHERE product_dimension.product_id = dimension_staging.product_id
);
```

**Dimension Update (Type 1):**
```sql
-- Update existing dimension
UPDATE dw.product_dimension p
SET 
    product_name = s.product_name,
    category = s.category,
    ...
FROM staging.dimension_staging s
WHERE p.product_id = s.product_id;
```

### Dimension ETL Best Practices:

- **Surrogate Key Generation**: Use sequences or identity columns
- **Natural Key Lookup**: Index on natural keys for performance
- **SCD Handling**: Implement appropriate SCD type logic
- **Data Quality**: Validate before loading
- **Incremental Processing**: Only process changed dimensions

## Process SCD Type 1 Changes to a Dimension Table

### Type 1 SCD ETL Process:

**Characteristics:**
- Overwrite old values
- No history maintained
- Simple update logic

### ETL Steps:

1. **Extract**: Get dimension data from source
2. **Transform**: Standardize and cleanse
3. **Lookup**: Find existing dimension rows
4. **Update**: Overwrite with new values

### Implementation:

```sql
-- Step 1: Load to staging
INSERT INTO staging.customer_staging
SELECT 
    customer_id,
    customer_name,
    address,
    city,
    state,
    ...
FROM source.customer_table
WHERE last_modified_date > @last_load_date;

-- Step 2: Update existing customers (Type 1)
UPDATE dw.customer_dimension c
SET 
    customer_name = s.customer_name,
    address = s.address,
    city = s.city,
    state = s.state,
    ...
FROM staging.customer_staging s
WHERE c.customer_id = s.customer_id;

-- Step 3: Insert new customers
INSERT INTO dw.customer_dimension (
    customer_key,
    customer_id,
    customer_name,
    address,
    ...
)
SELECT 
    NEXT_VALUE('customer_key_seq'),
    customer_id,
    customer_name,
    address,
    ...
FROM staging.customer_staging s
WHERE NOT EXISTS (
    SELECT 1 FROM dw.customer_dimension c
    WHERE c.customer_id = s.customer_id
);
```

### Type 1 ETL Considerations:

- **Performance**: Simple updates, fast processing
- **Data Loss**: Historical values are lost
- **Referential Integrity**: Existing fact table links remain valid
- **Use Cases**: Corrections, non-historical attributes

## Process SCD Type 2 Changes to a Dimension Table

### Type 2 SCD ETL Process:

**Characteristics:**
- Create new row for changes
- Preserve full history
- Manage effective/expiry dates
- More complex logic

### ETL Steps:

1. **Extract**: Get dimension data from source
2. **Transform**: Standardize and cleanse
3. **Compare**: Identify changed attributes
4. **Expire**: Set expiry date on current row
5. **Insert**: Create new row with new values

### Implementation:

```sql
-- Step 1: Load to staging
INSERT INTO staging.customer_staging
SELECT 
    customer_id,
    customer_name,
    address,
    city,
    state,
    ...
FROM source.customer_table
WHERE last_modified_date > @last_load_date;

-- Step 2: Identify changed rows
CREATE TEMPORARY TABLE changed_customers AS
SELECT 
    s.customer_id,
    s.customer_name,
    s.address,
    ...
FROM staging.customer_staging s
JOIN dw.customer_dimension c 
    ON s.customer_id = c.customer_id
    AND c.is_current = TRUE
WHERE s.address != c.address
    OR s.city != c.city
    OR s.state != c.state
    -- Compare all relevant attributes
    ...;

-- Step 3: Expire current rows
UPDATE dw.customer_dimension c
SET 
    expiry_date = CURRENT_DATE - 1,
    is_current = FALSE
FROM changed_customers ch
WHERE c.customer_id = ch.customer_id
    AND c.is_current = TRUE;

-- Step 4: Insert new rows
INSERT INTO dw.customer_dimension (
    customer_key,
    customer_id,
    customer_name,
    address,
    city,
    state,
    effective_date,
    expiry_date,
    is_current
)
SELECT 
    NEXT_VALUE('customer_key_seq'),
    customer_id,
    customer_name,
    address,
    city,
    state,
    CURRENT_DATE,
    '9999-12-31',
    TRUE
FROM changed_customers;

-- Step 5: Insert new customers (not just changed)
INSERT INTO dw.customer_dimension (...)
SELECT ...
FROM staging.customer_staging s
WHERE NOT EXISTS (
    SELECT 1 FROM dw.customer_dimension c
    WHERE c.customer_id = s.customer_id
);
```

### Type 2 ETL Considerations:

- **Change Detection**: Compare all relevant attributes
- **Date Management**: Properly set effective/expiry dates
- **Atomicity**: Expire and insert in same transaction
- **Performance**: Index on natural key and is_current
- **Validation**: Ensure no overlapping dates

### Change Detection Methods:

1. **Column Comparison**: Compare each attribute
2. **Hash Comparison**: Calculate hash of attributes
3. **CDC**: Use change data capture
4. **Timestamp**: Use last modified timestamp

## Design ETL for Fact Tables

### Fact Table ETL Process:

1. **Extract**: Get transaction/event data from sources
2. **Transform**: 
   - Lookup dimension keys
   - Calculate derived facts
   - Apply business rules
   - Handle nulls and defaults
3. **Load**: Insert into fact table

### Fact Table ETL Pattern:

```sql
-- Step 1: Extract to staging
INSERT INTO staging.sales_staging
SELECT 
    transaction_date,
    product_id,
    customer_id,
    store_id,
    sales_amount,
    quantity,
    ...
FROM source.sales_transactions
WHERE transaction_date > @last_load_date;

-- Step 2: Transform and lookup dimension keys
INSERT INTO staging.sales_fact_staging
SELECT 
    d.date_key,
    p.product_key,
    c.customer_key,
    s.store_key,
    st.sales_amount,
    st.quantity,
    st.sales_amount - st.cost_amount as profit_amount,
    st.transaction_number
FROM staging.sales_staging st
JOIN dw.date_dimension d 
    ON st.transaction_date = d.date_value
JOIN dw.product_dimension p 
    ON st.product_id = p.product_id
    AND p.is_current = TRUE  -- Important for Type 2 SCDs
JOIN dw.customer_dimension c 
    ON st.customer_id = c.customer_id
    AND c.is_current = TRUE
JOIN dw.store_dimension s 
    ON st.store_id = s.store_id
    AND s.is_current = TRUE;

-- Step 3: Load to fact table
INSERT INTO dw.sales_fact (
    date_key,
    product_key,
    customer_key,
    store_key,
    sales_amount,
    quantity,
    profit_amount,
    transaction_number
)
SELECT 
    date_key,
    product_key,
    customer_key,
    store_key,
    sales_amount,
    quantity,
    profit_amount,
    transaction_number
FROM staging.sales_fact_staging;
```

### Fact Table ETL Considerations:

1. **Dimension Lookups**
   - Use current version for Type 2 SCDs
   - Handle missing dimensions (use "Unknown" row)
   - Index dimension natural keys for performance

2. **Grain Validation**
   - Ensure one row per grain
   - Check for duplicates
   - Validate business rules

3. **Fact Calculations**
   - Calculate derived facts
   - Apply business rules
   - Handle nulls appropriately

4. **Performance**
   - Bulk insert operations
   - Disable indexes during load
   - Use partitioning
   - Parallel processing

### Handling Missing Dimensions:

```sql
-- Create "Unknown" dimension rows
INSERT INTO dw.product_dimension (
    product_key,
    product_id,
    product_name,
    ...
)
VALUES (
    -1,  -- Special key for unknown
    'UNKNOWN',
    'Unknown Product',
    ...
);

-- Use unknown key when dimension not found
SELECT 
    COALESCE(p.product_key, -1) as product_key,
    ...
FROM staging.sales_staging st
LEFT JOIN dw.product_dimension p 
    ON st.product_id = p.product_id
    AND p.is_current = TRUE;
```

### Incremental Fact Loading:

```sql
-- Identify new transactions
INSERT INTO staging.sales_staging
SELECT ...
FROM source.sales_transactions
WHERE transaction_date > @last_load_date
    AND transaction_id NOT IN (
        SELECT transaction_number 
        FROM dw.sales_fact
    );
```

### Fact Table ETL Best Practices:

- **Grain Validation**: Ensure correct grain
- **Dimension Lookups**: Use current versions, handle missing
- **Bulk Operations**: Use bulk insert for performance
- **Error Handling**: Log and handle errors gracefully
- **Incremental Loading**: Only process new/changed data
- **Partitioning**: Partition by date for performance
- **Indexing**: Create indexes after load

## Summarize ETL Design

### Key Components:

1. **Architecture**: Source → Staging → Warehouse → Marts
2. **Dimension ETL**: Extract, transform, handle SCDs, load
3. **Fact ETL**: Extract, lookup dimensions, transform, load
4. **Error Handling**: Validation, logging, recovery
5. **Performance**: Optimization, parallel processing, indexing

### ETL Design Principles:

- **Modularity**: Separate extract, transform, load phases
- **Reusability**: Reusable transformation components
- **Maintainability**: Clear documentation and structure
- **Reliability**: Error handling and recovery
- **Performance**: Optimize for throughput
- **Scalability**: Design for growth

### Best Practices:

- **Staging Layer**: Use for validation and transformation
- **SCD Handling**: Implement appropriate SCD type logic
- **Dimension Lookups**: Efficient lookup strategies
- **Grain Validation**: Ensure correct fact table grain
- **Error Handling**: Comprehensive error handling and logging
- **Testing**: Thorough testing of ETL processes
- **Documentation**: Document all transformation rules

### Common Patterns:

- **Initial Load**: Full historical data load
- **Incremental Load**: Process only new/changed data
- **Full Refresh**: Reload entire table
- **Upsert**: Update existing, insert new
- **SCD Processing**: Type 1, Type 2, Type 3 handling

### Next Steps:
- Understand data warehousing environments
- Learn about cloud vs. on-premises considerations
