# Module 04 - ETL Concepts and Data Transformation

## Overview

This module covers fundamental ETL (Extract, Transform, Load) concepts, data mapping, data profiling, and transformation techniques that form the foundation of data engineering.

## Learning Objectives

By the end of this module, you will understand:
- ETL fundamentals and concepts
- ETL vs ELT patterns
- Data mapping techniques
- Data profiling and quality assessment
- Common data transformation operations
- Best practices for ETL design


## What is ETL?

**ETL** stands for **Extract, Transform, Load** - a three-phase process for moving and transforming data.

### The Three Phases

#### 1. Extract
- **Purpose**: Read data from source systems
- **Activities**: 
  - Connect to source systems (databases, files, APIs)
  - Query or read data
  - Handle different data formats
  - Manage connection errors

#### 2. Transform
- **Purpose**: Clean, validate, and transform data
- **Activities**:
  - Data cleaning (remove duplicates, handle nulls)
  - Data validation (check data types, ranges)
  - Data enrichment (add calculated fields)
  - Data aggregation (summarize data)
  - Data formatting (standardize formats)

#### 3. Load
- **Purpose**: Write transformed data to destination
- **Activities**:
  - Connect to target system
  - Write data in appropriate format
  - Handle loading errors
  - Maintain data integrity

### ETL Process Flow

```
Source Systems → Extract → Transform → Load → Target System
     ↓             ↓          ↓          ↓          ↓
  Databases    Read Data   Clean      Write    Data Warehouse
  Files        Query       Validate   Insert   Data Lake
  APIs         Fetch       Enrich     Update   Database
```

### Why ETL?

- **Data Integration**: Combine data from multiple sources
- **Data Quality**: Clean and validate data before use
- **Data Consistency**: Standardize formats and values
- **Performance**: Optimize data for analytics
- **Compliance**: Ensure data meets business rules


## ETL vs ELT

### ETL (Extract, Transform, Load)

**Traditional Approach**: Transform data before loading to target.

```
Source → Extract → Transform (in ETL tool) → Load → Target
```

**Characteristics:**
- Transformations happen in ETL tool/memory
- Uses ETL server processing power
- Data is cleaned before loading
- Smaller data volumes during transformation

**Advantages:**
- Data is clean when loaded
- Reduced storage in target
- Better for complex transformations
- Works with limited target resources

**Disadvantages:**
- Requires powerful ETL servers
- Slower for large data volumes
- Limited scalability
- Higher infrastructure costs

**Use Cases:**
- Small to medium data volumes
- Complex transformations needed
- Target system has limited resources
- On-premises data warehouses

### ELT (Extract, Load, Transform)

**Modern Approach**: Load raw data first, then transform in target.

```
Source → Extract → Load → Transform (in target) → Analytics
```

**Characteristics:**
- Transformations happen in target system
- Uses target system processing power (cloud scale)
- Raw data loaded first
- Leverages cloud/data lake capabilities

**Advantages:**
- Leverages target system power (cloud)
- Faster for large data volumes
- More scalable
- Lower infrastructure costs
- Preserves raw data for reprocessing

**Disadvantages:**
- Requires powerful target system
- Raw data takes more storage
- Transformations may be slower if target is busy

**Use Cases:**
- Large data volumes (big data)
- Cloud data warehouses/lakes
- Need to preserve raw data
- Target system has powerful compute (Synapse, Databricks)

### When to Choose Which?

**Choose ETL when:**
- Small to medium data volumes
- Complex transformations needed
- Target system has limited resources
- On-premises environment

**Choose ELT when:**
- Large data volumes (big data)
- Cloud-based target systems
- Need to preserve raw data
- Target has powerful compute (Synapse, Databricks, Spark)


## Data Mapping

**Data Mapping** is the process of defining how data from source systems maps to target systems, including field mappings, transformations, and business rules.

### Types of Data Mapping

#### 1. Direct Mapping
- Source field maps directly to target field
- No transformation needed
- Example: `CustomerID → Customer_ID`

#### 2. Calculated Mapping
- Target field is calculated from source fields
- Uses formulas or expressions
- Example: `FullName = FirstName + ' ' + LastName`

#### 3. Lookup Mapping
- Target field value comes from lookup table
- Reference data mapping
- Example: `CountryCode → CountryName` (from lookup table)

#### 4. Conditional Mapping
- Target field value depends on conditions
- Uses IF-THEN-ELSE logic
- Example: `Status = IF(Amount > 1000, 'High', 'Low')`

#### 5. Aggregation Mapping
- Target field is aggregated from multiple source rows
- SUM, COUNT, AVG, MIN, MAX
- Example: `TotalSales = SUM(SalesAmount)`

### Data Mapping Document

A data mapping document typically includes:

| Source Field | Source Data Type | Target Field | Target Data Type | Transformation Rule | Notes |
|-------------|------------------|--------------|------------------|---------------------|-------|
| CustID | INT | Customer_ID | INT | Direct | Primary key |
| FName | VARCHAR(50) | First_Name | VARCHAR(100) | Direct | Truncate if > 100 |
| LName | VARCHAR(50) | Last_Name | VARCHAR(100) | Direct | Truncate if > 100 |
| FullName | - | Full_Name | VARCHAR(200) | FName + ' ' + LName | Calculated |
| DOB | DATE | Birth_Date | DATE | Direct | Format: YYYY-MM-DD |
| Salary | DECIMAL(10,2) | Annual_Salary | DECIMAL(12,2) | Direct | Scale change |

### Best Practices for Data Mapping

✅ **Document Everything**: Maintain detailed mapping documentation
✅ **Validate Mappings**: Test mappings with sample data
✅ **Handle Nulls**: Define how null values are handled
✅ **Data Type Compatibility**: Ensure compatible data types
✅ **Business Rules**: Document all business logic
✅ **Version Control**: Track changes to mappings
✅ **Review with Stakeholders**: Get business approval


## Data Profiling

**Data Profiling** is the process of examining data to understand its structure, quality, and characteristics before designing ETL processes.

### Why Data Profiling?

- **Understand Data**: Know what you're working with
- **Identify Issues**: Find data quality problems early
- **Design ETL**: Make informed decisions about transformations
- **Estimate Effort**: Understand complexity and time needed
- **Validate Assumptions**: Verify data matches expectations

### Data Profiling Activities

#### 1. Structure Analysis
- **Column Count**: Number of columns/fields
- **Row Count**: Number of records
- **Data Types**: Types of each field
- **Schema**: Overall structure

#### 2. Content Analysis
- **Sample Data**: View sample records
- **Value Patterns**: Common patterns in data
- **Uniqueness**: Identify unique values
- **Distributions**: Value distributions

#### 3. Quality Analysis
- **Null Values**: Count and percentage of nulls
- **Duplicates**: Identify duplicate records
- **Data Ranges**: Min, max, average values
- **Outliers**: Unusual values
- **Format Consistency**: Consistent formats?

#### 4. Relationship Analysis
- **Foreign Keys**: Relationships between tables
- **Referential Integrity**: Valid references?
- **Dependencies**: Data dependencies

### Data Profiling Metrics

| Metric | Description | Example |
|--------|-------------|---------|
| **Completeness** | Percentage of non-null values | 95% complete |
| **Uniqueness** | Percentage of unique values | 80% unique |
| **Validity** | Percentage of valid values | 90% valid |
| **Consistency** | Percentage of consistent values | 85% consistent |
| **Accuracy** | Percentage of accurate values | 92% accurate |

### Data Profiling Tools

- **Azure Data Factory**: Built-in data profiling
- **SQL Queries**: Manual profiling with SQL
- **Python/Pandas**: Programmatic profiling
- **Azure Databricks**: Spark-based profiling
- **Third-party Tools**: Talend, Informatica, etc.

### Example: Data Profiling SQL

```sql
-- Row count
SELECT COUNT(*) as TotalRows FROM SourceTable;

-- Null analysis
SELECT 
    COUNT(*) as TotalRows,
    SUM(CASE WHEN Column1 IS NULL THEN 1 ELSE 0 END) as NullCount,
    AVG(CASE WHEN Column1 IS NULL THEN 0 ELSE 1 END) * 100 as Completeness
FROM SourceTable;

-- Value distribution
SELECT Column1, COUNT(*) as Frequency
FROM SourceTable
GROUP BY Column1
ORDER BY Frequency DESC;

-- Data range
SELECT 
    MIN(Amount) as MinAmount,
    MAX(Amount) as MaxAmount,
    AVG(Amount) as AvgAmount
FROM SourceTable;
```


## Common Data Transformations

### 1. Data Cleaning

**Purpose**: Remove or fix data quality issues.

**Operations:**
- **Remove Duplicates**: Eliminate duplicate records
- **Handle Nulls**: Fill nulls with defaults or remove
- **Trim Whitespace**: Remove leading/trailing spaces
- **Standardize Formats**: Consistent date, phone formats
- **Remove Invalid Characters**: Clean special characters

**Example:**
```python
# Remove duplicates
df = df.drop_duplicates()

# Fill nulls
df['Column1'].fillna('Unknown', inplace=True)

# Trim whitespace
df['Name'] = df['Name'].str.strip()
```

### 2. Data Type Conversion

**Purpose**: Convert data to appropriate types.

**Operations:**
- **String to Number**: Convert text to numeric
- **String to Date**: Parse date strings
- **Number to String**: Convert numbers to text
- **Type Casting**: Change data types

**Example:**
```python
# String to number
df['Amount'] = pd.to_numeric(df['Amount'], errors='coerce')

# String to date
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
```

### 3. Data Enrichment

**Purpose**: Add calculated or derived fields.

**Operations:**
- **Calculated Fields**: Add computed columns
- **Lookups**: Join with reference data
- **Concatenation**: Combine fields
- **Derived Values**: Calculate from existing fields

**Example:**
```python
# Calculated field
df['FullName'] = df['FirstName'] + ' ' + df['LastName']

# Derived value
df['Age'] = (pd.Timestamp.now() - df['BirthDate']).dt.days // 365
```

### 4. Data Filtering

**Purpose**: Select subset of data based on conditions.

**Operations:**
- **Row Filtering**: Filter rows by conditions
- **Column Selection**: Select specific columns
- **Conditional Logic**: IF-THEN-ELSE filtering

**Example:**
```python
# Filter rows
df_filtered = df[df['Amount'] > 1000]

# Select columns
df_selected = df[['CustomerID', 'Name', 'Amount']]
```

### 5. Data Aggregation

**Purpose**: Summarize data by groups.

**Operations:**
- **Group By**: Group data by fields
- **Aggregate Functions**: SUM, COUNT, AVG, MIN, MAX
- **Pivoting**: Reshape data

**Example:**
```python
# Aggregation
df_agg = df.groupby('Region').agg({
    'Sales': 'sum',
    'Orders': 'count',
    'Amount': 'mean'
})
```

### 6. Data Joining

**Purpose**: Combine data from multiple sources.

**Operations:**
- **Inner Join**: Matching records only
- **Left Join**: All left records + matching right
- **Right Join**: All right records + matching left
- **Full Outer Join**: All records from both

**Example:**
```python
# Join
df_joined = df1.merge(df2, on='CustomerID', how='left')
```

### 7. Data Splitting

**Purpose**: Split data into multiple outputs.

**Operations:**
- **Split by Condition**: Route data based on conditions
- **Split by Value**: Separate by field values
- **Multiple Outputs**: Send to different destinations

**Example:**
```python
# Split by condition
df_high = df[df['Amount'] > 1000]
df_low = df[df['Amount'] <= 1000]
```


## ETL Design Best Practices

### 1. Incremental Loading

**Load only new or changed data**, not full datasets.

**Benefits:**
- Faster execution
- Lower resource usage
- Reduced network traffic
- Lower costs

**Techniques:**
- **Timestamp-based**: Load records modified after last run
- **Change Data Capture (CDC)**: Track changes at source
- **Watermarking**: Track last processed record
- **Hash Comparison**: Compare data hashes

### 2. Error Handling

**Handle errors gracefully** to ensure pipeline reliability.

**Strategies:**
- **Try-Catch Blocks**: Catch and handle exceptions
- **Error Logging**: Log all errors for debugging
- **Retry Logic**: Retry failed operations
- **Dead Letter Queue**: Store failed records
- **Notifications**: Alert on critical errors

### 3. Idempotency

**Design ETL to be idempotent** - running multiple times produces same result.

**Techniques:**
- **Upsert Operations**: Insert or update
- **Delete Before Insert**: Clear target before load
- **Unique Constraints**: Prevent duplicates
- **Transaction Management**: Atomic operations

### 4. Performance Optimization

**Optimize ETL for speed and efficiency.**

**Techniques:**
- **Parallel Processing**: Process multiple files/partitions
- **Partitioning**: Partition data for parallel processing
- **Indexing**: Create indexes on key columns
- **Batch Processing**: Process in batches
- **Resource Scaling**: Scale resources as needed

### 5. Data Validation

**Validate data at each stage** of ETL process.

**Validations:**
- **Row Count Checks**: Verify expected row counts
- **Data Type Validation**: Check data types
- **Range Checks**: Validate value ranges
- **Referential Integrity**: Check foreign keys
- **Business Rules**: Validate business logic

### 6. Monitoring and Logging

**Monitor ETL execution** for visibility and troubleshooting.

**Monitor:**
- **Execution Status**: Success/failure
- **Execution Time**: Duration of each step
- **Data Volumes**: Records processed
- **Error Rates**: Number of errors
- **Resource Usage**: CPU, memory, network

### 7. Documentation

**Document ETL processes** for maintainability.

**Document:**
- **Data Mapping**: Source to target mappings
- **Transformation Logic**: Business rules
- **Dependencies**: Data and process dependencies
- **Schedule**: When ETL runs
- **Contacts**: Who to contact for issues


## ETL Patterns

### Pattern 1: Full Load

**Load entire dataset every time.**

```
Source → Extract All → Transform → Load (Replace) → Target
```

**Use When:**
- Small datasets
- Data changes frequently
- Simple requirements

**Pros:** Simple, always current
**Cons:** Slow for large data, high resource usage

### Pattern 2: Incremental Load

**Load only new or changed data.**

```
Source → Extract Changed → Transform → Load (Append/Update) → Target
```

**Use When:**
- Large datasets
- Only new/changed data needed
- Performance is important

**Pros:** Fast, efficient, lower resource usage
**Cons:** More complex, need change tracking

### Pattern 3: Change Data Capture (CDC)

**Capture and process only changed data.**

```
Source → CDC → Changed Records → Transform → Load → Target
```

**Use When:**
- Real-time or near-real-time needs
- Source supports CDC
- Need to track all changes

**Pros:** Real-time, efficient, tracks history
**Cons:** Complex, requires CDC support

### Pattern 4: Slowly Changing Dimensions (SCD)

**Handle dimension data that changes over time.**

**SCD Type 1**: Overwrite old values
**SCD Type 2**: Keep history with versioning
**SCD Type 3**: Keep limited history

**Use When:**
- Dimension data changes
- Need historical tracking
- Data warehousing scenarios

### Pattern 5: Staging Area

**Load to staging first, then to final target.**

```
Source → Extract → Load to Staging → Transform → Load to Target
```

**Use When:**
- Complex transformations
- Need to validate before final load
- Multiple target systems

**Pros:** Validation, rollback capability, flexibility
**Cons:** Extra storage, additional step


## Summary

In this module, we've covered:

✅ ETL fundamentals (Extract, Transform, Load)
✅ ETL vs ELT patterns and when to use each
✅ Data mapping techniques and documentation
✅ Data profiling and quality assessment
✅ Common data transformation operations
✅ ETL design best practices
✅ ETL patterns (Full Load, Incremental, CDC, SCD, Staging)

### Key Takeaways

1. **ETL** is Extract, Transform, Load - fundamental data engineering process
2. **ETL vs ELT**: Choose based on data volume and target system capabilities
3. **Data Mapping** defines how source data maps to target
4. **Data Profiling** helps understand data before designing ETL
5. **Transformations** clean, enrich, and prepare data
6. **Best Practices** ensure reliable, maintainable ETL processes
7. **ETL Patterns** provide reusable solutions for common scenarios

### Next Steps

Proceed to **Module 05: Azure Data Factory Basics** to learn about:
- Azure Data Factory components
- Linked Services and Datasets
- Pipelines and Activities
- Creating ETL pipelines in ADF
