# Module 03 - Data Ingestion: Batch and Streaming

## Overview

Data ingestion is the process of moving data from various sources into Azure storage and processing systems. This module covers different types of data ingestion patterns and the Azure services used for each.

## Learning Objectives

By the end of this module, you will understand:
- What is data ingestion and why it's important
- Types of data (structured, semi-structured, unstructured)
- Batch data ingestion patterns and use cases
- Streaming data ingestion patterns and use cases
- Azure services for data ingestion
- Best practices for data ingestion


## What is Data Ingestion?

**Data Ingestion** is the process of importing, transferring, loading, and processing data for immediate use or storage in a database or data warehouse.

### Why is Data Ingestion Important?

- **Data Sources**: Data exists in many places (databases, files, APIs, IoT devices)
- **Centralized Storage**: Need to bring data together for analysis
- **Real-time Needs**: Some data needs immediate processing
- **Scalability**: Handle large volumes of data efficiently
- **Reliability**: Ensure data arrives correctly and on time

### Data Ingestion Pipeline

```
Source Systems → Ingestion Layer → Storage → Processing
     ↓              ↓                ↓          ↓
  Databases    Azure Data      Azure      Spark/Synapse
  Files        Factory         Storage    Analytics
  APIs         Event Hubs      Data Lake
  IoT Devices  Stream Analytics
```


## Types of Data

Understanding data types helps in choosing the right ingestion method and storage.

### 1. Structured Data

**Definition**: Data with a fixed schema and well-defined format.

**Characteristics:**
- Organized in rows and columns
- Follows a predefined schema
- Easy to query and analyze

**Examples:**
- Relational databases (SQL Server, Oracle, MySQL)
- CSV files with consistent columns
- Excel spreadsheets
- Parquet files

**Storage**: Azure SQL Database, Synapse SQL Pools, Tables

### 2. Semi-Structured Data

**Definition**: Data with some structure but flexible schema.

**Characteristics:**
- Has tags or markers to separate elements
- Schema can vary
- Self-describing format

**Examples:**
- JSON files
- XML files
- Avro files
- NoSQL databases (Cosmos DB)

**Storage**: Azure Storage, Data Lake, Cosmos DB

### 3. Unstructured Data

**Definition**: Data without a predefined structure or schema.

**Characteristics:**
- No fixed format
- Difficult to query directly
- Requires processing to extract insights

**Examples:**
- Text documents
- Images
- Videos
- Audio files
- Log files

**Storage**: Azure Blob Storage, Data Lake Storage


## Batch Data Ingestion

**Batch ingestion** processes data in large chunks at scheduled intervals or when triggered.

### Characteristics

- **Volume**: Large amounts of data processed together
- **Frequency**: Scheduled (hourly, daily, weekly) or on-demand
- **Latency**: Higher latency (minutes to hours)
- **Processing**: Bulk operations on entire datasets

### Use Cases

✅ **ETL Processes**: Extract data from source, transform, load to destination
✅ **Historical Data Loading**: Loading large historical datasets
✅ **Scheduled Reports**: Daily/weekly data refreshes
✅ **Data Warehousing**: Loading data into data warehouses
✅ **File Processing**: Processing files uploaded to storage

### Example Scenarios

1. **Daily Sales Data**
   - Source: On-premises SQL Server
   - Schedule: Every night at 2 AM
   - Destination: Azure Data Lake
   - Process: Extract all sales from previous day

2. **Monthly Financial Reports**
   - Source: Multiple Excel files
   - Schedule: First day of each month
   - Destination: Azure Synapse Analytics
   - Process: Aggregate and consolidate data

3. **Customer Data Migration**
   - Source: Legacy database
   - Schedule: One-time migration
   - Destination: Azure SQL Database
   - Process: Full data extract and load


## Streaming Data Ingestion

**Streaming ingestion** processes data continuously as it arrives, in real-time or near real-time.

### Characteristics

- **Volume**: Continuous flow of data
- **Frequency**: Real-time or near real-time
- **Latency**: Low latency (seconds to milliseconds)
- **Processing**: Event-by-event or micro-batch processing

### Use Cases

✅ **IoT Data**: Sensor data from devices
✅ **Real-time Analytics**: Live dashboards and monitoring
✅ **Event Processing**: User clicks, transactions, logs
✅ **Fraud Detection**: Real-time transaction monitoring
✅ **Live Recommendations**: Real-time personalization

### Example Scenarios

1. **IoT Sensor Data**
   - Source: Temperature sensors
   - Frequency: Every second
   - Destination: Azure Event Hubs → Stream Analytics
   - Process: Real-time temperature monitoring and alerts

2. **E-commerce Clickstream**
   - Source: Website user clicks
   - Frequency: Continuous
   - Destination: Event Hubs → Data Lake
   - Process: Real-time user behavior analysis

3. **Financial Trading**
   - Source: Stock market feeds
   - Frequency: Millisecond-level
   - Destination: Event Hubs → Stream Analytics
   - Process: Real-time trading decisions


## Batch vs Streaming: Comparison

| Aspect | Batch Ingestion | Streaming Ingestion |
|--------|----------------|---------------------|
| **Data Volume** | Large chunks | Continuous flow |
| **Processing Time** | Scheduled intervals | Real-time |
| **Latency** | Minutes to hours | Seconds to milliseconds |
| **Use Cases** | ETL, reports, analytics | Real-time monitoring, alerts |
| **Complexity** | Lower | Higher |
| **Cost** | Lower (scheduled) | Higher (always-on) |
| **Tools** | Azure Data Factory | Event Hubs, Stream Analytics |

### When to Use Batch

- Data doesn't need immediate processing
- Large volumes of historical data
- Scheduled reporting and analytics
- Cost optimization is important
- Data quality checks are needed before processing

### When to Use Streaming

- Real-time decision making required
- Immediate alerts and notifications
- Live dashboards and monitoring
- Event-driven applications
- Low latency is critical


## Azure Services for Data Ingestion

### Azure Data Factory (ADF)

**Purpose**: Cloud-based ETL/ELT service for batch data movement and transformation.

**Key Features:**
- Visual pipeline designer
- 90+ built-in connectors
- Schedule-based or event-driven triggers
- Data transformation capabilities
- Monitoring and alerting

**Use Cases:**
- Moving data from on-premises to cloud
- Scheduled batch data loads
- ETL workflows
- Data integration between systems

**Example Flow:**
```
SQL Server → ADF Pipeline → Azure Data Lake
```

### Azure Event Hubs

**Purpose**: Big data streaming platform and event ingestion service.

**Key Features:**
- High throughput (millions of events per second)
- Low latency
- Multiple consumer groups
- Capture feature (auto-save to storage)

**Use Cases:**
- IoT data ingestion
- Real-time event streaming
- Clickstream analytics
- Log aggregation

**Example Flow:**
```
IoT Devices → Event Hubs → Stream Analytics → Power BI
```

### Azure IoT Hub

**Purpose**: Managed service for IoT device connectivity and management.

**Key Features:**
- Device-to-cloud and cloud-to-device messaging
- Device management
- Security and authentication
- Protocol support (MQTT, AMQP, HTTP)

**Use Cases:**
- IoT device data collection
- Device management
- Command and control

### Azure Stream Analytics

**Purpose**: Real-time analytics on streaming data.

**Key Features:**
- SQL-like query language
- Real-time processing
- Multiple input/output sources
- Windowing functions

**Use Cases:**
- Real-time dashboards
- Anomaly detection
- Real-time aggregations
- Event filtering and routing


## Data Ingestion Patterns

### Pattern 1: Extract and Load (EL)

**Simple data movement without transformation.**

```
Source → Ingestion Service → Destination Storage
```

**Example:**
- Copy files from on-premises to Azure Storage
- No transformation needed
- Fast and simple

### Pattern 2: Extract, Transform, Load (ETL)

**Transform data during ingestion process.**

```
Source → Extract → Transform → Load → Destination
```

**Example:**
- Extract from SQL Server
- Transform: Clean, filter, aggregate
- Load to Data Lake

### Pattern 3: Extract, Load, Transform (ELT)

**Load raw data first, then transform in destination.**

```
Source → Extract → Load → Transform (in destination) → Analytics
```

**Example:**
- Extract raw data to Data Lake
- Load to Synapse Analytics
- Transform using SQL/Spark in Synapse

### Pattern 4: Change Data Capture (CDC)

**Capture only changed data since last ingestion.**

```
Source → CDC → Changed Data Only → Destination
```

**Example:**
- Track changes in source database
- Ingest only new/modified records
- Efficient for large tables

### Pattern 5: Lambda Architecture

**Combines batch and streaming for comprehensive analytics.**

```
Streaming Path: Real-time data → Event Hubs → Stream Analytics → Real-time views
Batch Path: Historical data → Data Factory → Data Lake → Batch processing → Batch views
Merge: Combine real-time and batch views for complete picture
```


## Data Sources and Destinations

### Common Data Sources

#### On-Premises Sources
- **SQL Server**: Relational databases
- **File Servers**: CSV, Excel, JSON files
- **Oracle/MySQL**: Other relational databases
- **SAP**: ERP systems
- **Mainframes**: Legacy systems

#### Cloud Sources
- **Azure SQL Database**: Managed SQL database
- **Azure Storage**: Blob, Data Lake
- **Azure Cosmos DB**: NoSQL database
- **Salesforce**: CRM data
- **Dynamics 365**: Business applications
- **REST APIs**: Web services

#### Streaming Sources
- **IoT Devices**: Sensors, devices
- **Applications**: Logs, events
- **Social Media**: Twitter, Facebook feeds
- **Web Clickstream**: User interactions

### Common Destinations

#### Storage Destinations
- **Azure Blob Storage**: Object storage
- **Azure Data Lake Storage Gen2**: Analytics storage
- **Azure Files**: File shares

#### Database Destinations
- **Azure SQL Database**: Managed SQL
- **Azure Synapse Analytics**: Data warehouse
- **Azure Cosmos DB**: NoSQL
- **Azure Database for PostgreSQL/MySQL**: Open-source databases

#### Analytics Destinations
- **Azure Synapse Analytics**: Data warehousing
- **Azure Databricks**: Spark analytics
- **Power BI**: Business intelligence
- **Azure Analysis Services**: Analytics engine


## Best Practices for Data Ingestion

### 1. Data Validation

✅ **Validate at Source**: Check data quality before ingestion
✅ **Schema Validation**: Ensure data matches expected schema
✅ **Data Type Checks**: Verify data types are correct
✅ **Null Handling**: Handle missing values appropriately
✅ **Error Handling**: Log and handle errors gracefully

### 2. Incremental Loading

✅ **Use Timestamps**: Track last ingestion time
✅ **Change Data Capture**: Only ingest changed data
✅ **Partitioning**: Partition by date/time for efficiency
✅ **Idempotency**: Ensure re-running doesn't create duplicates

### 3. Performance Optimization

✅ **Parallel Processing**: Process multiple files/partitions in parallel
✅ **Compression**: Compress data during transfer
✅ **Batch Sizes**: Optimize batch sizes for throughput
✅ **Network Optimization**: Use ExpressRoute for on-premises
✅ **Resource Scaling**: Scale resources based on workload

### 4. Monitoring and Alerting

✅ **Pipeline Monitoring**: Track pipeline execution status
✅ **Data Quality Metrics**: Monitor data quality
✅ **Latency Tracking**: Monitor ingestion latency
✅ **Error Alerts**: Set up alerts for failures
✅ **Cost Monitoring**: Track ingestion costs

### 5. Security

✅ **Encryption**: Encrypt data in transit and at rest
✅ **Authentication**: Use managed identities or service principals
✅ **Network Security**: Use private endpoints when possible
✅ **Access Control**: Implement least privilege access
✅ **Audit Logging**: Log all data access and changes


## Data Ingestion Architecture Example

### Hybrid Architecture (Batch + Streaming)

```
┌─────────────────┐
│  On-Premises    │
│  SQL Server     │──┐
└─────────────────┘  │
                     │
┌─────────────────┐  │    ┌──────────────────┐
│  IoT Devices    │──┼───▶│  Azure Event Hubs │
└─────────────────┘  │    └────────┬─────────┘
                     │             │
┌─────────────────┐  │             │
│  File Server    │──┘             │
└─────────────────┘                │
                                   │
                    ┌──────────────┴──────────────┐
                    │                             │
         ┌──────────▼──────────┐    ┌────────────▼─────────┐
         │ Azure Data Factory  │    │ Stream Analytics     │
         │ (Batch Processing)  │    │ (Real-time Processing)│
         └──────────┬──────────┘    └────────────┬─────────┘
                    │                            │
         ┌──────────▼──────────┐    ┌────────────▼─────────┐
         │ Azure Data Lake     │    │ Azure Data Lake      │
         │ (Raw/Batch Data)    │    │ (Streaming Data)     │
         └──────────┬──────────┘    └────────────┬─────────┘
                    │                            │
                    └────────────┬───────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │ Azure Synapse Analytics │
                    │ (Unified Analytics)     │
                    └─────────────────────────┘
```

### Key Components:

1. **Batch Path**: SQL Server, File Server → Data Factory → Data Lake
2. **Streaming Path**: IoT Devices → Event Hubs → Stream Analytics → Data Lake
3. **Unified Analytics**: Both paths feed into Synapse Analytics for comprehensive analysis


## Summary

In this module, we've covered:

✅ What is data ingestion and its importance
✅ Types of data (structured, semi-structured, unstructured)
✅ Batch data ingestion patterns and use cases
✅ Streaming data ingestion patterns and use cases
✅ Comparison between batch and streaming
✅ Azure services for data ingestion (ADF, Event Hubs, IoT Hub, Stream Analytics)
✅ Data ingestion patterns (EL, ETL, ELT, CDC, Lambda)
✅ Common data sources and destinations
✅ Best practices for data ingestion
✅ Example data ingestion architecture

### Key Takeaways

1. **Data Ingestion** is the first step in the data engineering pipeline
2. **Batch ingestion** is for scheduled, large-volume data processing
3. **Streaming ingestion** is for real-time, continuous data processing
4. **Choose the right service** based on your latency and volume requirements
5. **Azure Data Factory** is the primary service for batch data movement
6. **Event Hubs** is the primary service for streaming data ingestion
7. **Consider data types** when designing ingestion pipelines
8. **Follow best practices** for validation, performance, and security

### Next Steps

Proceed to **Module 04: ETL Concepts and Data Transformation** to learn about:
- ETL fundamentals
- Data mapping and transformation
- Data profiling
- Transformation techniques
