# Module 05 - Azure Data Factory Basics

## Overview

Azure Data Factory (ADF) is a cloud-based ETL/ELT service for creating data-driven workflows. This module covers the fundamental components of ADF and how to build data pipelines.

## Learning Objectives

By the end of this module, you will understand:
- What is Azure Data Factory and its purpose
- Linked Services - connecting to data sources
- Datasets - representing data structures
- Pipelines - orchestrating data workflows
- Activities - individual tasks in pipelines
- Source and Sink concepts
- Creating basic data pipelines


## What is Azure Data Factory?

**Azure Data Factory (ADF)** is a cloud-based data integration service that allows you to create, schedule, and orchestrate data-driven workflows (pipelines) to move and transform data.

### Key Features

- **Visual Pipeline Designer**: Drag-and-drop interface for building pipelines
- **90+ Built-in Connectors**: Connect to various data sources
- **Code-Free ETL**: Build pipelines without writing code
- **Scheduling**: Schedule pipelines to run automatically
- **Monitoring**: Track pipeline execution and performance
- **Hybrid Data Movement**: Move data from on-premises to cloud
- **Data Transformation**: Transform data using various activities

### Use Cases

✅ **Data Migration**: Move data from on-premises to cloud
✅ **ETL/ELT Workflows**: Extract, transform, and load data
✅ **Data Integration**: Integrate data from multiple sources
✅ **Scheduled Data Loads**: Automate daily/weekly data refreshes
✅ **Data Orchestration**: Coordinate multiple data processes

### ADF Architecture

```
Data Factory
├── Linked Services (Connections)
├── Datasets (Data Structures)
├── Pipelines (Workflows)
│   └── Activities (Tasks)
└── Triggers (Scheduling)
```


## Linked Services

**Linked Services** define connection information to external data sources or compute services. Think of them as connection strings or connection configurations.

### Purpose

- **Store Connection Details**: Connection strings, credentials, endpoints
- **Reusability**: Use same connection across multiple pipelines
- **Security**: Store credentials securely (Azure Key Vault)
- **Abstraction**: Hide connection details from pipelines

### Common Linked Service Types

#### Storage Linked Services
- **Azure Blob Storage**: Connect to blob storage
- **Azure Data Lake Storage Gen2**: Connect to ADLS Gen2
- **Azure Files**: Connect to file shares
- **Amazon S3**: Connect to AWS S3

#### Database Linked Services
- **Azure SQL Database**: Connect to SQL Database
- **Azure Synapse Analytics**: Connect to Synapse
- **SQL Server**: Connect to on-premises SQL Server
- **Oracle/MySQL/PostgreSQL**: Connect to various databases

#### Compute Linked Services
- **Azure Databricks**: Connect to Databricks clusters
- **Azure HDInsight**: Connect to HDInsight clusters
- **Azure Batch**: Connect to Batch compute

### Linked Service Example

**Azure Blob Storage Linked Service:**
```json
{
  "name": "AzureBlobStorage1",
  "type": "AzureBlobStorage",
  "typeProperties": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=...",
    "accountKey": {
      "type": "AzureKeyVaultSecret",
      "store": {
        "referenceName": "AzureKeyVault1"
      },
      "secretName": "storageAccountKey"
    }
  }
}
```

### Best Practices

✅ **Use Azure Key Vault**: Store sensitive credentials in Key Vault
✅ **Naming Convention**: Use descriptive names (e.g., `LS_SQLServer_Production`)
✅ **Parameterize**: Use parameters for different environments
✅ **Reuse**: Create linked services that can be reused across pipelines


## Datasets

**Datasets** represent data structures within data stores. They point to the data you want to use in your activities as inputs or outputs.

### Purpose

- **Define Data Structure**: Specify schema, format, location
- **Reference Data**: Point to specific data in linked services
- **Reusability**: Use same dataset definition across activities
- **Schema Definition**: Define columns, data types, constraints

### Dataset Components

1. **Linked Service Reference**: Which data store to connect to
2. **Structure/Schema**: Column names and data types
3. **Location/Path**: Where the data is located
4. **Format**: File format (CSV, JSON, Parquet, etc.)
5. **Properties**: Additional settings (compression, encoding)

### Common Dataset Types

#### File-Based Datasets
- **DelimitedText**: CSV, TSV files
- **Json**: JSON files
- **Parquet**: Parquet files
- **Avro**: Avro files
- **Excel**: Excel files

#### Database Datasets
- **AzureSqlTable**: SQL Database tables
- **SqlServerTable**: SQL Server tables
- **OracleTable**: Oracle tables

### Dataset Example

**CSV Dataset:**
```json
{
  "name": "SalesDataCSV",
  "type": "DelimitedText",
  "linkedServiceName": {
    "referenceName": "AzureBlobStorage1",
    "type": "LinkedServiceReference"
  },
  "schema": [
    {
      "name": "CustomerID",
      "type": "Int32"
    },
    {
      "name": "SalesAmount",
      "type": "Decimal"
    },
    {
      "name": "SaleDate",
      "type": "DateTime"
    }
  ],
  "typeProperties": {
    "location": {
      "type": "AzureBlobStorageLocation",
      "container": "raw-data",
      "folderPath": "sales/2024"
    },
    "columnDelimiter": ",",
    "firstRowAsHeader": true
  }
}
```

### Best Practices

✅ **Parameterize Paths**: Use parameters for dynamic paths (dates, partitions)
✅ **Define Schema**: Explicitly define schema when known
✅ **Use Descriptive Names**: Clear, meaningful dataset names
✅ **Reuse**: Create reusable dataset definitions


## Pipelines

**Pipelines** are logical groupings of activities that together perform a task. A pipeline is a workflow that orchestrates data movement and transformation.

### Purpose

- **Orchestrate Workflows**: Coordinate multiple activities
- **Define Dependencies**: Set activity execution order
- **Parameterize**: Accept parameters for flexibility
- **Schedule**: Can be triggered on schedule or event

### Pipeline Components

1. **Activities**: Individual tasks (copy, transform, etc.)
2. **Parameters**: Input parameters for flexibility
3. **Variables**: Internal variables for pipeline logic
4. **Dependencies**: Activity execution order
5. **Error Handling**: How to handle failures

### Pipeline Example Flow

```
Pipeline: Load Sales Data
├── Activity 1: Copy from SQL Server to Blob Storage
├── Activity 2: Transform data (Data Flow)
└── Activity 3: Copy from Blob Storage to Synapse
```

### Pipeline Parameters

Pipelines can accept parameters for dynamic behavior:

```json
{
  "name": "LoadSalesData",
  "parameters": {
    "sourceTable": {
      "type": "String"
    },
    "targetFolder": {
      "type": "String"
    },
    "loadDate": {
      "type": "String"
    }
  }
}
```

### Best Practices

✅ **Single Responsibility**: Each pipeline should do one thing well
✅ **Parameterize**: Use parameters for flexibility
✅ **Error Handling**: Implement proper error handling
✅ **Logging**: Add logging for debugging
✅ **Naming**: Use descriptive pipeline names


## Activities

**Activities** are individual tasks within a pipeline. Each activity performs a specific operation on data.

### Activity Types

#### 1. Data Movement Activities

**Copy Activity**: Copy data from source to sink
- Most common activity
- Supports 90+ data sources
- Handles schema mapping
- Supports transformations during copy

#### 2. Data Transformation Activities

**Data Flow Activity**: Transform data using visual data flows
- Code-free transformations
- Spark-based execution
- Supports complex transformations

**Stored Procedure Activity**: Execute stored procedures
- Run SQL stored procedures
- Pass parameters
- Get return values

**Lookup Activity**: Look up values from datasets
- Get single value or row
- Use in conditional logic
- Reference data lookups

#### 3. Control Flow Activities

**If Condition Activity**: Conditional branching
- Execute activities based on conditions
- IF-THEN-ELSE logic

**ForEach Activity**: Loop through items
- Iterate over arrays
- Execute activities for each item
- Parallel or sequential execution

**Wait Activity**: Pause pipeline execution
- Wait for specified duration
- Wait for external events

**Until Activity**: Loop until condition is met
- Retry logic
- Polling scenarios

### Copy Activity Example

```json
{
  "name": "CopySalesData",
  "type": "Copy",
  "inputs": [
    {
      "referenceName": "SourceDataset",
      "type": "DatasetReference"
    }
  ],
  "outputs": [
    {
      "referenceName": "SinkDataset",
      "type": "DatasetReference"
    }
  ],
  "typeProperties": {
    "source": {
      "type": "DelimitedTextSource",
      "skipLineCount": 1
    },
    "sink": {
      "type": "DelimitedTextSink",
      "writeBehavior": "append"
    }
  }
}
```


## Source and Sink

### Source

**Source** is where data comes from in a data movement activity (typically Copy Activity).

**Source Properties:**
- **Dataset Reference**: Points to source dataset
- **Query**: SQL query for databases
- **File Path**: Path to files in storage
- **Filter**: Filter data at source

**Common Source Types:**
- **DelimitedTextSource**: CSV, TSV files
- **JsonSource**: JSON files
- **SqlSource**: SQL databases
- **BlobSource**: Blob storage
- **ParquetSource**: Parquet files

### Sink

**Sink** is where data goes to in a data movement activity.

**Sink Properties:**
- **Dataset Reference**: Points to sink dataset
- **Write Behavior**: Append, Upsert, or Replace
- **Pre-copy Script**: SQL script to run before copy
- **Table Option**: Auto-create table if not exists

**Common Sink Types:**
- **DelimitedTextSink**: CSV, TSV files
- **JsonSink**: JSON files
- **SqlSink**: SQL databases
- **BlobSink**: Blob storage
- **ParquetSink**: Parquet files

### Source and Sink Example

```
Source (SQL Server)
├── Dataset: SQLServerTable
├── Query: SELECT * FROM Sales WHERE SaleDate >= @StartDate
└── Connection: Linked Service to SQL Server

    ↓ Copy Activity ↓

Sink (Azure Data Lake)
├── Dataset: DelimitedText
├── Path: /raw/sales/2024/01/
├── Format: CSV
└── Connection: Linked Service to ADLS Gen2
```

### Best Practices

✅ **Filter at Source**: Use source queries to filter data early
✅ **Partitioning**: Use partitioned sinks for large data
✅ **Compression**: Compress data during transfer
✅ **Parallel Copy**: Enable parallel copy for performance
✅ **Error Handling**: Handle source/sink errors gracefully


## Building a Simple Pipeline

### Step-by-Step Process

#### Step 1: Create Linked Services
1. Create linked service for source (e.g., Azure Blob Storage)
2. Create linked service for sink (e.g., Azure SQL Database)
3. Configure connection details and credentials

#### Step 2: Create Datasets
1. Create source dataset (points to source data)
2. Create sink dataset (points to destination)
3. Define schema and format

#### Step 3: Create Pipeline
1. Create new pipeline
2. Add Copy Activity
3. Configure source and sink
4. Set up dependencies

#### Step 4: Configure Triggers
1. Create schedule trigger (e.g., daily at 2 AM)
2. Attach trigger to pipeline
3. Set parameters if needed

#### Step 5: Publish and Monitor
1. Publish pipeline to Data Factory
2. Trigger pipeline manually or wait for schedule
3. Monitor execution in Monitor hub

### Example: Simple Copy Pipeline

**Scenario**: Copy CSV file from Blob Storage to SQL Database

```
1. Linked Service: LS_BlobStorage
   └── Connection to Azure Blob Storage

2. Linked Service: LS_SQLDatabase
   └── Connection to Azure SQL Database

3. Dataset: DS_SalesCSV (Source)
   └── Points to: container/sales/data.csv
   └── Format: CSV

4. Dataset: DS_SalesTable (Sink)
   └── Points to: dbo.Sales table
   └── Format: SQL Table

5. Pipeline: PL_CopySalesData
   └── Activity: Copy Activity
       ├── Source: DS_SalesCSV
       └── Sink: DS_SalesTable

6. Trigger: TR_DailyAt2AM
   └── Schedule: Daily at 2:00 AM
   └── Pipeline: PL_CopySalesData
```


## Triggers

**Triggers** determine when a pipeline execution should be kicked off. They can be scheduled or event-based.

### Trigger Types

#### 1. Schedule Trigger
- **Purpose**: Run pipeline on a schedule
- **Examples**: Daily, weekly, monthly, custom cron expressions
- **Use Cases**: Scheduled data loads, regular ETL jobs

#### 2. Tumbling Window Trigger
- **Purpose**: Run pipeline at regular intervals
- **Examples**: Every hour, every 15 minutes
- **Use Cases**: Periodic data processing

#### 3. Event-Based Trigger
- **Purpose**: Trigger on events (file arrival, blob creation)
- **Examples**: File added to storage, message in queue
- **Use Cases**: Process files as they arrive

#### 4. Manual Trigger
- **Purpose**: Trigger pipeline manually
- **Examples**: On-demand execution
- **Use Cases**: Testing, ad-hoc processing

### Trigger Example

**Schedule Trigger:**
```json
{
  "name": "DailyTrigger",
  "type": "ScheduleTrigger",
  "typeProperties": {
    "recurrence": {
      "frequency": "Day",
      "interval": 1,
      "startTime": "2024-01-01T02:00:00Z",
      "timeZone": "UTC"
    }
  },
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "LoadSalesData"
      },
      "parameters": {
        "loadDate": "@trigger().scheduledTime"
      }
    }
  ]
}
```

### Best Practices

✅ **Use Parameters**: Pass dynamic values via trigger parameters
✅ **Time Zones**: Be aware of time zone settings
✅ **Error Handling**: Configure retry policies
✅ **Monitoring**: Monitor trigger executions
✅ **Naming**: Use descriptive trigger names


## Summary

In this module, we've covered:

✅ What is Azure Data Factory and its purpose
✅ Linked Services - connection configurations
✅ Datasets - data structure definitions
✅ Pipelines - workflow orchestration
✅ Activities - individual tasks
✅ Source and Sink concepts
✅ Building simple pipelines
✅ Triggers - scheduling and event-based execution

### Key Takeaways

1. **Linked Services** define connections to data sources and compute
2. **Datasets** represent data structures and locations
3. **Pipelines** orchestrate workflows of activities
4. **Activities** perform individual tasks (copy, transform, etc.)
5. **Source** is where data comes from, **Sink** is where it goes
6. **Triggers** determine when pipelines run
7. **ADF** provides visual, code-free ETL capabilities

### Component Hierarchy

```
Data Factory
├── Linked Services (Connections)
├── Datasets (Data Definitions)
├── Pipelines (Workflows)
│   ├── Activities (Tasks)
│   │   ├── Source (Input)
│   │   └── Sink (Output)
│   └── Parameters & Variables
└── Triggers (Scheduling)
```

### Next Steps

Proceed to **Module 06: Spark Basics in Azure** to learn about:
- Apache Spark in Azure context
- Azure Databricks
- Processing data with Spark
- Spark transformations and actions
