# Azure Data Factory - Comprehensive Guide

## Overview

This comprehensive guide covers Azure Data Factory (ADF) features, components, and best practices. Use this as a reference before working in ADF Studio to understand the concepts and architecture.

## Learning Objectives

By the end of this guide, you will understand:
- Azure Data Factory architecture and components
- Linked Services and connection management
- Datasets and data structure definitions
- Pipelines and workflow orchestration
- Various activity types and their use cases
- Data Flows for transformations
- Integration Runtimes and compute options
- Triggers and scheduling
- Parameters, variables, and expressions
- Error handling and monitoring
- Best practices and common patterns


## What is Azure Data Factory?

**Azure Data Factory (ADF)** is a cloud-based data integration service that enables you to create data-driven workflows for orchestrating and automating data movement and data transformation.

### Key Characteristics

- **Serverless**: No infrastructure to manage
- **Visual Interface**: Drag-and-drop pipeline designer
- **90+ Connectors**: Built-in connectors for various data sources
- **Code-Free ETL**: Build pipelines without writing code
- **Hybrid Integration**: Connect to on-premises and cloud data sources
- **Scalable**: Automatically scales based on workload
- **Cost-Effective**: Pay only for what you use

### ADF vs Traditional ETL Tools

| Feature | Traditional ETL | Azure Data Factory |
|---------|----------------|-------------------|
| **Infrastructure** | Requires servers | Serverless |
| **Scalability** | Manual scaling | Auto-scaling |
| **Cost Model** | Fixed costs | Pay-per-use |
| **Maintenance** | High maintenance | Low maintenance |
| **Cloud Integration** | Limited | Native Azure integration |
| **Visual Design** | Limited | Full visual designer |

### Use Cases

âœ… **Data Migration**: Move data from on-premises to cloud

âœ… **ETL/ELT Pipelines**: Extract, transform, and load data

âœ… **Data Integration**: Combine data from multiple sources

âœ… **Scheduled Data Loads**: Automate daily/weekly/monthly refreshes

âœ… **Data Orchestration**: Coordinate complex data workflows

âœ… **Real-time Data Processing**: Process streaming data

âœ… **Data Warehousing**: Load data into data warehouses

âœ… **Big Data Processing**: Process large volumes of data


## ADF Architecture and Components

Azure Data Factory consists of several key components that work together:

```
Azure Data Factory Instance
â”‚
â”œâ”€â”€ Linked Services (Connections)
â”‚   â”œâ”€â”€ Data Store Linked Services
â”‚   â””â”€â”€ Compute Linked Services
â”‚
â”œâ”€â”€ Datasets (Data Definitions)
â”‚   â”œâ”€â”€ Source Datasets
â”‚   â””â”€â”€ Sink Datasets
â”‚
â”œâ”€â”€ Pipelines (Workflows)
â”‚   â”œâ”€â”€ Activities
â”‚   â”‚   â”œâ”€â”€ Data Movement Activities
â”‚   â”‚   â”œâ”€â”€ Data Transformation Activities
â”‚   â”‚   â””â”€â”€ Control Flow Activities
â”‚   â”œâ”€â”€ Parameters
â”‚   â””â”€â”€ Variables
â”‚
â”œâ”€â”€ Data Flows (Transformations)
â”‚   â”œâ”€â”€ Source Transformations
â”‚   â”œâ”€â”€ Transform Steps
â”‚   â””â”€â”€ Sink Transformations
â”‚
â”œâ”€â”€ Integration Runtimes (Compute)
â”‚   â”œâ”€â”€ Azure Integration Runtime
â”‚   â”œâ”€â”€ Self-Hosted Integration Runtime
â”‚   â””â”€â”€ Azure-SSIS Integration Runtime
â”‚
â””â”€â”€ Triggers (Scheduling)
    â”œâ”€â”€ Schedule Triggers
    â”œâ”€â”€ Tumbling Window Triggers
    â””â”€â”€ Event-Based Triggers
```

### Component Relationships

```
Linked Service â†’ Dataset â†’ Pipeline Activity
     â†“              â†“            â†“
  Connection    Data Structure  Task Execution
```

### ADF Studio Interface

When you open ADF Studio, you'll see:

- **Author Tab**: Design and create pipelines, datasets, linked services
- **Monitor Tab**: View pipeline runs, activity executions, debug sessions
- **Manage Tab**: Manage linked services, integration runtimes, triggers
- **Learning Center Tab**: Browse templates and samples


## Linked Services

**Linked Services** are connection definitions that contain the connection information needed for Data Factory to connect to external resources.

### Purpose

- Store connection information (connection strings, credentials, endpoints)
- Enable reusability across multiple pipelines
- Secure credential management (Azure Key Vault integration)
- Abstract connection details from pipelines

### Linked Service Structure

```json
{
  "name": "LinkedServiceName",
  "type": "LinkedServiceType",
  "typeProperties": {
    // Connection-specific properties
  },
  "connectVia": {
    // Integration Runtime reference (optional)
  }
}
```

### Types of Linked Services

#### 1. Data Store Linked Services

Connect to data storage systems:

**Azure Storage:**
- `AzureBlobStorage` - Azure Blob Storage
- `AzureDataLakeStorageGen2` - ADLS Gen2
- `AzureFileStorage` - Azure Files
- `AzureTableStorage` - Azure Table Storage

**Databases:**
- `AzureSqlDatabase` - Azure SQL Database
- `AzureSqlMI` - Azure SQL Managed Instance
- `SqlServer` - SQL Server (on-premises or Azure VM)
- `AzureSynapseAnalytics` - Azure Synapse Analytics
- `Oracle`, `MySQL`, `PostgreSQL` - Various databases

**NoSQL:**
- `CosmosDb` - Azure Cosmos DB
- `MongoDb` - MongoDB

**File Systems:**
- `FileServer` - On-premises file system
- `FtpServer` - FTP server
- `Sftp` - SFTP server

**Cloud Storage:**
- `AmazonS3` - Amazon S3
- `GoogleCloudStorage` - Google Cloud Storage

**Other:**
- `HttpServer` - HTTP/REST APIs
- `OData` - OData services
- `Salesforce`, `Dynamics365` - CRM systems

#### 2. Compute Linked Services

Connect to compute services for data transformation:

- `AzureDatabricks` - Azure Databricks
- `AzureHDInsight` - Azure HDInsight
- `AzureBatch` - Azure Batch
- `AzureMachineLearning` - Azure ML

### Linked Service Examples

#### Azure Blob Storage Linked Service

```json
{
  "name": "LS_AzureBlobStorage",
  "type": "AzureBlobStorage",
  "typeProperties": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=***;EndpointSuffix=core.windows.net"
  }
}
```

#### Azure SQL Database Linked Service (with Key Vault)

```json
{
  "name": "LS_AzureSQLDatabase",
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "Server=tcp:myserver.database.windows.net,1433;Database=mydatabase;User ID=myuser;Password=***;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;",
    "password": {
      "type": "AzureKeyVaultSecret",
      "store": {
        "referenceName": "LS_AzureKeyVault",
        "type": "LinkedServiceReference"
      },
      "secretName": "sqlPassword"
    }
  }
}
```

#### Self-Hosted Integration Runtime Linked Service

```json
{
  "name": "LS_SqlServerOnPrem",
  "type": "SqlServer",
  "typeProperties": {
    "connectionString": "Integrated Security=False;Data Source=myserver;Initial Catalog=mydatabase;User ID=myuser;Password=***",
    "userName": "myuser",
    "password": {
      "type": "SecureString",
      "value": "***"
    }
  },
  "connectVia": {
    "referenceName": "IR_SelfHosted",
    "type": "IntegrationRuntimeReference"
  }
}
```

### Best Practices for Linked Services

âœ… **Use Azure Key Vault**: Store sensitive credentials in Key Vault

âœ… **Naming Convention**: Use prefixes like `LS_` (e.g., `LS_AzureBlobStorage_Prod`)

âœ… **Parameterize**: Use parameters for different environments (dev, test, prod)

âœ… **Reuse**: Create linked services that can be reused across pipelines

âœ… **Documentation**: Add descriptions to linked services

âœ… **Test Connections**: Always test connections after creating linked services

âœ… **Use Integration Runtimes**: Specify Integration Runtime for on-premises connections


## Datasets

**Datasets** represent data structures within data stores. They define the structure, location, and format of data that you want to use as input or output in activities.

### Purpose

- Define data structure (schema, columns, data types)
- Specify data location (path, table name, query)
- Define data format (CSV, JSON, Parquet, etc.)
- Enable reusability across activities
- Support parameterization for dynamic paths

### Dataset Structure

```json
{
  "name": "DatasetName",
  "type": "DatasetType",
  "linkedServiceName": {
    "referenceName": "LinkedServiceName",
    "type": "LinkedServiceReference"
  },
  "schema": [
    // Schema definition (optional)
  ],
  "typeProperties": {
    // Type-specific properties
  },
  "parameters": {
    // Parameters for dynamic paths
  }
}
```

### Common Dataset Types

#### File-Based Datasets

**DelimitedText (CSV, TSV):**
```json
{
  "name": "DS_SalesCSV",
  "type": "DelimitedText",
  "linkedServiceName": {
    "referenceName": "LS_AzureBlobStorage",
    "type": "LinkedServiceReference"
  },
  "schema": [
    { "name": "CustomerID", "type": "Int32" },
    { "name": "SalesAmount", "type": "Decimal" },
    { "name": "SaleDate", "type": "DateTime" }
  ],
  "typeProperties": {
    "location": {
      "type": "AzureBlobStorageLocation",
      "container": "raw-data",
      "folderPath": "sales/2024"
    },
    "columnDelimiter": ",",
    "firstRowAsHeader": true,
    "compressionCodec": "gzip"
  }
}
```

**Json:**
```json
{
  "name": "DS_ProductsJSON",
  "type": "Json",
  "typeProperties": {
    "location": {
      "type": "AzureBlobStorageLocation",
      "container": "data",
      "folderPath": "products"
    }
  }
}
```

**Parquet:**
```json
{
  "name": "DS_SalesParquet",
  "type": "Parquet",
  "typeProperties": {
    "location": {
      "type": "AzureBlobStorageLocation",
      "container": "processed-data",
      "folderPath": "sales"
    }
  }
}
```

#### Database Datasets

**AzureSqlTable:**
```json
{
  "name": "DS_SalesTable",
  "type": "AzureSqlTable",
  "linkedServiceName": {
    "referenceName": "LS_AzureSQLDatabase",
    "type": "LinkedServiceReference"
  },
  "schema": [
    { "name": "CustomerID", "type": "Int32" },
    { "name": "SalesAmount", "type": "Decimal" }
  ],
  "typeProperties": {
    "schema": "dbo",
    "table": "Sales"
  }
}
```

### Parameterized Datasets

Use parameters for dynamic paths and table names:

```json
{
  "name": "DS_SalesCSV_Param",
  "type": "DelimitedText",
  "parameters": {
    "folderPath": {
      "type": "String"
    },
    "fileName": {
      "type": "String"
    }
  },
  "typeProperties": {
    "location": {
      "type": "AzureBlobStorageLocation",
      "container": "raw-data",
      "folderPath": {
        "value": "@dataset().folderPath",
        "type": "Expression"
      },
      "fileName": {
        "value": "@dataset().fileName",
        "type": "Expression"
      }
    }
  }
}
```

### Dataset Schema

**Explicit Schema:**
- Define columns and data types explicitly
- Use when schema is known and stable
- Better for validation and error detection

**Implicit Schema:**
- Let ADF infer schema from data
- Use when schema is unknown or changes frequently
- Less control but more flexible

### Best Practices for Datasets

âœ… **Parameterize Paths**: Use parameters for dynamic paths (dates, partitions)

âœ… **Define Schema**: Explicitly define schema when known for better validation

âœ… **Use Descriptive Names**: Clear, meaningful dataset names (e.g., `DS_SalesCSV_Source`)

âœ… **Reuse**: Create reusable dataset definitions

âœ… **Compression**: Use compression for large files (gzip, snappy)

âœ… **Partitioning**: Use partitioned datasets for large data volumes

âœ… **Naming Convention**: Use prefixes like `DS_` for datasets


## Pipelines

**Pipelines** are logical groupings of activities that together perform a task. A pipeline defines a workflow that orchestrates data movement and transformation.

### Purpose

- Orchestrate multiple activities in a workflow
- Define activity dependencies and execution order
- Accept parameters for flexibility
- Support variables for internal logic
- Enable error handling and retry policies
- Can be triggered on schedule or events

### Pipeline Structure

```json
{
  "name": "PipelineName",
  "properties": {
    "activities": [
      // Array of activities
    ],
    "parameters": {
      // Pipeline parameters
    },
    "variables": {
      // Pipeline variables
    },
    "annotations": []
  }
}
```

### Pipeline Components

#### 1. Activities
Individual tasks that perform operations (copy, transform, etc.)

#### 2. Parameters
Input parameters passed to the pipeline (from triggers or manual execution)

#### 3. Variables
Internal variables used within the pipeline for logic and calculations

#### 4. Dependencies
Define the order of activity execution (activity B depends on activity A)

#### 5. Error Handling
Configure how to handle failures (retry, fail, continue)

### Pipeline Example

```json
{
  "name": "PL_LoadSalesData",
  "properties": {
    "activities": [
      {
        "name": "CopyFromBlobToSQL",
        "type": "Copy",
        "dependsOn": [],
        "inputs": [{"referenceName": "DS_SalesCSV"}],
        "outputs": [{"referenceName": "DS_SalesTable"}]
      },
      {
        "name": "TransformData",
        "type": "ExecuteDataFlow",
        "dependsOn": [{"activity": "CopyFromBlobToSQL"}],
        "typeProperties": {
          "dataflow": {
            "referenceName": "DF_TransformSales",
            "type": "DataFlowReference"
          }
        }
      }
    ],
    "parameters": {
      "sourcePath": {"type": "String"},
      "targetTable": {"type": "String"}
    }
  }
}
```

### Pipeline Parameters

Parameters make pipelines flexible and reusable:

```json
{
  "parameters": {
    "sourceContainer": {
      "type": "String",
      "defaultValue": "raw-data"
    },
    "targetSchema": {
      "type": "String",
      "defaultValue": "dbo"
    },
    "loadDate": {
      "type": "String"
    }
  }
}
```

**Accessing Parameters:**
- In expressions: `@pipeline().parameters.sourceContainer`
- In activities: Reference parameters in activity properties

### Pipeline Variables

Variables store intermediate values:

```json
{
  "variables": {
    "rowCount": {
      "type": "Int32",
      "defaultValue": 0
    },
    "processDate": {
      "type": "String",
      "defaultValue": "@formatDateTime(utcnow(), 'yyyy-MM-dd')"
    }
  }
}
```

**Setting Variables:**
- Use `Set Variable` activity
- Use expressions: `@setVariable('rowCount', 100)`

**Accessing Variables:**
- In expressions: `@variables('rowCount')`

### Activity Dependencies

Control execution order:

```json
{
  "name": "ActivityB",
  "dependsOn": [
    {
      "activity": "ActivityA",
      "dependencyConditions": ["Succeeded"]
    }
  ]
}
```

**Dependency Conditions:**
- `Succeeded` - Activity must succeed
- `Failed` - Activity must fail
- `Completed` - Activity must complete (succeed or fail)
- `Skipped` - Activity must be skipped

### Best Practices for Pipelines

âœ… **Single Responsibility**: Each pipeline should have one clear purpose

âœ… **Parameterize**: Use parameters for flexibility and reusability

âœ… **Error Handling**: Implement proper error handling and retry policies

âœ… **Logging**: Add logging activities for debugging

âœ… **Naming**: Use descriptive names (e.g., `PL_LoadSalesData_Daily`)

âœ… **Documentation**: Add descriptions and annotations

âœ… **Modularity**: Break complex pipelines into smaller, reusable pipelines

âœ… **Version Control**: Use Git integration for version control


## Activities

**Activities** are individual tasks within a pipeline. Each activity performs a specific operation on data or controls pipeline flow.

### Activity Categories

#### 1. Data Movement Activities

##### Copy Activity
The most common activity for copying data from source to sink.

**Key Features:**
- Supports 90+ data sources
- Handles schema mapping automatically
- Supports transformations during copy
- Parallel copy for performance
- Data type conversion

**Example:**
```json
{
  "name": "CopySalesData",
  "type": "Copy",
  "inputs": [{"referenceName": "DS_Source"}],
  "outputs": [{"referenceName": "DS_Sink"}],
  "typeProperties": {
    "source": {
      "type": "DelimitedTextSource",
      "skipLineCount": 1
    },
    "sink": {
      "type": "DelimitedTextSink",
      "writeBehavior": "append"
    },
    "enableStaging": true,
    "stagingSettings": {
      "linkedServiceName": {
        "referenceName": "LS_BlobStorage",
        "type": "LinkedServiceReference"
      }
    }
  }
}
```

#### 2. Data Transformation Activities

##### Data Flow Activity
Transform data using visual data flows (Spark-based).

**Key Features:**
- Code-free transformations
- Spark-based execution
- Supports complex transformations
- Visual designer interface

**Example:**
```json
{
  "name": "TransformData",
  "type": "ExecuteDataFlow",
  "typeProperties": {
    "dataflow": {
      "referenceName": "DF_TransformSales",
      "type": "DataFlowReference"
    },
    "compute": {
      "computeType": "General",
      "coreCount": 8
    }
  }
}
```

##### Stored Procedure Activity
Execute SQL stored procedures.

**Example:**
```json
{
  "name": "ExecuteSP",
  "type": "SqlServerStoredProcedure",
  "typeProperties": {
    "storedProcedureName": "sp_ProcessSales",
    "storedProcedureParameters": {
      "LoadDate": {
        "value": "@pipeline().parameters.loadDate",
        "type": "String"
      }
    }
  }
}
```

##### Lookup Activity
Look up values from datasets (single value or row).

**Use Cases:**
- Get configuration values
- Reference data lookups
- Conditional logic based on lookup results

**Example:**
```json
{
  "name": "LookupConfig",
  "type": "Lookup",
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource",
      "sqlReaderQuery": "SELECT ConfigValue FROM Config WHERE ConfigKey = 'MaxRows'"
    },
    "dataset": {
      "referenceName": "DS_ConfigTable",
      "type": "DatasetReference"
    },
    "firstRowOnly": true
  }
}
```

#### 3. Control Flow Activities

##### If Condition Activity
Conditional branching (IF-THEN-ELSE logic).

**Example:**
```json
{
  "name": "CheckDataExists",
  "type": "IfCondition",
  "typeProperties": {
    "expression": {
      "value": "@greater(activity('LookupRowCount').output.firstRow.count, 0)",
      "type": "Expression"
    },
    "ifTrueActivities": [
      {
        "name": "ProcessData",
        "type": "Copy"
      }
    ],
    "ifFalseActivities": [
      {
        "name": "LogNoData",
        "type": "WebActivity"
      }
    ]
  }
}
```

##### ForEach Activity
Loop through items (arrays).

**Example:**
```json
{
  "name": "ProcessFiles",
  "type": "ForEach",
  "typeProperties": {
    "items": {
      "value": "@pipeline().parameters.fileList",
      "type": "Expression"
    },
    "isSequential": false,
    "batchCount": 5,
    "activities": [
      {
        "name": "CopyFile",
        "type": "Copy"
      }
    ]
  }
}
```

##### Wait Activity
Pause pipeline execution.

**Example:**
```json
{
  "name": "WaitForProcessing",
  "type": "Wait",
  "typeProperties": {
    "waitTimeInSeconds": 300
  }
}
```

##### Until Activity
Loop until condition is met (retry logic).

**Example:**
```json
{
  "name": "WaitForFile",
  "type": "Until",
  "typeProperties": {
    "expression": {
      "value": "@equals(activity('CheckFile').output.exists, true)",
      "type": "Expression"
    },
    "timeout": "00:10:00",
    "activities": [
      {
        "name": "CheckFile",
        "type": "GetMetadata"
      },
      {
        "name": "Wait",
        "type": "Wait",
        "typeProperties": {
          "waitTimeInSeconds": 30
        }
      }
    ]
  }
}
```

#### 4. Other Activities

##### Web Activity
Call REST APIs.

##### Get Metadata Activity
Get metadata about data (file existence, schema, etc.).

##### Set Variable Activity
Set pipeline variable values.

##### Filter Activity
Filter arrays based on conditions.

##### Validation Activity
Validate data before processing.

### Activity Retry and Timeout

Configure retry and timeout policies:

```json
{
  "name": "CopyData",
  "type": "Copy",
  "policy": {
    "timeout": "01:00:00",
    "retry": 3,
    "retryIntervalInSeconds": 30
  }
}
```

### Best Practices for Activities

âœ… **Error Handling**: Configure retry policies for transient failures

âœ… **Timeout**: Set appropriate timeouts

âœ… **Dependencies**: Clearly define activity dependencies

âœ… **Naming**: Use descriptive activity names

âœ… **Logging**: Add logging for debugging

âœ… **Performance**: Use parallel copy and staging for large data


## Data Flows

**Data Flows** are visual data transformation pipelines that run on Spark clusters. They provide a code-free way to transform data at scale.

### Purpose

- Transform data without writing code
- Handle complex transformations visually
- Scale automatically with Spark
- Support data quality and profiling
- Reusable across multiple pipelines

### Data Flow Architecture

```
Data Flow
â”œâ”€â”€ Source (Read data)
â”œâ”€â”€ Transformations (Transform data)
â”‚   â”œâ”€â”€ Select
â”‚   â”œâ”€â”€ Filter
â”‚   â”œâ”€â”€ Aggregate
â”‚   â”œâ”€â”€ Join
â”‚   â”œâ”€â”€ Derived Column
â”‚   â”œâ”€â”€ Sort
â”‚   â””â”€â”€ ... (many more)
â””â”€â”€ Sink (Write data)
```

### Key Transformations

#### 1. Source Transformation
- Read data from datasets
- Define schema
- Configure data sampling
- Set up partitioning

#### 2. Select Transformation
- Select, rename, or drop columns
- Reorder columns
- Change data types

#### 3. Filter Transformation
- Filter rows based on conditions
- Use expressions for complex filters

#### 4. Derived Column Transformation
- Create new columns
- Modify existing columns
- Use expressions and functions

**Example Expressions:**
- `toUpper(columnName)` - Convert to uppercase
- `concat(firstName, ' ', lastName)` - Concatenate strings
- `year(currentDate())` - Get year from date
- `iif(amount > 1000, 'High', 'Low')` - Conditional logic

#### 5. Aggregate Transformation
- Group by columns
- Calculate aggregations (sum, avg, count, etc.)
- Window functions

#### 6. Join Transformation
- Inner join, left join, right join, full outer join
- Join on multiple columns
- Handle nulls

#### 7. Sort Transformation
- Sort by one or more columns
- Ascending or descending
- Null handling

#### 8. Lookup Transformation
- Look up values from another data flow
- Reference data enrichment

#### 9. Pivot/Unpivot Transformations
- Pivot: Convert rows to columns
- Unpivot: Convert columns to rows

#### 10. Window Transformation
- Window functions (ROW_NUMBER, RANK, etc.)
- Partitioning and ordering

#### 11. Sink Transformation
- Write data to destination
- Configure output settings
- Handle partitioning

### Data Flow Example

**Scenario**: Transform sales data

```
Source (Sales CSV)
    â†“
Select (Choose columns)
    â†“
Filter (SalesAmount > 0)
    â†“
Derived Column (Calculate Total = Quantity * Price)
    â†“
Aggregate (Group by CustomerID, Sum Total)
    â†“
Join (with Customer table)
    â†“
Select (Final columns)
    â†“
Sink (Write to SQL Database)
```

### Data Flow Debug Mode

- Test data flows interactively
- See sample data at each transformation
- Validate transformations before publishing
- Use sample data or full data

### Data Flow Performance

**Optimization Techniques:**
- **Partitioning**: Configure partitioning strategy
- **Caching**: Cache intermediate results
- **Sampling**: Use sampling for development
- **Cluster Size**: Configure appropriate cluster size
- **Broadcast Joins**: For small lookup tables

### Best Practices for Data Flows

âœ… **Start Simple**: Begin with basic transformations

âœ… **Use Debug Mode**: Test transformations before publishing

âœ… **Optimize Joins**: Use broadcast joins for small tables

âœ… **Partitioning**: Configure appropriate partitioning

âœ… **Documentation**: Add descriptions to transformations

âœ… **Reusability**: Create reusable data flows

âœ… **Error Handling**: Handle nulls and data quality issues

âœ… **Performance**: Monitor and optimize performance


## Integration Runtimes

**Integration Runtime (IR)** is the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments.

### Purpose

- Execute data movement activities
- Dispatch activities to compute services
- Connect to data sources in different networks
- Provide transformation capabilities

### Types of Integration Runtimes

#### 1. Azure Integration Runtime

**Purpose**: Cloud-based, fully managed IR for cloud-to-cloud data movement

**Characteristics:**
- Serverless and fully managed
- Automatically scales
- No infrastructure to manage
- Free (no additional cost)
- Limited to cloud data sources

**Use Cases:**
- Copy data between cloud services
- Execute cloud-based transformations
- Connect to Azure services

**Limitations:**
- Cannot connect to on-premises data sources
- Cannot connect to private networks

#### 2. Self-Hosted Integration Runtime

**Purpose**: IR installed on on-premises machines or VMs for hybrid connectivity

**Characteristics:**
- Installed on your infrastructure
- Connects to on-premises data sources
- Can connect to cloud services
- Requires maintenance and updates
- Supports high-availability setup

**Use Cases:**
- Connect to on-premises SQL Server
- Connect to on-premises file systems
- Connect to private networks
- Hybrid data movement scenarios

**Installation:**
- Download and install on Windows machine or VM
- Register with Data Factory using authentication key
- Can install multiple nodes for high availability

**High Availability:**
- Install on multiple machines
- Automatic failover
- Load balancing

#### 3. Azure-SSIS Integration Runtime

**Purpose**: Lift and shift SQL Server Integration Services (SSIS) packages

**Characteristics:**
- Runs SSIS packages in Azure
- Managed Azure SQL Database or Managed Instance
- Supports SSIS catalog
- Can join Azure Virtual Network

**Use Cases:**
- Migrate existing SSIS packages
- Run SSIS packages in cloud
- Leverage existing SSIS investments

### Integration Runtime Selection

**When to use Azure IR:**
- Cloud-to-cloud data movement
- Azure services only
- No on-premises connectivity needed

**When to use Self-Hosted IR:**
- On-premises data sources
- Private network connectivity
- Hybrid scenarios
- Network security requirements

**When to use Azure-SSIS IR:**
- Existing SSIS packages
- SSIS-specific features needed
- Complex SSIS transformations

### Integration Runtime Configuration

**Self-Hosted IR Setup:**
1. Create Self-Hosted IR in ADF
2. Download and install IR software
3. Register IR with authentication key
4. Configure network settings
5. Test connectivity

**Performance Tuning:**
- Scale up machine resources
- Use multiple nodes for parallel processing
- Optimize network connectivity
- Monitor IR health and performance

### Best Practices for Integration Runtimes

âœ… **Right IR Type**: Choose appropriate IR for your scenario

âœ… **High Availability**: Use multiple nodes for Self-Hosted IR

âœ… **Monitoring**: Monitor IR health and performance

âœ… **Security**: Secure IR machines and network connections

âœ… **Updates**: Keep Self-Hosted IR updated

âœ… **Network**: Optimize network connectivity for performance

âœ… **Documentation**: Document IR configurations and purposes


## Triggers

**Triggers** determine when a pipeline execution should be started. They can be scheduled, event-based, or manual.

### Purpose

- Automate pipeline execution
- Schedule regular data loads
- Respond to events (file arrival, etc.)
- Coordinate multiple pipelines
- Pass parameters to pipelines

### Trigger Types

#### 1. Schedule Trigger

**Purpose**: Run pipeline on a recurring schedule

**Characteristics:**
- Based on calendar schedule
- Supports time zones
- Can pass parameters
- Supports recurrence patterns

**Schedule Patterns:**
- Daily, weekly, monthly
- Specific days of week
- Specific dates
- Custom intervals

**Example:**
```json
{
  "name": "TR_DailyAt2AM",
  "type": "ScheduleTrigger",
  "typeProperties": {
    "recurrence": {
      "frequency": "Day",
      "interval": 1,
      "startTime": "2024-01-01T02:00:00Z",
      "timeZone": "UTC",
      "schedule": {
        "hours": [2],
        "minutes": [0]
      }
    }
  },
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "PL_LoadSalesData"
      },
      "parameters": {
        "loadDate": "@formatDateTime(trigger().scheduledTime, 'yyyy-MM-dd')"
      }
    }
  ]
}
```

#### 2. Tumbling Window Trigger

**Purpose**: Run pipeline at regular intervals with fixed-size, non-overlapping time windows

**Characteristics:**
- Fixed-size time windows
- Non-overlapping intervals
- Supports retry on failure
- Can pass window start/end times

**Use Cases:**
- Hourly data processing
- Every 15 minutes processing
- Fixed-interval batch processing

**Example:**
```json
{
  "name": "TR_Hourly",
  "type": "TumblingWindowTrigger",
  "typeProperties": {
    "frequency": "Hour",
    "interval": 1,
    "startTime": "2024-01-01T00:00:00Z",
    "maxConcurrency": 1,
    "retryPolicy": {
      "count": 3,
      "intervalInSeconds": 30
    }
  },
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "PL_ProcessHourlyData"
      },
      "parameters": {
        "windowStart": "@trigger().outputs.windowStartTime",
        "windowEnd": "@trigger().outputs.windowEndTime"
      }
    }
  ]
}
```

#### 3. Event-Based Trigger

**Purpose**: Trigger pipeline when events occur (file arrival, blob creation, etc.)

**Characteristics:**
- Responds to storage events
- Near real-time processing
- Event-driven architecture
- Supports filtering

**Supported Events:**
- Blob created
- Blob deleted
- File created
- File deleted

**Example:**
```json
{
  "name": "TR_FileArrival",
  "type": "BlobEventsTrigger",
  "typeProperties": {
    "blobPathBeginsWith": "/raw-data/sales/",
    "blobPathEndsWith": ".csv",
    "scope": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Storage/storageAccounts/mystorage",
    "events": ["Microsoft.Storage.BlobCreated"]
  },
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "PL_ProcessFile"
      },
      "parameters": {
        "fileName": "@triggerBody().fileName"
      }
    }
  ]
}
```

#### 4. Custom Event Trigger

**Purpose**: Trigger pipeline based on custom events from Azure Event Grid

**Use Cases:**
- Custom application events
- Integration with other Azure services
- Complex event scenarios

#### 5. Manual Trigger

**Purpose**: Trigger pipeline manually (on-demand)

**Characteristics:**
- No schedule or event
- User-initiated
- Useful for testing
- Can pass parameters

### Trigger Parameters

Pass parameters from triggers to pipelines:

```json
{
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "PL_LoadData"
      },
      "parameters": {
        "sourcePath": "@trigger().scheduledTime",
        "targetTable": "Sales"
      }
    }
  ]
}
```

### Trigger System Variables

**Schedule Trigger:**
- `@trigger().scheduledTime` - Scheduled execution time
- `@trigger().startTime` - Actual start time

**Tumbling Window Trigger:**
- `@trigger().outputs.windowStartTime` - Window start
- `@trigger().outputs.windowEndTime` - Window end

**Event Trigger:**
- `@triggerBody().fileName` - File name that triggered
- `@triggerBody().folderPath` - Folder path

### Best Practices for Triggers

âœ… **Naming Convention**: Use descriptive names (e.g., `TR_DailySalesLoad`)

âœ… **Time Zones**: Be aware of time zone settings

âœ… **Parameters**: Pass dynamic values via trigger parameters

âœ… **Concurrency**: Configure max concurrency appropriately

âœ… **Retry Policy**: Configure retry policies for transient failures

âœ… **Monitoring**: Monitor trigger executions

âœ… **Error Handling**: Handle trigger failures gracefully

âœ… **Documentation**: Document trigger schedules and purposes


## Parameters and Variables

**Parameters** and **Variables** make pipelines flexible, reusable, and dynamic.

### Parameters

**Parameters** are inputs passed to pipelines, datasets, or linked services from outside (triggers, manual execution, or parent pipelines).

#### Pipeline Parameters

**Define Parameters:**
```json
{
  "parameters": {
    "sourceContainer": {
      "type": "String",
      "defaultValue": "raw-data"
    },
    "targetTable": {
      "type": "String"
    },
    "loadDate": {
      "type": "String",
      "defaultValue": "@formatDateTime(utcnow(), 'yyyy-MM-dd')"
    },
    "rowCount": {
      "type": "Int32",
      "defaultValue": 1000
    },
    "isProduction": {
      "type": "Bool",
      "defaultValue": false
    }
  }
}
```

**Access Parameters:**
- In expressions: `@pipeline().parameters.sourceContainer`
- In activities: Reference in activity properties
- In datasets: Pass as dataset parameters

**Pass Parameters from Trigger:**
```json
{
  "pipelines": [
    {
      "pipelineReference": {
        "referenceName": "PL_LoadData"
      },
      "parameters": {
        "loadDate": "@formatDateTime(trigger().scheduledTime, 'yyyy-MM-dd')",
        "sourceContainer": "raw-data"
      }
    }
  ]
}
```

#### Dataset Parameters

**Define Dataset Parameters:**
```json
{
  "parameters": {
    "folderPath": {
      "type": "String"
    },
    "fileName": {
      "type": "String"
    }
  },
  "typeProperties": {
    "location": {
      "folderPath": {
        "value": "@dataset().folderPath",
        "type": "Expression"
      },
      "fileName": {
        "value": "@dataset().fileName",
        "type": "Expression"
      }
    }
  }
}
```

**Pass Parameters to Dataset:**
```json
{
  "inputs": [
    {
      "referenceName": "DS_SalesCSV",
      "type": "DatasetReference",
      "parameters": {
        "folderPath": "sales/2024/01",
        "fileName": "sales_20240101.csv"
      }
    }
  ]
}
```

### Variables

**Variables** are internal values used within pipelines for logic and calculations.

**Define Variables:**
```json
{
  "variables": {
    "rowCount": {
      "type": "Int32",
      "defaultValue": 0
    },
    "processDate": {
      "type": "String",
      "defaultValue": "@formatDateTime(utcnow(), 'yyyy-MM-dd')"
    },
    "fileList": {
      "type": "Array",
      "defaultValue": []
    }
  }
}
```

**Set Variables:**
Use `Set Variable` activity:
```json
{
  "name": "SetRowCount",
  "type": "SetVariable",
  "typeProperties": {
    "variableName": "rowCount",
    "value": {
      "value": "@activity('GetMetadata').output.itemCount",
      "type": "Expression"
    }
  }
}
```

**Access Variables:**
- In expressions: `@variables('rowCount')`
- In activities: Reference in activity properties

### Expressions

**Expressions** are used to create dynamic values using functions and operators.

#### Common Expression Functions

**String Functions:**
- `concat(str1, str2, ...)` - Concatenate strings
- `substring(str, start, length)` - Extract substring
- `toUpper(str)` - Convert to uppercase
- `toLower(str)` - Convert to lowercase
- `replace(str, old, new)` - Replace text

**Date/Time Functions:**
- `utcnow()` - Current UTC time
- `formatDateTime(timestamp, format)` - Format date
- `addDays(timestamp, days)` - Add days
- `addHours(timestamp, hours)` - Add hours
- `year(timestamp)` - Get year
- `month(timestamp)` - Get month
- `day(timestamp)` - Get day

**Numeric Functions:**
- `add(value1, value2)` - Addition
- `sub(value1, value2)` - Subtraction
- `mul(value1, value2)` - Multiplication
- `div(value1, value2)` - Division
- `mod(value1, value2)` - Modulo

**Logical Functions:**
- `equals(value1, value2)` - Equality check
- `greater(value1, value2)` - Greater than
- `less(value1, value2)` - Less than
- `and(condition1, condition2)` - Logical AND
- `or(condition1, condition2)` - Logical OR
- `not(condition)` - Logical NOT
- `iif(condition, trueValue, falseValue)` - Conditional

**Array Functions:**
- `length(array)` - Array length
- `first(array)` - First element
- `last(array)` - Last element
- `contains(array, value)` - Check if contains

**Activity Functions:**
- `activity('ActivityName').output` - Activity output
- `activity('ActivityName').error` - Activity error
- `activity('ActivityName').status` - Activity status

### Expression Examples

```json
// Dynamic file path with date
"folderPath": {
  "value": "@concat('sales/', formatDateTime(utcnow(), 'yyyy/MM/dd'))",
  "type": "Expression"
}

// Conditional value
"container": {
  "value": "@iif(equals(pipeline().parameters.environment, 'prod'), 'prod-data', 'dev-data')",
  "type": "Expression"
}

// Calculate date range
"startDate": {
  "value": "@formatDateTime(addDays(utcnow(), -7), 'yyyy-MM-dd')",
  "type": "Expression"
}

// Activity output
"rowCount": {
  "value": "@activity('GetMetadata').output.itemCount",
  "type": "Expression"
}
```

### Best Practices for Parameters and Variables

âœ… **Parameterize Everything**: Make paths, table names, and configurations parameterized

âœ… **Default Values**: Provide default values for parameters when possible

âœ… **Naming**: Use clear, descriptive names

âœ… **Type Safety**: Use appropriate data types

âœ… **Documentation**: Document parameter purposes and expected values

âœ… **Validation**: Validate parameter values when possible

âœ… **Reusability**: Design for reuse across environments

âœ… **Expressions**: Use expressions for dynamic values


## Error Handling and Monitoring

Proper error handling and monitoring are essential for reliable data pipelines.

### Error Handling Strategies

#### 1. Activity Retry Policy

Configure retry for transient failures:

```json
{
  "name": "CopyData",
  "type": "Copy",
  "policy": {
    "timeout": "01:00:00",
    "retry": 3,
    "retryIntervalInSeconds": 30,
    "retryCount": 3
  }
}
```

**Retry Policy Properties:**
- `retry` - Number of retry attempts
- `retryIntervalInSeconds` - Wait time between retries
- `timeout` - Maximum execution time

#### 2. Activity Dependencies

Control flow based on success/failure:

```json
{
  "name": "ActivityB",
  "dependsOn": [
    {
      "activity": "ActivityA",
      "dependencyConditions": ["Succeeded"]
    }
  ]
}
```

**Dependency Conditions:**
- `Succeeded` - Continue only if succeeded
- `Failed` - Continue only if failed
- `Completed` - Continue regardless of status
- `Skipped` - Continue only if skipped

#### 3. If Condition Activity

Implement conditional error handling:

```json
{
  "name": "HandleError",
  "type": "IfCondition",
  "typeProperties": {
    "expression": {
      "value": "@equals(activity('CopyData').status, 'Failed')",
      "type": "Expression"
    },
    "ifTrueActivities": [
      {
        "name": "SendAlert",
        "type": "WebActivity"
      }
    ]
  }
}
```

#### 4. Try-Catch Pattern

Use activity dependencies to implement try-catch:

```
Try Activity
    â†“ (Succeeded)
Success Activity
    â†“
Continue Pipeline

Try Activity
    â†“ (Failed)
Catch Activity (Error Handling)
    â†“
Continue or Fail Pipeline
```

#### 5. Set Variable for Error Tracking

Track errors in variables:

```json
{
  "name": "SetErrorVariable",
  "type": "SetVariable",
  "typeProperties": {
    "variableName": "errorMessage",
    "value": {
      "value": "@activity('CopyData').error.message",
      "type": "Expression"
    }
  }
}
```

### Monitoring

#### Monitor Hub in ADF Studio

**Pipeline Runs:**
- View all pipeline executions
- Filter by status, time range, pipeline name
- See execution details and duration
- View activity-level details

**Activity Runs:**
- View individual activity executions
- See input/output data
- View error messages and stack traces
- Monitor performance metrics

**Trigger Runs:**
- View trigger executions
- See trigger status and timing
- Monitor trigger failures

#### Key Metrics to Monitor

**Pipeline Metrics:**
- Success rate
- Execution duration
- Failure rate
- Average execution time

**Activity Metrics:**
- Data volume processed
- Rows copied/transformed
- Execution time
- Throughput (rows/second)

**Integration Runtime Metrics:**
- CPU usage
- Memory usage
- Network throughput
- Queue length

#### Alerts and Notifications

**Azure Monitor Integration:**
- Create alerts for pipeline failures
- Set up email notifications
- Configure webhook notifications
- Monitor costs

**Alert Conditions:**
- Pipeline failure
- Activity failure
- Execution time threshold
- Data volume threshold

#### Logging

**Activity Logging:**
- Enable logging for debugging
- Log input/output data
- Log variable values
- Log custom messages

**Diagnostic Settings:**
- Enable diagnostic logs
- Send logs to Log Analytics
- Archive logs to storage
- Stream logs to Event Hub

### Best Practices for Error Handling

âœ… **Retry Policy**: Configure retry for transient failures

âœ… **Timeout**: Set appropriate timeouts

âœ… **Error Handling**: Implement comprehensive error handling

âœ… **Logging**: Enable logging for debugging

âœ… **Alerts**: Set up alerts for failures

âœ… **Monitoring**: Regularly monitor pipeline health

âœ… **Documentation**: Document error handling strategies

âœ… **Testing**: Test error scenarios

âœ… **Notifications**: Configure notifications for critical failures

âœ… **Recovery**: Plan for data recovery after failures


## Common Patterns and Use Cases

Understanding common patterns helps you design effective data pipelines.

### Pattern 1: Incremental Load

**Scenario**: Load only new or changed data since last run

**Approach:**
1. Use watermark column (last modified date, ID)
2. Store last watermark value
3. Query source for records > watermark
4. Update watermark after successful load

**Components:**
- Lookup activity to get watermark
- Copy activity with filtered query
- Stored procedure to update watermark

**Example:**
```
Lookup (Get LastLoadDate)
    â†“
Copy (SELECT * FROM Sales WHERE SaleDate > @LastLoadDate)
    â†“
Stored Procedure (Update LastLoadDate)
```

### Pattern 2: File Processing Loop

**Scenario**: Process multiple files in a folder

**Approach:**
1. Get list of files using Get Metadata
2. Use ForEach activity to loop through files
3. Process each file in parallel or sequentially

**Components:**
- Get Metadata activity
- ForEach activity
- Copy or Data Flow activity

**Example:**
```
Get Metadata (List Files)
    â†“
ForEach (Loop through files)
    â”œâ”€â”€ Copy File
    â””â”€â”€ Archive File
```

### Pattern 3: Data Validation

**Scenario**: Validate data before loading

**Approach:**
1. Load data to staging
2. Validate data quality
3. Load to target only if validation passes

**Components:**
- Copy to staging
- Validation activities
- If Condition for validation result
- Copy to target or error handling

**Example:**
```
Copy to Staging
    â†“
Validation (Check row count, nulls, etc.)
    â†“
If Condition (Validation passed?)
    â”œâ”€â”€ Yes â†’ Copy to Target
    â””â”€â”€ No â†’ Send Alert & Log Error
```

### Pattern 4: Slowly Changing Dimension (SCD)

**Scenario**: Handle dimension table updates (Type 2 SCD)

**Approach:**
1. Compare source with target
2. Identify new, changed, and unchanged records
3. Insert new records
4. Update changed records (end date old, insert new)

**Components:**
- Data Flow for comparison
- Stored procedures for SCD logic
- Multiple copy activities

### Pattern 5: Data Lake to Data Warehouse

**Scenario**: Load data from Data Lake to Data Warehouse

**Approach:**
1. Read from Data Lake (Parquet/CSV)
2. Transform data (Data Flow)
3. Load to staging table
4. Merge to final table

**Components:**
- Copy or Data Flow from Data Lake
- Data Flow for transformation
- Copy to staging
- Stored procedure for merge

**Example:**
```
Data Flow (Read from Data Lake & Transform)
    â†“
Copy to Staging Table
    â†“
Stored Procedure (Merge to Final Table)
```

### Pattern 6: Parallel Processing

**Scenario**: Process multiple data sources in parallel

**Approach:**
1. Create multiple parallel branches
2. No dependencies between branches
3. All branches execute simultaneously

**Components:**
- Multiple activities with no dependencies
- Or use ForEach with parallel execution

**Example:**
```
Pipeline Start
    â”œâ”€â”€ Copy Source1 â†’ Target1
    â”œâ”€â”€ Copy Source2 â†’ Target2
    â””â”€â”€ Copy Source3 â†’ Target3
    â†“
All Complete â†’ Final Activity
```

### Pattern 7: Conditional Execution

**Scenario**: Execute activities based on conditions

**Approach:**
1. Use Lookup to get configuration
2. Use If Condition based on lookup result
3. Execute different paths

**Components:**
- Lookup activity
- If Condition activity
- Conditional activities

**Example:**
```
Lookup (Get Config: ProcessType)
    â†“
If Condition (ProcessType == 'Full')
    â”œâ”€â”€ Yes â†’ Full Load
    â””â”€â”€ No â†’ Incremental Load
```

### Pattern 8: Wait for File

**Scenario**: Wait for file to arrive before processing

**Approach:**
1. Use Until activity
2. Check for file existence
3. Wait if file doesn't exist
4. Process when file arrives

**Components:**
- Until activity
- Get Metadata activity
- Wait activity

**Example:**
```
Until (File Exists)
    â”œâ”€â”€ Get Metadata (Check File)
    â””â”€â”€ Wait (30 seconds)
    â†“
File Found â†’ Process File
```

### Pattern 9: Data Quality Checks

**Scenario**: Ensure data quality before processing

**Approach:**
1. Load to staging
2. Run data quality checks
3. Generate quality report
4. Proceed or fail based on results

**Components:**
- Data Flow for quality checks
- Validation activities
- Reporting activities

### Pattern 10: Master Pipeline

**Scenario**: Orchestrate multiple pipelines

**Approach:**
1. Create master pipeline
2. Execute child pipelines in sequence or parallel
3. Handle errors from child pipelines

**Components:**
- Execute Pipeline activity
- Activity dependencies
- Error handling

**Example:**
```
Master Pipeline
    â”œâ”€â”€ Execute Pipeline (Load Customers)
    â”œâ”€â”€ Execute Pipeline (Load Products)
    â””â”€â”€ Execute Pipeline (Load Sales)
    â†“
All Complete â†’ Final Processing
```

### Best Practices for Patterns

âœ… **Reusability**: Create reusable patterns as templates

âœ… **Documentation**: Document pattern purposes and usage

âœ… **Error Handling**: Include error handling in patterns

âœ… **Parameterization**: Make patterns parameterized

âœ… **Testing**: Test patterns thoroughly

âœ… **Performance**: Optimize patterns for performance

âœ… **Monitoring**: Add monitoring to patterns


## Best Practices Summary

Following best practices ensures reliable, maintainable, and performant data pipelines.

### Design Best Practices

âœ… **Single Responsibility**: Each pipeline should have one clear purpose

âœ… **Modularity**: Break complex pipelines into smaller, reusable pipelines

âœ… **Parameterization**: Parameterize everything (paths, table names, configurations)

âœ… **Naming Conventions**: Use consistent naming (prefixes: LS_, DS_, PL_, TR_)

âœ… **Documentation**: Add descriptions and annotations to all components

âœ… **Version Control**: Use Git integration for version control

âœ… **Environment Separation**: Separate dev, test, and prod environments

### Performance Best Practices

âœ… **Parallel Copy**: Enable parallel copy for large data volumes

âœ… **Staging**: Use staging for better performance in database copies

âœ… **Partitioning**: Use partitioned datasets for large files

âœ… **Filter at Source**: Filter data at source to reduce data volume

âœ… **Compression**: Use compression for file transfers

âœ… **Data Flow Optimization**: Optimize Data Flows (partitioning, caching)

âœ… **Integration Runtime**: Choose appropriate IR and scale appropriately

âœ… **Batch Size**: Configure appropriate batch sizes

### Security Best Practices

âœ… **Azure Key Vault**: Store all secrets in Azure Key Vault

âœ… **Managed Identity**: Use Managed Identity when possible

âœ… **Least Privilege**: Grant minimum required permissions

âœ… **Network Security**: Use private endpoints and VNet integration

âœ… **Audit Logging**: Enable audit logs and monitoring

âœ… **Credential Rotation**: Regularly rotate credentials

âœ… **Access Control**: Use RBAC for access control

### Error Handling Best Practices

âœ… **Retry Policy**: Configure retry for transient failures

âœ… **Timeout**: Set appropriate timeouts

âœ… **Error Handling**: Implement comprehensive error handling

âœ… **Alerts**: Set up alerts for failures

âœ… **Logging**: Enable logging for debugging

âœ… **Notifications**: Configure notifications for critical failures

âœ… **Recovery**: Plan for data recovery after failures

### Monitoring Best Practices

âœ… **Regular Monitoring**: Monitor pipeline health regularly

âœ… **Metrics**: Track key metrics (success rate, duration, throughput)

âœ… **Alerts**: Set up proactive alerts

âœ… **Dashboards**: Create monitoring dashboards

âœ… **Cost Monitoring**: Monitor and optimize costs

âœ… **Performance Monitoring**: Track and optimize performance

### Development Best Practices

âœ… **Testing**: Test pipelines thoroughly before production

âœ… **Debug Mode**: Use debug mode for Data Flows

âœ… **Sample Data**: Use sample data during development

âœ… **Incremental Development**: Build pipelines incrementally

âœ… **Code Review**: Review pipeline designs and code

âœ… **Documentation**: Maintain up-to-date documentation

### Cost Optimization Best Practices

âœ… **Right-Sizing**: Use appropriate compute sizes

âœ… **Scheduling**: Schedule pipelines during off-peak hours when possible

âœ… **Data Volume**: Minimize unnecessary data movement

âœ… **Caching**: Use caching in Data Flows

âœ… **Auto-Pause**: Configure auto-pause for compute resources

âœ… **Monitoring**: Monitor and optimize costs regularly

### Maintenance Best Practices

âœ… **Regular Updates**: Keep components updated

âœ… **Cleanup**: Remove unused pipelines, datasets, and linked services

âœ… **Documentation**: Keep documentation updated

âœ… **Review**: Regularly review and optimize pipelines

âœ… **Backup**: Backup pipeline definitions

âœ… **Disaster Recovery**: Plan for disaster recovery


## ADF Studio Navigation Guide

When you open ADF Studio, here's what you'll see and how to navigate:

### Main Tabs

#### 1. Author Tab
**Purpose**: Design and create pipelines, datasets, linked services, and data flows

**Left Pane - Factory Resources:**
- **Pipelines**: Create and manage pipelines
- **Data flows**: Create and manage data flows
- **Datasets**: Create and manage datasets
- **Linked services**: Create and manage linked services
- **Integration runtimes**: Manage integration runtimes
- **Triggers**: Create and manage triggers
- **Power Query**: Create Power Query data flows

**Canvas Area:**
- Visual pipeline designer
- Drag-and-drop activities
- Configure activity properties
- Set up dependencies

**Properties Pane:**
- Configure component properties
- Set parameters
- Add annotations

#### 2. Monitor Tab
**Purpose**: Monitor pipeline runs, activity executions, and trigger runs

**Views:**
- **Pipeline runs**: All pipeline executions
- **Trigger runs**: All trigger executions
- **Integration runtime**: IR status and metrics
- **Data flow debug sessions**: Active debug sessions

**Filters:**
- Filter by status (Succeeded, Failed, In Progress)
- Filter by time range
- Filter by pipeline/trigger name

**Details:**
- View execution details
- See activity-level information
- View input/output data
- Check error messages

#### 3. Manage Tab
**Purpose**: Manage factory settings, Git configuration, and factory resources

**Sections:**
- **Git configuration**: Connect to Git repository
- **Global parameters**: Define factory-level parameters
- **Managed private endpoints**: Manage private endpoints
- **Customer-managed keys**: Configure encryption keys

#### 4. Gallery Tab
**Purpose**: Browse templates and samples

**Content:**
- Pipeline templates
- Data flow templates
- Sample pipelines
- Quick start guides

### Key Actions in ADF Studio

#### Creating a Pipeline
1. Go to Author tab
2. Click "+" next to Pipelines
3. Name your pipeline
4. Drag activities to canvas
5. Configure activities
6. Set dependencies
7. Publish

#### Creating a Linked Service
1. Go to Author tab
2. Click "+" next to Linked services
3. Choose connector type
4. Configure connection details
5. Test connection
6. Create

#### Creating a Dataset
1. Go to Author tab
2. Click "+" next to Datasets
3. Choose data store type
4. Select linked service
5. Configure data structure
6. Create

#### Creating a Data Flow
1. Go to Author tab
2. Click "+" next to Data flows
3. Add source transformation
4. Add transformation steps
5. Add sink transformation
6. Configure transformations

#### Creating a Trigger
1. Go to Author tab
2. Click "+" next to Triggers
3. Choose trigger type
4. Configure schedule/event
5. Attach pipelines
6. Start trigger

#### Monitoring Pipeline Runs
1. Go to Monitor tab
2. Select Pipeline runs
3. Filter as needed
4. Click on run to see details
5. View activity runs
6. Check logs and errors

### Tips for Using ADF Studio

ðŸ’¡ **Use Search**: Search for pipelines, datasets, or linked services
ðŸ’¡ **Use Templates**: Start with templates from Gallery
ðŸ’¡ **Debug Mode**: Use debug mode for Data Flows
ðŸ’¡ **Validate**: Always validate before publishing
ðŸ’¡ **Test Connections**: Test linked service connections
ðŸ’¡ **Use Expressions**: Use expression builder for dynamic values
ðŸ’¡ **Keyboard Shortcuts**: Learn keyboard shortcuts for efficiency
ðŸ’¡ **Auto-save**: Enable auto-save for drafts
ðŸ’¡ **Version History**: Use Git for version history
ðŸ’¡ **Export/Import**: Export pipelines for backup or sharing


## Summary

This comprehensive guide has covered:

âœ… **Azure Data Factory Overview**: What ADF is and its key features
âœ… **Architecture**: Components and their relationships
âœ… **Linked Services**: Connection definitions to data sources and compute
âœ… **Datasets**: Data structure definitions and locations
âœ… **Pipelines**: Workflow orchestration and activity coordination
âœ… **Activities**: Various activity types (Copy, Data Flow, Control Flow, etc.)
âœ… **Data Flows**: Visual data transformations on Spark
âœ… **Integration Runtimes**: Compute infrastructure (Azure, Self-Hosted, SSIS)
âœ… **Triggers**: Scheduling and event-based execution
âœ… **Parameters & Variables**: Making pipelines dynamic and reusable
âœ… **Error Handling**: Retry policies, dependencies, and error management
âœ… **Monitoring**: Tracking pipeline health and performance
âœ… **Common Patterns**: Real-world use cases and solutions
âœ… **Best Practices**: Design, performance, security, and maintenance guidelines
âœ… **ADF Studio Navigation**: How to use the ADF Studio interface

### Key Takeaways

1. **ADF is Serverless**: No infrastructure to manage, scales automatically
2. **Visual Design**: Build pipelines using drag-and-drop interface
3. **90+ Connectors**: Connect to various data sources out of the box
4. **Code-Free ETL**: Build data pipelines without writing code
5. **Hybrid Integration**: Connect to both cloud and on-premises data sources
6. **Parameterization**: Make pipelines flexible and reusable
7. **Monitoring**: Built-in monitoring and alerting capabilities
8. **Best Practices**: Follow best practices for reliable pipelines

### Component Hierarchy Recap

```
Azure Data Factory
â”‚
â”œâ”€â”€ Linked Services (Connections)
â”‚   â”œâ”€â”€ Data Store Linked Services
â”‚   â””â”€â”€ Compute Linked Services
â”‚
â”œâ”€â”€ Datasets (Data Definitions)
â”‚   â”œâ”€â”€ Source Datasets
â”‚   â””â”€â”€ Sink Datasets
â”‚
â”œâ”€â”€ Pipelines (Workflows)
â”‚   â”œâ”€â”€ Activities
â”‚   â”‚   â”œâ”€â”€ Data Movement (Copy)
â”‚   â”‚   â”œâ”€â”€ Data Transformation (Data Flow, Stored Procedure)
â”‚   â”‚   â””â”€â”€ Control Flow (If, ForEach, Wait, Until)
â”‚   â”œâ”€â”€ Parameters
â”‚   â””â”€â”€ Variables
â”‚
â”œâ”€â”€ Data Flows (Transformations)
â”‚   â”œâ”€â”€ Source Transformations
â”‚   â”œâ”€â”€ Transform Steps
â”‚   â””â”€â”€ Sink Transformations
â”‚
â”œâ”€â”€ Integration Runtimes (Compute)
â”‚   â”œâ”€â”€ Azure IR
â”‚   â”œâ”€â”€ Self-Hosted IR
â”‚   â””â”€â”€ Azure-SSIS IR
â”‚
â””â”€â”€ Triggers (Scheduling)
    â”œâ”€â”€ Schedule Triggers
    â”œâ”€â”€ Tumbling Window Triggers
    â””â”€â”€ Event-Based Triggers
```

### Next Steps

Now that you understand the concepts:

1. **Open ADF Studio**: Navigate to your Azure Data Factory instance
2. **Explore the Interface**: Familiarize yourself with Author, Monitor, and Manage tabs
]
### Additional Resources

- **Azure Data Factory Documentation**: Official Microsoft documentation
- **ADF Templates**: Browse templates in the Gallery tab
- **Azure Data Factory Blog**: Latest updates and best practices
- **Community Forums**: Get help from the community
- **Training Modules**: Microsoft Learn modules on ADF

---

The best way to learn Azure Data Factory is by doing. Use this guide as a reference, then practice in ADF Studio to build real pipelines!
