Skip to content

Azure/dataverse-to-sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataverseToSql

Note

This solution is in maintenance mode. Please consider alternatives, such as Copy Dataverse data into Azure SQL

DataverseToSql is a tool for the incremental transfer of data between Azure Synapse Link for Dataverse and Azure SQL Database.

  • Data is transferred incrementally between the container populated by Synapse Link for Dataverse and an Azure SQL Database; this minimizes the amount of data to process while updating the target database.
  • DataverseToSql reads from Azure Synapse Link near real-time data to minimize latency.
  • Schema changes are propagated automatically.

Alternatives

Why?

DataverseToSql goals are:

  • Reducing the replication latency between the Dataverse container and Azure SQL Database. compared to other solutions.
  • Automating data ingestion pipeline and schema evolution.

How it works

The core functionality of DataverseToSql is implemented by an Azure Function that extracts data incrementally from the tail of append-only tables using the Azure Blob Storage API; the copy is lightweight and is handled entirely by the storage account. DataverseToSql keeps track of the reading position within each blob to identify newly appended data to be copied.

DataverseToSql relies on the near real-time data produced by Synapse Link for Dataverse instead of the hourly snapshot, adopted by other solutions, in order to reduce the replication latency.

The mechanism implemented by DataverseToSql is an alternative to the native incremental update feature that is currently in preview. Once that feature becomes generally available, it should be considered the preferred solution to extract data incrementally.

The Azure Function copies new data to blobs that can then be consumed by any tool capable of reading CSV files. DataverseToSql provides an Azure Synapse pipeline with a Copy activity that performs an upsert to Azure SQL Database. The pipeline relies on Serverless SQL pool to read and deduplicate data from the incremental blobs.

To better support the core functionality, DataverseToSql automates the deployment of the required Synapse artifacts (pipeline, datasets, linked services) and updates the schema of the target database when the source schema in Dataverse changes (e.g., when a new column is added).

Architecture

Definitions

  • DataverseToSql environment - (or simply environment) the collection of services and metadata that support the incremental copy of data between the Dataverse container and Azure SQL Database.
  • Dataverse container - the Azure Blob Storage container, created and maintained by Synapse Link for Dataverse, where Dataverse data and metadata first land. It is the source of data for DataverseToSql.
  • Incremental blobs - the blobs created by DataverseToSql that contain the data extracted incrementally from the blobs in the Dataverse container.

Prerequisites

  • A source Dataverse environment.
  • An Azure subscription in the same Azure AD tenant of the Dataverse environment.
  • An Azure AD user ("dv2sql user" from now on) in the same Azure AD tenant, with the following permissions:
    • Create and manage Synapse Link for Dataverse in the Dataverse environment.
    • Deploy and own resources in the Azure subscription.
    • Create artifacts in Azure Synapse.
  • An environment that supports .NET 6 to:
    • Build and deploy the code.
    • Configure and deploy the DataverseToSql environment.

Setup

Azure Region

All the Azure resources must be deployed to the same region of the Dataverse environment.

Azure storage account

Provision an Azure storage account with Azure Data Lake Storage Gen2.

You can opt to create a storage account as part of the creation of the Azure Synapse workspace.

Assign the Storage Blob Data Contributor role to the dv2sql user

Assign the Storage Blob Data Contributor role on the storage account to the dv2sql user.

Assign storage permissions to the dv2sql user

Assign the "Storage Blob Data Contributor" on the storage account to the dv2sql user. See Assign an Azure role for access to blob data.

Configure the storage firewall to allow network access from all networks

To enable access to the storage account by Synapse Link for Dataverse and the Azure Function App, configure the firewall of the storage account to allow network access from all networks. See Configure Azure Storage firewalls and virtual networks.

Storage containers

Create two containers inside the storage account; the naming is not important.

The access level of the containers must be private.

The two containers are respectively for storing:

  • Incremental blobs.
  • DataverseToSql configuration files.

Azure Synapse Analytics workspace

Provision an Azure Synapse workspace.

Azure Synapse Apache Spark pool

Create a Spark pool. See Quickstart: Create a serverless Apache Spark pool using Synapse Studio.

A recommended starting configuration is Memory Optimized, Large node size with autoscale. Make sure to enable idle timeout to automatically pause the pool.

Spark version must be 3.3.

Install required packages

The Spark notebooks require two additional libraries. They can be installed as workspace packages as described in Workspace packages.

Install the apache Spark connector for SQL Server Apache Spark connector: SQL Server & Azure SQL. Make sure to install version 1.3.0 compatible with Spark 3.3.

Install Microsoft JDBC Driver for SQL Server version 8.4. See available downloads here: Release notes for the Microsoft JDBC Driver for SQL Server

Assign storage permissions to the Synapse workspace

Assign the "Storage Blob Data Contributor" on the storage account to the Synapse workspace. See Assign an Azure role for access to blob data.

Synapse Link for Dataverse

Setup Synapse Link for Dataverse to replicate tables to the Azure storage account and optionally connect to the Azure Synapse workspace; while an Azure Synapse workspace is not strictly necessary for DataverseToSql, it is anyway recommended because it enables new analytics capabilities that you are encouraged to explore.

Note The Azure Synapse Workspace is required, whether you decide to connect it to Synapse Link for Dataverse or not.

Important: The tables must be configured in append-only mode. See Advanced Configuration Options in Azure Synapse Link for more details.

NOTE: The storage account must be the same of the container for incremental data..

To setup Synapse Link for Dataverse connected to Azure Synapse, follow Create an Azure Synapse Link for Dataverse with your Azure Synapse Workspace.

To setup Synapse Link for Dataverse connected to the storage account only, follow Create an Azure Synapse Link for Dataverse with Azure Data Lake.

Azure SQL Database

Deploy an Azure SQL Database (single database).

Configure Azure SQL for Azure AD authentication

Configure the Azure SQL logical server for Azure AD authentication. See Configure and manage Azure AD authentication with Azure SQL.

Configure the Azure SQL firewall

Configure the Azure SQL firewall to allow access by

  • Azure services
  • The environment where the DataverseToSql CLI (dv2sql) is executed

See Azure SQL Database and Azure Synapse IP firewall rules.

Assign db_owner role to the dv2sql user inside the Azure SQL Database

Create an user as described in Create contained users mapped to Azure AD identities.

The CREATE USER command must be executed by the Azure AD administrator.

CREATE USER [<name_of_the_managed_identity>] FROM EXTERNAL PROVIDER;

Assign the db_owner role using the ALTER ROLE statement.

ALTER ROLE db_owner ADD MEMBER [<name_of_the_managed_identity>];

Assign db_owner role to the Synapse Workspace identity inside the Azure SQL Database

Create an user as described in Create contained users mapped to Azure AD identities.

The CREATE USER command must be executed by the Azure AD administrator.

CREATE USER [<synapse_workspace_name>] FROM EXTERNAL PROVIDER;

Assign the db_owner role using the ALTER ROLE statement.

ALTER ROLE db_owner ADD MEMBER [<synapse_workspace_name>];

Azure Function App

Deploy an Azure Function App.

The function must be configure to use the .NET 6 runtime.

Configure the function to use Application Insights for monitoring.

Azure Function App managed identity

Assign a managed identity (either system-assigned or a user-assigned) to the Azure Function App. See How to use managed identities for App Service and Azure Functions.

Assign storage permissions to the Azure Function App managed identity

Assign the "Storage Blob Data Contributor" on the storage account to the Azure Function App managed identity. See Assign an Azure role for access to blob data.

Assign db_owner role to the Azure Function App managed identity inside the Azure SQL Database

Create an user as described in Create contained users mapped to Azure AD identities.

The CREATE USER command must be executed by the Azure AD administrator.

CREATE USER [<name_of_the_managed_identity>] FROM EXTERNAL PROVIDER;

Assign the db_owner role using the ALTER ROLE statement.

ALTER ROLE db_owner ADD MEMBER [<name_of_the_managed_identity>];

Assign Synapse Administrator permissions to the Azure Function App managed identity

Assign the Synapse Administrator RBAC role to the Azure Function App managed identity.

Configure the Azure Function App settings

You must configure the following Function App settings.

Setting Description Example
TIMER_SCHEDULE Schedule of the function timer in NCRONTAB format. 0 */5 * * * *
CONFIGURATION_CONTAINER URI of the container with DataverseToSql configuration (as specified in the ConfigurationStorage section of DataverseToSql.json under Configure the environment below). https://myaccount.blob.core.windows.net/dataverse-to-sql

Install the required software on the local environment

Install the following components where you plan to build the code and use the DataverseToSql CLI (dv2sql).

For building deploying the Azure Function you can install any of the following:

  • Visual Studio 2022.
  • Visual Studio Code.

Additional tools are mentioned in Deployment technologies in Azure Functions.

Login to Azure CLI

Login to Azure CLI with the dv2sql user (see Prerequisites).

dv2sql uses DefaultAzureCredential to authenticate to Azure SQL Database, the storage account and Azure Synapse. DefaultAzureCredential tries different authentication methods, including the current Azure CLI credentials. You can override the authentication process to specify other credentials using the environment variables documented in EnvironmentCredential.

Setup checklist

  • You have identified an Azure subscription in the same Azure AD tenant of the Dataverse environment.
  • The user that performs the setup has administrative permissions in Dataverse and in the Azure subscription.
  • All Azure resources are deployed to the same region of the Dataverse environment.
  • An Azure storage account with Azure Data Lake Storage Gen2 has been deployed.
  • The dv2sql user has been assigned Storage Blob Data Contributor role on the storage account.
  • The storage account firewall is configured to enable network access from all networks.
  • Two containers have been created inside the storage account to store incremental blobs and DataverseToSql configuration files, respectively.
  • An Azure Synapse workspace has been deployed.
  • A Spark pool has been created.
  • The required additional packages have been installed.
  • The Synapse workspace has been assigned Storage Blob Data Contributor role on the storage account.
  • A link is established between Dataverse and the Azure storage account using Synapse Link for Dataverse. Note The storage account must be the same of the container used to store incremental blobs.
  • (Optionally) Synapse link for Dataverse is configured to connect to the Azure Synapse workspace.
  • Dataverse tables are replicated to Azure Storage in append-only mode.
  • An Azure SQL Database (single database) has been deployed.
  • Azure SQL is configured to use Azure AD authentication.
  • Azure SQL firewall is configured to allow access by Azure services.
  • Azure SQL firewall is configured to allow access by the environment where the DataverseToSql CLI is executed.
  • The dv2sql user has been assigned the db_owner role inside the database.
  • The Synapse Workspace has been assigned the db_owner role inside the database.
  • An Azure Function App has been deployed.
  • A managed identity has been assigned to the Azure Function App.
  • The managed identity of the Azure Function App has been assigned the "Storage Blob Data Contributor" role inside the storage account.
  • The managed identity of the Azure Function App has been assigned the db_owner role inside the database.
  • The managed identity of the Azure Function App has been assigned the Synapse Administrator role.
  • Azure Function App setting TIMER_SCHEDULE has been configured.
  • Azure Function App setting CONFIGURATION_CONTAINER has been configured.
  • The required software has been installed on the local environment.
  • dv2sql user logged on locally to Azure CLI.

Build and deployment

Build DataverseToSql CLI (dv2sql)

To build dv2sql run the following command (customize <output_folder>).

dotnet publish --use-current-runtime src/DataverseToSql/DataverseToSql.Cli/DataverseToSql.Cli.csproj -c Release -o <output_folder>

Example:

dotnet publish --use-current-runtime src/DataverseToSql/DataverseToSql.Cli/DataverseToSql.Cli.csproj -c Release -o bin

Build and deploy the Azure Function

For build and deployments steps, specific to the tool of your choice, refer to Deployment technologies in Azure Functions.

Usage

Initialize the environment

DataverseToSql requires a local folder for its configuration files.

Run the following command to create the folder and the necessary configuration files.

dv2sql init -p <path_to_environment>

Example:

dv2sql init -p ./dataverse-to-sql

Configure the environment

Open the DataverseToSql.json file from the folder created above and replace the placeholder values with the details of the environment.

Default template file:

{
  "DataverseStorage": {
    "StorageAccount": "<Storage account Blob URI e.g https://accountname.blob.core.windows.net>",
    "Container": "<Container name>"
  },
  "IncrementalStorage": {
    "StorageAccount": "<Storage account Blob URI e.g https://accountname.blob.core.windows.net>",
    "Container": "<Container name>"
  },
  "ConfigurationStorage": {
    "StorageAccount": "<Storage account Blob URI e.g https://accountname.blob.core.windows.net>",
    "Container": "<Container name>"
  },
  "Database": {
    "Server": "<Azure SQL Server FQDN>",
    "Database": "<Database name>",
    "Schema": "<Default schema>"
  },
  "SynapseWorkspace": {
    "SubscriptionId": "<Subscription ID of the Synapse workspace>",
    "ResourceGroup": "<Resource Group of the Synapse workspace>",
    "Workspace": "<Name of the Synapse workspace>"
  },
  "Ingestion": {
    "Parallelism": 1
  },
  "SchemaHandling": {
    "EnableSchemaUpgradeForExistingTables": true
  }

}
Section Setting Default Description
DataverseStorage StorageAccount FQDN of the storage account containing Dataverse data.
DataverseStorage Container Storage container containing Dataverse data (created by Synapse Link for Dataverse).
IncrementalStorage StorageAccount FQDN of the storage account containing incremental data.
IncrementalStorage Container Storage container containing incremental data.
ConfigurationStorage StorageAccount FQDN of the storage account containing DataverseToSql configuration data.
ConfigurationStorage Container Storage container containing DataverseToSql configuration data.
Database Server FQDN of the Azure SQL Database.
Database Database Name of the Azure SQL Database.
Database Schema SQL Schema to be used for Dataverse tables.
SynapseWorkspace SubscriptionId Subscription ID of the Synapse workspace.
SynapseWorkspace ResourceGroup Resource group of the Synapse workspace.
SynapseWorkspace Workspace Name of the Synapse workspace.
Ingestion Parallelism 1 Number of concurrent copy activities performed by the ingestion pipeline.
SchemaHandling EnableSchemaUpgradeForExistingTables true Enable the propagation of schema changes for existing tables.
SchemaHandling OptionSetInt32 false Use int instead of long for integer OptionSet fields. Note: once set to true, the option cannot be to false; all tables must be removed and redeployed.
SchemaHandling SkipIsDeleteColumn false Skip the IsDelete column when generating the target tables.
Spark SparkPool Name of the Spark pool.
Spark EntityConcurrency 4 Number of entities to process in parallel during full load via Spark.

Customize column data types

The data type of columns of tables replicated from Dataverse is determined automatically based on the metadata from the source storage account.

The default mapping is as follows.

Source Dataverse data type Destination SQL Server data type
binary varbinary(max)
boolean bit
byte tinyint
char nchar
date date
datetime datetime2
datetimeoffset datetimeoffset
decimal decimal
double float
float float
guid uniqueidentifier
int16 smallint
int32 int
int64 bigint
integer int
json nvarchar(max)
long bigint
short smallint
string nvarchar
time time
timestamp datetime2

The data type of each column can be customized using a mapping file.

The file must be in TSV (tab-separated values) format, with the columns below, without headers.

Column Definition
Table name The name of the table to be customized.
Column name The name of the column to be customized.
Data type The desired SQL Server data type.

Table and column name are case insensitive.

The data type must be a valid built-in SQL data type. XML, geometry, geography, hierarchy and custom data types are not supported.

DataverseToSql verifies the data type is valid, but does not attempt to verify that it is compatible with source data. It is your responsibility to verify that the desired data type can contain existing data. Failure to do so will likely result in failures of the ingestion pipeline.

The file must be named CustomDatatypeMap.tsv. Note: the name is case sensitive.

The file must be placed in the configuration folder, along with DataverseToSql.json.

The file is uploaded to the configuration container using the deploy command (see below). Later changes to the file can be uploaded using deploy again or replacing the file in the storage account with a new version.

Sample content:

account	accountcategorycode	int
account	accountclassificationcode	int
account	accountid	uniqueidentifier
account	accountnumber	nvarchar(20)
account	accountratingcode	int
account	address1_addressid	uniqueidentifier
account	address1_addresstypecode	int
account	address1_city	nvarchar(50)

This feature is useful if you want to match the schema of an existing database (e.g the one generated by Data Export Service). You can extract the column data type information from an existing database using a query like the following.

SELECT
    t.name [table],
    c.name [column],
    dt.name + CASE 
        WHEN dt.name in (N'decimal', N'numeric') 
            THEN N'(' + CAST(c.precision AS varchar(2)) + N',' + CAST(c.scale AS varchar(2)) + N')'
        WHEN dt.name in (N'time', N'datetime2', N'datetimeoffset') 
            THEN N'(' + CAST(c.scale AS varchar(2)) + N')'
        WHEN dt.name in (N'float')
            THEN CASE WHEN c.precision = 53 THEN N'' ELSE N'(' + CAST(c.precision AS varchar(2)) + N')' END
        WHEN dt.name in (N'binary', N'char')
            THEN N'(' + CAST(c.max_length AS nvarchar(4))+ N')'
        WHEN dt.name = N'nchar'
            THEN N'(' + CAST(c.max_length/2 AS nvarchar(4))+ N')'
        WHEN dt.name in (N'varbinary', N'varchar')
            THEN N'(' + CASE c.max_length WHEN -1 THEN N'max' ELSE CAST(c.max_length AS nvarchar(4)) END + N')'
        WHEN dt.name = N'nvarchar'
            THEN N'(' + CASE c.max_length WHEN -1 THEN N'max' ELSE CAST(c.max_length/2 AS nvarchar(4)) END + N')'
        ELSE N''
    END [datatype]
FROM
    sys.tables t
    INNER JOIN sys.columns c
        ON t.object_id = c.object_id
    INNER JOIN sys.types dt
        ON c.system_type_id = dt.system_type_id
        AND dt.system_type_id = dt.user_type_id
WHERE
    dt.is_assembly_type = 0
    and dt.name <> 'xml'
ORDER BY
    t.name,
    c.name;

Provide scripts for custom SQL objects

If you plan to introduce custom SQL objects into the database, the preferred way to do so is creating files under the CustomSqlObjects folder of the environment.

Using this method is mandatory when the custom SQL objects refer to entity tables or other objects generated by DataverseToSql. Per example, if you plan to create a custom index on an entity table, the index must be placed in a file in CustomSqlObjects. Doing otherwise, by creating objects directly on the database, may cause the failure of schema deployment and, as a consequence, block the ingestion process.

Any file placed under CustomSqlObjects is treated as a SQL script file and included in the database schema. Files can be placed directly under CustomSqlObjects or organized in folders.

Each file must contain the definition of one or more objects, separated by a GO command. Only top level DDL commands are considered.

Example:

CREATE TABLE mytable(
  field1 int,
  field2 int
)
GO

CREATE INDEX ix_mytable ON mytable(field1)
GO

When the ingestion process runs, it detects any change in the content of the CustomSqlObjects folder in the container and applies the necessary changes to the database. The ingestion process applies the files in random order and handles dependencies automatically, per example, when a file depends on objects from another file.

The ingestion process applies custom scripts in best effort; if any script contains syntax errors or introduces inconsistencies (e.g., it refers to non-existent objects) it is skipped and a warning is produced by the Azure function.

Deploy the environment

Environment deployment creates the following:

  • The database schema (DataverseToSql metadata tables and OptionsetMetadata tables).
  • Synapse linked services for Azure SQL Database and Synapse Serverless SQL pool.
  • Synapse datasets.
  • Synapse ingestion pipeline.
  • DataverseToSql configuration files in the container specified under the ConfigurationStorage section of DataverseToSql.json.
  • Custom SQL objects script under the CustomSqlObjects folder of the container specified under the ConfigurationStorage section of DataverseToSql.json.

To perform the deployment, run the command

dv2sql deploy -p <path_to_environment>

Example:

dv2sql deploy -p ./dataverse-to-sql

Add entities

To add entities (tables) to the environment, for ingestion, use one of the following commands.

To add one entity, run the command

dv2sql -p <path_to_environment> add --name <entity_name>

Example:

dv2sql -p ./dataverse-to-sql add --name account

To add all entities, run the command

dv2sql -p <path_to_environment> add --all

Example:

dv2sql -p ./dataverse-to-sql add --all

Upgrade

To upgrade the tool to a new version do the following.

  1. Build the CLI and function.
  2. Disable the function.
  3. Run dv2sql deploy. See Deploy the environment.
  4. Deploy the function.
  5. Enable the function.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

Repository of the DataverseToSql tool for the incremental copy of data between Synapse Link for Dataverse and Azure SQL Database

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published