SDMT - Sliced data migration toolbox

Overview

If data is migrated from an on-premises system to a modern data warehouse or in an Azure Data eXplorer (ADX) database, then often historical data, must be migrated to the new data services in Azure.
The sliced data migration toolbox (SMDT) provides framework components to simplify the data migration.

SMDT is also compatible with Microsoft Fabric and can be used to migrate data from your existing data stores to Microsoft Fabric, if they are not accessible via Shortcuts.

SDMT - Sliced data migration toolbox

Main Benefits of the toolbox

The toolbox provides the following main benefits:

Proven framework to migrate data in an efficient and governed way.
Flexible architecture to support different scenarios. It is possible to integrate the SDMT pipelines in a broader ETL/date migration scenario.
It is possible to load huge datasets in well-defined, small slices.
- Sequential and parallel load of the slices is supported (pipeline defines how many parallel loads are executed)
- Data transfer workload can scale out over different integration runtimes to optimize performance.
- If a slice fails, then it can be restarted, without data duplication.
The ADX property “creationTime” is set correctly.
The correct setting of "creationTime" is essential for the ADX caching policy. CreationTime
Each data transfer is logged in the meta data database (duration, number of rows transferred) and ADX tags provide additional information to each extent.
- "LoadedAt: "
- "SlicedImportObject_Id:"
- "PipelineRun_Id:"
- "ExtentFingerprint:<Identification of the slice. Used to clean up a slice in the case of a reload"
- "SourceFunction:"
Simple pipelines with any complex logic, the required values are provided by the stored procedures of the toolbox.

Common transfer stages

Depending on the shape of the pipeline you can choose one of the following options:

SQL Source¹ -> Database destination (e.g. ADX, SQL, PostgreSQL, ...) ²
SQL Source¹ -> Data lake
SQL Source¹ -> Data lake -> Database destination (e.g. ADX, SQL, PostgreSQL, ...)²
ADX Source -> Data lake (via external table)

If ADX is the target, then it is possible to choose one of the following paths.

SQL Source¹ -> ADX via Copy activity
SQL Source -> ADX using sql_request
SQL Source¹ -> Data lake via Copy activity -> ADX via Copy activity
SQL Source¹ -> Data lake via Copy activity -> ADX using an external table
Data lake -> ADX using an external table

Meta data

The whole data transfer is steered by meta data. Depending on the scenario, different meta data is required. Common meta data is:

Data transfer application name
Source object
Destination object
Start date
End date
Slice size (day or month)

Big picture

To transfer data from a source to a destination, the following steps are required:

Create the required objects in the metadata database and the destination database.
- Create the meta data database (deploy the database project).
- Create the necessary ADX artifacts (tables, functions, ...)
Create slice meta data. The most efficient way is to use the stored procedure [Helper].[GenerateSliceMetaData].
Pick the corresponding ADF/Synapes pipeline, specify the required start parameters (@Mode=REGULAR) and execute it.
- If some slices failed to loaded, then the pipeline can be restarted (@Mode=RESTART). The pipeline will only re-load the missing slices.

High level view on direct transfer pipeline

For a direct transfer of the data slices just a few activities are required, and the properties can be filled by using the output of the stored procedures provided by the framework. This makes building the pipelines very efficient and the maintenance can be done at a central place.

Example: Direct transfer to Azure ADX Database

Reload of data slices

Failed loads of slices will will be reported in the meta data database.

And a single slice can be reload in a controlled fashion.

Handling of creationTime and tags

One important feature of ADX is the functionality to define caching policies and to keep selected data in the hot cache. To be able to decide if data should be available in the hot cache, ADX keeps track on the age via meta data in the extent definition. If historical data is loaded, then it is essential that we provide the age of the records via a "hint". The toolbox provides this information direct as a property.

With the combination of the log entries in Azure Data Factory/Synapse Pipelines, the SDMT tables and the ADX tags it is possible to get a full end to end view and relate pipeline runs to extents in ADX.

Setup and next steps

Setup SDMT
Load data to ADX
Load data to Microsoft Fabric
- To Real-Time Analytics (ADX)
  - Transfer SQL to Fabric Real-Time Analytics (ADX) with the copy activity.
Export data from ADX to the data lake
- Export data.
Documentation

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Any SQL source that is supported by ADF/Synpase pipelines. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Any pipeline sink that supports insert/apppend. Samples provided for Azure SQL and Azure Data Explorer. ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
doc		doc
pipeline		pipeline
sampleFiles/SQLToLakeToADX/Core/Measurement/2021/11		sampleFiles/SQLToLakeToADX/Core/Measurement/2021/11
sqldb		sqldb
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

License

Azure/azure-kusto-sdmt

Folders and files

Latest commit

History

Repository files navigation

SDMT - Sliced data migration toolbox

Overview

Main Benefits of the toolbox

Common transfer stages

Meta data

Big picture

High level view on direct transfer pipeline

Example: Direct transfer to Azure ADX Database

Reload of data slices

Handling of creationTime and tags

Setup and next steps

Contributing

Trademarks

Footnotes

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages