If data is migrated from an on-premises system to a modern data warehouse or in an Azure Data eXplorer (ADX) database, then often historical data, must be migrated to the new data services in Azure.
The sliced data migration toolbox (SMDT) provides framework components to simplify the data migration.
SMDT is also compatible with Microsoft Fabric and can be used to migrate data from your existing data stores to Microsoft Fabric, if they are not accessible via Shortcuts.
The toolbox provides the following main benefits:
- Proven framework to migrate data in an efficient and governed way.
- Flexible architecture to support different scenarios. It is possible to integrate the SDMT pipelines in a broader ETL/date migration scenario.
- It is possible to load huge datasets in well-defined, small slices.
- Sequential and parallel load of the slices is supported (pipeline defines how many parallel loads are executed)
- Data transfer workload can scale out over different integration runtimes to optimize performance.
- If a slice fails, then it can be restarted, without data duplication.
- The ADX property “creationTime” is set correctly.
- The correct setting of "creationTime" is essential for the ADX caching policy. CreationTime
- Each data transfer is logged in the meta data database (duration, number of rows transferred) and ADX tags provide additional information to each extent.
- "LoadedAt: "
- "SlicedImportObject_Id:"
- "PipelineRun_Id:"
- "ExtentFingerprint:<Identification of the slice. Used to clean up a slice in the case of a reload"
- "SourceFunction:"
- Simple pipelines with any complex logic, the required values are provided by the stored procedures of the toolbox.
Depending on the shape of the pipeline you can choose one of the following options:
- SQL Source1 -> Database destination (e.g. ADX, SQL, PostgreSQL, ...) 2
- SQL Source1 -> Data lake
- SQL Source1 -> Data lake -> Database destination (e.g. ADX, SQL, PostgreSQL, ...)2
- ADX Source -> Data lake (via external table)
If ADX is the target, then it is possible to choose one of the following paths.
- SQL Source1 -> ADX via Copy activity
- SQL Source -> ADX using sql_request
- SQL Source1 -> Data lake via Copy activity -> ADX via Copy activity
- SQL Source1 -> Data lake via Copy activity -> ADX using an external table
- Data lake -> ADX using an external table
The whole data transfer is steered by meta data. Depending on the scenario, different meta data is required. Common meta data is:
- Data transfer application name
- Source object
- Destination object
- Start date
- End date
- Slice size (day or month)
To transfer data from a source to a destination, the following steps are required:
- Create the required objects in the metadata database and the destination database.
- Create the meta data database (deploy the database project).
- Create the necessary ADX artifacts (tables, functions, ...)
- Create slice meta data. The most efficient way is to use the stored procedure
[Helper].[GenerateSliceMetaData]
. - Pick the corresponding ADF/Synapes pipeline, specify the required start parameters (@Mode=
REGULAR
) and execute it.- If some slices failed to loaded, then the pipeline can be restarted (@Mode=
RESTART
). The pipeline will only re-load the missing slices.
- If some slices failed to loaded, then the pipeline can be restarted (@Mode=
For a direct transfer of the data slices just a few activities are required, and the properties can be filled by using the output of the stored procedures provided by the framework. This makes building the pipelines very efficient and the maintenance can be done at a central place.
Failed loads of slices will will be reported in the meta data database.
And a single slice can be reload in a controlled fashion.
One important feature of ADX is the functionality to define caching policies and to keep selected data in the hot cache. To be able to decide if data should be available in the hot cache, ADX keeps track on the age via meta data in the extent definition. If historical data is loaded, then it is essential that we provide the age of the records via a "hint". The toolbox provides this information direct as a property.
With the combination of the log entries in Azure Data Factory/Synapse Pipelines, the SDMT tables and the ADX tags it is possible to get a full end to end view and relate pipeline runs to extents in ADX.
-
Load data to ADX
-
Load data to Microsoft Fabric
- To Real-Time Analytics (ADX)
-
Export data from ADX to the data lake
-
Documentation
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.