Skip to content

Latest commit

 

History

History
109 lines (73 loc) · 5.65 KB

package-management-delta-lake.md

File metadata and controls

109 lines (73 loc) · 5.65 KB
title titleSuffix description author ms.author ms.reviewer ms.date ms.service ms.subservice ms.topic
SQL Server Big Data Clusters Delta Lake
SQL Server Big Data Clusters
This guide covers how to use Delta Lake on SQL Server Big Data Clusters.
HugoMSFT
hudequei
wiassaf
10/06/2021
sql
big-data-cluster
conceptual

Delta Lake on SQL Server Big Data Clusters

[!INCLUDESQL Server 2019]

[!INCLUDEbig-data-clusters-banner-retirement]

In this guide, you'll learn:

[!div class="checklist"]

  • The requisites and capabilities of Delta Lake on [!INCLUDEbig-data-clusters-nover].
  • How to load Delta Lake libraries on CU12 clusters to use with Spark 2.4 sessions and jobs.

Introduction

Linux Foundation Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads. To learn more about Delta Lake, see:

Delta Lake on [!INCLUDEbig-data-clusters-nover] CU13 and above (Spark 3)

Delta Lake is installed and configured by default on [!INCLUDEbig-data-clusters-nover] CU13 and above. No further action is required.

This article covers configuration of Delta Lake on [!INCLUDEbig-data-clusters-nover] CU12 and below.

Configure Delta Lake on SQL Server Big Data Clusters CU12 and below (Spark 2.4)

On [!INCLUDEbig-data-clusters-nover] CU12 or below, it is possible to load Delta Lake libraries using the Spark library management feature.

Note

As a general rule, use the most recent compatible library. The code in this guide was tested by using Delta Lake 0.6.1 on [!INCLUDEbig-data-clusters-nover] CU12. Delta Lake 0.6.1 is compatible with Apache Spark 2.4.x, later versions are not. The examples are provided as-is, not a supportability statement.

Configure Delta Lake library and Spark configuration options

Set up your Delta Lake libraries with your application before you submit the jobs. The following library is required:

  • delta-core - This core library enables Delta Lake support.

The library must target Scala 2.11 and Spark 2.4.7. This [!INCLUDEbig-data-clusters-nover] requirement is for SQL Server 2019 Cumulative Update 9 (CU9) or later.

It's also required to configure Spark to enable Delta Lake-specific Spark SQL commands and the metastore integration. The example below is how an Azure Data Studio notebook would configure Delta Lake support:

%%configure -f \
{
    "conf": {
        "spark.jars.packages": "io.delta:delta-core_2.11:0.6.1",
        "spark.sql.extensions":"io.delta.sql.DeltaSparkSessionExtension",
        "spark.sql.catalog.spark_catalog":"org.apache.spark.sql.delta.catalog.DeltaCatalog"
    }
}

Share library locations for jobs on HDFS

If multiple applications will use the Delta Lake library, copy the appropriate library JAR files to a shared location on HDFS. Then all jobs should reference the same library files.

Copy the libraries to the common location:

azdata bdc hdfs cp --from-path delta-core_2.11-0.6.1.jar --to-path "hdfs:/apps/jars/delta-core_2.11-0.6.1.jar"

Dynamically install the libraries

You can dynamically install packages when you submit a job by using the package management features of Big Data Clusters. There's a job startup time penalty because of the recurrent downloads of the library files on each job submission.

Submit the Spark job by using azdata

The following example uses the shared library JAR files on HDFS:

azdata bdc spark batch create -f hdfs:/apps/ETL-Pipelines/my-delta-lake-python-job.py \
-j '["/apps/jars/delta-core_2.11-0.6.1.jar"]' \
--config '{"spark.sql.extensions":"io.delta.sql.DeltaSparkSessionExtension","spark.sql.catalog.spark_catalog":"org.apache.spark.sql.delta.catalog.DeltaCatalog"}' \
-n MyETLPipelinePySpark --executor-count 2 --executor-cores 2 --executor-memory 1664m

This example uses dynamic package management to install the dependencies:

azdata bdc spark batch create -f hdfs:/apps/ETL-Pipelines/my-delta-lake-python-job.py \
--config '{"spark.jars.packages":"io.delta:delta-core_2.11:0.6.1","spark.sql.extensions":"io.delta.sql.DeltaSparkSessionExtension","spark.sql.catalog.spark_catalog":"org.apache.spark.sql.delta.catalog.DeltaCatalog"' \
-n MyETLPipelinePySpark --executor-count 2 --executor-cores 2 --executor-memory 1664m

Next steps

To learn how to effectively use Delta Lake, see the following articles.

To submit Spark jobs to [!INCLUDEbig-data-clusters-nover] by using azdata or Livy endpoints, see Submit Spark jobs by using command-line tools.

For more information about [!INCLUDEbig-data-clusters-nover] and related scenarios, see [Introducing [!INCLUDEbig-data-clusters-nover]](big-data-cluster-overview.md).