## Part 2: Data Lake Interop

Pre-reqs:
* Storage account (ADLS Gen 2)
* Synapse Serverless Pool
* Databricks Service Principal
* Secret scope with service principal secret, sql password, and storage account key


Overview:

1. Setup
  * Access secrets
  * Configure variables
  * Test storage account access
2. Write to Data Lake in Delta
  * Create a dataframe
  * Use delta Spark connector to write to ADLS
3. Synapse Read from Delta
4. Synapse Write to Delta
5. Databricks Read from Delta

## Why Delta?

Delta Lake is quickly becoming a core format for Data Lake storage in Azure. It combines the best of the parquet format with ACID compliance, time travel, automated compaction and a ton of other features to give you a robust, efficient and cost-effective storage for your organization's data. Additionally, it's becoming more and more supported by native Azure services like Power BI, Azure Analysis Service and (as we will see) Azure Synapse. In most cases, what is being shown below can be performed with other supported formats like parquet, ORC, JSON, CSV, etc.

In [0]:
from types import SimpleNamespace

secret_scope = 'dbw-syn-lab'
secrets = SimpleNamespace(
  sp_secret = dbutils.secrets.get(secret_scope, 'dbw-syn-lab-sp-secret'),
  sa_secret = dbutils.secrets.get(secret_scope, 'dbw-syn-lab-sa-secret'),
  sql_pw = dbutils.secrets.get(secret_scope, 'dbw-syn-lab-sql-pw')
)

The Azure Synapse connector uses three types of network connections:

* Spark driver to Azure Synapse
* Spark driver and executors to Azure storage account
* Azure Synapse to Azure storage account

```
                                 ┌─────────┐
      ┌─────────────────────────>│ STORAGE │<────────────────────────┐
      │   Storage acc key /      │ ACCOUNT │  Storage acc key /      │
      │   Managed Service ID /   └─────────┘  OAuth 2.0 /            │
      │                               │                              │
      │                               │                              │
      │                               │ Storage acc key /            │
      │                               │ OAuth 2.0 /                  │
      │                               │                              │
      v                               v                       ┌──────v────┐
┌──────────┐                      ┌──────────┐                │┌──────────┴┐
│ Synapse  │                      │  Spark   │                ││ Spark     │
│ Analytics│<────────────────────>│  Driver  │<───────────────>│ Executors │
└──────────┘  JDBC with           └──────────┘    Configured   └───────────┘
              username & password /                in Spark
```

It should be noted that use of Blob storage can only used the Storage Account Key, whereas ADLS Gen 2 can optionally use OAuth 2.0 instead.

In [0]:
# Application ID corresponds the App Registration / Service Principal used by Databricks
app_id = '4b309858-a987-4d5a-9a11-a84116790317'

# Directory ID is the tenant this databricks workspace belongs to
directory_id = '6871727a-5747-424a-b9d4-39a621930267'

# Defining the service principal credentials for the Azure storage account
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", app_id)
spark.conf.set("fs.azure.account.oauth2.client.secret", secrets.sp_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint", f"https://login.microsoftonline.com/{directory_id}/oauth2/token")

# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the connector will use the Azure storage account credentials)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", app_id)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", secrets.sp_secret)

# Setup the storage account key
storage_account_name = 'strdbwsynworkshop'
container_name = 'synstorage'
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", secrets.sa_secret)

In [0]:
configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": app_id,
  "fs.azure.account.oauth2.client.secret": secrets.sp_secret,
  "fs.azure.account.oauth2.client.endpoint": f"https://login.microsoftonline.com/{directory_id}/oauth2/token"
}

dbutils.fs.unmount(f'/mnt/{container_name}')
dbutils.fs.mount(
  source = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
  mount_point = f'/mnt/{container_name}',
  extra_configs = configs
)

In [0]:
df = spark.read.csv(f"/mnt/{container_name}/test.csv", header='true')
display(df)

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType

schema = StructType([
  StructField("id", IntegerType()),
  StructField("a", IntegerType()),
  StructField("b", StringType()),
  StructField("c", FloatType()),
])

test_df = spark.createDataFrame([[0, 1, "2", 3.14], [1, 4, "5", 6.28], [2, 7, "8", 9.42]], schema=schema)
display(test_df)

In [0]:
(
  test_df.write
  .format("delta")
  .save(f"/mnt/{container_name}/delta_test")
)

We now have an untracked Delta table in our ADLS Gen2 container.

## Querying from Synapse Serverless / on-demand

We can query Delta directly from Synapse (public preview as of April 2021).

```sql
SELECT TOP(10) *
FROM OPENROWSET(
    BULK 'https://strdbwsynworkshop.blob.core.windows.net/synstorage/delta_test',
    FORMAT = 'DELTA'
) 
WITH (
    id INT,
    a INT,
    b VARCHAR(6),
    c FLOAT
) 
AS rows
```

## Explanation

We can directly select data from our Delta table using Synapse on-demand. This leverages the `OPENROWSET` clause to identify the location and format of our data. Optionally, a schema can be specified using a `WITH` clause which can signifcantly improve performance. This is because we can minimize type sizes (such as using a `VARCHAR(6)` instead of the pessimistic `VARCHAR(1000)`) on our query.

## Notable limitations for Serverless queries:

* Serverless SQL pools do not support time travel queries or updating Delta Lake files.
* Delta Lake support is not available in dedicated SQL pools.
* External tables do not support partitioning.
* Delta Lake tables created in the Apache Spark pools are not synchronized in serverless SQL pool.
* You cannot use schema inference in the OPENROWSET function if you have nested/complex types in the files. Make sure that you explicitly specify the schema in WITH clause.

## Reference

* https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format#explicitly-specify-schema
* https://techcommunity.microsoft.com/t5/azure-synapse-analytics/query-delta-lake-files-using-t-sql-language-in-azure-synapse/ba-p/2388398
* Synapse Delta Known Issues: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand#delta-lake

In [0]:
# Use the following to clean-up the table
# %sh rm -rf /dbfs/mnt/synstorage/delta_test_partitioned_stream && rm -rf /dbfs/delta_test_partitioned_stream_checkpoint

from pyspark.sql.functions import expr

# spark.sql(f'DROP TABLE IF EXISTS events')
streaming_delta_location = f'/mnt/{container_name}/delta_test_partitioned_stream/'
# spark.sql(f"CREATE TABLE events (timestamp timestamp, value long, partition long) USING DELTA LOCATION '{streaming_delta_location}' PARTITIONED BY (partition)")

# Prepare streaming source; this could be Kafka or a simple rate stream.
streaming_df = (
   spark.readStream
  .format("rate")
  .option("rowsPerSecond", "10")
  .option("numPartitions", "16")
  .load()
)

streaming_df = streaming_df.withColumn("partition", expr("value % 5"))

# Apply some transformations to the data then use
# Structured Streaming API to continuously write the data to a table in Azure Synapse.
(
  streaming_df.writeStream
  .partitionBy("partition")
  .outputMode("append")
  .option("checkpointLocation", "/delta_test_partitioned_stream_checkpoint")
  .table("events")
)

In [0]:
df = spark.read.parquet(f"/mnt/{container_name}/parquet_test")
display(df)

In [0]:
# Once the Synapse spark stream has been started, you can execute this to see the transformed data in Databricks.
streaming_df = (
   spark.readStream
  .format("delta")
  .load(f'/mnt/{container_name}/delta_test_partitioned_stream_synapse/')
)
display(streaming_df)