# Part 4 - Synapse Spark

In this last section, we'll show how you can use Databricks and Synapse's Spark functionality together.

This is what we'll be building:

```
Databricks Write Stream ─────> ADLS ─────> Synapse Read Stream ──────┐
                                                                     │
                                                                  Transform
                                                                     │
Databricks Write Stream <────── ADLS <──── Synapse Write Stream ─────┘
```

We'll see that structured streams can be setup to seamlessly build workflows using both Databricks and Synapse.

In [0]:
from types import SimpleNamespace

secret_scope = 'dbw-syn-lab'
secrets = SimpleNamespace(
  sp_secret = dbutils.secrets.get(secret_scope, 'dbw-syn-lab-sp-secret'),
  sa_secret = dbutils.secrets.get(secret_scope, 'dbw-syn-lab-sa-secret'),
  sql_pw = dbutils.secrets.get(secret_scope, 'dbw-syn-lab-sql-pw')
)

The Azure Synapse connector uses three types of network connections:

* Spark driver to Azure Synapse
* Spark driver and executors to Azure storage account
* Azure Synapse to Azure storage account

```
                                 ┌─────────┐
      ┌─────────────────────────>│ STORAGE │<────────────────────────┐
      │   Storage acc key /      │ ACCOUNT │  Storage acc key /      │
      │   Managed Service ID /   └─────────┘  OAuth 2.0 /            │
      │                               │                              │
      │                               │                              │
      │                               │ Storage acc key /            │
      │                               │ OAuth 2.0 /                  │
      │                               │                              │
      v                               v                       ┌──────v────┐
┌──────────┐                      ┌──────────┐                │┌──────────┴┐
│ Synapse  │                      │  Spark   │                ││ Spark     │
│ Analytics│<────────────────────>│  Driver  │<───────────────>│ Executors │
└──────────┘  JDBC with           └──────────┘    Configured   └───────────┘
              username & password /                in Spark
```

It should be noted that use of Blob storage can only used the Storage Account Key, whereas ADLS Gen 2 can optionally use OAuth 2.0 instead.

In [0]:
# Application ID corresponds the App Registration / Service Principal used by Databricks
app_id = '4b309858-a987-4d5a-9a11-a84116790317'

# Directory ID is the tenant this databricks workspace belongs to
directory_id = '6871727a-5747-424a-b9d4-39a621930267'

# Defining the service principal credentials for the Azure storage account
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", app_id)
spark.conf.set("fs.azure.account.oauth2.client.secret", secrets.sp_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint", f"https://login.microsoftonline.com/{directory_id}/oauth2/token")

# Defining a separate set of service principal credentials for Azure Synapse Analytics (If not defined, the connector will use the Azure storage account credentials)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", app_id)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", secrets.sp_secret)

# Setup the storage account key
storage_account_name = 'strdbwsynworkshop'
container_name = 'synstorage'
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", secrets.sa_secret)

In [0]:
configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": app_id,
  "fs.azure.account.oauth2.client.secret": secrets.sp_secret,
  "fs.azure.account.oauth2.client.endpoint": f"https://login.microsoftonline.com/{directory_id}/oauth2/token"
}

dbutils.fs.unmount(f'/mnt/{container_name}')
dbutils.fs.mount(
  source = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/",
  mount_point = f'/mnt/{container_name}',
  extra_configs = configs
)

In [0]:
df = spark.read.csv(f"/mnt/{container_name}/test.csv", header='true')
display(df)

In [0]:
# Use the following to clean-up the table
# %sh rm -rf /dbfs/mnt/synstorage/delta_test_partitioned_stream && rm -rf /dbfs/delta_test_partitioned_stream_checkpoint

from pyspark.sql.functions import expr

# spark.sql(f'DROP TABLE IF EXISTS events')
streaming_delta_location = f'/mnt/{container_name}/delta_test_partitioned_stream/'
# spark.sql(f"CREATE TABLE events (timestamp timestamp, value long, partition long) USING DELTA LOCATION '{streaming_delta_location}' PARTITIONED BY (partition)")

# Prepare streaming source; this could be Kafka or a simple rate stream.
streaming_df = (
   spark.readStream
  .format("rate")
  .option("rowsPerSecond", "10")
  .option("numPartitions", "16")
  .load()
)

streaming_df = streaming_df.withColumn("partition", expr("value % 5"))

# Apply some transformations to the data then use
# Structured Streaming API to continuously write the data to a table in Azure Synapse.
(
  streaming_df.writeStream
  .partitionBy("partition")
  .outputMode("append")
  .option("checkpointLocation", "/delta_test_partitioned_stream_checkpoint")
  .table("events")
)

In [0]:
# Once the Synapse spark stream has been started, you can execute this to see the transformed data in Databricks.
streaming_df = (
   spark.readStream
  .format("delta")
  .load(f'/mnt/{container_name}/delta_test_partitioned_stream_synapse/')
)
display(streaming_df)

In [0]:
# Example 1: Show how a delta format dataset can be loaded and displayed
container_name = 'synstorage'
df = spark.read.format('delta').load(f'/mnt/{container_name}/delta_test_partitioned_stream_synapse/')
display(df)