# 03b - Parallel Switch-In Load Into Partitioned Table - Sigle Partition Load

This notebook will bulk load data into exactly one Azure SQL partition. It accepts a Partition Key as a parameter, and that value will be used to load all data that belongs to that partition. In this sample column used to partition data is the `L_PARTITION_KEY` column, which is an integer, so the provided partition key *must be* an integer too.

Data is not loaded directly into the selected partition, but a staging table is created, loaded and then switched into the target table, becoming the desired partition.

More info on this switch-in technique can be found in the related notebook: `03a-parallel-switch-in-load-into-partitioned-table-many`

## Notes on terminology

The term "row-store" is used to identify and index that is not using the [column-store layout](https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview) to store its data.

## Sample

This notebook is used to load exactly on partition of a partitioned table by loading a staging table and then switching it in into the target table. The process is the following:

- Create a staging table
- Load staging table
- Create indexes
- Create check constraints
- Execute switch-in operation

More details on this pattern can be found in [this post](https://www.cathrinewilhelmsen.net/2015/04/19/table-partitioning-in-sql-server-partition-switching/) written by the Data Platform MVP Cathrine Wilhelmsen. 
)

## Supported Azure Databricks Versions

Databricks supported versions: Spark 2.4.5 and Scala 2.11

## Setup

Define notebook parameter:

In [4]:
dbutils.widgets.text("partitionKey", "0", "Partition Key")

Define variables used thoughout the script. Azure Key Value has been used to securely store sensitive data. More info here: [Create an Azure Key Vault-backed secret scope](https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope)

In [6]:
val partitionKey = dbutils.widgets.get("partitionKey").toInt
val prevPartitionKey = partitionKey

val scope = "key-vault-secrets"

val storageAccount = "dmstore2";
val storageKey = dbutils.secrets.get(scope, "dmstore2-2");

val server = dbutils.secrets.get(scope, "srv001").concat(".database.windows.net");
val database = dbutils.secrets.get(scope, "db001");
val user = dbutils.secrets.get(scope, "dbuser001");
val password = dbutils.secrets.get(scope, "dbpwd001");
val table = "dbo.LINEITEM_LOADTEST"

val url = s"jdbc:sqlserver://$server;databaseName=$database;"

Configure Spark to access Azure Blob Store

In [8]:
spark.conf.set(s"fs.azure.account.key.$storageAccount.blob.core.windows.net", storageKey);

Load the Parquet file generated in `00-create-parquet-file` notebook that contains LINEITEM data partitioned by Year and Month. Make sure only the specified partion is loaded

In [10]:
val li = spark
  .read
  .parquet(s"wasbs://tpch@$storageAccount.blob.core.windows.net/10GB/parquet/lineitem")
  .filter($"L_PARTITION_KEY" === partitionKey)

All columns are shown load as nullable, even if they were originally set to NOT NULL, so we will need to fix this to make sure data can be loaded correctly. 

Schema needs to be defined explicitly as connector is very sensitive to nullability, as per the following issue [Nullable column mismatch between Spark DataFrame & SQL Table Error](
https://github.com/microsoft/sql-spark-connector/issues/5), so we need to explicity create the schema and apply it to the loaded data

In [12]:
import org.apache.spark.sql.types._

val schema = StructType(
    StructField("L_ORDERKEY", IntegerType, false) ::
    StructField("L_PARTKEY", IntegerType, false) ::
    StructField("L_SUPPKEY", IntegerType, false) ::  
    StructField("L_LINENUMBER", IntegerType, false) ::
    StructField("L_QUANTITY", DecimalType(15,2), false) ::
    StructField("L_EXTENDEDPRICE", DecimalType(15,2), false) ::
    StructField("L_DISCOUNT", DecimalType(15,2), false) ::
    StructField("L_TAX", DecimalType(15,2), false) ::
    StructField("L_RETURNFLAG", StringType, false) ::
    StructField("L_LINESTATUS", StringType, false) ::
    StructField("L_SHIPDATE", DateType, false) ::
    StructField("L_COMMITDATE", DateType, false) ::
    StructField("L_RECEIPTDATE", DateType, false) ::
    StructField("L_SHIPINSTRUCT", StringType, false) ::  
    StructField("L_SHIPMODE", StringType, false) ::  
    StructField("L_COMMENT", StringType, false) ::  
    StructField("L_PARTITION_KEY", IntegerType, false) ::  
    Nil)
    
val li2 = spark.createDataFrame(li.rdd, schema)

Create the T-SQL script need to extract information on the partition that will be loaded into Azure SQL

In [14]:
val sqlPartitionValueInfo = 
s"""
SELECT
	*
FROM
(
	SELECT
		prv.[boundary_id] AS partitionId,
		CAST(prv.[value] AS INT) AS [value],
		CAST(LAG(prv.[value]) OVER (ORDER BY prv.[boundary_id]) AS INT) AS [prevValue],
		CAST(LEAD(prv.[value]) OVER (ORDER BY prv.[boundary_id]) AS INT) AS [nextValue]
	FROM
		sys.[indexes] i
	INNER JOIN
		sys.[data_spaces] dp ON i.[data_space_id] = dp.[data_space_id]
	INNER JOIN
		sys.[partition_schemes] ps ON dp.[data_space_id] = ps.[data_space_id]
	INNER JOIN
		sys.[partition_range_values] prv ON [prv].[function_id] = [ps].[function_id]
	WHERE
		i.[object_id] = OBJECT_ID('${table}')
	AND
		i.[index_id] IN (0,1)
) AS [pi]
WHERE
	[value] = ${partitionKey}
"""

Setup JDBC connection, needed to execute ad-hoc T-SQL statement on Azure SQL

In [16]:
val connectionProperties = new java.util.Properties()
connectionProperties.put("user", user)
connectionProperties.put("password", password)
connectionProperties.setProperty("Driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
val conn = java.sql.DriverManager.getConnection(url, connectionProperties)
val st = conn.createStatement()

Load Azure SQL partition metadata

In [18]:
case class PartitionInfo(partitionId: Int, value: Int, prevValue: Option[Int], nextValue: Option[Int]);
val piDF = spark.read.jdbc(url, s"($sqlPartitionValueInfo) AS t", connectionProperties)
val pi= piDF.as[PartitionInfo].collect()(0)

Create on Azure SQL a staging table where data will be bulk loaded

In [20]:
st.execute(s"DROP TABLE IF EXISTS ${table}_STG_${partitionKey}")
st.execute(s"SELECT TOP (0) * INTO ${table}_STG_${partitionKey} FROM ${table}")

Create the same indexes that the target table has, in order to allow switch-in

In [22]:
st.execute(s"CREATE CLUSTERED INDEX IXC ON ${table}_STG_${partitionKey} ([L_COMMITDATE], [L_PARTITION_KEY])")
st.execute(s"CREATE UNIQUE NONCLUSTERED INDEX IX1 ON ${table}_STG_${partitionKey} ([L_ORDERKEY], [L_LINENUMBER], [L_PARTITION_KEY])")
st.execute(s"CREATE NONCLUSTERED INDEX IX2 ON ${table}_STG_${partitionKey} ([L_PARTKEY], [L_PARTITION_KEY])")

Load the staging table

In [24]:
li2.write 
  .format("com.microsoft.sqlserver.jdbc.spark") 
  .mode("overwrite")   
  .option("truncate", "true") 
  .option("url", url) 
  .option("dbtable", s"${table}_STG_${partitionKey}") 
  .option("user", user) 
  .option("password", password) 
  .option("reliabilityLevel", "BEST_EFFORT") 
  .option("tableLock", "false") 
  .option("batchsize", "100000")   
  .save()

Add a check constraint on the table to allow switch-in

In [26]:
if (pi.prevValue == None) {
  st.execute(s"ALTER TABLE ${table}_STG_${partitionKey} ADD CONSTRAINT ck_partition_${partitionKey} CHECK (L_PARTITION_KEY <= ${pi.value})")
} else {
  st.execute(s"ALTER TABLE ${table}_STG_${partitionKey} ADD CONSTRAINT ck_partition_${partitionKey} CHECK (L_PARTITION_KEY > ${pi.prevValue.get} AND L_PARTITION_KEY <= ${pi.value})")
}

Delete data in existing partition of target table, execute the switch-in and drop the staging table

In [28]:
st.execute(s"TRUNCATE TABLE ${table} WITH (PARTITIONS (${pi.partitionId}))")
st.execute(s"ALTER TABLE ${table}_STG_${partitionKey} SWITCH TO ${table} PARTITION ${pi.partitionId}")
st.execute(s"DROP TABLE ${table}_STG_${partitionKey}")

Done!

In [30]:
dbutils.notebook.exit(partitionKey.toString)

199810