# 03a - Parallel Switch-In Load Into Partitioned Table

If you have to load data into a table that is also actively used by users, you cannot just run a bulk copy operation on such table. If you plan to use `tableLock` option, users will not be able to access data for the whole duration of the bulk load; even if you don't plan to use `tableLock` option, a bulk load operation will still impact and interfere with conccurrent operations running on the table partition.

To get more details on partitioning, take a look at the `02-load-into-partitioned-table` notebook.

The solution to be able to bulk load data and at the same time have the table usable by applications and users is simple: load another table instead, and then "switch-in" that table into the target one. More details on this pattern can be found in [this post](https://www.cathrinewilhelmsen.net/2015/04/19/table-partitioning-in-sql-server-partition-switching/) written by the Data Platform MVP Cathrine Wilhelmsen. 

Beside improving concurrency during bulk load operation, you also have another benefit that can be very useful. When not using the switch-in ability just discussed, it's usually better to load the table with indexes already created, as for very big tables, creating an index can completely drain all the resources avaiable to your Azure SQL database. By using this tecnique you are actually using a "divide-et-impera" approach, so that you can load data into a staging table with no indexes, where you'll have the best load performance possible, and then create the needed index later, with much lower impact on resources. The lower resource impact is due to the fact that you are only load data that will go into a single partition, not the whole table, and thus should be smaller and much more manageable. By repeating this process for all partitions you need to load, you can load data without impacting to much on Azure SQL resources and thus query performances.

Due to the fact that Apache Spark RDD partitions and Azure SQL partitions are in a 1:N relationship, is not possibile for the Azure SQL Connector to easily determine which staging table should be used and how to do the switch-in. Luckily we can do this operation manually, using a [well documented technique](https://docs.databricks.com/notebooks/notebook-workflows.html), helping Apache Spark to maximize parallelism to load Azure SQL partitions.

The sample is using the new sql-spark-connector (https://github.com/microsoft/sql-spark-connector). Make sure you have installed it before running this notebook.

## Notes on terminology

The term "row-store" is used to identify and index that is not using the [column-store layout](https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview) to store its data.

## Sample

This notebook is used to parallelize the work done by another notebook (`03b-parallel-switch-in-load-into-partitioned-table-single.ipynb`), that is actually the one loading the data into a staging table via bulk copy and than doing the switch-in operation.  

## Supported Azure Databricks Versions

Databricks supported versions: Spark 2.4.5 and Scala 2.11

# Create Target Table
Create table and its indexes

Make sure you create on your Azure SQL the following `LINEITEM` table, partitioned by `L_PARTITION_KEY`:

```sql
create partition function pf_LINEITEM(int)
as range left for values 
(
	199201,199202,199203,199204,199205,199206,199207,199208,199209,199210,199211,199212,
	199301,199302,199303,199304,199305,199306,199307,199308,199309,199310,199311,199312,
	199401,199402,199403,199404,199405,199406,199407,199408,199409,199410,199411,199412,
	199501,199502,199503,199504,199505,199506,199507,199508,199509,199510,199511,199512,
	199601,199602,199603,199604,199605,199606,199607,199608,199609,199610,199611,199612,
	199701,199702,199703,199704,199705,199706,199707,199708,199709,199710,199711,199712,
	199801,199802,199803,199804,199805,199806,199807,199808,199809,199810
);

create partition scheme ps_LINEITEM
as partition pf_LINEITEM
all to ([Primary])
;

create table [dbo].[LINEITEM_LOADTEST]
(
	[L_ORDERKEY] [int] not null,
	[L_PARTKEY] [int] not null,
	[L_SUPPKEY] [int] not null,
	[L_LINENUMBER] [int] not null,
	[L_QUANTITY] [decimal](15, 2) not null,
	[L_EXTENDEDPRICE] [decimal](15, 2) not null,
	[L_DISCOUNT] [decimal](15, 2) not null,
	[L_TAX] [decimal](15, 2) not null,
	[L_RETURNFLAG] [char](1) not null,
	[L_LINESTATUS] [char](1) not null,
	[L_SHIPDATE] [date] not null,
	[L_COMMITDATE] [date] not null,
	[L_RECEIPTDATE] [date] not null,
	[L_SHIPINSTRUCT] [char](25) not null,
	[L_SHIPMODE] [char](10) not null,
	[L_COMMENT] [varchar](44) not null,
	[L_PARTITION_KEY] [int] not null
) on ps_LINEITEM([L_PARTITION_KEY])
;

create clustered index IXC on dbo.[LINEITEM_LOADTEST] ([L_COMMITDATE]) 
on ps_LINEITEM([L_PARTITION_KEY]);

create unique nonclustered index IX1 on dbo.[LINEITEM_LOADTEST] ([L_ORDERKEY], [L_LINENUMBER], [L_PARTITION_KEY]) 
on ps_LINEITEM([L_PARTITION_KEY]);

create nonclustered index IX2 on dbo.[LINEITEM_LOADTEST] ([L_PARTKEY], [L_PARTITION_KEY]) 
on ps_LINEITEM([L_PARTITION_KEY]);
```

## Create support function
To be able to execute a switch-in load, parallel load must be managed manually, as T-SQL code must be execute before and after each Azure SQL partition (not Dataframe partition! Remember that a Dataframe partition can target multiple Azure SQL partitions) has been loaded bia bulk load operation. By using the [tecnique explained in the official Databricks documentation](https://docs.databricks.com/notebooks/notebook-workflows.html#api) it is possibile to execute a notebook in parallel, by implementing the following function.

In [5]:
import scala.concurrent.{Future, Await}
import scala.concurrent.duration._
import scala.util.control.NonFatal

case class NotebookData(path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String])

def parallelNotebooks(notebooks: Seq[NotebookData]): Future[Seq[String]] = {
  import scala.concurrent.{Future, blocking, Await}
  import java.util.concurrent.Executors
  import scala.concurrent.ExecutionContext
  import com.databricks.WorkflowException

  val numNotebooksInParallel = 4 
  // If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once. 
  // This code limits the number of parallel notebooks.
  implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
  val ctx = dbutils.notebook.getContext()
  
  Future.sequence(
    notebooks.map { notebook => 
      Future {
        dbutils.notebook.setContext(ctx)
        if (notebook.parameters.nonEmpty)
          dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
        else
          dbutils.notebook.run(notebook.path, notebook.timeout)
      }
      .recover {
        case NonFatal(e) => s"ERROR: ${e.getMessage}"
      }
    }
  )
}

## Run Parallel Load

Create a Sequence with Azure SQL partitions to be loaded is stored

In [8]:
import spark.implicits._
import org.apache.spark.sql._

case class partitionToProcess(partitionKey:Int)

val ptp = Seq(
    partitionToProcess(199702),
    partitionToProcess(199703),
    partitionToProcess(199704),
    partitionToProcess(199706),
    partitionToProcess(199707),
    partitionToProcess(199708)
)

Execute in parallel several instances of the notebook that load a specific partition, using a different partition key for each instance

In [10]:
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.language.postfixOps

val timeOut = 600 // seconds
val ipynb = "./03b-parallel-switch-in-load-into-partitioned-table-single"

val notebooks = ptp.map(p => NotebookData(ipynb, timeOut, Map("partitionKey" -> p.partitionKey.toString)))

In [11]:
val res = parallelNotebooks(notebooks)

Await.result(res, (timeOut * ptp.size seconds)) // this is a blocking call.

res.value