# 03a - Parallel Switch-In Load Into Partitioned Table

If you have to load data into a table that is also actively used by users, you cannot just run a bulk copy operation on such table. If you plan to use `tableLock` option, users will not be able to access data for the whole duration of the bulk load. Even if you don't plan to use `tableLock` option, you will still impact and interfere with conccurrent operations running on the table partition.

The solution is simple: load another table instead, and then "switch-in" that table into the target one. More details on this pattern can be found in [this post](https://www.cathrinewilhelmsen.net/2015/04/19/table-partitioning-in-sql-server-partition-switching/) written by the Data Platform MVP Cathrine Wilhelmsen. 

Beside improving concurrency during bulk load operation, you also have another benefit that can be very useful. Without this pattern is usually better to load the table with indexes already created, as for very big table, creating an index can completely drain all the resources avaiable to your Azure SQL database. By using this tecnique you are actually using a "divide-et-impera" approach, so that you can load data into a staging table with no indexes, where you'll have the best load performance possible, and then create the needed index later, without the problem of resource exhaustion

- https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms345599(v=sql.105)?redirectedfrom=MSDN
- https://adb-5125913180342277.17.azuredatabricks.net/?o=5125913180342277#notebook/964636935775876/command/106587721246162
- https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms345599(v=sql.105)?redirectedfrom=MSDN

Partitions in Databricks vs Partition in Azure SQL

// https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

## Create support function
To be able to execute a switch-in load, parallel load must be managed manually, as T-SQL code must be execute before and after each Azure SQL partition (not Dataframe partition! Remember that a Dataframe partition can target multiple Azure SQL partitions) has been loaded bia bulk load operation. By using the [tecnique explained in the official Databricks documentation](https://docs.databricks.com/notebooks/notebook-workflows.html#api) it is possibile to execute a notebook in parallel, but implementing the followign function.

In [4]:
import scala.concurrent.{Future, Await}
import scala.concurrent.duration._
import scala.util.control.NonFatal

case class NotebookData(path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String])

def parallelNotebooks(notebooks: Seq[NotebookData]): Future[Seq[String]] = {
  import scala.concurrent.{Future, blocking, Await}
  import java.util.concurrent.Executors
  import scala.concurrent.ExecutionContext
  import com.databricks.WorkflowException

  val numNotebooksInParallel = 4 
  // If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once. 
  // This code limits the number of parallel notebooks.
  implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
  val ctx = dbutils.notebook.getContext()
  
  Future.sequence(
    notebooks.map { notebook => 
      Future {
        dbutils.notebook.setContext(ctx)
        if (notebook.parameters.nonEmpty)
          dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
        else
          dbutils.notebook.run(notebook.path, notebook.timeout)
      }
      .recover {
        case NonFatal(e) => s"ERROR: ${e.getMessage}"
      }
    }
  )
}

## Run Parallel Load

Create a Sequence with Azure SQL partitions to be loaded is stored

In [7]:
import spark.implicits._
import org.apache.spark.sql._

case class partitionToProcess(partitionKey:Int)

val ptp = Seq(
    partitionToProcess(199702),
    partitionToProcess(199703),
    partitionToProcess(199704),
    partitionToProcess(199706),
    partitionToProcess(199707),
    partitionToProcess(199708)
)

Execute in parallel several instances of the notebook that load a specific partition, using a different partition key for each instance

In [9]:
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.language.postfixOps

val timeOut = 600 // seconds

val notebooks = ptp.map { 
  p => NotebookData("./03b-parallel-switch-in-load-into-partitioned-table-single", 
                    timeOut, 
                    Map("partitionKey" -> p.partitionKey.toString)
                   )
}

val res = parallelNotebooks(notebooks)

Await.result(res, (timeOut * ptp.size seconds)) // this is a blocking call.

res.value