# 01 - Load data into an Azure SQL non-partitioned table

The sample is using the new sql-spark-connector (https://github.com/microsoft/sql-spark-connector). The new connector must be manually installed by importing the .jar file (available in GitHub repo's releases) into the cluster.

## Notes on terminology

The term "row-store" is used to identify and index that is not using the [column-store layout](https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview) to store its data.

## Samples

In this notebook there are three samples

- Load data into a table without indexes
- Load data into a table with row-store indexes
- Load data into a table with columns-store indexes

## SupportedAzure  Databricks Versions

Databricks supported versions: Spark 2.4.5 and Scala 2.11

## Setup

Define variables used thoughout the script. Azure Key Value has been used to securely store sensitive data. More info here: [Create an Azure Key Vault-backed secret scope](https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope)

In [4]:
val scope = "key-vault-secrets"

val storageAccount = "dmstore2";
val storageKey = dbutils.secrets.get(scope, "dmstore2-2");

val server = dbutils.secrets.get(scope, "srv001").concat(".database.windows.net");
val database = dbutils.secrets.get(scope, "db001");
val user = dbutils.secrets.get(scope, "dbuser001");
val password = dbutils.secrets.get(scope, "dbpwd001");
val table = "dbo.LINEITEM_LOADTEST"


Configure Spark to access Azure Blob Store

In [6]:
spark.conf.set(s"fs.azure.account.key.$storageAccount.blob.core.windows.net", storageKey);

Load the Parquet file generated in `00-create-parquet-file` notebook that contains LINEITEM data partitioned by Year and Month

In [8]:
val li = spark.read.parquet(s"wasbs://tpch@$storageAccount.blob.core.windows.net/10GB/parquet/lineitem")

Loaded data is split in 20 dataframe partitions

In [10]:
li.rdd.getNumPartitions

Show schema of loaded data

In [12]:
li.printSchema

All columns are shown as nullable, even if they were originally set to NOT NULL, so we will need to fix this to make sure data can be loaded correctly. Schema needs to be defined explicitly as connector is very sensitive to nullability, as per the following issue [Nullable column mismatch between Spark DataFrame & SQL Table Error](
https://github.com/microsoft/sql-spark-connector/issues/5), so we need to explicity create the schema and apply it to the loaded data

In [14]:
import org.apache.spark.sql.types._

val schema = StructType(
    StructField("L_ORDERKEY", IntegerType, false) ::
    StructField("L_PARTKEY", IntegerType, false) ::
    StructField("L_SUPPKEY", IntegerType, false) ::  
    StructField("L_LINENUMBER", IntegerType, false) ::
    StructField("L_QUANTITY", DecimalType(15,2), false) ::
    StructField("L_EXTENDEDPRICE", DecimalType(15,2), false) ::
    StructField("L_DISCOUNT", DecimalType(15,2), false) ::
    StructField("L_TAX", DecimalType(15,2), false) ::
    StructField("L_RETURNFLAG", StringType, false) ::
    StructField("L_LINESTATUS", StringType, false) ::
    StructField("L_SHIPDATE", DateType, false) ::
    StructField("L_COMMITDATE", DateType, false) ::
    StructField("L_RECEIPTDATE", DateType, false) ::
    StructField("L_SHIPINSTRUCT", StringType, false) ::  
    StructField("L_SHIPMODE", StringType, false) ::  
    StructField("L_COMMENT", StringType, false) ::  
    StructField("L_PARTITION_KEY", IntegerType, false) ::  
    Nil)
    
val li2 = spark.createDataFrame(li.rdd, schema)

Now, make sure you create on your Azure SQL the following LINEITEM table:
```sql
create table [dbo].[LINEITEM_LOADTEST]
(
	[L_ORDERKEY] [int] not null,
	[L_PARTKEY] [int] not null,
	[L_SUPPKEY] [int] not null,
	[L_LINENUMBER] [int] not null,
	[L_QUANTITY] [decimal](15, 2) not null,
	[L_EXTENDEDPRICE] [decimal](15, 2) not null,
	[L_DISCOUNT] [decimal](15, 2) not null,
	[L_TAX] [decimal](15, 2) not null,
	[L_RETURNFLAG] [char](1) not null,
	[L_LINESTATUS] [char](1) not null,
	[L_SHIPDATE] [date] not null,
	[L_COMMITDATE] [date] not null,
	[L_RECEIPTDATE] [date] not null,
	[L_SHIPINSTRUCT] [char](25) not null,
	[L_SHIPMODE] [char](10) not null,
	[L_COMMENT] [varchar](44) not null,
	[L_PARTITION_KEY] [int] not null
) 
```

## Load data into a table with no indexes

In Azure SQL terminology an Heap is a table with no clustered index. In this sample we'll load data into a table that as no index (clustered or non-clustered) as is not partitioned. This is the simplest scenario possibile and allows parallel load of data.

### Note:
Parallel load *cannot* happen if you have row-store indexes on the table. If you want to bulk load data in parallel into a table that has row-store indexes, you must use partitioning. If you are planning to add indexes to your table, and data to be loaded in the table is in the terabyte range, you want to use partitioing and have indexes created before bulk loading data into Azure SQL, as otherwise creating index once the table is already loaded will use a significat amout of resources.

To enable parallel load the option `tableLock` must be set to `true`. This will prevent any other access to the table, other then the one done for performing the bulk load operations.

In [18]:
val url = s"jdbc:sqlserver://$server;databaseName=$database;"

li2.write 
  .format("com.microsoft.sqlserver.jdbc.spark") 
  .mode("overwrite")   
  .option("truncate", "true") 
  .option("url", url) 
  .option("dbtable", "dbo.LINEITEM_LOADTEST") 
  .option("user", user) 
  .option("password", password) 
  .option("reliabilityLevel", "BEST_EFFORT") 
  .option("tableLock", "true") 
  .option("batchsize", "100000")   
  .save()

## Load data into a table with row-store indexes

If table is not partitioned, there are no options to bulk load data in parallel into the desired table. The only way to avoid locking and deadlocks is to load everything by serializing the bulk load operations. As you can expect, performance won't be the optimal.

Create the following index on the table
```sql
create clustered index IXC on dbo.[LINEITEM_LOADTEST] ([L_COMMITDATE]);

create unique nonclustered index IX1 on dbo.[LINEITEM_LOADTEST] ([L_ORDERKEY], [L_LINENUMBER]);

create nonclustered index IX2 on dbo.[LINEITEM_LOADTEST] ([L_PARTKEY]); 
```

Load data by coalescing all dataframe partitions into just one

In [22]:
val url = s"jdbc:sqlserver://$server;databaseName=$database;"

li2.coalesce(1).write 
  .format("com.microsoft.sqlserver.jdbc.spark") 
  .mode("overwrite")   
  .option("truncate", "true") 
  .option("url", url) 
  .option("dbtable", "dbo.LINEITEM_LOADTEST") 
  .option("user", user) 
  .option("password", password) 
  .option("reliabilityLevel", "BEST_EFFORT") 
  .option("tableLock", "false") 
  .option("batchsize", "100000")   
  .save()

## Load data into a table with (only) column-store indexes

If a table has only column-store indexes, data load can happen in parallel, as there is no sorting needed.

Empty table if needed, to speed up index deletion

```sql
truncate table dbo.[LINEITEM_LOADTEST];
```

Drop the previously create indexes if needed:
```sql
drop index IXC on dbo.[LINEITEM_LOADTEST];
drop index IX1 on dbo.[LINEITEM_LOADTEST];
drop index IX2 on dbo.[LINEITEM_LOADTEST];
```

And the create a clustered columnstore index:

```sql
create clustered columnstore index IXCCS on dbo.[LINEITEM_LOADTEST]
```

Load data using [columnstore data loading best pratices](https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-loading-guidance), by loading 1048576 rows at time, to land directly into a compressed segment. `tableLock` options must be set to `false` to avoid table lock that will prevent parallel load. Data with be loaded in parallel, using as many as Apache Spark workers are available.

In [26]:
val url = s"jdbc:sqlserver://$server;databaseName=$database;"

li2.write 
  .format("com.microsoft.sqlserver.jdbc.spark") 
  .mode("overwrite")   
  .option("truncate", "true") 
  .option("url", url) 
  .option("dbtable", "dbo.LINEITEM_LOADTEST") 
  .option("user", user) 
  .option("password", password) 
  .option("reliabilityLevel", "BEST_EFFORT") 
  .option("tableLock", "false") 
  .option("batchsize", "1048576")   
  .save()