# 01 - Load data into an Azure SQL heap, non-partitioned, non-indexed, table

In Azure SQL terminology an Heap is a table with no clustered index. In this samples we'll load data into a table that as no index (clustered or non-clustered) as is not partitioned. This is the simplest scenario possibile and allows parallel load of data.

Sample is using both the new sql-spark-connector (https://github.com/microsoft/sql-spark-connector), and the previous one (https://github.com/Azure/azure-sqldb-spark). To install the _new connector_ manually import the .jar file (available in GitHub repo's releases) into the cluster. To install the previous one, just import the library right from Databricks portal using the "com.microsoft.azure:azure-sqldb-spark:1.0.2" coordinates.

Define variables used thoughout the script. Azure Key Value has been used to securely store sensitive data. More info here: [Create an Azure Key Vault-backed secret scope](https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope)

In [3]:
val scope = "key-vault-secrets"

val storageAccount = "dmstore2";
val storageKey = dbutils.secrets.get(scope, "dmstore2-2");

val server = dbutils.secrets.get(scope, "srv001").concat(".database.windows.net");
val database = dbutils.secrets.get(scope, "db001");
val user = dbutils.secrets.get(scope, "dbuser001");
val password = dbutils.secrets.get(scope, "dbpwd001");
val table = "dbo.LINEITEM_LOADTEST"


Configure Spark to access Azure Blob Store

In [5]:
spark.conf.set(s"fs.azure.account.key.$storageAccount.blob.core.windows.net", storageKey);

Load the Parquet file generated in `00-create-parquet-file` notebook that contains LINEITEM data partitioned by Year and Month

In [7]:
val li = spark.read.parquet("wasbs://tpch@dmstore2.blob.core.windows.net/10GB/parquet/lineitem")

Loaded data is split in 20 partitions

In [9]:
li.rdd.getNumPartitions

Show data distribution across partitions

In [11]:
display(li.groupBy($"L_PARTITION_KEY").count.orderBy($"L_PARTITION_KEY"))

L_PARTITION_KEY,count
199201,412
199202,190252
199203,582150
199204,748645
199205,770266
199206,746006
199207,772006
199208,770213
199209,748997
199210,771256


Show schema of loaded data

In [13]:
li.printSchema

All columns are shown as nullable, even if they were originally set to NOT NULL, so we will need to keep this in mind later.

Make sure you create on your Azure SQL the following LINEITEM table:
```sql
create table [dbo].[LINEITEM_LOADTEST]
(
	[L_ORDERKEY] [int] not null,
	[L_PARTKEY] [int] not null,
	[L_SUPPKEY] [int] not null,
	[L_LINENUMBER] [int] not null,
	[L_QUANTITY] [decimal](15, 2) not null,
	[L_EXTENDEDPRICE] [decimal](15, 2) not null,
	[L_DISCOUNT] [decimal](15, 2) not null,
	[L_TAX] [decimal](15, 2) not null,
	[L_RETURNFLAG] [char](1) not null,
	[L_LINESTATUS] [char](1) not null,
	[L_SHIPDATE] [date] not null,
	[L_COMMITDATE] [date] not null,
	[L_RECEIPTDATE] [date] not null,
	[L_SHIPINSTRUCT] [char](25) not null,
	[L_SHIPMODE] [char](10) not null,
	[L_COMMENT] [varchar](44) not null,
	[L_PARTITION_KEY] [int] not null
) 
```

In [16]:
display(li.filter($"L_PARTITION_KEY" === 199202))

L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_LINENUMBER,L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX,L_RETURNFLAG,L_LINESTATUS,L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT,L_PARTITION_KEY
1248,1502741,27757,2,37.0,64515.79,0.06,0.06,R,F,1992-01-26,1992-02-05,1992-02-13,COLLECT COD,TRUCK,. final requests integrate quickly. blit,199202
1248,1551688,51689,4,49.0,85240.89,0.02,0.01,A,F,1992-04-24,1992-02-18,1992-05-03,TAKE BACK RETURN,AIR,beans run quickly according to the carefu,199202
2983,480390,55403,2,11.0,15074.07,0.09,0.06,A,F,1992-04-29,1992-02-27,1992-05-26,NONE,MAIL,aids integrate s,199202
3011,1975917,937,1,5.0,9964.1,0.02,0.04,R,F,1992-04-21,1992-02-23,1992-05-15,NONE,TRUCK,nusual sentiments. carefully bold idea,199202
3271,948691,48692,3,14.0,24355.1,0.05,0.01,A,F,1992-02-24,1992-02-14,1992-03-23,NONE,AIR,"ending, even packa",199202
3271,633683,83696,4,29.0,46882.85,0.07,0.04,A,F,1992-03-10,1992-02-05,1992-03-14,COLLECT COD,MAIL,lar instructions. carefully regular,199202
3685,574127,99133,2,7.0,8407.7,0.05,0.0,R,F,1992-05-16,1992-02-23,1992-05-17,DELIVER IN PERSON,FOB,sits. special asymptotes about the r,199202
3712,1400122,50151,1,27.0,27595.35,0.01,0.05,R,F,1992-02-01,1992-02-26,1992-03-02,TAKE BACK RETURN,SHIP,ctions. even accounts haggle alongside,199202
3712,1849149,24204,2,13.0,14274.65,0.03,0.03,R,F,1992-04-30,1992-02-11,1992-05-30,DELIVER IN PERSON,FOB,s around the furiously ironic account,199202
3712,639226,89239,3,44.0,51268.36,0.01,0.01,A,F,1992-03-26,1992-02-19,1992-04-18,TAKE BACK RETURN,FOB,ously permanently regular req,199202


## Using the new connector

Schema needs to be defined explicitly as new connector is very sensitive to nullability, as per the following issue [Nullable column mismatch between Spark DataFrame & SQL Table Error](
https://github.com/microsoft/sql-spark-connector/issues/5), so we need to explicity create the schema and apply it to the loaded data

In [19]:
import org.apache.spark.sql.types._

val schema = StructType(
    StructField("L_ORDERKEY", IntegerType, false) ::
    StructField("L_PARTKEY", IntegerType, false) ::
    StructField("L_SUPPKEY", IntegerType, false) ::  
    StructField("L_LINENUMBER", IntegerType, false) ::
    StructField("L_QUANTITY", DecimalType(15,2), false) ::
    StructField("L_EXTENDEDPRICE", DecimalType(15,2), false) ::
    StructField("L_DISCOUNT", DecimalType(15,2), false) ::
    StructField("L_TAX", DecimalType(15,2), false) ::
    StructField("L_RETURNFLAG", StringType, false) ::
    StructField("L_LINESTATUS", StringType, false) ::
    StructField("L_SHIPDATE", DateType, false) ::
    StructField("L_COMMITDATE", DateType, false) ::
    StructField("L_RECEIPTDATE", DateType, false) ::
    StructField("L_SHIPINSTRUCT", StringType, false) ::  
    StructField("L_SHIPMODE", StringType, false) ::  
    StructField("L_COMMENT", StringType, false) ::  
    StructField("L_PARTITION_KEY", IntegerType, false) ::  
    Nil)
    
  val li2 = spark.createDataFrame(li.rdd, schema)

In [20]:
val url = s"jdbc:sqlserver://$server;databaseName=$database;"

li2.write 
  .format("com.microsoft.sqlserver.jdbc.spark") 
  .mode("overwrite")   
  .option("truncate", "true") 
  .option("url", url) 
  .option("dbtable", table) 
  .option("user", user) 
  .option("password", password) 
  .option("reliabilityLevel", "BEST_EFFORT") 
  .option("tableLock", "true") 
  .option("batchsize", "100000") 
  .save()

# Using the old connector:

This connector is more permissive about schema so we can just use the schema coming from Parquet file

In [23]:
import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

val config = Config(Map(
  "url" -> server,
  "databaseName" -> database,
  "dbTable" -> table,
  "user" -> user,
  "password" -> password,
  "bulkCopyBatchSize" -> "100000",
  "bulkCopyTableLock" -> "true",  
  "bulkCopyTimeout" -> "600" //seconds  
))

li.bulkCopyToSqlDB(config)