# Secrets

## In Databricks
The secrets below like the Cosmos account key are retrieved from a secret scope. If you don't have defined a secret scope for a Cosmos Account you want to use when going through this sample you can find the instructions on how to create one here:

- Here you can Create a new secret scope for the current Databricks workspace
  - See how you can create an [Azure Key Vault backed secret scope](https://docs.microsoft.com/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope)
  - See how you can create a [Databricks backed secret scope](https://docs.microsoft.com/azure/databricks/security/secrets/secret-scopes#create-a-databricks-backed-secret-scope)
- And here you can find information on how to [add secrets to your Spark configuration](https://docs.microsoft.com/azure/databricks/security/secrets/secrets#read-a-secret).

## In Synapse
- You can find instructions on how to define a Cosmos DB account as linked service (with safe handling of secrets) [here](https://docs.microsoft.com/en-us/azure/synapse-analytics/synapse-link/how-to-connect-synapse-link-cosmos-db#connect-an-azure-cosmos-db-database-to-an-azure-synapse-workspace)

If you don't want to use secrets at all you can of course also just assign the values in clear-text below - but for obvious reasons we recommend the usage of secrets.

In [None]:
string cosmosEndpoint = spark.Conf().Get("spark.cosmos.accountEndpoint");
string cosmosMasterKey = spark.Conf().Get("spark.cosmos.accountKey");

Console.WriteLine($"Cosmos Account endpoint: {cosmosEndpoint}");

**Preparation - creating the Cosmos DB container to ingest the data into**

Configure the Catalog API to be used

In [None]:
spark.Conf().Set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog");
spark.Conf().Set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint);
spark.Conf().Set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey);
spark.Conf().Set("spark.sql.catalog.cosmosCatalog.spark.cosmos.views.repositoryPath", $"/viewDefinitions/{Guid.NewGuid().ToString()}");

And execute the command to create the new container with a throughput of up-to 100,000 RU (Autoscale - so 10,000 - 100,000 RU based on scale) and only system properties (like /id) being indexed. We will also create a second container that will be used to store metadata for the global throughput control

In [None]:
%%sql
CREATE DATABASE IF NOT EXISTS cosmosCatalog.SampleDatabase;

CREATE TABLE IF NOT EXISTS cosmosCatalog.SampleDatabase.GreenTaxiRecords
USING cosmos.oltp
TBLPROPERTIES(partitionKeyPath = '/id', autoScaleMaxThroughput = '100000', indexingPolicy = 'OnlySystemProperties');

/* NOTE: It is important to enable TTL (can be off/-1 by default) on the throughput control container */
CREATE TABLE IF NOT EXISTS cosmosCatalog.SampleDatabase.ThroughputControl
USING cosmos.oltp
OPTIONS(spark.cosmos.database = 'SampleDatabase')
TBLPROPERTIES(partitionKeyPath = '/groupId', autoScaleMaxThroughput = '4000', indexingPolicy = 'AllProperties', defaultTtlInSeconds = '-1');

**Preparation - loading data source \"[NYC Taxi & Limousine Commission - green taxi trip records](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/)\"**

The green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. This data set has over 80 million records (>8 GB) of data and is available via a publicly accessible Azure Blob Storage Account located in the East-US Azure region.

In [None]:
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using static Microsoft.Spark.Sql.Functions;

Console.WriteLine($"Starting preparation: {DateTimeOffset.UtcNow.ToString("o")}");

// Azure storage access info
string blob_account_name = "azureopendatastorage";
string blob_container_name = "nyctlc";
string blob_relative_path = "green";
string blob_sas_token = String.Empty;

// Allow SPARK to read from Blob remotely
string wasbs_path = $"wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/{blob_relative_path}";
spark.Conf().Set(
  $"fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net",
  blob_sas_token);
Console.WriteLine($"Remote blob path: {wasbs_path}");

// SPARK read parquet, note that it won't load any data yet by now
// NOTE - if you want to experiment with larger dataset sizes - consider switching to Option B (commenting code 
// for Option A/uncommenting code for option B) the lines below or increase the value passed into the 
// limit function restricting the dataset size below

// ------------------------------------------------------------------------------------
//  Option A - with limited dataset size
// ------------------------------------------------------------------------------------
DataFrame df_rawInputWithoutLimit = spark.Read().Parquet(wasbs_path);
DataFrame df_rawInput = df_rawInputWithoutLimit.Limit(1_000_000);

// ------------------------------------------------------------------------------------
//  Option B - entire dataset
// ------------------------------------------------------------------------------------
// DataFrame df_rawInput = spark.Read().Parquet(wasbs_path)

// Adding an id column with unique values
Func<Column> uuidUdf= Udf<string>(() => Guid.NewGuid().ToString());
DataFrame df_input_withId = df_rawInput.WithColumn("id", uuidUdf()).Persist(); 

Console.WriteLine("Register the DataFrame as a SQL temporary view: source");
df_input_withId.CreateOrReplaceTempView("source");

Console.WriteLine($"Finished preparation: {DateTimeOffset.UtcNow.ToString("o")}");

**Sample - ingesting the NYC Green Taxi data into Cosmos DB**

By setting the target throughput threshold to 0.95 (95%) we reduce throttling but still allow the ingestion to consume most of the provisioned throughput. For scenarios where ingestion should only take a smaller subset of the available throughput this threshold can be reduced accordingly.

In [None]:
Console.WriteLine($"Starting ingestion: {DateTimeOffset.UtcNow.ToString("o")}");

var writeCfg = new Dictionary<string, string>() {
  { "spark.cosmos.accountEndpoint",  cosmosEndpoint },
  { "spark.cosmos.accountKey", cosmosMasterKey },
  { "spark.cosmos.database", "SampleDatabase" },
  { "spark.cosmos.container", "GreenTaxiRecords" },
  { "spark.cosmos.write.strategy", "ItemOverwrite" },
  { "spark.cosmos.write.bulk.enabled", "true" },
  { "spark.cosmos.throughputControl.enabled", "true" },
  { "spark.cosmos.throughputControl.name", "NYCGreenTaxiDataIngestion" },
  { "spark.cosmos.throughputControl.targetThroughputThreshold", "0.95" },
  { "spark.cosmos.throughputControl.globalControl.database", "SampleDatabase" },
  { "spark.cosmos.throughputControl.globalControl.container", "ThroughputControl" },
};

DataFrame df_NYCGreenTaxi_Input = spark.Sql("SELECT * FROM source");

df_NYCGreenTaxi_Input
  .Write()
  .Format("cosmos.oltp")
  .Mode("Append")
  .Options(writeCfg)
  .Save();

Console.WriteLine($"Finished ingestion: {DateTimeOffset.UtcNow.ToString("o")}");

**Getting the reference record count**

In [None]:
long count_source = spark.Sql("SELECT * FROM source").Count();

Console.WriteLine($"Number of records in source:  {count_source}");

**Sample - validating the record count via query**

In [None]:
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using static Microsoft.Spark.Sql.Functions;

Console.WriteLine($"Starting validation via query: {DateTimeOffset.UtcNow.ToString("o")}");

var readCfg = new Dictionary<string, string>() {
  { "spark.cosmos.accountEndpoint", cosmosEndpoint },
  { "spark.cosmos.accountKey", cosmosMasterKey },
  { "spark.cosmos.database", "SampleDatabase" },
  { "spark.cosmos.container", "GreenTaxiRecords" },

  //IMPORTANT - any other partitioning strategy will result in indexing not being 
  // used to count - so latency and RU would spike up
  { "spark.cosmos.read.partitioning.strategy", "Restrictive" }, 
  
  { "spark.cosmos.read.inferSchema.enabled", "false" },
  { "spark.cosmos.read.customQuery", "SELECT COUNT(0) AS Count FROM c" }
};

var count_query_schema=new StructType(new [] {new StructField("Count", new LongType(), true)});
DataFrame query_df = spark
    .Read()
    .Format("cosmos.oltp")
    .Schema(count_query_schema)
    .Options(readCfg)
    .Load();
;
int count_query = query_df.Select(Functions.Sum("Count").Alias("TotalCount")).First().GetAs<int>("TotalCount");
Console.WriteLine($"Number of records retrieved via query: {count_query}");
Console.WriteLine($"Finished validation via query: {DateTimeOffset.UtcNow.ToString("o")}");

**Sample - validating the record count via change feed**

In [None]:
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using static Microsoft.Spark.Sql.Functions;

Console.WriteLine($"Starting validation via change feed: {DateTimeOffset.UtcNow.ToString("o")}");

var changeFeedCfg  = new Dictionary<string, string>() {
  { "spark.cosmos.accountEndpoint", cosmosEndpoint },
  { "spark.cosmos.accountKey", cosmosMasterKey },
  { "spark.cosmos.database", "SampleDatabase" },
  { "spark.cosmos.container", "GreenTaxiRecords" },
  { "spark.cosmos.read.partitioning.strategy", "Default" }, 
  { "spark.cosmos.read.inferSchema.enabled", "false" },
  { "spark.cosmos.changeFeed.startFrom", "Beginning" },
  { "spark.cosmos.changeFeed.mode", "Incremental" }
};

DataFrame changeFeed_df = spark
    .Read()
    .Format("cosmos.oltp.changeFeed")
    .Options(changeFeedCfg)
    .Load();
;
long count_changeFeed  = changeFeed_df.Count();
Console.WriteLine($"Number of records retrieved via change feed: {count_changeFeed}");
Console.WriteLine($"Finished validation via change feed: {DateTimeOffset.UtcNow.ToString("o")}");

**Sample - bulk deleting documents and validating document count afterwards**

In [None]:
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using static Microsoft.Spark.Sql.Functions;

Console.WriteLine($"Starting to identify to be deleted documents: {DateTimeOffset.UtcNow.ToString("o")}");

var readCfg = new Dictionary<string, string>() {
  { "spark.cosmos.accountEndpoint", cosmosEndpoint },
  { "spark.cosmos.accountKey", cosmosMasterKey },
  { "spark.cosmos.database", "SampleDatabase" },
  { "spark.cosmos.container", "GreenTaxiRecords" },
  { "spark.cosmos.read.partitioning.strategy", "Default" }, 
  { "spark.cosmos.read.inferSchema.enabled", "false" }
};

DataFrame toBeDeleted_df = spark
    .Read()
    .Format("cosmos.oltp")
    .Options(readCfg)
    .Load()
    .Limit(100_000);
Console.WriteLine($"Number of records to be deleted: {toBeDeleted_df.Count()}");

Console.WriteLine($"Starting to bulk delete documents: {DateTimeOffset.UtcNow.ToString("o")}");
var deleteCfg = new Dictionary<string, string>(writeCfg);
deleteCfg["spark.cosmos.write.strategy"] = "ItemDelete";

toBeDeleted_df
  .Write()
  .Format("cosmos.oltp")
  .Mode("Append")
  .Options(deleteCfg)
  .Save();

Console.WriteLine($"Finished deleting documents: {DateTimeOffset.UtcNow.ToString("o")}");

Console.WriteLine($"Starting count validation via query: {DateTimeOffset.UtcNow.ToString("o")}");

var count_query_schema=new StructType(new [] {new StructField("Count", new LongType(), true)});
readCfg["spark.cosmos.read.customQuery"] = "SELECT COUNT(0) AS Count FROM c";
DataFrame query_df = spark
    .Read()
    .Format("cosmos.oltp")
    .Schema(count_query_schema)
    .Options(readCfg)
    .Load();

int count_query = query_df.Select(Functions.Sum("Count").Alias("TotalCount")).First().GetAs<int>("TotalCount");
Console.WriteLine($"Number of records retrieved via query: {count_query}");
Console.WriteLine($"Finished validation via query: {DateTimeOffset.UtcNow.ToString("o")}");


**Sample - showing the existing Containers**

In [None]:
%%sql
SHOW TABLES FROM cosmosCatalog.SampleDatabase

In [None]:
using System.Diagnostics;

var df_Tables = spark.Sql("SHOW TABLES FROM cosmosCatalog.SampleDatabase");
Trace.Assert(df_Tables.Count() == 3);

**Sample - querying a Cosmos Container via Spark Catalog**

In [None]:
%%sql
SELECT * FROM cosmosCatalog.SampleDatabase.GreenTaxiRecords LIMIT 10

**Sample - querying a Cosmos Container with custom settings via Spark Catalog**

Creating the view with custom settings (in this case adding a projection, disabling schema inference and switching to aggressive partitioning strategy)

In [None]:
%%sql
CREATE TABLE cosmosCatalog.SampleDatabase.GreenTaxiRecordsView 
  (id STRING, _ts TIMESTAMP, vendorID INT, totalAmount DOUBLE)
USING cosmos.oltp
TBLPROPERTIES(isCosmosView = 'True')
OPTIONS (
  spark.cosmos.database = 'SampleDatabase',
  spark.cosmos.container = 'GreenTaxiRecords',
  spark.cosmos.read.inferSchema.enabled = 'False',
  spark.cosmos.read.inferSchema.includeSystemProperties = 'True',
  spark.cosmos.read.partitioning.strategy = 'Aggressive');

SELECT * FROM cosmosCatalog.SampleDatabase.GreenTaxiRecordsView LIMIT 10

Creating another view with custom settings (in this case enabling schema inference and switching to restrictive partitioning strategy)

In [None]:
%%sql
CREATE TABLE cosmosCatalog.SampleDatabase.GreenTaxiRecordsAnotherView 
USING cosmos.oltp
TBLPROPERTIES(isCosmosView = 'True')
OPTIONS (
  spark.cosmos.database = 'SampleDatabase',
  spark.cosmos.container = 'GreenTaxiRecords',
  spark.cosmos.read.inferSchema.enabled = 'True',
  spark.cosmos.read.inferSchema.includeSystemProperties = 'False',
  spark.cosmos.read.partitioning.strategy = 'Restrictive');

SELECT * FROM cosmosCatalog.SampleDatabase.GreenTaxiRecordsAnotherView LIMIT 10

Show all Tables in the Cosmos Catalog to show that both the "real" Containers as well as the views show-up

In [None]:
%%sql
SHOW TABLES FROM cosmosCatalog.SampleDatabase

In [None]:
using System.Diagnostics;

var df_Tables = spark.Sql("SHOW TABLES FROM cosmosCatalog.SampleDatabase");
Trace.Assert(df_Tables.Count() == 5);

**Cleanup the views again**

In [None]:
%%sql
DROP TABLE IF EXISTS cosmosCatalog.SampleDatabase.GreenTaxiRecordsView;
DROP TABLE IF EXISTS cosmosCatalog.SampleDatabase.GreenTaxiRecordsAnotherView;
SHOW TABLES FROM cosmosCatalog.SampleDatabase

In [None]:
using System.Diagnostics;

var df_Tables = spark.Sql("SHOW TABLES FROM cosmosCatalog.SampleDatabase");
Trace.Assert(df_Tables.Count() == 3);