# What's in this exercise
Basics of how to work with Azure Cosmos DB-Cassandra API from Databricks <B>in batch</B>.<BR>
Section 05: Upsert operation (crUd)<BR>

### Prerequisites
The Datastax connector for Cassandra requires the Azure Comsos DB Cassandra API connection details to be initialized as part of the spark context.  When you launch a Jupyter notebook, the spark session and context are already initialized and it is not advisable to stop and reinitialize the Spark context unless with every configuration set as part of the HDInsight default Jupyter notebook start-up.  One workaround is to add the Cassandra instance details to Ambari, Spark2 service configuration directly.  This is a one-time activity that requires a Spark2 service restart.<BR>

1.  Go to Ambari, Spark2 service and click on configs
2.  Then go to custom spark2-defaults and add a new property with the following, and restart Spark2 service:
spark.cassandra.connection.host=YOUR_COSMOSDB_ACCOUNT_NAME.cassandra.cosmosdb.azure.com<br>
spark.cassandra.connection.port=10350<br>
spark.cassandra.connection.ssl.enabled=true<br>
spark.cassandra.auth.username=YOUR_COSMOSDB_ACCOUNT_NAME<br>
spark.cassandra.auth.password=YOUR_COSMOSDB_KEY<br>

---------
## 1.0. Cassandra API connection

### 1.0.1. Configure dependencies

In [1]:
%%configure -f
{ "conf": {"spark.jars.packages": "com.datastax.spark:spark-cassandra-connector_2.11:2.3.0,com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.0.0" }}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
13,application_1536862968089_0017,spark,dead,Link,,
17,application_1536862968089_0021,spark,idle,Link,Link,


### 1.0.2. Cassandra API configuration

In [2]:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType,LongType,FloatType,DoubleType, TimestampType}
import org.apache.spark.sql.cassandra._

//datastax Spark connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.driver.core.{ConsistencyLevel, DataType}
import com.datastax.spark.connector.writer.WriteConf

//Azure Cosmos DB library for multiple retry
import com.microsoft.azure.cosmosdb.cassandra

// Specify connection factory for Cassandra
spark.conf.set("spark.cassandra.connection.factory", "com.microsoft.azure.cosmosdb.cassandra.CosmosDbConnectionFactory")

// Parallelism and throughput configs
spark.conf.set("spark.cassandra.output.batch.size.rows", "1")
spark.conf.set("spark.cassandra.connection.connections_per_executor_max", "10")
spark.conf.set("spark.cassandra.output.concurrent.writes", "100")
spark.conf.set("spark.cassandra.concurrent.reads", "512")
spark.conf.set("spark.cassandra.output.batch.grouping.buffer.size", "1000")
spark.conf.set("spark.cassandra.connection.keep_alive_ms", "60000000") //Increase this number as needed
spark.conf.set("spark.cassandra.output.ignoreNulls","true")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
20,application_1536862968089_0024,spark,idle,Link,Link,✔


SparkSession available as 'spark'.


---
## 2.0. Dataframe API

### Reset data

In [10]:
//Delete data from prior runs
val cdbConnector = CassandraConnector(sc)
cdbConnector.withSessionDo(session => session.execute("delete from books_ks.books where book_id in ('b00300','b00001','b00023','b00501','b09999','b01001','b00999','b03999','b02999','b000009');"))

//Create 5 records and persist
val booksDF = Seq(
   ("b00001", "Arthur Conan Doyle", "A study in scarlet", 1887),
   ("b00023", "Arthur Conan Doyle", "A sign of four", 1890),
   ("b01001", "Arthur Conan Doyle", "The adventures of Sherlock Holmes", 1892),
   ("b00501", "Arthur Conan Doyle", "The memoirs of Sherlock Holmes", 1893),
   ("b00300", "Arthur Conan Doyle", "The hounds of Baskerville", 1901)
).toDF("book_id", "book_author", "book_name", "book_pub_year")

booksDF.write.mode("append").format("org.apache.spark.sql.cassandra").options(Map( "table" -> "books", "keyspace" -> "books_ks", "output.consistency.level" -> "ALL", "ttl" -> "10000000")).save()

res36: com.datastax.driver.core.ResultSet = ResultSet[ exhausted: true, Columns[]]

### 2.0.1. Upsert

In [12]:
//Before
spark.read.cassandraFormat("books", "books_ks", "").load().show()

+-------+------------------+--------------------+----------+-------------+
|book_id|       book_author|           book_name|book_price|book_pub_year|
+-------+------------------+--------------------+----------+-------------+
| b00300|Arthur Conan Doyle|The hounds of Bas...|      null|         1901|
| b00001|Arthur Conan Doyle|  A study in scarlet|      null|         1887|
| b00023|Arthur Conan Doyle|      A sign of four|      null|         1890|
| b00501|Arthur Conan Doyle|The memoirs of Sh...|      null|         1893|
| b01001|Arthur Conan Doyle|The adventures of...|      null|         1892|
+-------+------------------+--------------------+----------+-------------+

In [13]:
// Create a dataframe with changes you want to make
//(1) Update: Changing author name to include prefix of "Sir", and (2) Insert: adding a new book
val booksUpsertDF = Seq(
                         ("b00001", "Sir Arthur Conan Doyle", "A study in scarlet", 1887),
                         ("b00023", "Sir Arthur Conan Doyle", "A sign of four", 1890),
                         ("b01001", "Sir Arthur Conan Doyle", "The adventures of Sherlock Holmes", 1892),
                         ("b00501", "Sir Arthur Conan Doyle", "The memoirs of Sherlock Holmes", 1893),
                         ("b00300", "Sir Arthur Conan Doyle", "The hounds of Baskerville", 1901),
                         ("b09999", "Sir Arthur Conan Doyle", "The return of Sherlock Holmes", 1905)
                        ).toDF("book_id", "book_author", "book_name", "book_pub_year")

+-------+--------------------+--------------------+-------------+
|book_id|         book_author|           book_name|book_pub_year|
+-------+--------------------+--------------------+-------------+
| b00001|Sir Arthur Conan ...|  A study in scarlet|         1887|
| b00023|Sir Arthur Conan ...|      A sign of four|         1890|
| b01001|Sir Arthur Conan ...|The adventures of...|         1892|
| b00501|Sir Arthur Conan ...|The memoirs of Sh...|         1893|
| b00300|Sir Arthur Conan ...|The hounds of Bas...|         1901|
| b09999|Sir Arthur Conan ...|The return of She...|         1905|
+-------+--------------------+--------------------+-------------+

In [14]:
// Upsert
booksUpsertDF.write.mode("append").format("org.apache.spark.sql.cassandra").options(Map( "table" -> "books", "keyspace" -> "books_ks")).save()

In [15]:
//After
spark.read.cassandraFormat("books", "books_ks", "").load().show()

+-------+--------------------+--------------------+----------+-------------+
|book_id|         book_author|           book_name|book_price|book_pub_year|
+-------+--------------------+--------------------+----------+-------------+
| b00300|Sir Arthur Conan ...|The hounds of Bas...|      null|         1901|
| b00001|Sir Arthur Conan ...|  A study in scarlet|      null|         1887|
| b00023|Sir Arthur Conan ...|      A sign of four|      null|         1890|
| b00501|Sir Arthur Conan ...|The memoirs of Sh...|      null|         1893|
| b09999|Sir Arthur Conan ...|The return of She...|      null|         1905|
| b01001|Sir Arthur Conan ...|The adventures of...|      null|         1892|
+-------+--------------------+--------------------+----------+-------------+

### 2.0.2. Update

In [16]:
//Before
spark.read.cassandraFormat("books", "books_ks", "").load().show()

+-------+--------------------+--------------------+----------+-------------+
|book_id|         book_author|           book_name|book_price|book_pub_year|
+-------+--------------------+--------------------+----------+-------------+
| b00300|Sir Arthur Conan ...|The hounds of Bas...|      null|         1901|
| b00001|Sir Arthur Conan ...|  A study in scarlet|      null|         1887|
| b00023|Sir Arthur Conan ...|      A sign of four|      null|         1890|
| b00501|Sir Arthur Conan ...|The memoirs of Sh...|      null|         1893|
| b09999|Sir Arthur Conan ...|The return of She...|      null|         1905|
| b01001|Sir Arthur Conan ...|The adventures of...|      null|         1892|
+-------+--------------------+--------------------+----------+-------------+

In [17]:
//Update
val booksUpdateDF = Seq(
                         ("b00001", 5.99),
                         ("b00023", 7.50),
                         ("b01001", 12.25),
                         ("b00501", 12.00),
                         ("b00300", 18.00),
                         ("b09999", 23.99)
                        ).toDF("book_id", "book_price")
booksUpdateDF.write.mode("append").format("org.apache.spark.sql.cassandra").options(Map( "table" -> "books", "keyspace" -> "books_ks")).save()

In [18]:
//After
spark.read.cassandraFormat("books", "books_ks", "").load().show()

+-------+--------------------+--------------------+----------+-------------+
|book_id|         book_author|           book_name|book_price|book_pub_year|
+-------+--------------------+--------------------+----------+-------------+
| b00300|Sir Arthur Conan ...|The hounds of Bas...|      18.0|         1901|
| b00001|Sir Arthur Conan ...|  A study in scarlet|      5.99|         1887|
| b00023|Sir Arthur Conan ...|      A sign of four|       7.5|         1890|
| b00501|Sir Arthur Conan ...|The memoirs of Sh...|      12.0|         1893|
| b09999|Sir Arthur Conan ...|The return of She...|     23.99|         1905|
| b01001|Sir Arthur Conan ...|The adventures of...|     12.25|         1892|
+-------+--------------------+--------------------+----------+-------------+

---
## 3.0. RDD API
Upserts, no different inserts/creates

---
## 4.0. Upserts with CQL

In [20]:
//Runs on driver, use wisely
cdbConnector.withSessionDo(session => session.execute("update books_ks.books set book_price=99.33 where book_id ='b00300';"))

res58: com.datastax.driver.core.ResultSet = ResultSet[ exhausted: true, Columns[]]