# What's in this exercise
Basics of how to work with Azure Cosmos DB Cassandra API from Zeppelin <B>in batch</B>.<br>
Section 01: Cassandra API connection<br>
Section 02: Keyspace DDL<br>
Section 03: Table DDL<br>
  
**Reference:** 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md

### Prerequisites
The Datastax connector for Cassandra requires the Azure Comsos DB Cassandra API connection details to be initialized as part of the spark context.  When you launch a Jupyter notebook, the spark session and context are already initialized and it is not advisable to stop and reinitialize the Spark context unless with every configuration set as part of the HDInsight default Jupyter notebook start-up.  One workaround is to add the Cassandra instance details to Ambari, Spark2 service configuration directly.  This is a one-time activity that requires a Spark2 service restart.<BR>

1.  Go to Ambari, Spark2 service and click on configs
2.  Then go to custom spark2-defaults and add a new property with the following, and restart Spark2 service:
spark.cassandra.connection.host=YOUR_COSMOSDB_ACCOUNT_NAME.cassandra.cosmosdb.azure.com<br>
spark.cassandra.connection.port=10350<br>
spark.cassandra.connection.ssl.enabled=true<br>
spark.cassandra.auth.username=YOUR_COSMOSDB_ACCOUNT_NAME<br>
spark.cassandra.auth.password=YOUR_COSMOSDB_KEY<br>


---------
## 1.0. Cassandra API connection

### 1.0.1. Configure dependencies

In [None]:
%%configure -f
{ "conf": {"spark.jars.packages": "com.datastax.spark:spark-cassandra-connector_2.11:2.3.0,com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.0.0" }}

### 1.0.2. Cassandra API configuration

In [2]:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

import org.apache.spark.sql.cassandra._

//datastax Spark connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector

//CosmosDB library for multiple retry
import com.microsoft.azure.cosmosdb.cassandra

// Specify connection factory for Cassandra
spark.conf.set("spark.cassandra.connection.factory", "com.microsoft.azure.cosmosdb.cassandra.CosmosDbConnectionFactory")

// Parallelism and throughput configs - increase as needed to tune your jobs
spark.conf.set("spark.cassandra.output.batch.size.rows", "1")//Do not modify this
spark.conf.set("spark.cassandra.connection.connections_per_executor_max", "10")
spark.conf.set("spark.cassandra.output.concurrent.writes", "100")
spark.conf.set("spark.cassandra.concurrent.reads", "512")
spark.conf.set("spark.cassandra.output.batch.grouping.buffer.size", "1000")
spark.conf.set("spark.cassandra.connection.keep_alive_ms", "60000000") 
spark.conf.set("spark.cassandra.output.ignoreNulls","true")



Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
13,application_1536862968089_0017,spark,idle,Link,Link,✔


SparkSession available as 'spark'.


-----
## 2.0. Cassandra Keyspace DDL operations

### 2.0.a. Create keyspace - from Spark

In [None]:
val cdbConnector = CassandraConnector(sc)
// Create keyspace
cdbConnector.withSessionDo(session => session.execute("CREATE KEYSPACE IF NOT EXISTS books_ks WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 } "))

### 2.0.b. Alter keyspace - from Spark
Not supported currently.

### 2.0.c. Delete keyspace - from Spark

In [None]:
val cdbConnector = CassandraConnector(sc)
//cdbConnector.withSessionDo(session => session.execute("DROP KEYSPACE books_ks"))

-----
## 3.0. Cassandra Table DDL operations

**Considerations:**<br>
&nbsp;&nbsp;-Throughput can be provisioned at a table level as part of the create table statement.<br>
&nbsp;&nbsp;-One partition key can store 10 GB of data. <br> 
&nbsp;&nbsp;-One record can be max of 2 MB in size<br>
&nbsp;&nbsp;-One partition key range can store multiple partition keys.<br>

### 3.0.a. Create table - from cqlsh

In [None]:
val cdbConnector = CassandraConnector(sc)
cdbConnector.withSessionDo(session => session.execute("CREATE TABLE IF NOT EXISTS books_ks.books(book_id TEXT PRIMARY KEY,book_author TEXT, book_name TEXT,book_pub_year INT,book_price FLOAT) WITH cosmosdb_provisioned_throughput=4000 , WITH default_time_to_live=630720000;"))

### 3.0.b. Alter table 
**Considerations:**<BR>
(1) Alter table - add/change columns - on the roadmap<BR>
(2) Alter provisioned throughput from Spark - on the roadmap <BR>
(3) Alter table TTL - on the roadmap <BR>

In [None]:
//val cdbConnector = CassandraConnector(sc)
//cdbConnector.withSessionDo(session => session.execute("ALTER TABLE books_ks.books WITH cosmosdb_provisioned_throughput=8000, WITH default_time_to_live=0;"))

### 3.0.c. Drop table 

In [None]:
val cdbConnector = CassandraConnector(sc)
//cdbConnector.withSessionDo(session => session.execute("DROP TABLE IF EXISTS books_ks.books;"))