# Spark on YARN integration with Azure Cosmos DB Cassandra API

The Cassandra API allows you to interact with Azure Cosmos DB using Apache Cassandra constructs/codebase. Azure Cosmos DB is a managed service, therefore, not all functionality available in Cassandra is applicable.  E.g. Azure Cosmos DB has its own replication,indexing etc that is not overridable.<br>

This notebook covers:<br>
1) Provisioning Azure Cosmos DB - Cassandra API<br>
2) Azure Cosmos DB instance related configuration needed for Spark integration<br>
3) Attaching the Datastax Spark-Cassandra connector library<br>
4) Attaching the Azure Cosmos DB - Cassandra API specific library<br>
5) Using cqlsh with Azure Cosmos DB Cassandra API

----------
### 1.0. Provision Azure Cosmos DB Cassandra API instance
Details can be found here-
https://docs.microsoft.com/en-us/azure/cosmos-db/create-cassandra-dotnet#create-a-database-account

----------
### 2.0. Azure Cosmos DB instance related configuration 
The Datastax connector for Cassandra requires the Azure Comsos DB Cassandra API connection details to be initialized as part of the spark context.  When you launch a Jupyter notebook, the spark session and context are already initialized and it is not advisable to stop and reinitialize the Spark context unless with every configuration set as part of the HDInsight default Jupyter notebook start-up.  One workaround is to add the Cassandra instance details to Ambari, Spark2 service configuration directly.  This is a one-time activity that requires a Spark2 service restart.<BR>

1.  Go to Ambari, Spark2 service and click on configs
2.  Then go to custom spark2-defaults and add a new property with the following, and restart Spark2 service:
spark.cassandra.connection.host=YOUR_COSMOSDB_ACCOUNT_NAME.cassandra.cosmosdb.azure.com<br>
spark.cassandra.connection.port=10350<br>
spark.cassandra.connection.ssl.enabled=true<br>
spark.cassandra.auth.username=YOUR_COSMOSDB_ACCOUNT_NAME<br>
spark.cassandra.auth.password=YOUR_COSMOSDB_KEY<br>

----------
### 3.0. Attaching the Datastax Spark-Cassandra connector library
To work with Azure Cosmos DB Cassandra API, we have to use the Datastax Cassandra connector.<br>
1.  Review the Spark version on your HDI cluster and find associated maven coordinates for the compatible Datastax Cassandra Spark connector at -<br>
https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector<br>
2.  Configure your notebook to use the external package identified<br>
An example for Spark 2.3.0 and Scala 2.11 is-<br>
```scala
%%configure
{ "conf": {"spark.jars.packages": "com.datastax.spark:spark-cassandra-connector_2.11:2.3.0" }}
```

------------
### 4.0. Attaching the Azure Cosmos DB - Cassandra API specific library
We need a custom connection factory as this is the only way to configure a retry policy on the connector.<br>
As completed in 3.0., add the following maven coordinates to attach the library to the cluster-<br>
Download: https://search.maven.org/artifact/com.microsoft.azure.cosmosdb/azure-cosmos-cassandra-spark-helper/1.0.0/jar<br>
Coordinates: com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.0.0<br>
In the example below, we are adding two maven dependencies-<br>
```scala
%%configure
{ "conf": {"spark.jars.packages": "com.datastax.spark:spark-cassandra-connector_2.11:2.3.0,com.microsoft.azure.cosmosdb:azure-cosmos-cassandra-spark-helper:1.0.0" }}
```

----------------
### 5.0. Using cqlsh with Azure Cosmos DB Cassandra API
If you are particular about using cqlsh to validate, you can install a pseudo-distrubuted Cassandra instance on your machine, and launch cqlsh against your Azure Cosmos DB-Cassandra API instance.<br><br>
**1) Installing - cqlsh:**
- Installation on Mac<br>
```scala sudo brew install cassandra``` <br>
- Start Cassandra<br>
```scala bin/cassandra``` <br>
- Check status<br>
```scala bin/nodetool status``` <br>
- Launch cqlsh<br>
```scala bin/cqlsh``` <br>

<br>
**2) Connecting from cqlsh to Azure Cosmos DB Cassandra API:**
```scala 
cd bin
export SSL_VERSION=TLSv1_2
export SSL_VALIDATE=false
python cqlsh.py YOUR-COSMOSDB-ACCOUNT-NAME.cassandra.cosmosdb.windows-ppe.net  10350 -u YOUR-COSMOSDB-ACCOUNT-NAME -p YOUR-COSMOSDB-ACCOUNT-KEY --ssl``` 