Summary: This document provides guidance on getting your hands dirty using Azure Cosmos DB API for Apache Cassandra Database.
What you learn from this Sample
Core Differences between Apache Cassandra and Azure Cosmos DB API for Cassandra
What you need for this Sample?
A few things to do before deep-dive
Azure Cosmos DB Cassandra API can be used as the data store for apps written for Apache Cassandra. This means that by using existing Apache drivers compliant with CQLv4, your existing Cassandra application can now communicate with the Azure Cosmos DB Cassandra API. In many cases, you can switch from using Apache Cassandra to using Azure Cosmos DB's Cassandra API, by just changing a connection string.
The Cassandra API enables you to interact with data stored in Azure Cosmos DB using the Cassandra Query Language (CQL) , Cassandra-based tools (like cqlsh) and Cassandra client drivers that you're already familiar with. For more information, visit the official Microsoft documentation for Azure Cosmos DB API for Cassandra Database.
Azure Cosmos DB is a fully managed NoSQL database for modern app development, with SLA-backed speed and availability, automatic and instant scalability, and open-source APIs for MongoDB, Cassandra, and other NoSQL engines. For a more in-depth coverage of Azure Cosmos DB, you should visit the official site here > https://docs.microsoft.com/en-us/azure/cosmos-db/introduction
If you're looking to get started quickly, you can find a range of SDK Support and sample Tutorials using .NET, .NET Core, Java, Python etc. here.
Key learning include:
- Creating an Apache Cassandra Keyspace in Azure Cosmos DB using API for Cassandra leveraging C#.
- Providing provisioned throughput (RU) at Keyspace level.
- Creating an Apache Cassandra Table in Azure Cosmos DB using API for Cassandra.
- Providing provisioned throughput (RU) at table level.
- Best practices for creating a Primary Key in Cassandra (which includes 1 partitionKey + 0 or more Clustering Columns). An in-depth technical discussion on Apache Cassandra Data Model is beyond the scope of this document, but if you're seriously interested, then I highly recommend going through the links in Further Study.
- Creating a table with a single Primary Key.
- Creating a table with a Compound Primary Key for a use-case wherein a single Primary Key will not work.
- Inserting data into both tables: uprofile.user and weather.data.
- Query operations using a simple filter across a Single Primary Key.
- Query operations using a simple filter across a Compound Primary Key.
- Query operations using a complex filter across a Compound Primary Key.
- Query Operation trying to query table by filtering by non-primary key. Flags an error which is as per Cassandra Database Guidelines.
- Calculating Request Unit (throughput) in Azure Cosmos DB API for Cassandra via .NET SDK for different operations.
- HOME WORK - To explore 2 possible solutions to solve the above problem.
- HOME WORK - To explore increasing Cardinality of the weather.data table by replacing 'identity_id' with Cassandra timestamp data type.
There's some inherent differences 'Architecturally', 'Conceptually' and 'Realistically' that you must be aware of for using the Azure Cosmos DB API for Cassandra. The core differences have been outlined in the Comments section in Visual Studio Solution Program.cs file in this repo. Mentioning it once again for relevance.
-
The Azure Cosmos DB Cassandra API is compatible with CQL v3.11 API (backward-compatible with version 2.x). Read more > https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra/cassandra-support#cassandra-protocol
-
Size limits:
- total size of data stored in a table on Cosmos = NONE. RULE is: Add TB/PBs of data as long as 'partitionKey' size limits are respected.
- total data size of entity (row) should not exceed 2MB.
- total data size of a single partitionKey cannot exceed 20GB.
- In OSS/DataStax, at Keyspace creation level, you can choose options: replica replacement strategy (SimpleStrategy, NetworkTopologyStrategy), replication factor & durable writes setting.
CREATE KEYSPACE uprofile WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1 }
In Cosmos, all options are ignored currently (class, replicationstrategy, replicationfactor, datacenter). What Cosmos does is:
- Cosmos uses the underlying Global distribution replication method to add the regions.
- If you need the cross-region, need to do at account level with PowerShell, CLI, or Azure portal.
- Durable_writes can't be disabled because Azure Cosmos DB ensures every write is durable.
- In every region, Cosmos replicates the data across the replica set that is made up of four replicas and this replica set configuration can't be modified.
- In Cosmos, throughput (RU) can be set both at Keyspace and Table level.
CREATE KEYSPACE sampleks WITH REPLICATION = { 'class' : 'SimpleStrategy'} AND cosmosdb_provisioned_throughput=2000;
CREATE TABLE sampleks.t1 (user_id int PRIMARY KEY, lastname text) WITH cosmosdb_provisioned_throughput=2000;
-
In OSS/DataStax, recommended PrimaryKey (partitionKey) should be < 100MB-limit. Best practice recommendation is to store up to 100,000 rows in 1 partition in OSS Cassandra. In Cosmos, single partitionKey size can be of limit 20GB (per logical partition), 30GB (per physical partition). Each PPartition = 10,000 RUs.
-
In OSS/DataStax, Cassandra a replication factor is mentioned during creation time; e.g. In Cosmos, there is (by default) a replication factor = 4 (quorum of 3). Microsoft manages replica sets, you can sleep nicely at night.
-
In OSS/DataStax, Cassandra has an important concept of tokens (# of partitionKey(fx)). TokenRing = murmur3 64 byte hash, with values ranging from -2^63 to -2^63 - 1. In Cosmos, we use a similar concept, but we use a different # token, and token ring range is different internally (larger), but externally same.
-
Difference in CQL Functions:
- Cosmos supports token as a projection/selector, and only allows token(pk) on the left-hand side of a where clause.
WHERE token(pk) > 1024 is OK.
WHERE token(pk) > token(100) is **not** supported.
- The cast() function is not nestable in Cassandra API.
SELECT cast(count as double) FROM myTable is supported.
SELECT avg(cast(count as double)) FROM myTable is **not** supported.
- Custom timestamps and TTL specified with the USING option are applied at a row level (and not per cell).
- Aggregate functions work on regular columns, but aggregates on clustering columns are not supported. Read more > https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra/cassandra-support#cql-functions
-
Specifics around difference between OSS & Cosmos DB API CQL commands > https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra/cassandra-support#cql-commands
-
In Cosmos, all attributes are 'Indexed' by Default for all APIs (e.g. Core SQL API). Cassandra API does not work in the same manner. In other words, Cassandra API does not index all attributes by default. Cassandra supports 'Secondary Indexing'. Read more > https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra/secondary-indexing
-
In Cosmos, filtering a Q against a non-PrimaryKey is not allowed (as per Cassandra best practices). See code to fix it either by creating 'Secondary Index' OR 'ALLOW FILTERING'.
-
Cassandra API on Azure Cosmos DB supports only TLSv1.2
-
An in-depth difference between OSS/DataStax Cassandra Consistency Level and Cosmos DB Cassandra API Consistency Levels. Read more > https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra/apache-cassandra-consistency-mapping#mapping-consistency-levels
This sample is in .NET. For running this sample, all you need to do is to download the Visual Studio Solution file; and then make the following changes as mentioned below. You can also leverage this GitHub repo for getting up and running quickly > https://github.com/Azure-Samples/azure-cosmos-db-cassandra-dotnet-core-getting-started.
You need the following:
- An Azure subscription. If you do not have one, you can get a free one here with USD 200 Credit.
- Working Azure Cosmos DB Account with Cassandra API. Learn how to create one using Azure portal here using this Tutorial.
- Visual Studio Code / Visual Studio 2019 or similar IDE. You can download your VS here.
- Working knowledge of Apache Cassandra contructs, queries & limitations.
- Working knowledge of programming in C#. It is assumed that you possess all these for enjoying and doing further R&D on this sample. Simply clone this git repo (or download as Zip).
- Open the Visual Studio Solution file; ensure your Nuget packages are upto date. Specifically, ensure that 'CassandraCSharpDriver' is installed. Your packages.config file should resemble the same as shown below:
2.In the Program.cs file, edit the secion below. You will find these from your Azure portal, Cosmos DB account's Settings > Connection String:
// Cassandra Cluster configs section.
private const string UserName = "<< ENTER YOUR USERNAME >>";
private const string Password = "<< ENTER YOUR PRIMARY PASSWORD >>";
private const string CassandraContactPoint = "<< ENTER YOUR CONTACT POINT >>"; // DnsName
private static int CassandraPort = 10350; // Leave this as it is
- Once run successfully, the program should run to create 2 Keyspaces and 2 Tables respectively in each Keyspace.
- Next, it will also load data into the corresponding tables with the Keys that have been created.
- Keyspace 'uprofile' has table user with a single PrimaryKey; keyspace 'weather' has table data with a Compound PrimaryKey.
- At this stage, you can pause to take a look at your resources in the Azure portal.
- When you perform operations against the Azure Cosmos DB Cassandra API, the RU charge is returned in the incoming payload as a field named RequestCharge. In .NET SDK, you can retrieve the incoming payload under the Info property of a RowSet object. You can perform Operations and test RU on VS console.
- We then proceed in VS to test Basic Query Operations on Keyspaces and Tables created in the earlier steps. Finally, we perform a filter operation against weather.data table using a simple filter from a Compound Primary Key which retrieves us a result. You can map different RU consumptions per operation that you perform as you go along.
- Finally, we execute a filter against a Non-Primary Key and it throws an error. This is owing to the reason, that Non-Primary Keys cannot be used in a Filter in a Query in Apache Cassandra database. There are possible solutions given below which you can tinker with to solve this issue.
In the Azure portal, you should find screens similar to these and do further R&D in Data Explorer.
-
2 Keyspaces and 2 Tables created. One with Shared Keyspace-level RU, and one with Provisioned Table-level RU.
-
Use the CQL Query Builder & CQL Query Text in Data Explorer, to query table uprofile.user with a simple filter (e.g. user_id = 7).
-
Use the CQL Query Builder & CQL Query Text in Data Explorer, to query table weather.data with a simple filter. Please note that this table has a Compound Primary Key (station_id, identity_id). First, we filter against 'station_id' = station_13. The result is as expected and the row is extracted from the database.
-
Next, we use the CQL Query Builder & CQL Query Text in Data Explorer, to query table weather.data with a complex filter. Please note that this table has a Compound Primary Key (station_id, identity_id). We now filter against 'station_id' = station_4 & 'identity_id' = 20210901 which represents our 'Noida' Weather Station in our dataset. The result is as expected and the row is extracted from the database.
-
Next, we use the CQL Query Builder & CQL Query Text in Data Explorer, to query table weather.data with a Non-Primary Key filter. Please note that this table has a Compound Primary Key (station_id, identity_id). We now filter against 'temp' = 74. The result is as expected: An Error is thrown which says,
{"readyState":4,"responseText":"\"{\\\"message\\\":\\\"Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING\\\",\\\"activityId\\\":\\\"3b810dcd-ce7d-4483-9d43-51040f7103b5\\\"}\"","responseJSON":"{\"message\":\"Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING\",\"activityId\":\"3b810dcd-ce7d-4483-9d43-51040f7103b5\"}","status":500,"statusText":"error"}
In short, you cannot filter and execute a query against a Non-Primary Key in Apache Cassandra and the same holds true even if you're using Azure Cosmos DB's API for Cassandra Database.
- The same you can capture as an Error when executed in Visual Studio. I try executing a similar Query from .NET SDK, and it exits with an Error which reads the same.
If you have reached thus far, give yourself a well deserved applause and a coffee break! You can extend the sample and try and find out how you could solve the above mentioned error.
- Solving Error: HINT: There are 2 possible solutions to the above:
-
Option #1: Add "ALLOW FILTERING" to the query which then solves the problem. Note: This is not a recommended option since this can cause a poor performance on massively large datasets. In fact, for supremely fast query read/write performance on Cassandra, it is highly recommended to build a correct Data Model justifying your use-case and having 1 highly optimized keyspace.table per query. Read this document as good guidance from the DataStax team: https://www.datastax.com/blog/allow-filtering-explained
-
Option #2: In Azure Cosmos DB Cassandra API, you can choose which specific attributes you want to Index; this is called 'Concept of Secondary Indexes'. You can thereby create a Secondary Index and this will allow the query to run. Code to use:
CREATE INDEX ON weather.data (state);
describe table weather.data;
- Increasing Cardinality of weather.data table. During creation of the data table, we use:
session.Execute("CREATE TABLE IF NOT EXISTS weather.data (station_id int, identity_id int, temp int, state text, PRIMARY KEY (station_id, identity_id)) WITH cosmosdb_provisioned_throughput = 4000 AND CLUSTERING ORDER BY (identity_id DESC)");
You can replace identity_id with a timestamp data type variable; e.g. ts as well. For real-life large dataset projects, it is recommended to use Cassandra UUID & timeuuid functions.
session.Execute("CREATE TABLE IF NOT EXISTS weather.data (station_id int, ts timestamp, temp int, state text, PRIMARY KEY (station_id, identity_id)) WITH cosmosdb_provisioned_throughput = 4000 AND CLUSTERING ORDER BY (ts DESC)");
- How to Create a Cassandra Data Model by DataStax (Patrick McFadin and Jeff Carpenter will develop a Cassandra data model for a real application — step by step): https://www.youtube.com/watch?v=4D39wJu5Too
- Official documentation of Microsoft for Azure Cosmos DB API for Cassandra: https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra/cassandra-introduction
- 2018 video of Azure Friday | Cassandra API for Azure Cosmos DB (Join Kirill Gavrylyuk and Scott Hanselman to learn about native support for Apache Cassandra API in Azure Cosmos DB with wire protocol level compatibility): https://www.youtube.com/watch?v=gFxJnegGG0o
- 2018 Azure Cosmos DB Cassandra API Overview by Govind Kanshi: https://www.youtube.com/watch?v=p3jSVi3ERFg
- 2020 Introduction to Cassandra API in Azure Cosmos DB by Theo van Kraay: https://www.youtube.com/watch?v=3WOFJjU126s
You can share any feedback at: sugh AT microsoft dot com
This is a free white paper released into the public domain. Anyone is free to use or distribute this white paper, for any purpose, commercial or non-commercial, and by any means. Same applies to the code in the repo.
THE WHITE PAPER IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE WHITE PAPER.
Have fun & happy coding!