# Apache Cassandra

## What is Cassandra?

> Cassandra is a free, open-source, distributed NoSQL columnar data store. 

<p align="center">
  <img src="./images/cassandra-logo.png" width=400>
</p>

It can handle petabytes of information and thousands of concurrent operations per second with no single point of failure.

Cassandra was originally developed by Facebook and was released in 2008. The tool was initially designed to power Facebook's inbox message searching feature to help users quickly find conversations. In 2010, it became a top-level project for the Apache foundation, and has become more widely used since.


## Architecture of Cassandra

> In contrast to many other storage services, Cassandra has no master node required to control the others. This means that it has _no single point of failure_. Instead, Cassandra is composed of a peer-to-peer cluster of nodes arranged in a ring layout with data replicated across nodes. 

Here's Cassandra's architecture:

<p align="center">
  <img src="./images/cassandra-architecture2.jpg" width=400>
</p>

As a reminder:
- A _node_ is a single machine which stores data and other programs to coordinate the system
- A _cluster_ is a network containing multiple nodes, networked together. A cluster could container nodes across multiple data centers.

Key architectural features of Cassandra to note:
- Cassandra can be installed on a single machine, a docker container or in a distributed system
- Each node is connected to _all other nodes_
- Each node in the cluster _plays the same role_
  - This means every node is independent and, at the same time, is interconnected to all other nodes, in a ring format
  - All nodes can accept read and write requests, regardless of where the data is stored
  - If a node fails, read/write requests can be served from any other node in the network
  - Data is assigned to each node
- Data replication strategies are easily configurable, and can easily scale to multiple data centers

### Data Model

> Cassandra uses a columnar-based data model which is composed of a _keyspace_ (the equivalent of a SQL database) and a _column family_ (equivalent to a SQL table)

Here is how we can visualise how the keyspace and column family fit together:
<p align="center">
  <img src="./images/cassandra-keyspace.jpg" width=400>
</p>

Cassandra's data is organised in a cluster using the following hierarchy:
- Keyspace
  - This is the outermost grouping for data in Cassandra
<p>

- Column Family (also called Tables)
  - Is the grouping for an ordered collection of rows which can have varying lengths (due to the flexible schema model used)
<p>

- Row/Primary Key
  - Each column family/table has a _row key_ which is used to identify that row
  - It is divided into a _partition key_, which is required to determine the partition data will reside on, and _clustering columns_ which help to determine the default sorting of rows within that partition
<p>

- Row
  - Is a collection of columns


## Strengths of Cassandra

> Cassandra's main strength is the fact that it does not have a single point of failure. This is what makes it unique compared to SQL databases and other NoSQL data stores.

The other strengths of Cassandra include:

<p align="center">
  <img src="./images/cassandra-strengths3.png" width=600>
</p>

- Schema-free data model 
    - Flexible data model can  support both structured and unstructured data
<p>

- Open Source code 
    - With an active contributing community
<p>

- User-friendly Interface
    - The tool comes with a user-friendly graphical user interface
<p>

- High Performance 
    - Very low latency write operations due to its design
    - This is due to Cassandra's storage engine technology, which uses an approach called Log-structured merge trees. This avoids having to read the existing data first before writing (skipping the read part of the process). All of this is done in-memory.
    - For more details, check out this [article](https://www.linkedin.com/pulse/why-cassandra-writes-faster-than-traditional-rdbms-vishal-kharjul/)
<p>

- Ability to handle data coming in at high velocities
    - Cassandra supports streaming data easily
<p>

- Can handle high volumes of data
    - It can easily scale horizontally by increasing the number of nodes in a cluster
<p>

- No Downtime
    - As the tool does not have a single point of failure
<p>

- Highly Scalable 
    - Nodes can be easily added to the cluster at any time
<p>

- Fault Tolerant
    - Data is automatically replicated to multiple nodes based on a replication factor
    - Failed nodes can be easily replaced
<p>

- Hadoop Compatability
    - Can easily integrate with Hadoop and provides MapReduce support
<p>

- Provides its own query language (CQL)
    - Easy to use
    - Can be considered a simpler version of SQL 


## Limitations of Cassandra

Some of the main limitations include:

<p align="center">
  <img src="./images/cassandra-limitations2.png" width=600>
</p>

- Data deletion is slow:
    - Instead of deleting data immediately, Cassandra marks this data using a "delete marker"
    - This operation is treated as an insert/upsert and is slow to execute 
<p>    

- Slower Reads compared to writes
    - Although Cassandra excels at write operations, reads get progressively slower as the data increases
    - Moving data across the network is one reason latency increases during read operations. Also, the way indexes work are not as efficient as other data stores or relational database systems. 
    - The use of data tombstones (delete markers) also negatively impacts the read performance
    - For more details, check out this [article](https://medium.com/analytics-vidhya/15-reasons-of-read-latency-in-cassandra-8d965f18f85c)
<p>

- Cannot do JOINs
    - This is not supported by the data model
<p>

- Does not support sub-queries (nested querying)
    - As the data stored is usually de-normalised
<p>

- Cross-partition transactions are slow
    - Cross-partition refers to data that is spread out among multiple partitions
    - This would require more network bandwidth which would decrease performance
<p>

- Does not support Foreign Keys between tables
    - As it does not follow the relational database model
    - Also, keys might not be unique


## Use Cases for Cassandra

> According to some studies, Cassandra is being used by over 80% of Fortune 500 companies

It is also currently the top columnar data store worldwide, and is overall the 11th most popular data store used by global companies according to [DB-Engines](https://db-engines.com/en/ranking).

Some notable companies using the tool include:

#### Instagram
- Instagram uses Cassandra as a general purpose data storage tool
- It is leveraged to support the user photo feed, direct messaging and for fraud detection
- Read more about the use case [here](https://thenewstack.io/instagram-supercharges-cassandra-pluggable-rocksdb-storage-engine/)

#### Uber
- Uber uses Cassandra as a data dump for its real-time location data
- Location data is sent every 30 seconds by both the driver app and the rider app
- Read more about this use case [here](http://highscalability.com/blog/2016/9/28/how-uber-manages-a-million-writes-per-second-using-mesos-and.html)


## Key Takeaways
- Apache Cassandra is one of the most popular NoSQL columnar-based data stores used by global companies today
- A Cassandra ecosystem is composed of one or more computer nodes arranged in a circular manner. Every node plays the same role, and accordingly there is no single point of failure.
- The data model used is hierarchical, with the topmost level being the keyspace, followed by a column family which contains columns and rows. Each row has a row key to identify it.
- Cassandra has many strengths including: not having a single point of failure, leveraging a flexible schema-free model, using open-source code, providing a user-friendly GUI, being highly scalable, strong fault tolerance, hadoop compatibility and providing a simple to use query language (CQL) 
- The main limitations of the tool include: having much slower read operations (compared to writes), slow data deletion, lack of support for JOINS and inability to do complex subqueryijng tasks 
- Many global companies use Cassandra. The main use case is to leverage the tool as a general purpose data store/dump for big data.