## Review - Databases

A database is used to define a central system in which data can be stored and queried. 

We usually desire databases to support the following characteristics:

* store massive amounts of data
* allow access to stored data via a query language
* allow durable data storage even when there are power failures
* allow multiple users to read and write the same data

Consider how these features are different from a file system. Consider how ignoring these desired features could lead to [catastrophe](https://www.bbc.com/news/technology-54423988). 

## Review - Relational Databases

Used to store and query structured data. Data is stored in tables that represent some entity. Each instance is assigned a *primary key* that uniquely identifies it.

The use of keys to reference data enables a database to be normalized so that details are only stored once. Tables are managed and queried using SQL, which is based on an [ANSI](https://blog.ansi.org/sql-standard-iso-iec-9075-2023-ansi-x3-135/#gref) (American National Standards Institute) standard, so it's similar across database systems.

## Review - Non-relational Databases

Used to store and query unstructured data or data that does not contain a relational schema. Often referred to as a NoSQL database, even though most unstructured databases support a variant of SQL.

There are 4 common types of non-relational database types:

**Key-value databases**

Each record consists of a unique key and an associated value.

| key     | value |
|---------|-------|
| "apple" | 1.99  |
| "peach" | 2.99  |


**Document databases**

A specific form of key-value database in which the value is a JSON document.

| key     | Document                           |
|---------|------------------------------------|
| 1       | {'name': 'Eve' }                   |
| 2       | {'name': 'Adam' , "banished": 1 }  |

**Column family databases**

Store tabular data comprising rows and columns, but you can divide the columns into groups known as column families.

| key     | Customer                           |
|---------|------------------------------------|

|         | Name     | Address                 |
|---------|----------|-------------------------|
| 1       |Bob       |123 Main St.             |
| 2       |Adam      |125 Main St.             | 

**Graph databases**

Store entities as nodes with links to define relationships between them

https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/data-store-overview


## Review - Transactional Data Processing

Considered a primary function of business computing. A transactional system records *transactions* that encapsulate specific events that organizations want to track. Consider a transaction as a single unit of work.

Transactional systems must be able to handle high volumes of data which must be easily accessible by appropriate parties. This is often referred to as Online Transactional Processing (OLTP).

OLTP solutions rely on database systems in which storage is optimized for read & write operations to support workloads where data is created, retrieved, updated and deleted (CRUD operations).

OLTP enforces **ACID** semantics:  

--------

**Atomic**: Transactions commit all operations or none

A transaction either succeeds completely or fails completely. Ex: A transaction between bank accounts. If money is successfully withdrawn from one bank account and fails deposit to another, it should be impossible for the entire transaction to complete. 

--------

**Consistent**: Database must be consistent before and after transactions

The "validity" of an OLTP database must remain consistent before and after transactions. Ex: A transaction occurs between bank accounts. All data constraints and key constraints are enforced in all tables.

--------

**Isolated**: Multiple transactions can occur without interference

Transactions cannot interfere with one another. Ex: One transaction transfers money between bank accounts, while another transaction observes balances. The second transaction *cannot* observe the intermediate state of the first transaction, where money is withdrawn from one bank account, but not deposited in another.

--------

**Durable**: System failure does not undo transactions

Commits stay in place. If the database is switched off accidentally or purposefully, commits that were successfully logged cannot be undone to the database.

--------

OLTP systems are typically used to support live applications that process business data - often referred to as line of business (LOB) applications.

## Review - Analytical Data Processing

Analytical data processing typically uses "read-only" systems that store vast volumes of historical data or business metrics. Can be based on a snapshot of the data at a given point in time, or a series of snapshots.

Online analytical processing (OLAP) models vary from business to business, but a common architecture encapsulates the following:

1. Data is stored in a **data lake** for analysis
2. An ETL pipeline copies data from OLTP databases & data lakes into **data warehouses** that are optimized for fast reads. 
3. Data from warehouse is stored in **OLAP model** or "cube." Aggregates are computed from data warehouses according to various dimensions (date, customer, and product). 
4. Data from OLAP model is queried to form data visualizations, reports, or models.

**data lake**

A [data lake](https://azure.microsoft.com/en-us/solutions/data-lake) is built via Azure File Storage with hierarchical BLOB stores. Stores large volumes of file-base & possibly unrelated data.

**data warehouses**

A [data warehouse](https://cloud.google.com/learn/what-is-a-data-warehouse) is a relational data storage that is optimized for read operations (primarily for data reporting & data visualization). May require denormalization.

**OLAP model**

An [OLAP cube model](https://learn.microsoft.com/en-us/system-center/scsm/olap-cubes-overview?view=sc-sm-2022) is an aggregated type of data storage that is optimized for analytical workloads. Data is taken from an OLTP database and pre-aggregated across multiple dimensions, which enables you to *quickly* query summary statistics.

However, consider the following problem. Say your original OLTP schema updates. If you have massive amounts of data, this means that data engineers will have to re-run ETL pipelines, which could be a costly process in terms of time.

In addition, OLTP databases also had to be organized in such a way as to make cube creation as easy as possible. 

As compute and memory became cheaper, we've also moved away from the complexity of maintaining data cubes. After the spread of columnar data warehouses, we can now perform simple OLAP-type workloads on regular SQL databases. This is because columnar databases support:

* higher read efficiency
* better compression
* better sorting & indexing efficiency

At this point, cubes might be [outdated](https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-cube/).