https://blog.bytebytego.com/p/factors-to-consider-in-database-selection<br>
https://blog.bytebytego.com/p/understanding-database-types


# Data Storage Format

## Documents
A document is the basic unit of data in document-oriented databases. It's structure format group information in key-value pairs.<br>
- Supports various data types and can contain nested structures and arrays.
- Schema-less: No predefined structure. Different documents within the same group can have different fields, enabling more flexible and rapid development.
- Contains unique identifiers, allowing easy retrieval and indexing.
- Fields can be added or removed as needed.
- Enable horizontal scaling and distributed systems.
- Enables rapid application development and evolution.
- Data validation happens at the application level rather than database level.
- Documents can be nested to represent complex relationships without joins.
- Efficient serialization/deserialization
- Optimized for read/write operations

### Formats
Document structures format deal with key-value pairs.<br>
- JSON serves as the most common format, human-readable structure that supports nested data<br>
{
    "_id": 1,
    "name": "Tom",
    "age": 30
}
- BSON (Binary JSON) enhances JSON by providing a binary representation with additional data types, optimizing storage and processing. BSON stores as binary but shows as JSON in database tools, they’re stored internally in BSON format to enable more efficient storage and querying.
- XML is relevant in legacy systems as tag-based structure and schema support, it is verbose.
<student>
    <_id>1</_id>
    <name>Tom</name>
    <age>20</age>
</student>
- YAML as a human-friendly format with minimal punctuation and comment.
_id: 1
name: Tom
age: 20

### Collections
A collection is a grouping of documents, provide organization and management of related documents.<br>
- Documents in a collection can have different fields, they have no predefined structure/schema.
- Can grow to billions of documents.
- Databases automatically partition large collections across servers.
- Fast queries through indexing.
- Support adding indexes on any document field. Support for multiple index types.
- Each collection exists in a specific database namespace.
- Insert, update, delete work on collections.
- Access control and security policies apply at collection level.
The flexibility of documents and collections makes them well-suited for applications where data requirements might evolve highly, like content management systems, e-commerce sites, or analytics applications.

# Indexing
Indexes speed up queries on the indexed fields.

## Compound index

## Collections
- Individual fields within documents are indexed. The same field across all documents in a collection shares the same index. Not every field needs to be indexed.
- Writing new documents updates all relevant indexes.
- Indexes belong to collections, not individual documents.
- Documents' fields are what get indexed.
- One index can cover multiple fields (compound index).
- Index management occurs at the collection level, all index-related operations and decisions are made for the entire collection, not for individual documents.

# Scaling
Scaling are powerful tools that can be used to improve performance and scalability of large-scale database systems.

## Horizontal Scaling (Sharding)
Sharding is a distributing method of splitting data across multiple servers or nodes to improve performance, handle larger datasets, and increase the database’s fault tolerance. 
The database divides collections into chunks using a shard key (id field in the data) used to route requests and partition data. Each chunk moves to different servers.<br>
When an application queries data, the database routing layer determines which shard(s) contain the relevant data, enabling more efficient data retrieval by accessing only specific shards.<br>
- Allows to add more servers to the database cluster as the data grows rather than upgrading to a more powerful and expensive server.
- Cheaper than vertical scaling.
- read/write operations can occur in parallel. Reducing latency and improving response times.
- Workload spread across multiple machines.
- No downtime required.
- Keeps working if some servers fail, other shards can continue to serve the application’s requests, increasing overall availability and reliability.
- An improperly chosen shard key can lead to "hot spots" where some shards handle most of the traffic, leading to performance bottlenecks.
- More network communication.
- More complex to manage and maintain than non-sharded systems. Coordinating data placement, balancing load, and managing failover mechanisms are added responsibilities for database administrators.
- Queries that require data from multiple shards can be slower and more resource-intensive, as the database must retrieve and combine data from various locations.
- Sharding can be combined with replication to provide fault tolerance.
Sharding is a powerful technique, especially in distributed systems where performance, scalability, and fault tolerance are critical. Often used with NoSQL databases to support applications with high traffic or substantial data storage requirements.

### Horizontal Sharding
Horizontal sharding involves dividing the rows of a table across multiple servers. This is the most common type of sharding, as it allows for the best scalability.

### Vertical Sharding
Vertical sharding involves dividing the columns of a table across multiple servers. This is less common than horizontal sharding, as it can be more difficult to implement.

#### Range-Based Sharding
#### Hash-Based Sharding
#### Zone-Based Sharding
Key Benefits:
Linear scalability: Add more shards to handle more data/traffic

Geographic distribution possible


## Vertical Scaling
increasing the power of existing servers by adding more resources to a single machine, such as:
### Hardware Upgrades
- Adding more CPU cores/processing power
- Increasing RAM
- Upgrading to faster/larger storage (SSD/NVMe)
- Expanding network capacity
### Advantages
- Simpler to implement
- No application changes needed
- Better for complex queries/transactions
- Lower latency (all resources on same machine)
### Disadvantages:
- Hardware limits
- More expensive
- Single point of failure
- Downtime during upgrades

# Replication
Replication is a process of copying data from one database server (or node) to others.
- Availability, if one server is busy or undergoing maintenance, other replicas can handle requests, thus reducing downtime.
- Fault tolerance, if one server fails, other replicas can continue to serve requests ensuring that the system remains operational.
- Improved Performance, distributing read requests across multiple replicas, replication can reduce load on individual servers, improve response times, and help the system handle a higher volume of traffic.
- Scalability by maintaining multiple copies of the same data across various servers.
- Data Recovery and Backup, ensure that up-to-date copies of the data are available in case of primary server issues or disasters.
- Allows data to be closer to end-users by placing replicas in different geographic locations, which minimizes latency and improves performance for users accessing data globally.

## Master-Slave Replication
Replication
- One primary server (master) handles all write operations, while one or more secondary servers (slaves) hold copies of the data.
- Only the master processes write requests, while slaves can serve read requests, thus reducing load on the master.
- Any updates on the master server are copied to the slave servers. Changes are only applied to the master and then propagated.

## Master-Master (Multi-Master) Replication
Replication used in systems that require high write availability.
- Multiple servers act as masters and can handle both read and write operations.
- Each master replicates changes to other masters in the network.
- Provides higher availability and flexibility, as any master can process both reads and writes.
- Introduce more complexity due to potential data conflicts when multiple servers make concurrent changes.

## Peer-to-Peer Replication
Replication useful in distributed systems where data is spread across a wide geographic area.
- Multiple servers act as masters and can handle both read and write operations.
- Nodes communicate changes to each other without a central master, creating a decentralized system.
- Conflict resolution strategies, there’s no single master to determine which data should take precedence.

---

In systems with multi-master or peer-to-peer replication, conflicts can arise when multiple replicas make concurrent updates to the same data. Conflict resolution mechanisms are required to determine which version of the data to keep.

# Conflict resolution
TODO
- Last-write-wins: Recent timestamp wins
- Version numbers: Higher version wins
- Custom merge: Business rules decide
- Multi-version: Keep all versions
- Locks: Prevent concurrent edits

# Vocabulary

## Documents
A document is a container of organized data that groups related information in key-value pairs.

## Collections
A collection is a group of related documents stored together in a database.

## Scaling
Scaling increases database capacity to handle growth. Two methods exist, vertical and horizontal.

## Replication
Replication creates copies of databases across multiple servers.

## Availability
Availability ensures that the database is accessible to users when needed.

## Fault Tolerance
Fault tolerance is the system's ability to withstand unexpected failures, such as hardware malfunctions, data corruption, or network outages.

## Consistency
Consistency ensures all users see same data state. Database keeps data valid during transactions, failures, concurrent access.
- Strong: All reads get latest write, slows performance
- Eventual: Reads might get old data temporarily, faster performance

## Conflict
Conflicts occur when multiple users modify same data simultaneously. Databases use conflict resolution to maintain consistency.