## [System Design Interviews](https://blog.algomaster.io/p/how-to-answer-a-system-design-interview-problem)
1. Requirements clarifications
   1. Functional requirement
   2. Non-Functional requirement
   3. Extended Requirement
2. Estimation and Constraints
3. Data model design
4. API design
5. High level component design
6. Detailed design
7. Indetify and resolve bottlenecks

## Latency and response time
1. Response time: The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays.
2. Latency: Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.
3. Throughput: Throughput refers to the amount of data that can be processed or transmitted in a given amount of time. 

## Scalability:
Scalability means having strategies for keeping performance good, even when load increases(ex- more user, adding new features).
 ## How can we make a system scaleable?
 1. Vertical Scaling (Scale up):
 2. Horizontal Scaling (Scale out):
 3. Load Balancing:
 4. Caching
 5. Content Delivery Networks (CDNs)
 6. Partitioning
 7. Asynchronous communication
 8. Microservices Architecture
 9.  Auto Scalling
 10. Multi-region Deployment

### 4#Caching:
A cache's primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer.

***Types of cache:***

1. $L1$:
   - Closest to CPU core or inside the cpu
   - Small in size, usally ranging from `16kb to 128kb`.
   - Stores the most frequently accessed data and instructions for immediate use by the CPU.
2. L2:
   - Inside the CPU
   -  Larger than L1, typically ranging from 256 KB to 1 MB per core.
3. L3:
   - Inside the CPU.
   - typically ranging from 2 MB to 50 MB or more.
4. L4:
   - Outside of the CPU.
   - Larger in size.

5. Memory Cache: 
   - involve system RAM or other software-based caching solutions, not the CPU caches.
   - `File System Cache:` Part of the operating system that caches frequently accessed files in RAM.
   - `Database Cache:` Stores frequently accessed data in RAM to speed up database queries.
   - `Application-Level Cache:` Such as Redis or Memcached, which stores frequently used data in RAM to improve application performance.


#### Cache hit and Cache miss
1. Cache Hit:
    - the data is found and read, it's considered a cache hit.
    - hot cache is a instance when data retreeved from `L1`.
    - Cool cache is a instance when data retrieved from `L3 or lower`.
2. Cache Miss: A cache miss refers to the instance when the memory is searched, and the data isn't found. When this happens, the content is transferred and written into the cache.

#### Write data into cache:
##### Cache aside: 
Data is written into the cache and the corresponding database simultaneously.
##### Write through:
The application uses the cache as the main data store, reading and writing data to it, while the cache is responsible for reading and writing to the database:
##### Write behind:
Where the write is only done to the caching layer and the write is confirmed as soon as the write to the cache completes. The cache then asynchronously syncs this write to the database.

`pros:`<br>
1. reduce latency and high throughput for write-intensive applications.

`cons:`<br>
1. a risk of data loss in case the caching layer crashes.
##### Refresh a head:

#### Distrributed cache:

#### Global cache:
When the requested data is not found in the global cache, it's the responsibility of the cache to find out the missing piece of data from the underlying data store.

### CDN:
A content delivery network (CDN) is a globally distributed network of proxy servers, serving content from locations closer to the user. Generally, static files such as HTML/CSS/JS, photos, and videos are served from CDN, although some CDNs such as Amazon's CloudFront support dynamic content. The site's DNS resolution will tell clients which server to contact.

Advantages:

1. Users receive content from data centers close to them
2. Your servers do not have to serve requests that the CDN fulfills

#### CDN Types:

1. Push CDNs: Push CDNs receive new content whenever changes occur on the server. We take full responsibility for providing content, uploading directly to the CDN, and rewriting URLs to point to the CDN. We can configure when content expires and when it is updated. Content is uploaded only when it is new or changed, minimizing traffic, but maximizing storage. <br>Sites with a small amount of traffic or sites with content that isn't often updated work well with push CDNs. Content is placed on the CDNs once, instead of being re-pulled at regular intervals.

2. Pull CDNs: In a Pull CDN situation, the cache is updated based on request. When the client sends a request that requires static assets to be fetched from the CDN if the CDN doesn't have it, then it will fetch the newly updated assets from the origin server and populate its cache with this new asset, and then send this new cached asset to the user. <br> Contrary to the Push CDN, this requires less maintenance because cache updates on CDN nodes are performed based on requests from the client to the origin server. Sites with heavy traffic work well with pull CDNs, as traffic is spread out more evenly with only recently-requested content remaining on the CDN.

Disadvantages
1. As we all know good things come with extra costs, so let's discuss some disadvantages of CDNs:

2. Extra charges: It can be expensive to use a CDN, especially for high-traffic services.
3. Restrictions: Some organizations and countries have blocked the domains or IP addresses of popular CDNs.
4. Location: If most of our audience is located in a country where the CDN has no servers, the data on our website may have to travel further than without using any CDN.

## Availability
Availability is the time a system remains operational to perform its required function in a specific period. It is a simple measure of the percentage of time that a system, service, or machine remains operational under normal conditions.

Strategies for Improving Availability
1. Redundancy: Redundancy involves having backup components that can take over when primary components fail.
   - Server Redundancy: Deploying multiple servers to handle requests, ensuring that if one server fails, others can continue to provide service.
   - Database Redundancy: Creating a replica database that can take over if the primary database fails.
   - Geographic Redundancy: Distributing resources across multiple geographic locations to mitigate the impact of regional failures.
2. Load Balancing: Load balancing distributes incoming network traffic across multiple servers to ensure no single server becomes a bottleneck, enhancing both performance and availability.
   - Hardware Load Balancers: Physical devices that distribute traffic based on pre-configured rules.
   - Software Load Balancers: Software solutions that manage traffic distribution, such as HAProxy, Nginx, or cloud-based solutions like AWS Elastic Load Balancer.
3. Data Replication: Data replication involves copying data from one location to another to ensure that data is available even if one location fails.
   - Synchronous Replication: Data is replicated in real-time to ensure consistency across locations.
   - Asynchronous Replication: Data is replicated with a delay, which can be more efficient but may result in slight data inconsistencies.

### The Nine's of availability

Availability is often quantified by uptime (or downtime) as a percentage of time the service is available. It is generally measured in the number of 9s.

$$
\mathcal{Availability = \frac{Uptime}{(Uptime + Downtime)}}
$$

If availability is 99.00% available, it is said to have "2 nines" of availability, and if it is 99.9%, it is called "3 nines", and so on.

| Availability (Percent)   | Downtime (Year)    | Downtime (Month)  | Downtime (Week)    |
| ------------------------ | ------------------ | ----------------- | ------------------ |
| 90% (one nine)           | 36.53 days         | 72 hours          | 16.8 hours         |
| 99% (two nines)          | 3.65 days          | 7.20 hours        | 1.68 hours         |
| 99.9% (three nines)      | 8.77 hours         | 43.8 minutes      | 10.1 minutes       |
| 99.99% (four nines)      | 52.6 minutes       | 4.32 minutes      | 1.01 minutes       |
| 99.999% (five nines)     | 5.25 minutes       | 25.9 seconds      | 6.05 seconds       |
| 99.9999% (six nines)     | 31.56 seconds      | 2.59 seconds      | 604.8 milliseconds |
| 99.99999% (seven nines)  | 3.15 seconds       | 263 milliseconds  | 60.5 milliseconds  |
| 99.999999% (eight nines) | 315.6 milliseconds | 26.3 milliseconds | 6 milliseconds     |
| 99.9999999% (nine nines) | 31.6 milliseconds  | 2.6 milliseconds  | 0.6 milliseconds   |

### Availability in Sequence vs Parallel

If a service consists of multiple components prone to failure, the service's overall availability depends on whether the components are in sequence or in parallel.

#### Sequence

Overall availability decreases when two components are in sequence.

$$
Availability \space (Total) = Availability \space (Foo) * Availability \space (Bar)
$$

For example, if both `Foo` and `Bar` each had 99.9% availability, their total availability in sequence would be 99.8%.

#### Parallel

Overall availability increases when two components are in parallel.

$$
Availability \space (Total) = 1 - (1 - Availability \space (Foo)) * (1 - Availability \space (Bar))
$$

For example, if both `Foo` and `Bar` each had 99.9% availability, their total availability in parallel would be 99.9999%.

### Availability vs Reliability

If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve high availability even with an unreliable system.

### High availability vs Fault Tolerance

Both high availability and fault tolerance apply to methods for providing high uptime levels. However, they accomplish the objective differently.

A fault-tolerant system has no service interruption but a significantly higher cost, while a highly available system has minimal service interruption. Fault-tolerance requires full hardware redundancy as if the main system fails, with no loss in uptime, another system should take over.

## Reliability:
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or soft‐ ware faults, and even human error).

# Database

## How can We Improve databae performance?

### Data partitioning:
Data partitioning is a technique to break up a database into many smaller parts. It is the process of splitting up a database or a table across multiple machines to improve the manageability, performance, and availability of a database.

#### Horizontal Partitioning (or Sharding)
In this strategy, we split the table data horizontally based on the range of values defined by the partition key. It is also referred to as database sharding.

#### vertical partitioning:
In vertical partitioning, we partition the data vertically based on columns. We divide tables into relatively smaller tables with few elements, and each part is present in a separate partition.

### Replication
Replication is a process that involves sharing information to ensure consistency between redundant resources such as multiple databases, to improve reliability, fault-tolerance, or accessibility.

#### Master-slave replication:
The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion. If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned.
##### Pros:
1. Backups of the entire database of relatively no impact on the master.
2. Applications can read from the slave(s) without impacting the master.
3. Slaves can be taken offline and synced back to the master without any downtime.
##### Cons:
1. Replication adds more hardware and additional complexity.
2. Downtime and possibly loss of data when a master fails.
3. All writes also have to be made to the master in a master-slave architecture.
4. The more read slaves, the more we have to replicate, which will increase replication lag.

#### Master-master replication
Both masters serve reads/writes and coordinate with each other. If either master goes down, the system can continue to operate with both reads and writes.
##### Pros:
1. Applications can read from both masters.
2. Distributes write load across both master nodes.
3. Simple, automatic, and quick failover.
##### Cons:
1. Not as simple as master-slave to configure and deploy.
2. Either loosely consistent or have increased write latency due to synchronization.
3. Conflict resolution comes into play as more write nodes are added and as latency increases.

#### Synchronous replication
In synchronous replication, data is written to primary storage and the replica simultaneously. As such, the primary copy and the replica should always remain synchronized.

#### Asynchronous replication
asynchronous replication copies the data to the replica after the data is already written to the primary storage. Although the replication process may occur in near-real-time, it is more common for replication to occur on a scheduled basis and it is more cost-effective. It will violates the `Consistcy`.


### Sharding
Database sharding is a `horizontal scaling` technique used to split a large database into smaller, independent pieces called shards.

Partitioning criteria:
1. Hash-Based
2. List-Based
3. Range Based

#### Pros:
1. Availability: Provides logical independence to the partitioned database, ensuring the high availability of our application. Here individual partitions can be managed independently.
2. Scalability: Proves to increase scalability by distributing the data across multiple partitions.
3. Security: Helps improve the system's security by storing sensitive and non-sensitive data in different partitions. This could provide better manageability and security to sensitive data.
4. Query Performance: Improves the performance of the system. Instead of querying the whole database, now the system has to query only a smaller partition.
5. Data Manageability: Divides tables and indexes into smaller and more manageable units.
6. Geographical Distribution: Sharding allows you to strategically place shards closer to your users, reducing latency and improving the user experience.

#### Cons:
1. Complexity: Sharding introduces additional complexity, requiring careful planning and management.

2. Data Consistency: Maintaining data consistency across shards can be challenging, especially when data needs to be joined or aggregated from multiple shards.

3. Cross-shard Joins: Performing joins across multiple shards can be complex and computationally expensive.

4. Data Rebalancing: As data volumes grow, shards may become unevenly distributed, requiring rebalancing to maintain optimal performance.

### Denormalization:
Denormalization attempts to improve read performance at the expense of some write performance. Redundant copies of the data are written in multiple tables to avoid expensive joins. Some RDBMS such as PostgreSQL and Oracle support materialized views which handle the work of storing redundant information and keeping redundant copies consistent.

Pros:
1. Retrieving data is faster.
2. Writing queries is easier.
3. Reduction in number of tables.
4. Convenient to manage.

Disadvantages
1. Expensive inserts and updates.
2. Increases complexity of database design.
3. Increases data redundancy.
4. More chances of data inconsistency.

## ACID and BASE consitency Model

### ACID
1. Atomicity: Atomicity ensures that a transaction is treated as an indivisible unit of work. It means that either all the operations within a transaction are successfully completed, or none of them are applied to the database.
2. Consistency: Consistency ensures that a transaction brings the database from one valid state to another valid state. It means that all the data integrity constraints, such as unique constraints, foreign key constraints, and check constraints, are satisfied before and after the transaction.
3. Isolation: Isolation ensures that transactions are executed in a way that they do not interfere with each other. It means that the intermediate state of a transaction is not visible to other transactions until it is committed.
4. Durability: Once the transaction has been completed and the writes and updates have been written to the disk, it will remain in the system even if a system failure occurs.

While ACID properties are foundational to RDBMS, NoSQL databases like MongoDB often sacrifice some ACID properties for performance and scalability.

#### How does it work?
1. Logging: Detailed records of all transactions are kept, allowing for recovery in case of a failure.

2. Locking: Data is locked during a transaction to prevent concurrent access and ensure isolation.

3. Two-Phase Commit: A protocol used to coordinate the commitment of a transaction across multiple systems.

### BASE
BASE stands for basically available, soft state, and eventually consistent. The acronym highlights that BASE is opposite of ACID, like their chemical equivalents.
1. Basically available: Basically available is the database’s concurrent accessibility by users at all times. For example, during a sudden surge in traffic on an ecommerce platform, the system may prioritize serving product listings and accepting orders. Even if there is a slight delay in updating inventory quantities, users continue to check out items.
2. Soft Sate: Indicates that the state of the system may change over time, the system may not be in a consistent state at all times.
3. Eventually consistent: Eventually consistent means the record will achieve consistency when all the concurrent updates have been completed.

## CAP theorem:
CAP theorem states that a distributed system can deliver only two of the three desired characteristics Consistency, Availability, and Partition tolerance (CAP).

1. Consistency: Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated across all the nodes in the system before the write is deemed "successful".

2. Availability: Availability means that any client making a request for data gets a response, even if one or more nodes are down.
3. Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

### The CAP Trade-Off: Choosing 2 out of 3:
1. CP (Consistency and Partition Tolerance): `Banking systems`
   1. RDBMS: `MySQL and PostgreSQL`
2. AP (Availability and Partition Tolerance): A shopping cart system is designed to always accept items, prioritizing availability.
   1. NoSQL databases
3. CA (Consistency and Availability): In the absence of partitions, a system can be both consistent and available. However, network partitions are inevitable in distributed systems, making this combination impractical. `Single-node databases`

## Distributed Transactions
A distributed transaction is a set of operations on data that is performed across two or more databases. It is typically coordinated across separate nodes connected by a network, but may also span multiple databases on a single server.

### Two-Phase commit
The two-phase commit (2PC) protocol is a distributed algorithm that coordinates all the processes that participate in a distributed transaction on whether to commit or abort (roll back) the transaction.

This protocol achieves its goal even in many cases of temporary system failure and is thus widely used. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed to remedy an outcome.

This protocol requires a coordinator node, which basically coordinates and oversees the transaction across different nodes. The coordinator tries to establish the consensus among a set of processes in two phases, hence the name.

***Prepare phase:***
The prepare phase involves the coordinator node collecting consensus from each of the participant nodes. The transaction will be aborted unless each of the nodes responds that they're prepared.

***Commit phase:***
If all participants respond to the coordinator that they are prepared, then the coordinator asks all the nodes to commit the transaction. If a failure occurs, the transaction will be rolled back.

#### Cons:
- What if one of the nodes crashes?
- What if the coordinator itself crashes?
- It is a blocking protocol.

### Three-phase commit:
Three-phase commit (3PC) is an extension of the two-phase commit where the commit phase is split into two phases. This helps with the blocking problem that occurs in the two-phase commit protocol.

1. Prepare phase: This phase is the same as the two-phase commit.

2. Pre-commit phase: Coordinator issues the pre-commit message and all the participating nodes must acknowledge it. If a participant fails to receive this message in time, then the transaction is aborted.

3. Commit phase: This step is also similar to the two-phase commit protocol.

#### How does Pre-commit phase helpful?
1. If the participant nodes are found in this phase, that means that every participant has completed the first phase. The completion of prepare phase is guaranteed.
2. Every phase can now time out and avoid indefinite waits.

### Saga
A saga is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction in the saga. If a local transaction fails because it violates a business rule then the saga executes a series of compensating transactions that undo the changes that were made by the preceding local transactions.

1. Choreography: Each local transaction publishes domain events that trigger local transactions in other services.
2. Orchestration: An orchestrator tells the participants what local transactions to execute.

Problems:
1. The Saga pattern is particularly hard to debug.
2. There's a risk of cyclic dependency between saga participants.
3. Lack of participant data isolation imposes durability challenges.
4. Testing is difficult because all services must be running to simulate a transaction.