These comprehensive notes on system design fundamentals were meticulously crafted with the assistance of ChatGPT. They are a paraphrased and refined version of my original notes, which I took while watching AlgoExpert's SystemsExpert videos.
In this guide, we will explore essential topics in system design, including core principles, architectural concepts, scalability considerations, and much more. Whether you're gearing up for a technical interview or aiming to deepen your knowledge of system design, these notes are a valuable resource. They provide insights and guidance to navigate the intricate landscape of designing robust and efficient software systems.
In the realm of software engineering, we often encounter two types of interviews: coding interviews ๐ง and systems design interviews ๐๏ธ. While coding interviews assess problem-solving skills, systems design interviews delve into engineering knowledge beneath the surface of open-ended design questions.
Coding Interviews vs. Systems Design Interviews
- Coding interviews focus less on theoretical knowledge and more on problem-solving.
- Systems design interviews demand a deep understanding of systems, their robustness, functionality, and scalability.
- Design Fundamentals serve as the foundation for systems design interviews, just as data structures are fundamental for coding interviews.
Why Design Fundamentals Matter ๐ง
- Design Fundamentals encompass various topics such as System architecture ๐๏ธ, Availability โฐ, Databases ๐๏ธ, Load Balancing โ๏ธ, Caching ๐, and HTTP .
- They are indispensable for tackling Systems Design interview questions effectively.
Building scalable, production-ready applications is both an art ๐จ and a science ๐งช. It requires knowledge across various computer engineering topics and the ability to make smart design choices. Mastery of these disciplines can transform you into a Systems Expert ๐.
How Systems Design Interviews Work ๐ค
- Systems Design interview questions are intentionally vague and must be explored in-depth.
- Interviewees must ask questions, understand system requirements, and demonstrate fundamental Systems Design knowledge.
- Unlike coding interviews with objective solutions, Systems Design interviews involve subjective solutions that require confident justifications.
Categories of Design Fundamentals ๐
- Design Fundamentals can be categorized into four important groups that build upon each other.
- Foundational Knowledge ๐ข: Understanding concepts like the client-server model and network protocols ๐ก.
- Key Characteristics of Systems : Grasping attributes like availability โฐ, throughput ๐, redundancy โป๏ธ, and consistency ๐ค.
- Actual Components of a System โ๏ธ: Familiarity with components like load balancers โ๏ธ, caches ๐๏ธ, and proxies ๐.
- Real Existing Products or Services ๐ ๏ธ: Knowledge of tools such as Zookeeper ๐ฆ, Redis ๐๏ธ, and cloud services โ๏ธ.
Mastering these design fundamentals is essential for addressing vague Systems Design interview questions effectively.
At the core of modern networking is the client-server model, where clients ๐ฅ๏ธ request services from servers ๐ข. Let's explore this concept by breaking down what happens when you visit a website .
What Happens When You Visit a Website ๐
- Your web browser acts as the client, and the website's server ๐ข is the server.
- Your browser initially sends a DNS query to resolve the website's IP address .
- Once it has the IP address , it establishes a connection with the server over HTTP .
- The server responds with the website's content , and your browser renders it.
Understanding the Client-Server Model ๐ค
- The client-server model is foundational for computer communication, driving how machines interact.
- IP addresses and ports ๐ช play crucial roles in directing data between machines.
- Ports ๐ช are used to differentiate network services, like HTTP on port 80 or HTTPS on port 443.
- Network protocols ๐ก, like HTTP, govern the format and rules of data exchange.
Why the Client-Server Model Matters ๐
- This model underpins global computer communication and modern technologies.
- Understanding it is crucial for designing scalable and efficient systems.
Network protocols, such as IP, TCP, and HTTP, form the backbone of machine-to-machine communication. While they might seem complex, grasping their high-level concepts is essential for effective systems design.
What Are Network Protocols? ๐ก
- Network protocols are agreed-upon rules governing interactions between machines.
- They define the format, structure, and order of messages exchanged.
- Common network protocols include IP , TCP ๐, UDP
โ ๏ธ , and HTTP .
IP (Internet Protocol)
- IP is the foundation of internet communication, providing unique addresses for machines.
- IP packets ๐ฆ are the basic units of data transfer between machines.
- Packets consist of an IP header ๐จ and a data payload ๐ฆ.
- Versions like IPv4 and IPv6 exist, with IPv4 being more prevalent.
TCP (Transmission Control Protocol) ๐
- TCP builds on IP and ensures ordered, reliable, and error-free data delivery.
- It establishes connections through handshakes ๐ค, allowing machines to communicate.
- TCP handles data transfer and retransmission in case of packet loss.
HTTP (HyperText Transfer Protocol)
- HTTP operates atop TCP ๐ and introduces the request-response paradigm.
- Machines communicate by sending HTTP requests and receiving responses ๐จ.
- HTTP simplifies system development by providing a structured way to exchange data.
Understanding these protocols, especially HTTP , is vital for designing robust and efficient systems.
Information storage is a critical aspect of systems design. ๐ It's a complex and vital topic that impacts how systems store and retrieve data, affecting their performance, reliability, and scalability.
Let's start by understanding four key terms related to storage:
-
Databases ๐
- Databases are programs that record and query data.
- They can use memory or disk for data storage.
- Most databases need disk storage for data persistence, ensuring data remains intact through various system issues.
-
Disk ๐ฝ
- Refers to Hard Disk Drives (HDD) or Solid-State Drives (SSD).
- Data written to disk persists through power failures and crashes.
- HDDs are cost-effective for long-term data storage, while SSDs offer faster data access but at a higher cost.
-
Memory ๐ง
- Short for Random Access Memory (RAM).
- Data stored in memory is volatile and is lost when the process ends.
-
Persistent Storage ๐ฆ
- Usually refers to disk storage.
- It includes any storage that remains intact even if the managing process stops.
Databases play a crucial role in most systems, serving two primary purposes:
- Data Storage: They store, record, and save data.
- Data Retrieval: They allow data to be retrieved, read, and queried.
But here's the catch: databases are not magical black boxes floating in the cloud! ๐ฉโ๏ธ Instead, a database is essentially a serverโa computer like any other.
And the most crucial concept in storage is data persistence. Many assume that data stored in a database will survive system outages, but that's not always the case.
When data is written to disk, it persists through system failures, just like saving a file on your computer. However, if data is stored in memory, it's lost when the process ends.
Why would anyone choose memory over disk? Simple: accessing data from memory is blazing fast compared to accessing it from disk.
Storage is a vast field within systems design, with countless database options. Each database comes with its unique features, trade-offs, and suitability for different use cases.
So far, we've discussed storing data on a single machine, but what if you want to prevent your entire system from crashing when that machine fails? This is where distributed storage comes into play. It's a complex challenge involving data distribution, replication, and ensuring consistency across distributed systems.
Consistency refers to the up-to-dateness of data. When you access data from a distributed database, will you get the latest version or potentially stale data? Different databases offer various properties and guarantees, which come with their trade-offs.
The world of storage is diverse and intricate, with numerous database offerings, each tailored to specific needs. Your choice of database can profoundly impact your system's performance and resilience. ๐ช
If you've ever experienced frustrating lag in a video game, you've encountered the consequences of high latency and low throughputโsomething no one enjoys! ๐ก
-
Latency โณ
- The time it takes for a particular operation to finish in a system.
- Measured in units like milliseconds (ms).
- Latency varies widely across different system components.
-
Throughput ๐
- It measures a system's capacity to handle operations in a given time period.
- It's often expressed as the rate of data transfer, like gigabits per second (Gbps).
- Throughput determines how many operations a system can handle in a fixed time frame.
Latency and throughput are essential concepts in systems design, yet they're often misunderstood. Let's clarify them:
- Latency is the time data takes to traverse a system, from one point to another.
- It applies to various aspects, such as network request latency and data access latency.
- Different system components have different latencies, and optimizing these latencies is a key consideration in system design.
Examples of Latency:
- Reading 1 MB from RAM: 250 ฮผs (0.25 ms)
- Reading 1 MB from SSD: 1,000 ฮผs (1 ms)
- Transfer 1 MB over Network: 10,000 ฮผs (10 ms)
- Reading 1 MB from HDD: 20,000 ฮผs (20 ms)
- Inter-Continental Round Trip: 150,000 ฮผs (150 ms)
Latency is crucial in designing systems, as it impacts the user experience. Systems with high latency can result in laggy interactions, which is especially problematic in applications like online gaming.
- Throughput measures a system's capacity to handle operations in a given time period.
- It's often expressed as the rate of data transfer, like gigabits per second (Gbps).
- Throughput determines how many operations a system can handle in a fixed time frame.
Boosting Throughput: Increasing throughput can be as simple as investing in better network infrastructure. However, blindly increasing throughput isn't always the best solution.
When designing systems, you'll need to make trade-offs between latency and throughput. Some systems require low latency to ensure near-instantaneous responses, while others prioritize high throughput to handle a large volume of requests.
Remember: Latency and throughput are related but not correlated. A system can have low latency in one aspect and low throughput in another, so it's essential to consider both when optimizing system performance. ๐ก
In systems design, availability is a critical concept that relates to how often a system is up and running. It's essential to consider because it impacts the user experience and the success of a service. ๐
Think of availability as a system's resilience to failures. When a component, like a server or database, fails, does the entire system collapse, or can it continue functioning? This concept is often called fault tolerance.
Another perspective is to view availability as the percentage of time a system remains operational in a specific period, such as a month or a year, meeting its primary functions. It's vital for users to rely on a system being consistently accessible.
๐ก Availability is a crucial consideration when designing systems, as it directly affects both end-users and system designers.
Availability is typically measured as a percentage of uptime in a year, but high percentages are crucial. Here's a quick breakdown:
- Two Nines (99%): Equates to around 87.7 hours of downtime per year. Generally unacceptable for most services.
- Three Nines (99.9%): Allows for roughly 8.8 hours of downtime per year.
- Four Nines (99.99%): Only permits about 52.6 minutes of downtime per year.
- Five Nines (99.999%): The gold standard, guaranteeing just 5.3 minutes of downtime per year.
The term "nines" signifies the number of nines appearing in the percentage, making it a standard way to discuss availability in the industry. Five nines (99.999%) is considered highly available.
โก๏ธ Keep in mind that even 90% availability isn't great, as it means about 36 days of downtime in a year, which is unacceptable for most services.
Availability matters significantly because it can impact both customer satisfaction and a company's revenue. Service providers often formalize availability through Service-Level Agreements (SLAs), which include guarantees about system availability.
SLAs consist of Service-Level Objectives (SLOs), specific guarantees to customers. They provide peace of mind and set clear expectations.
๐ผ Major cloud service providers like Google Cloud Platform or Amazon Web Services offer SLAs, emphasizing the importance of availability.
However, achieving high availability involves trade-offs. It can be challenging and costly. Therefore, not all systems need to aim for five nines of availability.
When designing systems, consider what parts are critical and require high availability and which can tolerate some downtime.
To enhance availability, eliminate single points of failure by introducing redundancy. Redundancy involves duplicating components to ensure continued operation if one fails.
Passive redundancy means having multiple components at a given layer. If one fails, the others handle the load until the issue is resolved. Think of it as backup components ready to take over.
๐ฉ๏ธ An example is twin-engine airplanes. If one engine fails, the other can keep the plane flying safely.
Active redundancy is more complex. Multiple machines work together, with only one or a few handling traffic or tasks at a time. If one fails, the others detect the failure and take over.
๐ Implementing redundancy is a key strategy to make systems highly available.
Remember that some failures may require human intervention. Establish rigorous processes to handle system failures promptly.
Availability is a vital consideration in systems design, impacting both user satisfaction and the success of a service. Designing for high availability requires thoughtful planning, redundancy, and robust processes to ensure a system remains reliable even in the face of failures. ๐
Caching is a fascinating topic in systems design, and it plays a crucial role in optimizing system performance. Let's delve into caching, break down key concepts, explore its various applications, and discuss some real-world examples.
- A cache is a tool, whether hardware or software, that stores data for quicker retrieval.
- It's often used to store responses to network requests or results of computationally-intensive operations.
- Keep in mind that data in a cache can become "stale" if not updated when the main data source changes.
- When the requested data is found in a cache.
- Occurs when the requested data could have been found in a cache but isn't, often due to system failures or design issues.
- This policy defines how data is removed or "evicted" from a cache.
- Popular policies include LRU (Least Recently Used), FIFO (First In, First Out), and LFU (Least Frequently Used).
- A third-party service acting as a cache for your servers, distributed worldwide to reduce latency.
- CDNs, like Cloudflare and Google Cloud CDN, help deliver content quickly to users in various regions.
Caching is a fundamental technique in systems design, commonly employed to enhance system performance. It's used to speed up operations, reduce latency, and improve the user experience.
- Client-Level Caching: Clients can cache data to reduce the need for frequent server requests.
- Server-Level Caching: Servers can cache data fetched from databases to serve subsequent requests more swiftly.
- Intermediate-Level Caching: Caches can be placed between components within a system, optimizing data flow.
- Hardware-Level Caching: Modern computers employ hardware-level caching, such as CPU caches, for faster data retrieval.
Caching operates at multiple levels in a system, depending on the specific use case.
Caching proves valuable in various scenarios, primarily aimed at speeding up processes and avoiding redundant work:
- Network Requests: Caching helps avoid frequent network requests by storing previous responses.
- Computationally-Intensive Operations: Caching stores results to prevent re-computation of resource-intensive tasks.
- Scaling Data Access: When multiple servers or clients access the same data, caching can reduce the load on the primary data source (e.g., a database).
-
On AlgoExpert, the questions list is cached on the client side. This static content, which doesn't change often, is stored locally, reducing the need for repeated server requests.
-
When users run code with AlgoExpert's solutions, the results are cached. Subsequent requests for the same code execution can retrieve cached data, significantly improving response times.
-
Imagine designing the YouTube comments section. Here, caching becomes complex due to mutable data (comments that can be edited).
-
Two caching approaches are considered:
- Write Through Cache: Updates both the cache and the main data source (e.g., database) simultaneously. Ensures consistency but requires extra database access.
- Write Back Cache: Updates the cache immediately and asynchronously updates the main data source. Offers faster responses but introduces potential data staleness.
Caching introduces the challenge of dealing with data staleness, where cached data becomes outdated. Solutions depend on the specific use case and system requirements:
-
Single Cache: In some cases, centralizing caching in a single cache (e.g., Redis) can mitigate staleness issues.
-
Use Case Considerations: Determining when data staleness is acceptable is crucial. Some data may tolerate staleness, while others require real-time updates.
As caches have limited space and must handle data updates, eviction policies define how data is removed:
- LRU (Least Recently Used): Evicts the data that hasn't been accessed for the longest time.
- LFU (Least Frequently Used): Removes the data that has been accessed the least.
- FIFO (First In, First Out): Eliminates the oldest data in the cache.
- Random Eviction: Removes data randomly from the cache.
Choosing the right eviction policy depends on your specific system and its caching needs.
Caching is a powerful tool in systems design, but it comes with challenges like data staleness and eviction policies. Understanding your system's requirements and use cases is essential when implementing caching to ensure optimal performance and user experience. ๐
Demystifying Proxies: Hidden Helpers or Hackers' Tools? ๐
Proxies are often associated with hackers trying to hide their identity, but they serve essential real-world purposes in caching, access control, and bypassing censorship, among other applications. Let's dive into the world of proxies! ๐ต๏ธโโ๏ธ
Let's begin by understanding three key terms related to proxies:
-
Forward Proxy ๐
- A server acting on behalf of clients, commonly used to conceal the client's identity.
- Also known as just "proxies."
-
Reverse Proxy ๐
- A server that represents other servers to clients, often used for logging, load balancing, or caching.
-
Nginx ๐
- Pronounced "engine X," a popular webserver frequently utilized as a reverse proxy and load balancer.
- Learn more at Nginx.
People often use the term "proxy" loosely, which can lead to confusion. Sometimes it refers to a forward proxy, while other times, it points to a reverse proxy.
-
Forward Proxy ๐
- Imagine a forward proxy as a server standing between a client or multiple clients and another set of servers.
- It acts on behalf of the client, like being on the same team.
- When a client wants to communicate with a server, its request goes first to the forward proxy.
- The client essentially says, "Hey, forward proxy, talk to the server for me, please."
- The forward proxy forwards the request to the server and relays the server's response to the client.
๐ค๐ผ A forward proxy can help hide a client's identity by replacing its source IP address with the proxy's IP address, allowing access to restricted servers or content.
-
Reverse Proxy ๐
- Reverse proxies are more complex.
- They act on behalf of servers in interactions between clients and servers.
- When a client sends a request, it unknowingly goes to the reverse proxy, not the actual server.
- To the client, there's no distinction; it believes it's interacting directly with the server.
๐๐ฅ Reverse proxies serve various purposes, including filtering requests, logging, caching, and load balancing.
๐๐๏ธโโ๏ธ One of the best uses of a reverse proxy is as a load balancer. It can distribute incoming requests evenly among multiple servers, enhancing performance and providing resilience against malicious clients attempting to overwhelm a server.
โจ๐ Additionally, reverse proxies can act as shields, protecting servers from potential security threats.
Proxies are versatile tools in systems design, offering solutions to complex challenges. Understanding their roles and capabilities is essential in architecting robust systems. ๐๏ธ
Load balancers are the unsung heroes of network management, tirelessly orchestrating the flow of network requests across multiple servers. They ensure your system operates at its peak, day and night! ๐
-
Load Balancer
- A type of "reverse proxy" that spreads incoming traffic across multiple servers. They can be positioned at various layers of a system, from DNS to the database.
-
Server-Selection Strategy
- The method a load balancer employs to choose servers when distributing traffic. Common strategies include round-robin, random selection, performance-based selection, and IP-based routing.
-
Hot Spot
- When workloads are unevenly distributed among servers. This can occur due to suboptimal sharding keys, hashing functions, or natural workload imbalances.
-
Nginx
- Pronounced as "engine X," not "N jinx." Nginx is a widely-used web server often utilized as a reverse proxy and load balancer. Learn more
Load balancers are indispensable in system design, making them a common topic in interviews. Let's simplify their essence through a straightforward example:
Imagine a single client and server. The client sends requests, and the server responds. Now, picture your system expanding with multiple clients, each sending requests. Some clients might be more prolific, sending numerous requests. The single server has limits and can become overloaded.
To tackle this, you have two options:
-
Vertical Scaling: Enhance the server's power, but this has limits.
-
Horizontal Scaling: Add more servers. However, ensuring balanced traffic distribution among them is key.
This is where load balancers shine.
A load balancer sits between clients and servers, redistributing requests evenly or according to a predefined strategy. It prevents overloading, boosts throughput, improves response times, and optimizes resource utilization.
Load balancing can occur at various points in a system, including between clients and servers, servers and databases, or even at the DNS layer.
Load balancers utilize diverse server selection strategies:
-
Random Redirection: Redirects requests randomly among servers. Simple but can lead to uneven loads.
-
Round Robin: Cycles through servers in order, ensuring fair distribution.
-
Weighted Round Robin: Assigns weights to servers, directing more traffic to powerful ones.
-
Performance-Based Selection: Monitors server performance, redirecting traffic to the best-performing servers.
-
IP-Based Routing: Hashes client IP addresses to direct traffic, useful for optimizing caching.
-
Path-Based Selection: Routes requests based on their paths, ideal for isolating services within a system.
Load balancers come in two flavors:
-
Software Load Balancers: Offer more customization and scalability. Ideal for flexibility.
-
Hardware Load Balancers: Physical machines with limited customization but can be reliable.
System administrators configure load balancers to recognize servers. Servers can register or deregister with the load balancer to maintain awareness.
In complex systems, you might employ multiple load balancers with different strategies at various system layers. This redundancy enhances reliability and scalability.
In Conclusion: Load balancers are simple yet indispensable tools, ensuring the smooth operation of large-scale systems. Expect to encounter them in systems design interviews! ๐
Hashing, often associated with hash tables, is a critical concept in systems design, but it's not as simple as it might seem. Let's dive in and explore its nuances! ๐
At its core, a hashing function takes specific data, like a string or an identifier, and produces a numerical output. It's important to note that different inputs can result in the same output, but a good hashing function aims to minimize these collisions, ensuring uniformity. ๐
-
Consistent Hashing ๐
- This hashing approach minimizes the need to remap keys when resizing a hash table.
- It's commonly used in load balancing to distribute traffic among servers efficiently, especially when adding or removing servers.
-
Rendezvous Hashing ๐ฏ
- Also known as "highest random weight" hashing.
- Minimizes remapping when a server goes down.
- Ensures smooth and balanced distribution of data.
-
SHA (Secure Hash Algorithms) ๐
- A family of cryptographic hash functions used in the industry.
- SHA-3, for instance, is a popular choice for secure hashing in systems.
Hashing is the process of converting arbitrary data into a fixed-size value, typically an integer. This is crucial in systems design, where data like IP addresses, usernames, or HTTP requests can be transformed into integers for various purposes. ๐
Imagine a scenario with clients (C1, C2, C3, C4), servers (A, B, C, D), and a load balancer. Requests from clients go through the load balancer before reaching servers. Selecting the right server is crucial, especially for expensive computations. Caching can help, but the wrong server selection strategy can lead to cache misses. This is where hashing becomes vital. ๐
Hashing allows us to map clients to servers consistently. By hashing client names (C1, C2, C3, C4), we assign them to servers. For example, C1 gets hashed to 11, C2 to 12, and so on. We then use modulo arithmetic to assign clients to servers. However, adding or removing servers can disrupt this strategy. ๐งฎ
Consistent hashing remedies the issues of adding/removing servers. Servers and clients are positioned on a virtual circle based on hashes. Clients are directed to the closest server as they move clockwise around the circle. Even when servers change, most mappings remain intact, ensuring better cache utilization. ๐
Rendezvous hashing ranks destinations (servers) for each input (username, request). The highest-ranking server is chosen. When servers change, only affected mappings are adjusted, maintaining consistency. ๐
In large-scale systems, the choice of hashing strategy becomes crucial. Using consistent or rendezvous hashing can make a significant difference in performance, especially when relying on caching mechanisms. These strategies ensure that your system scales smoothly and reliably, even during server changes or high traffic events. ๐ช
As you prepare for systems design interviews, remember the power of hashing strategies like consistent and rendezvous hashing. They can be game-changers in ensuring the scalability and reliability of your distributed systems.๐ค
Welcome to the world of databases! ๐ In this journey, we'll explore two important types: Relational Databases and Key-Value Stores. Think of them as the Batman and Robin of data storage. ๐
-
Relational Database: Picture your data neatly organized in tables, with powerful querying through SQL. That's a relational database for you!
-
Non-Relational Database: Unlike relational ones, these databases don't force data into tables. Meet the flexible NoSQL databases.
-
SQL: Structured Query Language, the tool to talk to relational databases like PostgreSQL.
-
SQL Database: Any database that understands SQL, often interchangeable with "Relational Database."
-
NoSQL Database: The opposite of SQL-compatible databases, they embrace flexibility without structure.
-
ACID Transaction: A rock-solid type of database transaction with Atomicity, Consistency, Isolation, and Durability.
-
Database Index: Like a table of contents for your database, it speeds up reads but may slightly slow down writes.
-
Strong Consistency: The reliability of ACID transactions, ensuring your data stays solid.
-
Eventual Consistency: A more flexible approach where reads may be a bit stale, but everything will catch up eventually.
-
Postgres: A popular relational database that speaks PostgreSQL and offers ACID transactions.
Let's dive into specialized storage paradigms used in systems design, each catering to specific data types and use cases. ๐
Blob storage is a widely used type of storage for unstructured data, often in large-scale systems. It's not quite like traditional databases because it primarily allows you to store and retrieve data based on the name of the blob, akin to a key-value store. Blobs can be large, ranging from megabytes to gigabytes, making them suitable for storing things like large binaries, database snapshots, or images.
In most systems design interviews, you can assume access to cloud-based Blob storage services such as Google Cloud Storage (GCS) or Amazon S3. These services are offered by Google and Amazon and are ideal for handling vast amounts of unstructured data.
Time Series Databases (TSDB) are specialized databases optimized for storing and analyzing time-indexed data. They excel at handling data points that occur at specific moments in time. Common examples include InfluxDB, Prometheus, and Graphite. TSDBs are valuable for use cases like monitoring, IoT, and financial data tracking.
Graph databases are designed for storing data using the graph data model, emphasizing explicit relationships between data entries, similar to nodes and edges in a graph. They excel in handling complex queries on interconnected data, making them ideal for applications like social networks.
One popular query language for graph databases is Cypher, which simplifies querying connected data.
Spatial databases are tailored for storing and querying spatial data, such as geographic locations on a map. They rely on spatial indexes, like quadtrees, to quickly perform spatial queries, such as finding nearby locations.
A quadtree is a tree data structure commonly used to index two-dimensional spatial data. Each node in a quadtree has either zero or four children nodes. Quadtrees efficiently organize spatial data, making them well-suited for fast spatial queries.
-
Google Cloud Storage (GCS) ๐
- Google's blob storage service. Learn more
-
Amazon S3 ๐ฆ
- Amazon's blob storage service offered through AWS. Learn more
-
InfluxDB ๐
- An open-source time series database. Learn more
-
Prometheus ๐
- Another open-source time series database, commonly used for monitoring. Learn more
-
Neo4j ๐
- A popular graph database featuring nodes, relationships, properties, and labels. Learn more
Specialized storage paradigms like blob storage, time series databases, graph databases, and spatial databases cater to diverse data types and use cases. Familiarity with these concepts allows you to design systems that efficiently handle specific data requirements.
Remember, the choice of storage paradigm depends on your system's unique needs and scalability considerations. ๐ ๏ธ
-
Replication ๐
- Replication involves copying data from one database server to others. This can enhance system redundancy and resilience, allowing you to withstand regional failures. Replication can also reduce data access latency by bringing data closer to users.
-
Sharding ๐ฆ
- Sharding, also known as data partitioning, is the practice of dividing a database into multiple "shards" to improve database throughput. Popular sharding strategies include region-based, data-type-based, and hash-based sharding for structured data.
-
Hot Spot ๐ฅ
- A hot spot occurs when the workload isn't evenly distributed across servers. This can happen due to suboptimal sharding keys or hashing functions or naturally skewed workloads, resulting in some servers receiving significantly more traffic than others.
- Your system's performance heavily relies on the performance of your database.
- If your database is unavailable, your entire system may become inaccessible.
- High database latency or low throughput can lead to sluggish system performance.
- Consider a scenario where your primary database goes offline.
- To prevent this from causing system downtime, you can set up a secondary database, a replica of the primary.
- Replication ensures that the replica is always up to date with the primary database.
- If the primary fails, the replica takes over, maintaining system availability.
-
Let's say you're designing a system like LinkedIn, serving users in the US and India.
-
To reduce latency, you can have separate databases in both regions.
-
Asynchronously, the US database updates its Indian replica.
-
This approach minimizes the roundtrip delay for users when posting content.
-
Replication can also be useful for deploying software updates globally in a large tech company like Google.
-
Asynchronous replication allows gradual deployment across database replicas.
-
When your main database becomes overloaded due to a high volume of requests, scaling horizontally becomes necessary.
-
Instead of replicating all the data across multiple databases, consider sharding or partitioning the data.
-
Sharding involves splitting the main database into smaller shards or data partitions.
-
This increases throughput without duplicating data excessively.
-
Deciding how to split the data is crucial, as you might encounter hotspots where some shards receive more traffic.
-
Hashing can be a practical way to distribute data evenly among shards.
-
Consistent hashing ensures data consistency but doesn't address server failures, so having shard replicas is essential.
-
Implementing sharding logic can be done in your application servers or through a reverse proxy acting on behalf of database servers.
Replication and sharding are powerful tools in system design, especially when aiming to enhance system performance. They help ensure database availability, reduce latency, and improve overall throughput. ๐
In the world of distributed systems, servers need to choose a leader to coordinate their activities. This process, known as "leader election," ensures that one server takes charge while others standby, ready to step in if needed. Let's explore this fascinating aspect of distributed systems.
Leader Election ๐ณ๏ธ
- Leader election is the mechanism by which servers in a distributed system select a "leader" among themselves.
- The leader is responsible for managing critical operations in the system.
- It's essential for maintaining system consistency and fault tolerance.
Consensus Algorithms ๐ค
- Leader election relies on consensus algorithms, complex methods for entities to agree on a single value.
- Two well-known consensus algorithms are Paxos and Raft. They help ensure that all servers know the current leader.
Ensuring Redundancy ๐
- To prevent a single point of failure, we introduce redundancy.
- Instead of one server, we might have multiple servers (e.g., five) managing the same tasks.
- But how do we avoid duplicated actions?
Imagine you're designing a subscription-based service like Netflix. Your system involves a database and a third-party payment service. To prevent direct access to your database, you introduce an intermediary service.
The Problem ๐ค
- If this intermediary service has only one server and it fails, your entire payment system collapses.
- The solution is redundancy, having multiple servers handle the task.
- With multiple servers, you must ensure that critical actions, like charging users, happen only once.
- Leader election plays a pivotal role here.
Leader Election in Action ๐
- Servers elect one among themselves as the leader.
- The leader takes charge of actions, ensuring they are executed once.
- If the leader fails, a new leader is elected from the remaining servers.
Leader election is a challenging problem in distributed systems due to several factors:
- Network failures and partitions can disrupt communication.
- Achieving consensus among distributed nodes is complex.
- It's not just about electing a leader but ensuring all nodes agree on who the leader is.
Consensus algorithms like Paxos and Raft tackle the complexity of leader election. While implementing these algorithms is intricate, they are essential for maintaining system reliability and consistency.
In practice, you won't need to implement consensus algorithms yourself. Instead, you can use tools like ZooKeeper and Etcd.
ZooKeeper ๐ฆ
- ZooKeeper is a highly available, strongly consistent key-value store.
- It's commonly used for leader election and configuration management.
- Learn more: ZooKeeper
Etcd ๐๏ธ
- Etcd is another key-value store with high availability and strong consistency.
- It implements the Raft consensus algorithm.
- Learn more: Etcd
Etcd is particularly noteworthy:
- It combines high availability and strong consistency through leader election.
- Multiple nodes can read and write to the key-value store, ensuring resilience.
- A single leader coordinates writes, thanks to leader election.
- This makes it a robust choice for implementing leader election in your own systems.
Leader election is a crucial aspect of distributed systems. While it involves complex concepts, having a high-level understanding of it is valuable for systems design interviews. It ensures your systems can maintain consistency and reliability even in the face of server failures.
In this exciting topic of systems design and computing, we delve into the intriguing world of peer-to-peer (P2P) networks. These networks embody principles like equality, sharing, unity, and teamwork, making them a vital subject for systems design interviews. So, let's dive in!
Peer-To-Peer Network ๐ค A network of interconnected machines, called peers, that collaborate to efficiently complete tasks, commonly used in file distribution systems.
Gossip Protocol ๐ฃ๏ธ A decentralized communication protocol where machines in a cluster spread information without relying on a central source of data.
Imagine designing a system for a tech giant to distribute large files to thousands of machines simultaneously. You have a powerful data center with a network throughput of 40 gigabits per second (5 gigabytes per second). If you need to transfer a 5-gigabyte file to 1,000 machines, it would take 1,000 seconds or roughly 17 minutes โ a significant bottleneck.
To improve this, you might consider having multiple machines serve the file simultaneously. However, this approach introduces its own issues, such as data replication and potential bottlenecks.
This is where peer-to-peer networks come into play. Instead of sending the entire file to each machine individually, you divide the file into smaller chunks and distribute them to peers. These peers then communicate with each other to obtain the missing pieces and construct the complete file. It's like a collaborative puzzle-solving approach.
For example, you could split a 5GB file into 1,000 5MB chunks and send one chunk to each of the 1,000 machines. Each machine needs to communicate with others to gather the missing chunks, significantly speeding up the process compared to the initial bottleneck.
This concept is akin to the gossip protocol, where information is shared between peers in an uncoordinated manner, allowing parallelized transfers and efficient data distribution. Additionally, peer-to-peer networks often employ a Distributed Hash Table (DHT) to keep track of which peers hold specific data pieces.
Peer-to-peer networks offer high speed and find applications in systems like Uber's Kraken. The key takeaway is that the power of peer-to-peer lies in the connections established between peers, enabling efficient data sharing.
Peer-to-peer networks have extensive applications, including the popular concept of torrenting. Torrenting involves distributing large files in chunks to peers worldwide, allowing them to collaboratively assemble the complete file. This method minimizes the load on the source and speeds up data distribution.
In summary, peer-to-peer networks are not only fascinating but also incredibly practical, offering solutions to various data distribution challenges.
Think of polling and streaming as two different classroom scenarios: one where students ask questions periodically, and another where they listen attentively to the teacher throughout the lecture.
Polling ๐ Fetching data at regular intervals to ensure it remains up-to-date.
Streaming ๐ Continuous retrieval of data from a server by maintaining an open connection, often used for real-time data feeds.
So far, we've explored systems where clients send requests to servers and receive responses. But what if your system involves regularly updated data that clients need to monitor? Consider scenarios like monitoring changing temperatures or real-time chat applications.
Polling is a straightforward approach where clients request data at fixed intervals. This interval could be every second, minute, or as per your use case. Polling works well for cases like monitoring changing temperature data, where updates don't need to be instantaneous.
However, polling has limitations, especially for applications requiring real-time updates. For example, in a chat app, frequent polling might still result in delays between messages.
This is where streaming shines. Instead of clients repeatedly requesting data, they establish long-lived connections with servers, often using sockets. Servers then proactively push data to clients, creating a continuous data stream. In the chat app example, this allows clients to instantly receive messages as they arrive, providing a seamless user experience.
Streaming, however, places more responsibility on the server to push data proactively. It's a powerful solution for real-time applications but can increase server load when dealing with a large number of clients.
The choice between polling and streaming depends on your system's specific needs. If you require instant updates and real-time experiences, streaming is ideal. For less frequent updates, like monitoring stock prices every few minutes, polling may be sufficient.
In conclusion, understanding your system's functionality and requirements will guide your decision to use polling, streaming, or a combination of both.
Configuration is like the DNA of a computer application, defining critical settings. Unlike biological DNA, config files are easily editable. No gene therapy needed! ๐งฌโ๏ธ
-
JSON (JavaScript Object Notation): A file format widely used in APIs and configuration. ๐๐จโ๐ป
-
Key-Value Store: Flexible NoSQL databases for caching and dynamic configuration, including DynamoDB, Etcd, Redis, and ZooKeeper. ๐๏ธ๐
Configuration comprises essential parameters or constants for a system, often in JSON or YAML format. It can be static (bundled with code) or dynamic (external). ๐๐ง
Configuration may seem simple but is crucial in systems design. Large-scale systems rely heavily on it. It's like a set of constants for your application. ๐ผ๐ฏ
-
Static Configuration: Bundled with code, changes require code deployment. Safer but slower for updates.
-
Dynamic Configuration: Separated from code, changes take immediate effect. Requires database support and careful management.
Dynamic configuration offers flexibility but comes with responsibilities. Tools like review systems and access controls ensure safe changes. ๐ก๏ธ๐
Ever been poked repeatedly? Rate limiting prevents that in your system! ๐ข๐ก๏ธ
-
Rate Limiting: Restricting the number of requests to prevent DoS attacks. Can be based on IP, user, or region. Important for security and performance.
โ ๏ธ ๐ป -
DoS Attack (Denial-of-Service): Flooding a system with traffic to render it unavailable. Rate limiting helps prevent this type of attack. ๐๐
-
DDoS Attack (Distributed Denial-of-Service): A more complex DoS attack involving traffic from multiple sources. Challenging to defend against. ๐๐๐
-
Redis: In-memory key-value store often used for rate limiting. ๐๐
Rate limiting is about setting thresholds on operations. If the limit is exceeded, errors are returned. It protects against flooding and abuse. ๐งโ
Rate limiting can be tiered, with different limits for different scenarios. For example:
- Limiting a specific operation to once every 0.5 seconds.
- Allowing only three operations in 10 seconds.
- Enforcing stricter limits for malicious activity. ๐๐
Tiered rate limiting adds complexity but enhances security. It involves tracking time windows and additional logic. ๐งฉ๐ฐ๏ธ
While rate limiting is effective, it's not foolproof. Complex attacks like DDoS can bypass it. Still, it's a vital tool in protecting systems from abuse and overloads. ๐ก๏ธ๐ค
In the realm of systems design, two indispensable concepts that ensure the smooth operation of your system are Logging and Monitoring. Let's dive into these key terms to better understand their significance.
- Definition: Logging involves collecting and storing log messages, which are vital pieces of information about events within your system. These log messages are usually generated by your programs and are directed to STDOUT or STDERR pipes, often aggregated into a centralized logging solution.
- Why it's Important: Logging is your troubleshooting buddy. It allows you to track and diagnose issues within your system effectively. When something goes wrong, you can rely on logs to provide insights into what happened.
- Definition: Monitoring is the practice of keeping an eye on your system's critical metrics and events. It typically involves collecting important data and presenting it in human-readable charts, making it easier to identify trends and anomalies.
- Why it's Important: Monitoring ensures that you have real-time visibility into your system's health and performance. It helps you detect and address issues promptly, often before they impact users.
- Definition: Alerting is the process of receiving notifications when critical system issues occur. Thresholds are set on monitored metrics, and when these thresholds are breached, alerts are sent to communication channels like Slack.
- Why it's Important: Alerting keeps you informed about potential problems in real-time, enabling rapid responses to prevent or mitigate disruptions.
- Logging and monitoring are invaluable tools for understanding and maintaining large-scale systems. They provide visibility into the inner workings of your system, which becomes crucial as your system grows and serves more users.
- Imagine a scenario where a user reports a payment issue on your e-commerce website. Without logs, it would be challenging to pinpoint the problem, especially if it's a rare occurrence. Logging helps you track user actions and system responses, aiding in debugging and issue resolution.
- Logging involves inserting log statements into your code to capture essential information, such as errors, user actions, and requests. These logs are then collected and stored in a database for future analysis.
- By investing in robust logging and monitoring practices, you equip your system to handle unexpected issues, maintain high availability, and provide a better experience to your users.
In systems design interviews, while logging and monitoring may not be the central focus, they are essential tools in your toolkit. They enhance your system's reliability, scalability, and maintainability, making it more resilient in the face of challenges. So, embrace the power of logging and monitoring to build robust and efficient systems! ๐
The Publish/Subscribe pattern, often referred to as Pub/Sub, is a messaging model that plays a crucial role in systems design. ๐ It involves publishers, subscribers, topics (or channels), and messages, all working together to enable efficient communication.
-
Publish/Subscribe Pattern (Pub/Sub) ๐ก
- Pub/Sub is a messaging model where publishers send messages to specific topics without knowing who the subscribers are.
- Subscribers express interest in topics and receive messages from them.
- It guarantees features like at-least-once delivery, persistent storage, message ordering, and message replayability.
-
Idempotent Operation ๐
- An operation that produces the same result regardless of how many times it's performed.
- In Pub/Sub systems, operations must often be idempotent because messages can be consumed multiple times.
-
Apache Kafka ๐
- A distributed messaging system developed by LinkedIn, especially useful for streaming data.
- Learn more about Kafka
-
Cloud Pub/Sub โ๏ธ
- A highly scalable Pub/Sub messaging service by Google.
- Offers at-least-once message delivery and supports message "rewinding" for reprocessing.
- Learn more about Cloud Pub/Sub
The Pub/Sub pattern is a powerful concept in systems design, especially when dealing with distributed systems. It solves challenges related to network partitions, message persistence, and reliable message delivery.
In a distributed system, ensuring that data remains accessible even in the face of network disruptions is crucial. Imagine a stockbroker application where clients rely on real-time stock prices to make important trades. Losing access to critical data due to a network partition is not an option.
This is where the need for persistent storage arises. Storing data in a database might seem like an obvious solution, but not all data fits neatly into a traditional database. Some data, like asynchronous operations, doesn't align with the typical database model.
That's where the Pub/Sub pattern shines. It introduces four key entities:
- Publishers: These are servers that publish data/messages to topics.
- Subscribers: Clients that subscribe to topics to receive relevant data.
- Topics: Channels or categories where messages are published.
- Messages: Units of data that subscribers are interested in.
A crucial aspect of the Pub/Sub pattern is that publishers and subscribers don't communicate directly. They interact through topics, acting as intermediaries. Publishers publish messages to topics, and subscribers subscribe to topics to receive messages.
-
At-Least-Once Delivery: Messages are stored in persistent storage (topics), ensuring they are not lost even if subscribers disconnect. However, this can lead to messages being delivered more than once.
-
Ordering of Messages: Messages within a topic are delivered to subscribers in the order they were published. This ensures the sequential presentation of data, which is vital in applications like chat or financial trading.
-
Message Replayability: Some Pub/Sub systems allow subscribers to rewind and reprocess messages. This is valuable for scenarios where historical data needs to be revisited.
In a Pub/Sub system, it's common to have multiple topics. This is because the system often serves various types of data or handles different use cases. For example, a Pub/Sub system might have separate topics for stock prices and news alerts.
Multiple topics offer clear separation of concerns and duties. Each topic can represent a distinct category of data, making management and scalability more manageable.
Pub/Sub systems can be enhanced with additional features like content-based filtering and end-to-end encryption. Many cloud-based Pub/Sub solutions, such as Google Cloud Pub/Sub, offer scalability and robustness out of the box.
These features, combined with the foundational characteristics of Pub/Sub, make it an indispensable tool in the realm of systems design.
The Pub/Sub pattern empowers architects and developers to build resilient, scalable, and efficient distributed systems, making it a valuable addition to your systems design toolkit. ๐ ๏ธ
MapReduce is like a magic wand ๐ช for processing enormous datasets in a parallel and distributed manner on a cluster of machines. ๐
Before we delve into the wonders of MapReduce, let's grasp a fundamental concept: File System. Think of it as the organizer of your data. Most file systems, like the Unix file system, arrange data in a hierarchical structure with directories and files. ๐๐
MapReduce is a superstar framework that lets you handle vast datasets efficiently, quickly, and fault-tolerantly in a distributed environment. A MapReduce job is like a recipe with three main steps:
-
Map: It's like the prep work. The Map step runs a special map function on chunks of data, transforming them into intermediate key-value pairs. ๐บ๏ธ
-
Shuffle: Imagine organizing puzzle pieces. The Shuffle step arranges those intermediate key-value pairs, making sure pairs with the same key end up on the same machine. ๐งฉ
-
Reduce: Time to put it all together! The Reduce step runs a reduce function on the shuffled key-value pairs, turning them into something meaningful. ๐งฎ
The classic example? Counting word occurrences in a massive text file. ๐
With MapReduce, you don't need to worry about the nitty-gritty details of parallelization and fault tolerance; it's all abstracted away for you.
A Distributed File System (DFS) is like a super-sized, smarter version of your regular file system. It turns a cluster of machines into one cohesive file system. Google's Google File System (GFS) and the famous Hadoop Distributed File System (HDFS) are two popular implementations. ๐๐ง
DFS handles stuff like availability and replication automatically. It splits files into manageable chunks, spreads them across machines, and a central control system keeps everything in check. No more worrying about where your files are hiding! ๐ต๏ธโโ๏ธ
Hadoop is the rockstar of open-source frameworks, and it's BFFs with MapReduce. At its core is HDFS (Hadoop Distributed File System). But Hadoop's talents go beyond MapReduce; it can handle a variety of data processing tasks. Check it out: Hadoop. ๐ธ
Let's rewind to the early 2000s when Google faced the challenge of handling massive datasets. Vertical scaling could only take them so far. So, they had to go horizontal, adding hundreds or thousands of machines to their system. ๐๐ป
Processing data stored across so many machines was no walk in the park. You had to parallelize tasks, handle network hiccups, and cope with machine failures. Enter MapReduce, the hero of the story. ๐ฆธโโ๏ธ
In 2004, two brilliant Google engineers shared a white paper introducing MapReduceโa framework that made processing massive distributed datasets efficient, fast, and fault-tolerant. ๐
Let's break it down:
-
Data is scattered across machines in a distributed file system.
-
The Map function transforms data into key-value pairs. ๐๏ธ
-
Shuffling happens, organizing those key-value pairs.
-
The Reduce step crunches the numbers and delivers the final output.
See? Not so complicated! You've got your data, a Map function, a Shuffle step, and a Reduce function. Easy peasy! ๐ฐ
Here are some key takeaways:
-
A distributed file system is essential for MapReduce to shine. It's like a librarian who knows where every book is on a massive shelf. ๐๐
-
Map functions often travel to the data instead of moving the data, especially when dealing with colossal datasets. It's like bringing the chef to your ingredients! ๐ณ
-
Key-value pairs rule the intermediate step. They're like puzzle pieces that fit together when you're reducing data. ๐งฉ
-
Fault tolerance is a big deal. If something goes wrong, MapReduce just re-does the Map or Reduce step where the problem occurred. This relies on the Map and Reduce functions being idempotent, meaning they produce the same result no matter how many times you run them. ๐ ๏ธ
-
As an engineer or sysadmin, you focus on defining the Map and Reduce functions and specifying their inputs and outputs. The MapReduce framework takes care of the rest. Less headache, more data crunching! ๐คฏ
MapReduce is a Swiss Army knife for data processing. You can use it to tackle tasks like:
- Analyzing YouTube video metadata.
- Counting logs from various services.
- Handling a wide range of data processing challenges.
It's a versatile tool you'll want in your toolkit for systems design interviews and real-world data processing adventures. ๐งฐ๐
Now you're armed with MapReduce knowledge! Go out there and conquer those colossal datasets like a pro! ๐
In the tech world, interviews often revolve around either Systems Design or API Design, which are closely related but distinct areas. Today, we'll dive into API Design, its importance, and what to expect in an API Design interview. ๐
APIs (Application Programming Interfaces) are at the heart of many software products and services. Think about Twitter, YouTube, or AlgoExpert; they all rely on APIs to function. In some cases, like Stripe, the API is the actual product! So, designing APIs well is vital because once people and systems start using your API, changing it becomes challenging. Every design decision can have long-lasting consequences. ๐
The start of an API Design interview is similar to Systems Design. You're given a vague problem statement, such as "design the Twitter API." You'll ask clarifying questions to understand the scope, like which part of Twitter the API should support, who the users are, and what functionalities it should provide. ๐ค
After the initial phase, API and Systems Design interviews take different paths. In Systems Design, you might draw diagrams and discuss system components. In API Design, you'll outline the API. You define the entities (like tweets or payments), parameters, responses, but you won't write actual code. It's more about having a conversation with your interviewer, explaining your decisions, and being open to feedback, just as you would when developing a real API. ๐ฃ๏ธ
You can outline your API in various formats. If you're comfortable with it, you can use tools like Swagger. However, the key is using a format that suits you and is acceptable to your interviewer. ๐
In an API Design interview, the outline of the API is usually expected. It's a challenging task, but you'll have 35-45 minutes to create it. While there's no single correct answer, you need to defend your design choices logically. The interview is about having a conversation, explaining your decisions, and being receptive to feedback. ๐๏ธ
Notes for AlgoExpert's SystemsExpert videos. ๐