## **COURSE OVERVIEW**

## **INTRO**
### 1. Introduction
## **Theory & Algorithms**
### 1. Models of DS
### 2. Time in DS
### 3. Multicast
### 4. Consensus
### 5. Distributed Mutual Exclusion
## **Application**
### 7. Distributed Storage
### 8. Distributed Computing
### 9. Blockchains
### 10. Peer-to-Peer Networking
### 11. Internet of Things and Routing
### 12. Distributed Algorithms

# Lecture 08/09/2025

## What is a distributed systems?
A distributed system is one where hardware and software components in/on networked computers communicate and coordinate their activity only by passing messages.

## CONCURRENCY

Concurrency is the ability of a system to execute multiple tasks or processes simultaneously or at overlapping times, improving efficiency.
However it can cause some problems such as:
- Deadlocks and livelocks: these are conditions in which the processes do not make progress due to their circumstances.
- Non-determinism: occurs when the output or behavior of a concurrent program differs for the same input, depending on the precise, unpredictable timing of events, such as thread interleaving or resource access.

Other issues could rise from the absence of shared state that will eventually lead to:
- Pass messages to synchronize: for example if 2 people have a shared resource and both are trying to access it at the same time an error could occur for one member of the party.
- May not agree on time: 

Everything can fail in distributed systems:
- Devices
- Integrity of data
- Network
    - Security
    - Man-in-the-Middle (MITM): attack occurs when an attacker intercepts and potentially alters communication between two legitimate parties, unbeknownst to them.
    - Zibantine failure: is a condition of a system, particularly a distributed computing system, where a fault occurs such that different symptoms are presented to different observers, including imperfect information on whether a system component has failed.

Distributed systems are used for domain, redundancy, and performance.

---
## Domain
A domain is a specific area of knowledge or activity that a distributed system is designed to address.
Some examples of domains are:
- The internet
- Wireless Mesh Networks
- Industrail systems
- Ledgers (bitcoin, ethereum)
However these domains can encounter some limits that can be physical and logical.
- Physical limits: are constraints imposed by the physical properties of the system's components or environment, such as hardware limitations, network bandwidth, latency, and geographical distribution.   
- Logical limits: defined by its bounded context. This is a core concept from Domain-Driven Design (DDD). The logical limit of a domain in a distributed system is defined by its bounded context. This is a core concept from Domain-Driven Design (DDD), a software development approach that focuses on aligning software design with the business domain. A bounded context is an explicit boundary within a distributed system where a specific domain model and its language (ubiquitous language) are consistent and applicable.
---
## Redundancy
A system with redundacy means that it has duplicate ocmponents, processes or data.
Given these specifications the system will result:
- Robust: more resilient to failure
- Available: as in system availability which is measured in uptime.

A system with redundancy can:
- offer 99.9% uptime or "five nines": means it's designed to be operational and available 99.9% of the time, with a small amount of planned or unplanned downtime.
- be a backup: in an active-passive configuration, the redundant server acts as a hot or cold backup. The primary server handles all the workload, while the backup server waits on standby. Given the duplication of information and components if one fails the other automatically takes over.
- be a database: In an active-active configuration, all redundant servers are considered active and work together simultaneously. They are not simply waiting as a backup. Incoming requests are distributed among all the servers using a load balancer.
- used in the banking sector: any downtime can lead to significant financial losses.

---
## Performance
To ensure a performant system we need:
- Economics
- Scalability
Here we talk about different topics such as:
- Video streaming: requires a lot of procesing power
- Cloud computing: Offers on-demand, scalable resources, eliminating the need for upfront hardware investment.
- Supercomputers: Excel at massive, specialized calculations but are very expensive and not scalable for general-purpose use.
- Many inexpensive vs many expensive specialized: Distributing workloads across many inexpensive machines is often more economical and scalable than using a few expensive, specialized ones.

---

# Lecture 09/09/2025

# Models of distributed systems

## Aspects of models

Why do we build distributed systems?

- **Inherent distribution**: By definition, distributed systems span multiple computers, often connected through networks such as telecommunications systems.
- **Reliability**: Even if one node fails, the system as a whole can continue functioning, avoiding single points of failure.
- **Performance**: Workloads can be shared among multiple machines, and data can be accessed from geographically closer nodes to reduce latency.
- **Scalability for large problems**: Some datasets and computations are simply too large to fit into a single machine, requiring distributed processing.

### Modelling the process – API Style

A distributed system can be described in terms of modules that exchange **events** through well-defined interfaces:

- **Event representation**:  
  \{Event\_type | Attributes, …\}  
  or  
  \{Process\_ID | Event\_type | Attributes, …\}

- **Module behavior**:  
  Each module reacts to incoming events and produces outputs according to specified rules:
upon event {condition | Event | attributes} such that condition holds do
perform some action


Multiple modules together (one per process or subsystem) should collectively satisfy desired **global properties** (e.g., safety, liveness).

### What we want/will make

We aim to:
- Design APIs for modules and prove that their composition satisfies global system properties.
- Implement modules that guarantee **local properties**.
- Use pseudocode and mathematics to formally demonstrate when such guarantees are possible—or prove impossibility.

---

## Failures

Failures are inevitable in distributed systems. They can arise due to hardware breakdowns, software bugs, network disruptions, or even human mistakes. Designing robust systems requires understanding different types of failures and strategies to mitigate them.

### Types of failures

1. **Crash-stop**: A process halts and all other processes can reliably detect the failure. *Easiest to handle.*
2. **Crash-silent**: A process halts but failures cannot be detected reliably.
3. **Crash-noisy**: Failures may be detected, but only with eventual accuracy (false positives or delays are possible).
4. **Crash-recovery**: Processes may fail and later recover, rejoining the system. Requires care to avoid state inconsistencies.
5. **Crash-arbitrary (Byzantine failures)**: Processes behave arbitrarily or maliciously, deviating from the protocol. *Hardest to handle.*
6. **Randomized behavior**: Processes make decisions probabilistically. Correctness is argued via probability theory rather than strict guarantees.

---

## Communication

Is communication always required? In distributed systems, yes—but it can be realized in different ways:

- **Message passing**:
1. Types of links and their potential failures.
2. Network topology (commonly assumed fully connected).
3. Routing algorithms for multi-hop communication.
4. Broadcast and multicast primitives.

- **Shared memory**:
1. Which process can read or write to which location?
2. How do we guarantee reading the *freshest* value? (Consistency models)

### On types of links

A **link** is a module implementing send/receive operations with certain properties.

- **TCP/IP**: Enables reliable communication between a pair of nodes (or none).
- **SSH**: Adds protection against corruption, interception, and tampering.

**Network reliability models**:
1. **Perfect links**: Reliable delivery, no duplication, no spurious messages.
2. **Fair-loss links**: Messages may be lost occasionally, but infinitely many attempts guarantee eventual delivery; finite duplication possible.
3. **Stubborn links**: Messages are retransmitted until delivery is guaranteed but still no creation (this model is built upon the fair-loss).
4. **Logged-perfect links**: Perfect delivery with persistent logs for auditing/recovery.
5. **Authenticated links**: Reliable delivery, no duplication, and sender authenticity.

### Can networks fail?

While TCP/IP and lower-level protocols often give us the illusion of **perfect links** and **fail-stop crashes**, failures still happen.

- **Network partitions**: Occur when many links fail simultaneously, dividing the system into disconnected components. This is rare but catastrophic.

### Crashes vs Failures

Having discussed both **network** and **process** failures, it is important to distinguish between the two levels:

- A **process can crash** (e.g., by crashing, halting, or misbehaving).  
- A **system fails** when the combination of process crashes and communication assumptions no longer allows correct operation.

For the remainder of our discussion, we usually assume **perfect links** (thanks to TCP/IP and lower-level reliability mechanisms). This means that:
- Messages are delivered reliably,
- No duplicates are created,
- No spurious (phantom) messages appear.

Under this assumption, we can define **system failure models** in terms of process behavior:

- **Fail-stop system**: Processes may experience crash-stop failures, but links are perfect.  
- More complex models (e.g., crash-recovery, Byzantine failures) are defined similarly, always considering both the **process failure type** and the **communication assumptions**.

In short, a system failure model = (process failure model) + (assumed link properties).

---

## Timing

Timing plays a central role in distributed systems, especially when considering **synchronization** and **failure detection**.

- Systems may be **synchronous** (bounded delays) or **asynchronous** (no timing guarantees).
- Links are still modeled as modules with send/receive properties.

### Synchronous vs. Asynchronous Systems

Distributed systems can be broadly classified according to their **timing assumptions**:

1. **Asynchronous systems**:
   - No bounds on message transmission delays.
   - No assumptions about process execution speeds (relative speeds may differ arbitrarily).
   - Failure detection is unreliable, since a slow process cannot be distinguished from a failed one.
   - Coordination and ordering rely on **logical clocks** (e.g., Lamport clocks, vector clocks), rather than real time.

2. **Synchronous systems**:
   - Bounds exist on message transmission delays and process execution speeds.
   - **Timed failure detection** is possible: if a message or heartbeat is not received within a known bound, a failure can be suspected reliably.
   - Transit delays can be measured and incorporated into algorithms.
   - Coordination can be based on **real-time clocks** rather than purely logical clocks.
   - Performance is often analyzed in terms of **worst-case bounds**, since timing assumptions provide guarantees.
   - Processes may maintain **synchronized clocks** (to some degree of precision), enabling algorithms such as consensus and coordinated actions.

**Key question**: *Can processes in an asynchronous system with fair-loss links reach agreement (e.g., on coordinated attack time)?*

### Proof via contradiction (Two Generals Problem)

1. Assume a protocol exists where a fixed sequence of messages guarantees agreement.
2. Consider the last message in this sequence that is successfully delivered.
3. If this message is lost, the receiving general decides **not** to attack.
4. But the sender cannot distinguish whether the message was delivered or lost, so must behave deterministically and decide the same action in both cases.
5. This creates a contradiction: one general attacks, the other does not.  
 $\Rightarrow$ Perfect agreement is impossible under these assumptions.

### Which crash/link/timing assumptions implement distributed systems?

A **failure detector** can be modeled as just another module that provides (possibly imperfect) information about which processes are alive. Different combinations of timing assumptions and failure detectors allow different guarantees in distributed systems.  

### Example

![image](../images/Screenshot%202025-09-09%20at%2009.56.39.png)

#### Explanation:

This algorithm describes a **Perfect Failure Detector** for distributed systems using a heartbeat mechanism.

In short, here's what it does:

1.  **Sends Heartbeats:** Periodically, on a **timeout**, every process sends a `HEARTBEATREQUEST` message to all other processes in the system.
2.  **Waits for Replies:** It assumes no one is alive and waits for `HEARTBEATREPLY` messages. When a process receives a reply, it marks the sender as `alive`.
3.  **Detects Failures:** At the next timeout, any process that has not sent a reply is considered to have **crashed**. The algorithm then triggers a `Crash` event for that process.

Because it assumes **perfect communication links** (messages are never lost), this method guarantees that a non-responsive process has truly failed, making the failure detection "perfect."

### Network latency and bandwith

When discussing communication performance, two key metrics matter:

- **Latency**: The time it takes for a single message (or bit) to travel from sender to receiver.  
- **Bandwidth**: The rate at which data can be transmitted, usually measured in bits per second (bps) or bytes per second (B/s).

#### Physical Link
Sometimes, surprisingly “low-tech” physical methods can provide high bandwidth, even if latency is poor:
- **Hard drives in a van**  
- Messengers carrying storage devices  
- Smoke signals (extreme latency, minimal bandwidth)  
- Radio signals or laser communication

#### Network Links
More conventional digital communication technologies include:
- DSL (Digital Subscriber Line)  
- Cellular data (e.g., 3G, 4G, 5G)  
- Wi-Fi (various standards)  
- Ethernet/fiber cables  
- Satellite links  

#### Latency examples
1. Hard drives transported by van: $\approx$ 1 day latency  
2. Intra-continent fiber-optic cable: $\approx$ 100 ms latency  

#### Bandwidth examples
1. Hard drives in a van: $\frac{50 \, \text{TB}}{1 \, \text{day}}$ = **very high bandwidth** despite huge latency  
2. 3G cellular network: $\approx 1 \, \text{Mbit/s}$ bandwidth  

---

## Performance

### Performance measures

- **SLI (Service Level Indicator)**: What aspect of the system do we measure?  
Examples: bandwidth, latency, fault tolerance, uptime, failure detection time.
- **SLO (Service Level Objective)**: What target values do we aim for?  
Example: latency < 200ms.
- **SLA (Service Level Agreement)**: An SLO backed with contractual consequences.  
Example: "99% uptime, otherwise partial refund."

Why should we study these?
- Measuring means we can improve
- Spend time improving when it is needed.
- Reliability is kind of the point with distributed systems.

### Reading SLAs

When evaluating claims like *“This solution offers 99% uptime”*, consider:

- **Sampling frequency**: How often is system availability checked?
- **Responsibility scope**: Does the SLA cover only server uptime, or also account for client/network failures?
- **Time interval**: Does 99% apply per day, per month, or per year?

---

# LECTURE 15/09/2025

## The Challenge of Time

In distributed systems, we often contrast **synchronous** and **asynchronous** computation. A synchronous system has known, bounded delays for message delivery and process execution. An asynchronous system has no such guarantees. Most real-world systems are asynchronous, which makes coordination difficult. Without certain timing guarantees, some problems are impossible to solve deterministically, a classic example being the **Two Generals' Problem**, which illustrates the impossibility of reaching a consensus over an unreliable channel.

### Reasons for Asynchrony
Asynchrony isn't an abstract problem; it arises from concrete issues with the physical components of a system: the network and the nodes themselves.

#### Network unpredictability:
* **Physical failures:** Cables can be damaged (famously by sharks or cut by construction) requiring traffic to be rerouted. 🦈
* **Message loss:** Packets can be dropped, requiring retransmission protocols (like TCP) to resend data.
* **Congestion:** High traffic can lead to queues and variable delays (latency).
* **Re-configuration:** The network topology itself may change, causing temporary disruptions.

#### Node unpredictability:
* **OS scheduling:** The operating system's scheduler can preempt a process at any time to run another one.
* **Garbage collection (GC):** In managed languages (like Java or Go), a "stop-the-world" GC pause can halt an application for milliseconds or even seconds.
* **Hardware faults:** Nodes can crash, reboot, or suffer from other hardware-related issues.

But what if a system were "perfect"? Imagine no network loss and perfectly functioning nodes. Could asynchrony still occur? **Yes**. The non-deterministic nature of process scheduling is a fundamental source of asynchrony. A real-world example is the **2012 Knight Capital Group glitch**, where a software deployment error led to an algorithm running haywire. The system's components were working "correctly," but the timing and interaction between processes led to a catastrophic failure, costing the company $440 million in 45 minutes.

---

## How Do Distributed Systems Use Time?

Systems need to measure time for many fundamental operations. Think about how you would implement these on a single computer; in a distributed system, this becomes much harder.

1.  **Scheduling and Timeouts:** To run a task for a specific duration or to give up on an operation if a response isn't received within a certain window.
2.  **Failure Detection:** Using **heartbeats** (periodic "I'm alive" messages) to detect if a node has crashed. If a heartbeat isn't received within a timeout period, the node is presumed dead.
3.  **Event Timestamping:** Recording the time an event occurred, which is critical in databases for transaction ordering and data versioning (e.g., using Multi-Version Concurrency Control or MVCC).
4.  **Performance Measurement:** Logging and statistics gathering to measure latency, throughput, and other performance metrics.
5.  **Data Expiration:** Caching systems use Time-To-Live (TTL) values to expire old data. DNS records and security certificates also have expiration times.
6.  **Causal Ordering:** Most importantly, to determine the **order of events** across different nodes to maintain consistency and causality.

---

## Types of Clocks

In distributed systems, we primarily talk about two types of clocks. From a practical standpoint, a clock is simply something we can query to get a timestamp.

* **Physical Clocks:** These measure the passage of real-world time in units like seconds. They are based on physical phenomena, like the oscillation of a crystal.
* **Logical Clocks:** These don't track real time. Instead, they count events (e.g., the number of requests processed) to determine the logical order of operations.

---

## Physical Clocks: The Quartz Crystal

Most computers use quartz clocks. Here's how they work:

* A thin slice of quartz crystal is precisely cut to control its oscillation frequency when an electric voltage is applied (the **piezoelectric effect**).
* When you boot your computer, it queries a **Real-Time Clock (RTC)**—a small, battery-powered circuit on the motherboard—which has been continuously counting these oscillations.
* By counting the cycles, the computer can calculate the elapsed time.

However, these clocks aren't perfect:
* **Manufacturing variations:** No two crystals are identical.
* **Temperature sensitivity:** Frequency changes with temperature.
* This imperfection leads to **clock drift**. We measure this in **parts per million (ppm)**. A drift of 1 ppm means the clock is off by one microsecond per second, which adds up to about **32 seconds per year**. A typical computer clock might have a drift of around 50 ppm.

Better, but more expensive, alternatives include:
* **Atomic clocks:** Extremely precise but very expensive.
* **GPS:** Satellites contain atomic clocks. A GPS receiver can use signals from multiple satellites to calculate a very precise time.
* **Network Time Protocol (NTP):** Ask another, more accurate server for the time.

---

## Time Standards and Representations

To agree on time, we need standards.

* **Solar Time (UT1):** Based on the Earth's rotation. A day is the time between the sun reaching its highest point in the sky on two consecutive days. This is not perfectly stable.
* **International Atomic Time (TAI):** Based on the oscillations of a caesium-133 atom. One second is defined as exactly 9,192,631,770 oscillations. TAI is extremely stable.
* **Coordinated Universal Time (UTC):** The global standard we all use. It's a compromise: it ticks at the same rate as TAI but is kept within 0.9 seconds of Solar Time (UT1) by adding **leap seconds**.

### Leap Seconds
To keep UTC aligned with the Earth's wobbly rotation, a second is occasionally added. This happens on June 30 or December 31.
* **Positive leap second:** The time `23:59:59` is followed by `23:59:60`, and then `00:00:00`.
* **Negative leap second:** `23:59:58` would be followed directly by `00:00:00`. (This has never happened).
Leap seconds are a notorious source of bugs in computer systems.

### Common Representations
* **Unix time:** The number of seconds that have elapsed since `00:00:00 UTC` on 1 January 1970 (the "epoch"). Importantly, it **ignores leap seconds**; a day with a leap second is still counted as having 86,400 seconds.
* **ISO 8601:** A standard format for representing dates and times, e.g., `2025-09-15T14:30:00Z` (where `Z` indicates UTC).

---

## Network Time Protocol (NTP)

Since computer clocks drift, they need to be periodically corrected. NTP is the most common protocol for this. A client synchronizes its clock with a more accurate time server.

The main protocols are **NTP** and the more precise **PTP** (Precision Time Protocol).
On Ubuntu/Linux, you can check the time synchronization service with: `systemctl status systemd-timesyncd`.

### NTP Synchronization Logic
Let's analyze the message exchange between a client and a server.

```
\--------t1-------------t4------------\> NTP CLIENT
           \           /
            \         /
\------------t2-----t3----------------\> NTP SERVER

```

* $T_1$: Client sends a request.
* $T_2$: Server receives the request.
* $T_3$: Server sends a response.
* $T_4$: Client receives the response.

The client can now calculate two important values:
1.  **Round-trip delay ($\delta$):** This is the total time the messages spent on the network, excluding the server's processing time.
    $$\delta = (T_4 - T_1) - (T_3 - T_2)$$
2.  **Clock offset/skew ($\theta$):** This is the client's best guess of the difference between its clock and the server's clock. Assuming the network delay is symmetric (i.e., the trip to the server takes as long as the trip back), the client calculates its offset as the difference between its local time ($T_4$) and what it thinks the server's time should be ($T_3$ plus half the round-trip delay).
    $$\theta = (T_3 + \frac{\delta}{2}) - T_4$$

Based on the calculated offset $\theta$, the client's clock is adjusted:
* If $|\theta| < 125ms$: **Slew** the clock. The clock is gradually sped up or slowed down until it's correct. This avoids sudden time jumps.
* If $125ms \le |\theta| < 1000s$: **Jump** the clock. The time is set immediately. This can cause issues for applications sensitive to time reversals.
* If $|\theta| \ge 1000s$: **Ignore**. The offset is too large and is likely an error, so the update is ignored.

---

## Clock Types Revisited: Monotonic vs. Time-of-Day

This brings us to two important types of clocks available in most programming environments.

* **Time-of-day Clock:**
    * Measures time since a fixed point in the past (e.g., the Unix epoch).
    * **Not monotonic:** It can jump forwards or backwards due to NTP adjustments or leap seconds.
    * Useful for timestamping events that need to be compared across different nodes.

* **Monotonic Clock:**
    * Measures time since an arbitrary point in the past (e.g., system boot).
    * **Guaranteed to move forward** and is not affected by NTP jumps.
    * Perfect for measuring elapsed time (e.g., timeouts) on a single node. You cannot use its value to compare timestamps between different nodes.

Relying on physical clocks alone is insufficient for ordering events correctly in a distributed system due to clock skew and network latency.

---

## The Happens-Before Relation

To reason about causality without perfect physical clocks, we use a logical concept called the **happens-before** relation, denoted by $\rightarrow$. An event is an atomic operation on a single node.

We say event **a happens before event b** ($a \rightarrow b$) if one of the following is true:
1.  `a` and `b` happen on the same node, and `a` occurs before `b`.
2.  `a` is the sending of a message, and `b` is the receipt of that same message.
3.  There exists some event `c` such that $a \rightarrow c$ and $c \rightarrow b$ (transitivity).

This relation defines a **partial order**. It's possible that neither $a \rightarrow b$ nor $b \rightarrow a$. In this case, we say `a` and `b` are **concurrent**, written as $a || b$. This means we cannot determine their causal order from the information we have.

This notion of causality is inspired by physics:
* Information cannot travel faster than the speed of light. If two events in spacetime are too far apart to influence each other, they are not causally related.
* In distributed systems, we replace the speed of light with the speed of messages. If no chain of messages connects event `a` to event `b`, then `a` cannot have caused `b`.

---

## Safety and Liveness

When designing distributed algorithms, we want them to satisfy certain properties across all possible executions. These properties usually fall into two categories:

* **Safety:** *Nothing bad ever happens.*
    * A safety property, once violated, can never be undone. For example, "a database will never return incorrect data." If it does so even once, the property is broken forever.
* **Liveness:** *Something good eventually happens.*
    * A liveness property can always be satisfied in the future. For example, "every request will eventually receive a response." Even if a request is waiting, there's always the possibility it will be answered later.

### Formal Definitions
* **Safety:** A property is a safety property if for any execution where it is violated, there is a finite prefix of that execution after which the violation is guaranteed and unavoidable.
* **Liveness:** A property is a liveness property if for any finite (partial) execution, there is at least one possible continuation of that execution where the property is satisfied.

### Examples
Consider a "perfect link" communication channel:
* **Safety Property:** A process only receives messages that were actually sent. (Prevents the "bad thing" of phantom messages).
* **Liveness Property:** If a correct process sends a message to another correct process, the destination eventually receives it. (Ensures the "good thing" of message delivery eventually happens).

---


# LECTURE 16/09/2025

## Multicast

A multicast is a one-to-many communication where a single process sends a message to a specific group, and all members of that group receive it.

### What is it?

**Examples:**
* **Systems needing redundancy:** Algorithms with failover or replication, such as in Databases, DNS, or Banks.
* **One-to-many streaming:** Live TV/Radio broadcasts.
* **Many-to-many collaboration:** Skype, Teams, TikTok, and Massively Multiplayer Online games (MMOs).

**Disclaimer for this lecture:**
* We assume groups are **closed and static** (no members joining or leaving).
* We will not be discussing multiple overlapping groups.
* We will **not assume any special hardware support** for multicast.
* **Good news:** All algorithms shown work in both synchronous and asynchronous networks.


### Requirements

**Assuming:**
* We have **reliable 1-to-1 communication** (like TCP) as a building block.
* The sending process might crash.
* There is no default message ordering.

**Guarantees we want:**
* If a message is sent, it is **delivered exactly once**.
* Messages are eventually delivered to all **non-crashed (correct) processes**.
* The system is **fault-tolerant**; if one node fails, the rest can continue.

### General broadcast structure

We introduce a "broadcast algorithm" layer that sits between the application and the network.

* **Node 1** doesn't send directly to the network; it tells the **broadcast algorithm** to broadcast a message.
* The **broadcast algorithm** handles the logic of sending, re-transmitting, and ordering messages over the network.
* The **broadcast algorithm** on the receiving end then **delivers** the message to Node 2.

---

## Problems - IP Multicast

Standard IP Multicast often uses UDP, which offers no guarantees.

* **No re-transmission:** Lost packets are gone forever.
* **No reception guaranteed:** Messages might never arrive.
* **No ordering:** Messages can be delivered in an arbitrary order.

We need to build smarter algorithms to solve these problems.

### Implementing reliable broadcast algorithms

Different algorithms provide different ordering guarantees.

* **FIFO broadcast:** If a process sends `m1` before `m2`, they are delivered in that order. Preserves order from a single sender.
* **Causal broadcast:** If `broadcast(m1)` *happens-before* `broadcast(m2)`, then `m1` is delivered before `m2` everywhere. Preserves causality across different senders.
* **Total order broadcast:** If one node delivers `m1` before `m2`, then *all* nodes must deliver `m1` before `m2`. Everyone agrees on a single, global delivery order.
* **FIFO-total order broadcast:** A combination of both FIFO and total order guarantees.

---

## Hierarchy

We can think of these broadcast types as layers, each adding stronger guarantees.

* **Best-effort broadcast** is the unreliable base layer. We add re-transmission to get...
* **Reliable broadcast**, which guarantees delivery but not order. From there, we can add...
* **FIFO broadcast**, which doesn't re-order messages from the same sender. Then...
* **Causal broadcast**, which doesn't re-order messages related by the happens-before rule. Finally...
* **Total order broadcast**, which ensures all processes deliver messages in the exact same sequence.

---

## Reliable Multicast

### Properties

A reliable multicast protocol must have these three properties:

* **Integrity:** Messages are delivered at most once (no duplicates).
* **Validity:** If a correct process sends a message, it is eventually delivered.
* **Agreement:** If a correct process delivers a message, all other correct processes also deliver it.

A naive implementation where everyone forwards to everyone else is inefficient ($O(N^2)$ messages). A better approach is **Gossip**, where each node forwards a message to a few random peers. This is far more scalable and works with high probability.

---

## Ordered Multicast

### Details and implications

1.  **FIFO and Causal ordering are partial orderings.** They don't specify the order for concurrent multicasts (those not linked by the happens-before relation).
2.  **Reliable totally ordered multicast** is often called **atomic multicast**. It's a powerful tool for building consistent distributed systems.
3.  **Ordering does not imply reliability.** A protocol could guarantee total order but still fail to deliver a message to a correct process, breaking the "Agreement" property.

---

## FIFO Broadcast

Reliable multicast that respects sender order is a FIFO broadcast. This is typically implemented by having the sender add a sequence number to each message. Receivers only deliver messages from a specific sender in the order of their sequence numbers.

---

## Totally Ordered Multicast

This is complex because everyone must agree on a single, global message order.

### Totally Ordered Multicast (Sequencer)

We elect a single process to act as a **leader** or **sequencer**.

1.  Processes send their messages to the sequencer.
2.  The sequencer assigns a global, sequential number to each message and broadcasts it to the group.
3.  All processes deliver messages in the order dictated by the sequencer.

* **Problems:** The sequencer is a performance bottleneck and a single point of failure.

### Totally Ordered Multicast (ISIS)

A decentralized approach where processes negotiate the order.

1.  Process `p` broadcasts a message `m` with a proposed ID.
2.  Every other process `q` responds to `p` with its own proposed ID (typically the highest it has seen + 1).
3.  Process `p` collects all proposals, picks the largest one as the final ID, and broadcasts this final ID to the group.

* **The Trick:** Each process tracks the "largest proposed ID" and the "largest agreed-upon ID" to ensure it never delivers a message out of order.
* **Tradeoff:** This is more robust than a sequencer but requires more communication rounds (3 rounds vs. the sequencer's 2).

### Is it really ordered?

Let's say process A sends `m` and `n`, and the protocol assigns the final timestamps `1` to `m` and `2` to `n`. A will deliver `m` before `n`. Could process B deliver `n` before `m`?

* No. B receives the same final, agreed-upon timestamps of `1` for `m` and `2` for `n` via multicast. It cannot invent different ones.
* What if B proposed a timestamp of `3` for `m`? Then the final, agreed-upon timestamp for `m` would have to be at least `3`, but we know it's `1`. This is a contradiction.
* What if B wants to deliver `n` (with final timestamp `2`) before it has even heard of `m`? This is not possible, because B would have to participate in the proposal round for `m`. In that round, it would propose a timestamp larger than `2`, leading to `m`'s final timestamp being greater than `2`, which contradicts the fact that it's `1`.

The protocol forces all nodes to converge on the same sequence.

---

## Totaly order broadcast via Lamport clocks

We can achieve total order by giving each message a logical timestamp.

* **Idea:** Attach a Lamport timestamp to all messages and deliver them in timestamp order.
* **Problem:** If I receive a message with timestamp 5, how do I know a message with timestamp 4 won't arrive later?
* **Solution:** Use FIFO links. A process can only deliver the message with timestamp 5 after it has received a message with a timestamp *greater than 5* from **every other process**. This confirms no earlier messages are still in transit.

---

## Causal broadcast via lamport clock

Physical clock timestamps may not respect causality. The solution is **Logical Clocks**. They are designed to capture the happens-before relation (`e1` ⇒ `e2` implies `T(e1) < T(e2)`).

We will look at two types:
1.  Lamport Clocks
2.  Vector Clocks

---

## Lamport clocks Algorithm

Each process maintains a single integer counter.

* Each process initializes a local clock `t` to 0.
* Before any event, a process increments its clock: `t = t + 1`.
* When sending a message `m`, it sends the tuple `(t, m)`.
* When receiving `(t_msg, m)`, a process updates its clock `t = max(t, t_msg)` and then increments it for the receive event.

### Properties

* If `a` happens-before `b` (`a` ⇒ `b`), then `L(a) < L(b)`.
* However, if `L(a) < L(b)`, it does **not** mean `a` ⇒ `b`. They could be concurrent.

---

## Vector Clocks

Each process `pi` maintains a vector `Vi` of size `N` (number of processes).

* Initially, `Vi[j] = 0` for all `j`.
* Before an event at `pi`, it increments its own clock entry: `Vi[i] = Vi[i] + 1`.
* When sending a message, it attaches its entire vector `V`.
* On receiving a message with vector `V'`, the process updates its local vector by taking the element-wise maximum: `Vi[j] = max(Vi[j], V'[j])` for all `j`.

**Comparison Rules:**
* `V = W` if `V[j] = W[j]` for all `j`.
* `V ≤ W` if `V[j] ≤ W[j]` for all `j`.
* `V < W` if `V ≤ W` and `V ≠ W`.

### Vector Clocks, as used for CO Multicast

To ensure causal order, when process `Pj` receives a message `m` from `Pi` (with vector `Vm`), it delays delivery of `m` until **both** conditions are met:

1.  `Vm[i] = Vj[i] + 1`
    * This ensures `m` is the very next message `Pj` expected from `Pi`.
2.  `Vm[k] ≤ Vj[k]` for all `k != i`
    * This ensures `Pj` has already delivered all messages that `Pi` had seen before it sent `m`.

![image](../images/Screenshot%202025-09-16%20at%2012.41.19.png)

As we can see in vector two we need to move the point of V2 = (1,1,0) ahead of (1,0,0) due to the order.

---

# LECTURE 22/09/2025

## Consensus
In distributed computing, **consensus** is the fundamental challenge of getting a group of independent processes (or nodes) to **agree on a single value**. This agreed-upon value is final. Think of it as a committee that must vote on and finalize one decision, and once made, it cannot be changed.

This is formally equivalent to **total order broadcast**, where processes must agree on the *sequence* of messages to deliver. If they can agree on the first message, then the second, then the third, and so on, they are effectively solving consensus for each message slot in the order.

### Practical Examples
* **Multicast & Bank Accounts**: Imagine you have $100 in an account replicated across multiple servers. If you deposit $50 and simultaneously withdraw $30, all servers must agree on the order of operations. Do they process the deposit first (balance becomes $120) or the withdrawal first (balance becomes $120)? They must reach a consensus to ensure the final balance is consistent everywhere.
* **Redundancy**:
    * **Space and Aeronautics**: The flight control computers on a spacecraft or modern airplane must agree on sensor readings and control actions. If one computer thinks the plane should pitch up and another thinks it should pitch down, they must reach a consensus to avoid a catastrophic failure.
* **Replication**:
    * **Distributed File Systems**: When you write to a file stored on Google Drive or Dropbox, multiple replicas of that data are updated. Consensus ensures all replicas agree on the latest version of the file.
    * **Ledger Technology (e.g., Blockchain)**: A blockchain is essentially a chain of consensus decisions. Miners or validators around the world must agree on which block of new transactions is the next one to be added to the chain.

### Common algorithms
* **Paxos**: A classic and highly influential algorithm for reaching consensus on a single value in an asynchronous system where nodes can crash. **Multi-Paxos** extends this to agree on a sequence of values, effectively creating a total order broadcast.
* **Raft, Viewstamped Replication, Zab**: These are more modern algorithms designed to be more understandable and easier to implement than Paxos. They solve total order broadcast by default, often by first electing a stable leader to coordinate decisions.

---

## System model
In distributed systems, we must define the "rules of the game" under which an algorithm operates. This is the **system model**, and it typically specifies:
* **Network Behavior**: Are messages delivered reliably? Can they be delayed indefinitely (**asynchronous**) or is there a known maximum delay (**synchronous**)?
* **Node Behavior**: How can processes fail? Can they simply stop (**crash-fail**) or can they behave maliciously and lie (**Byzantine**)?
* **Timing Assumptions**: Do processes have access to synchronized clocks?

The choice of system model drastically affects what problems are solvable.

---

## Reliable consensus vs failures summary
Achieving consensus becomes progressively harder as the system becomes less reliable.

* **No Failures (Easy Case)**: If no process ever fails, consensus is trivial.
    1.  Every process broadcasts its proposed value to all others.
    2.  Each process waits until it has received a value from every other process.
    3.  Each process applies a simple function (like choosing the minimum value, or the first one received) to its collection of received values. Since everyone has the same set of values, they will all decide on the same outcome.

* **With Crash Failures**: If processes can crash, the simple approach fails. A process might wait forever for a message from a crashed node.
    * **The core problem**: How do you distinguish a process that is just very **slow** from one that is **dead**? This ambiguity is a central challenge in asynchronous systems.
    * **Solution**: You need a **failure detector** mechanism to handle crashed nodes, but these are often imperfect.

* **With Lies (Byzantine Failures)**: This is the hardest case. A faulty process can lie, sending value `A` to one node and value `B` to another.
    * **The trust problem**: If you receive conflicting information, how do you know who is telling the truth?
    * **Impact**: To tolerate these malicious failures, you need more nodes in total. Intuitively, you need enough honest nodes to "outvote" the liars. This significantly decreases the number of faulty nodes a system can withstand compared to simple crash failures.

---

## Requirements for consensus
For a set of processes `p_i` proposing values, we define a set of formal properties that any correct consensus algorithm must satisfy. Each process has a decision variable `d_i`, initially set to `⊥` (undecided).

* **Termination**: Eventually, every **correct** (non-faulty) process must decide on a value (i.e., set its `d_i` to something other than `⊥`). The system cannot get stuck forever.
* **Agreement**: No two correct processes decide on different values. If process `p_i` decides `v_a` and process `p_j` decides `v_b`, then it must be that `v_a = v_b`.
* **Integrity**: If all correct processes propose the same value `v`, then any correct process that decides must decide on that value `v`. This prevents trivial solutions like "always decide 0".
* **Weak Integrity (often used)**: A slightly different version states that the decided value must have been proposed by at least one of the processes. This ensures the outcome is not just made up.

A process `p_i` is in the **Decided State** as soon as its decision variable `d_i` is no longer `⊥`.

---

## Synchronous Consensus Algorithm
In a **synchronous** system, we assume that message delivery and processing happen in lock-step **rounds**. There's a known upper bound on how long a message takes to arrive. This assumption simplifies things greatly but is often unrealistic.

**Goal**: To create an algorithm that is resilient to `f` crash failures and computes the minimum proposed value.

### f-resilient (synchronous) Consensus Algorithm
The algorithm operates in `f + 1` rounds. In each round, every process broadcasts the set of values it knows about so far, and then updates its set with the values it receives from others.

```
1 v = { value from application (call x) }
2 B-multicast(v)
3 for each round i ∈ 1 ... f + 1 do
4 v' = v
5 for each m received do
6 update v = v ∪ m
7 end
8 B-multicast(v \ v') // not needed in round f+1
9 end
10 Pick d as minimal value of v
11 return d
```

* **Initialization**: Each process `p_i` starts with a set of values `V_i = {v_i}`, where `v_i` is its own initial proposal.
* **Rounds**: For `k` from 1 to `f + 1`:
    1.  Each process `p_i` broadcasts its current set of values `V_i` to all other processes.
    2.  Each process `p_i` waits to receive messages from all other non-faulty processes. It updates its set `V_i` by taking the union of its current set and all the sets it received in this round.
* **Decision**: After `f + 1` rounds, each process `p_i` decides on the minimum value in its final set `V_i`.

#### Why does this work? (Proof Sketch)
The proof for **Agreement** works by contradiction.
* **Assume** two correct processes, `p_i` and `p_j`, decide on different minimum values, `x` and `y`, where `x < y`.
* This means that at the end of round `f+1`, `p_j`'s set of values *did not contain* `x`.
* For this to happen, the value `x` must have been "hidden" from `p_j` for all `f + 1` rounds. The only way to hide a value is if a process holding it crashes before sending it.
* But we assume there are at most `f` faulty processes. In each round, at most one new process can fail and "block" the propagation of `x`. Over `f+1` rounds, even if a different process fails each round, the value `x` from a correct process would have had at least one round to propagate to everyone.
* Therefore, it's a **contradiction** to think `p_j` never received `x`. Both `p_i` and `p_j` must have the same set of values from all correct processes and will thus decide on the same minimum.

### Theorem
A famous result in distributed computing states that any optimal, deterministic consensus algorithm that can tolerate `f` crash failures requires **at least `f + 1` rounds** of communication in the worst case.

---

## Byzantine Error
What if processes don't just crash, but behave unpredictably or maliciously? This is a **Byzantine error**. A Byzantine node can lie, send conflicting messages to different peers, or collude with other faulty nodes. This models the most challenging failure scenario. The name comes from Lamport's famous paper, "The Byzantine Generals Problem."

### Examples
This isn't just a theoretical or software problem; it can be caused by hardware faults.
* **Single Event Upset (SEU)**: A cosmic ray or high-energy particle strikes a memory cell, flipping a bit from 0 to 1 (or vice-versa). This can corrupt data or instructions, causing the node to behave erratically.
* **Single Event Latchup (SEL)**: A hardware error that can cause a short-circuit, leading to unpredictable behavior or permanent damage.

These issues are critical in:
* Aerospace, where radiation is higher.
* Systems using non-**ECC (Error-Correcting Code) memory**.
* High-reliability systems like **nuclear power plants** or **avionics**.

---

## Byzantine Consensus
To solve consensus with Byzantine failures, we need a stronger integrity property.

* **Byzantine Integrity**: If all **non-faulty** (i.e., correct) processes start with the same value `v`, then all non-faulty processes must decide on `v`. This ensures that a few Byzantine nodes cannot trick the honest majority into deciding on a wrong value when they already agree.

**Goal**: Design an `f`-byzantine-resilient synchronous consensus algorithm.

### The Bad News: Impossibility Result
A groundbreaking result shows that no solution can exist if the number of faulty nodes `f` is too high relative to the total number of nodes `n`. Consensus is **impossible for `f ≥ n/3`**, or `n ≤ 3f`.

### The Good News
If `n > 3f`, solutions are possible. For example, to tolerate 1 Byzantine fault (`f=1`), you need at least 4 nodes in total (`n=4`). To tolerate 2 (`f=2`), you need at least 7 (`n=7`).

### f-byzantine resilience?
Let's see why `n=3, f=1` is impossible.
Imagine a Commander (C) sending an order ("attack" or "retreat") to two Lieutenants (L1, L2). One of them is a traitor.

* **Scenario**: The Commander is the traitor. C tells L1 to "attack" and L2 to "retreat". Now L1 and L2 have conflicting information. L1 tells L2 "C told me to attack", and L2 tells L1 "C told me to retreat". L1 knows one of them is a traitor, but it could be C or L2. L2 faces the same dilemma. They cannot agree.

### Byzantine Non-Consensus larger n simulation
This is a proof technique to show that if a solution existed for `n ≤ 3f`, it would lead to a contradiction.

* **Practical Example (Proof by Reduction)**: Let's **assume** we have a magical algorithm that solves Byzantine consensus for `n=3` generals with `f=1` traitor. We will use this faulty assumption to solve an even simpler problem, which we know is truly impossible, thereby proving our initial assumption was wrong.
* The "truly impossible" problem is the `n=2, f=1` scenario (one Commander, one traitor Lieutenant). The Lieutenant can never know if the Commander is lying or not.
* **The Simulation**:
    1.  The Commander (C) and Lieutenant (L) in the `n=2` problem will *simulate* the `n=3, f=1` algorithm.
    2.  C will simulate being the Commander from the `n=3` world.
    3.  L will simulate being *both* Lieutenant 1 and Lieutenant 2 from the `n=3` world.
    4.  They run the magical `n=3` algorithm on these simulated roles.
* **The Contradiction**: The algorithm is supposed to work even with one traitor. In this simulation, if the real Commander C is the traitor, then the simulated Commander is the traitor. If the real Lieutenant L is the traitor, then the simulated L1 and L2 are traitors. In either case, the number of simulated traitors is at most 1. The `n=3` algorithm should therefore work, allowing the real Commander and Lieutenant to reach consensus.
* But we know consensus is impossible for `n=2, f=1`! Since our "magical" algorithm allowed us to solve an unsolvable problem, the magical algorithm itself cannot exist. This logic extends to show `n ≤ 3f` is impossible in general.

---

## Three Equivalent Problems
These three problems are different formulations of the same core challenge and can be transformed into one another.

1.  **Consensus**:
    * **Goal**: All processes propose a value `v_i`; they must agree on a single one.
    * **Properties**: Termination, Agreement, Integrity.

2.  **Byzantine Generals**:
    * **Goal**: A single Commander issues an order to `n-1` Lieutenants. They must all agree on the order received.
    * **Properties**: Termination, Agreement, and a special **Integrity**: If the Commander is correct, all correct Lieutenants must decide on the Commander's proposed order. (Note: if the Commander is faulty, they just need to agree on *some* order).

3.  **Interactive Consistency**:
    * **Goal**: Every process `p_i` proposes a value `v_i`. All correct processes must agree on the *same vector* of values `V = (v_1, v_2, ..., v_n)`.
    * **Properties**: Termination, Agreement (on the whole vector), and **Integrity**: If process `p_i` is correct, then the `i`-th component of the decided vector must be `v_i`.

### Equivalence of the problems
* **Byzantine Generals (BG) to Interactive Consistency (IC)**: To agree on a vector, simply run the BG algorithm `n` times. In the first run, `p_1` acts as Commander. In the second, `p_2` acts as Commander, and so on. The final vector is built from the outcomes of each run.
* **Interactive Consistency (IC) to Consensus (C)**: First, run IC to get an agreed-upon vector of proposals. Then, each process independently applies a deterministic function (e.g., `min()`, `max()`, `majority()`) to that vector to compute a single final value. Since they all start with the same vector and apply the same function, they will arrive at the same consensus value.
* **Consensus (C) to Byzantine Generals (BG)**: The Commander sends its value to all Lieutenants. Then, every process (including the Commander) initiates a Consensus round, proposing the value it received (or its own value, if it's the Commander). Because the Consensus algorithm can tolerate traitors, the honest nodes will agree on a single value, achieving the BG goal.

---

## Byzantine Generals Algorithm (f=1)
This is a simple synchronous algorithm that solves the problem for `n=4, f=1`. It takes two rounds of communication.

```
// Executed by the Commander
def Commander:
v = value from application // e.g., "attack"
B-multicast(v) to all Lieutenants // Round 1: Send the order
// Executed by each Lieutenant
def Lieutenant:
  let v = value received from commander
  let i = my unique process id
  // Round 2: Relay the order you received to everyone else
  B-multicast(i : v) to all other Lieutenants
  // Wait to receive messages from the other n-2 Lieutenants
  // Decide based on the majority vote of all orders received
  let d = the majority vote of received answers. If there's a tie, use a default.
  return d
```

**Why it works for `n=4, f=1`**:
* **Case 1: Commander is honest**. All 3 Lieutenants receive the same correct order. They will all decide on that order.
* **Case 2: A Lieutenant is the traitor**. The 2 honest Lieutenants and the Commander are honest. The 2 honest Lieutenants receive the correct order from the Commander. When they exchange messages, they will each have 2 votes for the correct order (one from the C, one from the other honest L) and 1 vote for whatever the traitor says. The majority vote will be the correct order.

---

## Fixing the Async Problem
In a purely asynchronous system, the famous **FLP Impossibility Result** proves that there is no deterministic algorithm that can solve consensus while tolerating even a single crash failure. The core issue is the inability to distinguish a crashed node from a very slow one.

So how do we build real systems?
* **Use randomness**: If we allow algorithms to use random numbers, we can design protocols that are guaranteed to reach consensus with a probability of 1. They might not terminate on a specific run, but over infinite runs, they will.

---

## Paxos

### What is it?
**Paxos** is a family of protocols for solving consensus in an asynchronous network where processors may fail by crashing (it does not handle Byzantine failures). It was created by Leslie Lamport.

* **Key features**:
    * It does **not** rely on a fixed coordinator/leader.
    * It works in an **asynchronous** system.
    * It is resilient to up to `(n-1)/2` crash failures.
    * It prioritizes **safety (Agreement)** over **liveness (Termination)**. This means it will never allow two nodes to decide differently, but it's not guaranteed to make progress and decide at all.



### The Paxos Nodes
Paxos operates by electing a temporary "leader" (called a **Proposer**) for a specific decision. Any node can try to become a proposer. Nodes that are not proposers act as **Acceptors**, voting on proposals.
* A node can become a **Proposer** at any time.
* All nodes are **Acceptors**.
* Nodes that learn the final outcome are **Learners**. In practice, all nodes often play all three roles.

### Steps
Paxos works in two phases to decide on a single value.

**Phase 1: Prepare/Promise (Electing a Leader)**
1.  A **Proposer** decides it wants to lead. It picks a proposal number `n` that is unique and higher than any number it has used before. It sends a `Prepare(n)` message to a majority of Acceptors.
2.  An **Acceptor** receives `Prepare(n)`.
    * If `n` is higher than any proposal number it has promised to listen to before, it responds with a `Promise(n)` message. This is a promise to not accept any proposals with a number less than `n`.
    * **Crucially**: If the Acceptor has *already accepted* a value `val_prev` from a previous proposal `n_prev`, it must include `(n_prev, val_prev)` in its `Promise` response.
    * If `n` is not the highest it has seen, it ignores the message.

**Phase 2: Accept/Accepted (Deciding on a Value)**
3.  The **Proposer** waits for `Promise` responses. If it receives them from a **majority** of Acceptors, it is now the leader for proposal `n`. It then chooses a value `val` to propose.
    * **The Rule**: If any of the `Promise` responses it received contained a previously accepted value, the Proposer **must** choose the value `val_prev` associated with the highest proposal number `n_prev` it saw. Otherwise, it is free to propose its own initial value.
    * It then sends an `Accept(n, val)` message to a majority of Acceptors.
4.  An **Acceptor** receives `Accept(n, val)`.
    * If it has not made a newer promise (to a proposal number higher than `n`), it accepts the value and sends an `Accepted(n, val)` message to all nodes (who act as Learners).
    * Once a Learner sees `Accepted` messages from a majority of nodes for the same value, that value is **decided**.

---

## The proof of paxos

We will not go through the formal proof, as it involves a very large and detailed analysis of all possible message orderings and failure scenarios. The key safety property relies on the rule in Step 3: forcing a new leader to continue with a value that might have already been decided ensures that once a value is chosen, it can never be changed.

However, Paxos can fail to terminate. This does not violate its safety guarantee, but it does violate the **Termination** requirement for consensus.

* **Practical Example of Non-Termination (Dueling Proposers)**:
    1.  Proposer P1 sends `Prepare(n=10)` and gets promises from a majority of Acceptors (A, B, C). P1 is now leader.
    2.  Before P1 can send its `Accept` message, another Proposer P2 wakes up, chooses a higher number, and sends `Prepare(n=11)` to the same Acceptors.
    3.  Acceptors A, B, and C see this higher proposal number. They respond to P2 with `Promise(n=11)`, and will now ignore any messages related to `n=10`.
    4.  P1 finally sends its `Accept(n=10, value="X")` message, but it is ignored by the majority because they've promised to listen to `n=11`. P1's proposal fails.
    5.  Now P2 has a majority of promises and is the leader. But before it can send its `Accept` message, P1 realizes it failed, chooses an even higher number, and sends `Prepare(n=12)`.
    6.  This cycle can repeat indefinitely, with the two proposers constantly preempting each other, and no value is ever decided. This is a "livelock" situation. In practice, randomized timeouts are used to make this scenario highly unlikely.

---

## Resources and alternatives
* **Google TechTalk on Paxos**: A good video resource for understanding the algorithm in more detail.
* **Raft Algorithm Illustration**: [https://raft.github.io/](https://raft.github.io/) provides an excellent interactive visualization of Raft, an alternative to Paxos designed for understandability.

---

## Heartbeat for Synchronized Systems
A **heartbeat** is a common mechanism for failure detection in systems that are not fully asynchronous.
* **How it works**:
    1.  You guess a reasonable upper bound for message delay, `D`.
    2.  Each process sends a "beat" message to others every `T` seconds.
    3.  If a process hasn't received a beat from another process in the last `T + D` seconds, it **suspects** that the process has crashed.

* **The Trade-off**:
    * If `D` is **too small**, you get **inaccurate** detections. A slow but perfectly alive process might be declared dead.
    * If `D` is **too large**, your detection is **incomplete**. A dead process might be considered alive for a long time (a "zombie").

This shows we can only ever **suspect** a crash in a distributed system; we can never be 100% certain.

---

# LECTURE 23/09/2025

## Mutual Exclusions

What is a mutual exclusion (mutex)?
Mutual exclusion algorithms ensures that one and only one process can access a
shared resource at any given time.

### Examples
▶ Printing
▶ Using Coffee Machine
▶ Writing a file
▶ Changing the stat of an actuator
    ▶ Arm of a robot
▶ Wireless Communication
▶ Wired Communication

---

## System model
What is a (computer science) process?
A process p = (S, sι, M, →) in a set of processes p ∈ P has
▶ a set of states S,
▶ an initial state sι ∈ S,
▶ a set of messages M
    ▶ including the empty message ϵ ∈ M,
▶ and a transition function →⊆ S × M 7 → S × 2P×M 

---

## Type of communication
▶ We can have mutex algorithms both in message-parsing and in shared memory
paradigms.
▶ Message parsing : Processes do not share variables, but may access
recourses/states that are considered common.
▶ Shared memory : more literal: processes can edit the same data.
▶ We will start with message parsing.

---

## Assumptions

▶ Processes have Crash Failures
    ▶ Stay dead
▶ Direct Communication
    ▶ Transparent routing
    ▶ No forwarding
▶ Reliable Communication
    ▶ Synchronous
        ▶ Delivery within fixed timeframe
    ▶ Asynchronous
        ▶ Delivery at some point
        ▶ Underlying protocol handles re-transmission etc.
    ▶ Partitions are fixed eventually

---

## Requirements (Mutex Algorithms)

1. Safety
▶ at most one is given access
2. Liveness
▶ requests for access are (eventually) granted
3. Ordering/Fairness
▶ request A happened-before request B
⇒ grant A before B.
▶ If two processes request periodically, grant the requests in some "fair" manner.

---

## Properties (Mutex Algorithms)
▶ Fault tolerance
    ▶ What happens on crashes?
▶ Performance
    ▶ Message Complexity
    ▶ Client Delay
        ▶ Time from a request R to a grant of R
    ▶ Synchronization Delay
        ▶ Time from a release of R to a grant of the next request Q
        ▶ Related to the throughput (rate the processes can access the critical section)
    ▶ Bandwidth: proportional to the number of messages sent in each entry and exit operation

---

## Centralized Algorithm - With a Token
The simplest possible case
▶ Assume one external coordinator (a leader )
▶ Coordinator/leader/server has ordered queue (the ring).
▶ The reply constitutes a token signifying permission to enter the critical section.
A process:
▶ Asks coordinator for access
▶ Waits
▶ Gets token, computes, returns token
The coordinator:
▶ Gets a request, checks if anyone is using the resource (where is token?).
▶ If resource is empty: replies by giving token
▶ If not: does not reply, queues the request
▶ When token is returned, coordinator checks for next in queue.

### Properties (Centralized Algorithm)

Requirements
▶ Safe: Yes
▶ Liveness: Yes
▶ Ordering: No
Properties
▶ Client Delay
▶ Entry: 2 (Request + Grant)
▶ Exit: 1
▶ Synchronization Delay
▶ 2 (Release + Grant)
▶ : Bandwidth: 3 messages to enter and
leave a critical region: A request, a grant
to enter and a release to exit
Fault Tolerance

---

## Token Ring Algorithm (no leader)
Idea
▶ Send token around in a ring
▶ Assumes ordering of processes
▶ Forward token to “next” if not using mutex
▶ Enter mutex if token is acquired

### Properties
Requirements
▶ Safe: Yes
▶ Liveness: Yes
▶ Ordering: No (order by ring)
Properties
▶ Client Delay
▶ Entry: n/2 avg, n − 1 worst case
▶ Exit: 1
▶ Synchronization Delay
▶ n/2 avg, n − 1 worst case
▶ Bandwidth:
Fault Tolerance
Deadlock if any process fails
Can be recovered if crash can be detected reliably

---

## Ricart and Agrawala’s Algorithm
Idea
▶ Order Events!
▶ Extension of shared priority queue (Lamport ‘78)
▶ Basic algorithm
▶ Request all for access
▶ Execute CS when ”reply OK” permission is received from all other processes.
Secret Ingredient
Lamport Clocks!

### Lamport Clocks (reminder)
▶ Counter number of messages/events
▶ Annotate messages with clock
▶ increment local clock before send
▶ “Correct” local clock on receive, then increment when sending
▶ max(A, B) + 1

### Algorithm
A process:
▶ Has its Lamport clock + links to all other nodes.
▶ Updates its local Lamport clock on its events.
▶ Timestamps messages with Lamport clock + ID.
▶ Has three states: FREE, WANT, USE

### On a request:
▶ If all other processes is FREE, then all processes will reply immediately → the
requester obtain entry (becomes USE).
▶ If process is in USE then waits and then replies.
▶ On more than 1 request: Respond in order of timestamps (of requests): Break ties
with ID.

### Properties (Ricart and Agrawala)
Requirements
▶ Safe: Yes
▶ Liveness: Yes
▶ Ordering: Yes
Properties
▶ Client Delay
▶ Entry: 1 (multicast) + 1
▶ Exit: 1 (multicast)
▶ Synchronization Delay
▶ 1
▶ Bandwidth:
▶ Entry:
▶ n − 1 + n − 1
▶ Exit: up to n − 1, included into Entry
Fault Tolerance
Deadlock if any process fails

---



---
