## **COURSE OVERVIEW**

## **INTRO**
### 1. Introduction
## **Theory & Algorithms**
### 1. Models of DS
### 2. Time in DS
### 3. Multicast
### 4. Consensus
### 5. Distributed Mutual Exclusion
## **Application**
### 7. Distributed Storage
### 8. Distributed Computing
### 9. Blockchains
### 10. Peer-to-Peer Networking
### 11. Internet of Things and Routing
### 12. Distributed Algorithms

# Lecture 08/09/2025

## What is a distributed systems?
A distributed system is one where hardware and software components in/on networked computers communicate and coordinate their activity only by passing messages.

## CONCURRENCY

Concurrency is the ability of a system to execute multiple tasks or processes simultaneously or at overlapping times, improving efficiency.
However it can cause some problems such as:
- Deadlocks and livelocks: these are conditions in which the processes do not make progress due to their circumstances.
- Non-determinism: occurs when the output or behavior of a concurrent program differs for the same input, depending on the precise, unpredictable timing of events, such as thread interleaving or resource access.

Other issues could rise from the absence of shared state that will eventually lead to:
- Pass messages to synchronize: for example if 2 people have a shared resource and both are trying to access it at the same time an error could occur for one member of the party.
- May not agree on time: 

Everything can fail in distributed systems:
- Devices
- Integrity of data
- Network
    - Security
    - Man-in-the-Middle (MITM): attack occurs when an attacker intercepts and potentially alters communication between two legitimate parties, unbeknownst to them.
    - Zibantine failure: is a condition of a system, particularly a distributed computing system, where a fault occurs such that different symptoms are presented to different observers, including imperfect information on whether a system component has failed.

Distributed systems are used for domain, redundancy, and performance.

---
## Domain
A domain is a specific area of knowledge or activity that a distributed system is designed to address.
Some examples of domains are:
- The internet
- Wireless Mesh Networks
- Industrail systems
- Ledgers (bitcoin, ethereum)
However these domains can encounter some limits that can be physical and logical.
- Physical limits: are constraints imposed by the physical properties of the system's components or environment, such as hardware limitations, network bandwidth, latency, and geographical distribution.   
- Logical limits: defined by its bounded context. This is a core concept from Domain-Driven Design (DDD). The logical limit of a domain in a distributed system is defined by its bounded context. This is a core concept from Domain-Driven Design (DDD), a software development approach that focuses on aligning software design with the business domain. A bounded context is an explicit boundary within a distributed system where a specific domain model and its language (ubiquitous language) are consistent and applicable.
---
## Redundancy
A system with redundacy means that it has duplicate ocmponents, processes or data.
Given these specifications the system will result:
- Robust: more resilient to failure
- Available: as in system availability which is measured in uptime.

A system with redundancy can:
- offer 99.9% uptime or "five nines": means it's designed to be operational and available 99.9% of the time, with a small amount of planned or unplanned downtime.
- be a backup: in an active-passive configuration, the redundant server acts as a hot or cold backup. The primary server handles all the workload, while the backup server waits on standby. Given the duplication of information and components if one fails the other automatically takes over.
- be a database: In an active-active configuration, all redundant servers are considered active and work together simultaneously. They are not simply waiting as a backup. Incoming requests are distributed among all the servers using a load balancer.
- used in the banking sector: any downtime can lead to significant financial losses.

---
## Performance
To ensure a performant system we need:
- Economics
- Scalability
Here we talk about different topics such as:
- Video streaming: requires a lot of procesing power
- Cloud computing: Offers on-demand, scalable resources, eliminating the need for upfront hardware investment.
- Supercomputers: Excel at massive, specialized calculations but are very expensive and not scalable for general-purpose use.
- Many inexpensive vs many expensive specialized: Distributing workloads across many inexpensive machines is often more economical and scalable than using a few expensive, specialized ones.

---

# Lecture 09/09/2025

# Models of distributed systems

## Aspects of models

Why do we build distributed systems?

- **Inherent distribution**: By definition, distributed systems span multiple computers, often connected through networks such as telecommunications systems.
- **Reliability**: Even if one node fails, the system as a whole can continue functioning, avoiding single points of failure.
- **Performance**: Workloads can be shared among multiple machines, and data can be accessed from geographically closer nodes to reduce latency.
- **Scalability for large problems**: Some datasets and computations are simply too large to fit into a single machine, requiring distributed processing.

### Modelling the process – API Style

A distributed system can be described in terms of modules that exchange **events** through well-defined interfaces:

- **Event representation**:  
  \{Event\_type | Attributes, …\}  
  or  
  \{Process\_ID | Event\_type | Attributes, …\}

- **Module behavior**:  
  Each module reacts to incoming events and produces outputs according to specified rules:
upon event {condition | Event | attributes} such that condition holds do
perform some action


Multiple modules together (one per process or subsystem) should collectively satisfy desired **global properties** (e.g., safety, liveness).

### What we want/will make

We aim to:
- Design APIs for modules and prove that their composition satisfies global system properties.
- Implement modules that guarantee **local properties**.
- Use pseudocode and mathematics to formally demonstrate when such guarantees are possible—or prove impossibility.

---

## Failures

Failures are inevitable in distributed systems. They can arise due to hardware breakdowns, software bugs, network disruptions, or even human mistakes. Designing robust systems requires understanding different types of failures and strategies to mitigate them.

### Types of failures

1. **Crash-stop**: A process halts and all other processes can reliably detect the failure. *Easiest to handle.*
2. **Crash-silent**: A process halts but failures cannot be detected reliably.
3. **Crash-noisy**: Failures may be detected, but only with eventual accuracy (false positives or delays are possible).
4. **Crash-recovery**: Processes may fail and later recover, rejoining the system. Requires care to avoid state inconsistencies.
5. **Crash-arbitrary (Byzantine failures)**: Processes behave arbitrarily or maliciously, deviating from the protocol. *Hardest to handle.*
6. **Randomized behavior**: Processes make decisions probabilistically. Correctness is argued via probability theory rather than strict guarantees.

---

## Communication

Is communication always required? In distributed systems, yes—but it can be realized in different ways:

- **Message passing**:
1. Types of links and their potential failures.
2. Network topology (commonly assumed fully connected).
3. Routing algorithms for multi-hop communication.
4. Broadcast and multicast primitives.

- **Shared memory**:
1. Which process can read or write to which location?
2. How do we guarantee reading the *freshest* value? (Consistency models)

### On types of links

A **link** is a module implementing send/receive operations with certain properties.

- **TCP/IP**: Enables reliable communication between a pair of nodes (or none).
- **SSH**: Adds protection against corruption, interception, and tampering.

**Network reliability models**:
1. **Perfect links**: Reliable delivery, no duplication, no spurious messages.
2. **Fair-loss links**: Messages may be lost occasionally, but infinitely many attempts guarantee eventual delivery; finite duplication possible.
3. **Stubborn links**: Messages are retransmitted until delivery is guaranteed but still no creation (this model is built upon the fair-loss).
4. **Logged-perfect links**: Perfect delivery with persistent logs for auditing/recovery.
5. **Authenticated links**: Reliable delivery, no duplication, and sender authenticity.

### Can networks fail?

While TCP/IP and lower-level protocols often give us the illusion of **perfect links** and **fail-stop crashes**, failures still happen.

- **Network partitions**: Occur when many links fail simultaneously, dividing the system into disconnected components. This is rare but catastrophic.

### Crashes vs Failures

Having discussed both **network** and **process** failures, it is important to distinguish between the two levels:

- A **process can crash** (e.g., by crashing, halting, or misbehaving).  
- A **system fails** when the combination of process crashes and communication assumptions no longer allows correct operation.

For the remainder of our discussion, we usually assume **perfect links** (thanks to TCP/IP and lower-level reliability mechanisms). This means that:
- Messages are delivered reliably,
- No duplicates are created,
- No spurious (phantom) messages appear.

Under this assumption, we can define **system failure models** in terms of process behavior:

- **Fail-stop system**: Processes may experience crash-stop failures, but links are perfect.  
- More complex models (e.g., crash-recovery, Byzantine failures) are defined similarly, always considering both the **process failure type** and the **communication assumptions**.

In short, a system failure model = (process failure model) + (assumed link properties).

---

## Timing

Timing plays a central role in distributed systems, especially when considering **synchronization** and **failure detection**.

- Systems may be **synchronous** (bounded delays) or **asynchronous** (no timing guarantees).
- Links are still modeled as modules with send/receive properties.

### Synchronous vs. Asynchronous Systems

Distributed systems can be broadly classified according to their **timing assumptions**:

1. **Asynchronous systems**:
   - No bounds on message transmission delays.
   - No assumptions about process execution speeds (relative speeds may differ arbitrarily).
   - Failure detection is unreliable, since a slow process cannot be distinguished from a failed one.
   - Coordination and ordering rely on **logical clocks** (e.g., Lamport clocks, vector clocks), rather than real time.

2. **Synchronous systems**:
   - Bounds exist on message transmission delays and process execution speeds.
   - **Timed failure detection** is possible: if a message or heartbeat is not received within a known bound, a failure can be suspected reliably.
   - Transit delays can be measured and incorporated into algorithms.
   - Coordination can be based on **real-time clocks** rather than purely logical clocks.
   - Performance is often analyzed in terms of **worst-case bounds**, since timing assumptions provide guarantees.
   - Processes may maintain **synchronized clocks** (to some degree of precision), enabling algorithms such as consensus and coordinated actions.

**Key question**: *Can processes in an asynchronous system with fair-loss links reach agreement (e.g., on coordinated attack time)?*

### Proof via contradiction (Two Generals Problem)

1. Assume a protocol exists where a fixed sequence of messages guarantees agreement.
2. Consider the last message in this sequence that is successfully delivered.
3. If this message is lost, the receiving general decides **not** to attack.
4. But the sender cannot distinguish whether the message was delivered or lost, so must behave deterministically and decide the same action in both cases.
5. This creates a contradiction: one general attacks, the other does not.  
 $\Rightarrow$ Perfect agreement is impossible under these assumptions.

### Which crash/link/timing assumptions implement distributed systems?

A **failure detector** can be modeled as just another module that provides (possibly imperfect) information about which processes are alive. Different combinations of timing assumptions and failure detectors allow different guarantees in distributed systems.  

### Example

![image](../images/Screenshot%202025-09-09%20at%2009.56.39.png)

#### Explanation:

This algorithm describes a **Perfect Failure Detector** for distributed systems using a heartbeat mechanism.

In short, here's what it does:

1.  **Sends Heartbeats:** Periodically, on a **timeout**, every process sends a `HEARTBEATREQUEST` message to all other processes in the system.
2.  **Waits for Replies:** It assumes no one is alive and waits for `HEARTBEATREPLY` messages. When a process receives a reply, it marks the sender as `alive`.
3.  **Detects Failures:** At the next timeout, any process that has not sent a reply is considered to have **crashed**. The algorithm then triggers a `Crash` event for that process.

Because it assumes **perfect communication links** (messages are never lost), this method guarantees that a non-responsive process has truly failed, making the failure detection "perfect."

### Network latency and bandwith

When discussing communication performance, two key metrics matter:

- **Latency**: The time it takes for a single message (or bit) to travel from sender to receiver.  
- **Bandwidth**: The rate at which data can be transmitted, usually measured in bits per second (bps) or bytes per second (B/s).

#### Physical Link
Sometimes, surprisingly “low-tech” physical methods can provide high bandwidth, even if latency is poor:
- **Hard drives in a van**  
- Messengers carrying storage devices  
- Smoke signals (extreme latency, minimal bandwidth)  
- Radio signals or laser communication

#### Network Links
More conventional digital communication technologies include:
- DSL (Digital Subscriber Line)  
- Cellular data (e.g., 3G, 4G, 5G)  
- Wi-Fi (various standards)  
- Ethernet/fiber cables  
- Satellite links  

#### Latency examples
1. Hard drives transported by van: $\approx$ 1 day latency  
2. Intra-continent fiber-optic cable: $\approx$ 100 ms latency  

#### Bandwidth examples
1. Hard drives in a van: $\frac{50 \, \text{TB}}{1 \, \text{day}}$ = **very high bandwidth** despite huge latency  
2. 3G cellular network: $\approx 1 \, \text{Mbit/s}$ bandwidth  

---

## Performance

### Performance measures

- **SLI (Service Level Indicator)**: What aspect of the system do we measure?  
Examples: bandwidth, latency, fault tolerance, uptime, failure detection time.
- **SLO (Service Level Objective)**: What target values do we aim for?  
Example: latency < 200ms.
- **SLA (Service Level Agreement)**: An SLO backed with contractual consequences.  
Example: "99% uptime, otherwise partial refund."

Why should we study these?
- Measuring means we can improve
- Spend time improving when it is needed.
- Reliability is kind of the point with distributed systems.

### Reading SLAs

When evaluating claims like *“This solution offers 99% uptime”*, consider:

- **Sampling frequency**: How often is system availability checked?
- **Responsibility scope**: Does the SLA cover only server uptime, or also account for client/network failures?
- **Time interval**: Does 99% apply per day, per month, or per year?

---

# LECTURE 15/09/2025

## The Challenge of Time

In distributed systems, we often contrast **synchronous** and **asynchronous** computation. A synchronous system has known, bounded delays for message delivery and process execution. An asynchronous system has no such guarantees. Most real-world systems are asynchronous, which makes coordination difficult. Without certain timing guarantees, some problems are impossible to solve deterministically, a classic example being the **Two Generals' Problem**, which illustrates the impossibility of reaching a consensus over an unreliable channel.

### Reasons for Asynchrony
Asynchrony isn't an abstract problem; it arises from concrete issues with the physical components of a system: the network and the nodes themselves.

#### Network unpredictability:
* **Physical failures:** Cables can be damaged (famously by sharks or cut by construction) requiring traffic to be rerouted. 🦈
* **Message loss:** Packets can be dropped, requiring retransmission protocols (like TCP) to resend data.
* **Congestion:** High traffic can lead to queues and variable delays (latency).
* **Re-configuration:** The network topology itself may change, causing temporary disruptions.

#### Node unpredictability:
* **OS scheduling:** The operating system's scheduler can preempt a process at any time to run another one.
* **Garbage collection (GC):** In managed languages (like Java or Go), a "stop-the-world" GC pause can halt an application for milliseconds or even seconds.
* **Hardware faults:** Nodes can crash, reboot, or suffer from other hardware-related issues.

But what if a system were "perfect"? Imagine no network loss and perfectly functioning nodes. Could asynchrony still occur? **Yes**. The non-deterministic nature of process scheduling is a fundamental source of asynchrony. A real-world example is the **2012 Knight Capital Group glitch**, where a software deployment error led to an algorithm running haywire. The system's components were working "correctly," but the timing and interaction between processes led to a catastrophic failure, costing the company $440 million in 45 minutes.

---

## How Do Distributed Systems Use Time?

Systems need to measure time for many fundamental operations. Think about how you would implement these on a single computer; in a distributed system, this becomes much harder.

1.  **Scheduling and Timeouts:** To run a task for a specific duration or to give up on an operation if a response isn't received within a certain window.
2.  **Failure Detection:** Using **heartbeats** (periodic "I'm alive" messages) to detect if a node has crashed. If a heartbeat isn't received within a timeout period, the node is presumed dead.
3.  **Event Timestamping:** Recording the time an event occurred, which is critical in databases for transaction ordering and data versioning (e.g., using Multi-Version Concurrency Control or MVCC).
4.  **Performance Measurement:** Logging and statistics gathering to measure latency, throughput, and other performance metrics.
5.  **Data Expiration:** Caching systems use Time-To-Live (TTL) values to expire old data. DNS records and security certificates also have expiration times.
6.  **Causal Ordering:** Most importantly, to determine the **order of events** across different nodes to maintain consistency and causality.

---

## Types of Clocks

In distributed systems, we primarily talk about two types of clocks. From a practical standpoint, a clock is simply something we can query to get a timestamp.

* **Physical Clocks:** These measure the passage of real-world time in units like seconds. They are based on physical phenomena, like the oscillation of a crystal.
* **Logical Clocks:** These don't track real time. Instead, they count events (e.g., the number of requests processed) to determine the logical order of operations.

---

## Physical Clocks: The Quartz Crystal

Most computers use quartz clocks. Here's how they work:

* A thin slice of quartz crystal is precisely cut to control its oscillation frequency when an electric voltage is applied (the **piezoelectric effect**).
* When you boot your computer, it queries a **Real-Time Clock (RTC)**—a small, battery-powered circuit on the motherboard—which has been continuously counting these oscillations.
* By counting the cycles, the computer can calculate the elapsed time.

However, these clocks aren't perfect:
* **Manufacturing variations:** No two crystals are identical.
* **Temperature sensitivity:** Frequency changes with temperature.
* This imperfection leads to **clock drift**. We measure this in **parts per million (ppm)**. A drift of 1 ppm means the clock is off by one microsecond per second, which adds up to about **32 seconds per year**. A typical computer clock might have a drift of around 50 ppm.

Better, but more expensive, alternatives include:
* **Atomic clocks:** Extremely precise but very expensive.
* **GPS:** Satellites contain atomic clocks. A GPS receiver can use signals from multiple satellites to calculate a very precise time.
* **Network Time Protocol (NTP):** Ask another, more accurate server for the time.

---

## Time Standards and Representations

To agree on time, we need standards.

* **Solar Time (UT1):** Based on the Earth's rotation. A day is the time between the sun reaching its highest point in the sky on two consecutive days. This is not perfectly stable.
* **International Atomic Time (TAI):** Based on the oscillations of a caesium-133 atom. One second is defined as exactly 9,192,631,770 oscillations. TAI is extremely stable.
* **Coordinated Universal Time (UTC):** The global standard we all use. It's a compromise: it ticks at the same rate as TAI but is kept within 0.9 seconds of Solar Time (UT1) by adding **leap seconds**.

### Leap Seconds
To keep UTC aligned with the Earth's wobbly rotation, a second is occasionally added. This happens on June 30 or December 31.
* **Positive leap second:** The time `23:59:59` is followed by `23:59:60`, and then `00:00:00`.
* **Negative leap second:** `23:59:58` would be followed directly by `00:00:00`. (This has never happened).
Leap seconds are a notorious source of bugs in computer systems.

### Common Representations
* **Unix time:** The number of seconds that have elapsed since `00:00:00 UTC` on 1 January 1970 (the "epoch"). Importantly, it **ignores leap seconds**; a day with a leap second is still counted as having 86,400 seconds.
* **ISO 8601:** A standard format for representing dates and times, e.g., `2025-09-15T14:30:00Z` (where `Z` indicates UTC).

---

## Network Time Protocol (NTP)

Since computer clocks drift, they need to be periodically corrected. NTP is the most common protocol for this. A client synchronizes its clock with a more accurate time server.

The main protocols are **NTP** and the more precise **PTP** (Precision Time Protocol).
On Ubuntu/Linux, you can check the time synchronization service with: `systemctl status systemd-timesyncd`.

### NTP Synchronization Logic
Let's analyze the message exchange between a client and a server.

```
\--------t1-------------t4------------\> NTP CLIENT
           \           /
            \         /
\------------t2-----t3----------------\> NTP SERVER

```

* $T_1$: Client sends a request.
* $T_2$: Server receives the request.
* $T_3$: Server sends a response.
* $T_4$: Client receives the response.

The client can now calculate two important values:
1.  **Round-trip delay ($\delta$):** This is the total time the messages spent on the network, excluding the server's processing time.
    $$\delta = (T_4 - T_1) - (T_3 - T_2)$$
2.  **Clock offset/skew ($\theta$):** This is the client's best guess of the difference between its clock and the server's clock. Assuming the network delay is symmetric (i.e., the trip to the server takes as long as the trip back), the client calculates its offset as the difference between its local time ($T_4$) and what it thinks the server's time should be ($T_3$ plus half the round-trip delay).
    $$\theta = (T_3 + \frac{\delta}{2}) - T_4$$

Based on the calculated offset $\theta$, the client's clock is adjusted:
* If $|\theta| < 125ms$: **Slew** the clock. The clock is gradually sped up or slowed down until it's correct. This avoids sudden time jumps.
* If $125ms \le |\theta| < 1000s$: **Jump** the clock. The time is set immediately. This can cause issues for applications sensitive to time reversals.
* If $|\theta| \ge 1000s$: **Ignore**. The offset is too large and is likely an error, so the update is ignored.

---

## Clock Types Revisited: Monotonic vs. Time-of-Day

This brings us to two important types of clocks available in most programming environments.

* **Time-of-day Clock:**
    * Measures time since a fixed point in the past (e.g., the Unix epoch).
    * **Not monotonic:** It can jump forwards or backwards due to NTP adjustments or leap seconds.
    * Useful for timestamping events that need to be compared across different nodes.

* **Monotonic Clock:**
    * Measures time since an arbitrary point in the past (e.g., system boot).
    * **Guaranteed to move forward** and is not affected by NTP jumps.
    * Perfect for measuring elapsed time (e.g., timeouts) on a single node. You cannot use its value to compare timestamps between different nodes.

Relying on physical clocks alone is insufficient for ordering events correctly in a distributed system due to clock skew and network latency.

---

## The Happens-Before Relation

To reason about causality without perfect physical clocks, we use a logical concept called the **happens-before** relation, denoted by $\rightarrow$. An event is an atomic operation on a single node.

We say event **a happens before event b** ($a \rightarrow b$) if one of the following is true:
1.  `a` and `b` happen on the same node, and `a` occurs before `b`.
2.  `a` is the sending of a message, and `b` is the receipt of that same message.
3.  There exists some event `c` such that $a \rightarrow c$ and $c \rightarrow b$ (transitivity).

This relation defines a **partial order**. It's possible that neither $a \rightarrow b$ nor $b \rightarrow a$. In this case, we say `a` and `b` are **concurrent**, written as $a || b$. This means we cannot determine their causal order from the information we have.

This notion of causality is inspired by physics:
* Information cannot travel faster than the speed of light. If two events in spacetime are too far apart to influence each other, they are not causally related.
* In distributed systems, we replace the speed of light with the speed of messages. If no chain of messages connects event `a` to event `b`, then `a` cannot have caused `b`.

---

## Safety and Liveness

When designing distributed algorithms, we want them to satisfy certain properties across all possible executions. These properties usually fall into two categories:

* **Safety:** *Nothing bad ever happens.*
    * A safety property, once violated, can never be undone. For example, "a database will never return incorrect data." If it does so even once, the property is broken forever.
* **Liveness:** *Something good eventually happens.*
    * A liveness property can always be satisfied in the future. For example, "every request will eventually receive a response." Even if a request is waiting, there's always the possibility it will be answered later.

### Formal Definitions
* **Safety:** A property is a safety property if for any execution where it is violated, there is a finite prefix of that execution after which the violation is guaranteed and unavoidable.
* **Liveness:** A property is a liveness property if for any finite (partial) execution, there is at least one possible continuation of that execution where the property is satisfied.

### Examples
Consider a "perfect link" communication channel:
* **Safety Property:** A process only receives messages that were actually sent. (Prevents the "bad thing" of phantom messages).
* **Liveness Property:** If a correct process sends a message to another correct process, the destination eventually receives it. (Ensures the "good thing" of message delivery eventually happens).

---


# LECTURE 16/09/2025

## Multicast

A multicast is a one-to-many communication where a single process sends a message to a specific group, and all members of that group receive it.

### What is it?

**Examples:**
* **Systems needing redundancy:** Algorithms with failover or replication, such as in Databases, DNS, or Banks.
* **One-to-many streaming:** Live TV/Radio broadcasts.
* **Many-to-many collaboration:** Skype, Teams, TikTok, and Massively Multiplayer Online games (MMOs).

**Disclaimer for this lecture:**
* We assume groups are **closed and static** (no members joining or leaving).
* We will not be discussing multiple overlapping groups.
* We will **not assume any special hardware support** for multicast.
* **Good news:** All algorithms shown work in both synchronous and asynchronous networks.


### Requirements

**Assuming:**
* We have **reliable 1-to-1 communication** (like TCP) as a building block.
* The sending process might crash.
* There is no default message ordering.

**Guarantees we want:**
* If a message is sent, it is **delivered exactly once**.
* Messages are eventually delivered to all **non-crashed (correct) processes**.
* The system is **fault-tolerant**; if one node fails, the rest can continue.

### General broadcast structure

We introduce a "broadcast algorithm" layer that sits between the application and the network.

* **Node 1** doesn't send directly to the network; it tells the **broadcast algorithm** to broadcast a message.
* The **broadcast algorithm** handles the logic of sending, re-transmitting, and ordering messages over the network.
* The **broadcast algorithm** on the receiving end then **delivers** the message to Node 2.

---

## Problems - IP Multicast

Standard IP Multicast often uses UDP, which offers no guarantees.

* **No re-transmission:** Lost packets are gone forever.
* **No reception guaranteed:** Messages might never arrive.
* **No ordering:** Messages can be delivered in an arbitrary order.

We need to build smarter algorithms to solve these problems.

### Implementing reliable broadcast algorithms

Different algorithms provide different ordering guarantees.

* **FIFO broadcast:** If a process sends `m1` before `m2`, they are delivered in that order. Preserves order from a single sender.
* **Causal broadcast:** If `broadcast(m1)` *happens-before* `broadcast(m2)`, then `m1` is delivered before `m2` everywhere. Preserves causality across different senders.
* **Total order broadcast:** If one node delivers `m1` before `m2`, then *all* nodes must deliver `m1` before `m2`. Everyone agrees on a single, global delivery order.
* **FIFO-total order broadcast:** A combination of both FIFO and total order guarantees.

---

## Hierarchy

We can think of these broadcast types as layers, each adding stronger guarantees.

* **Best-effort broadcast** is the unreliable base layer. We add re-transmission to get...
* **Reliable broadcast**, which guarantees delivery but not order. From there, we can add...
* **FIFO broadcast**, which doesn't re-order messages from the same sender. Then...
* **Causal broadcast**, which doesn't re-order messages related by the happens-before rule. Finally...
* **Total order broadcast**, which ensures all processes deliver messages in the exact same sequence.

---

## Reliable Multicast

### Properties

A reliable multicast protocol must have these three properties:

* **Integrity:** Messages are delivered at most once (no duplicates).
* **Validity:** If a correct process sends a message, it is eventually delivered.
* **Agreement:** If a correct process delivers a message, all other correct processes also deliver it.

A naive implementation where everyone forwards to everyone else is inefficient ($O(N^2)$ messages). A better approach is **Gossip**, where each node forwards a message to a few random peers. This is far more scalable and works with high probability.

---

## Ordered Multicast

### Details and implications

1.  **FIFO and Causal ordering are partial orderings.** They don't specify the order for concurrent multicasts (those not linked by the happens-before relation).
2.  **Reliable totally ordered multicast** is often called **atomic multicast**. It's a powerful tool for building consistent distributed systems.
3.  **Ordering does not imply reliability.** A protocol could guarantee total order but still fail to deliver a message to a correct process, breaking the "Agreement" property.

---

## FIFO Broadcast

Reliable multicast that respects sender order is a FIFO broadcast. This is typically implemented by having the sender add a sequence number to each message. Receivers only deliver messages from a specific sender in the order of their sequence numbers.

---

## Totally Ordered Multicast

This is complex because everyone must agree on a single, global message order.

### Totally Ordered Multicast (Sequencer)

We elect a single process to act as a **leader** or **sequencer**.

1.  Processes send their messages to the sequencer.
2.  The sequencer assigns a global, sequential number to each message and broadcasts it to the group.
3.  All processes deliver messages in the order dictated by the sequencer.

* **Problems:** The sequencer is a performance bottleneck and a single point of failure.

### Totally Ordered Multicast (ISIS)

A decentralized approach where processes negotiate the order.

1.  Process `p` broadcasts a message `m` with a proposed ID.
2.  Every other process `q` responds to `p` with its own proposed ID (typically the highest it has seen + 1).
3.  Process `p` collects all proposals, picks the largest one as the final ID, and broadcasts this final ID to the group.

* **The Trick:** Each process tracks the "largest proposed ID" and the "largest agreed-upon ID" to ensure it never delivers a message out of order.
* **Tradeoff:** This is more robust than a sequencer but requires more communication rounds (3 rounds vs. the sequencer's 2).

### Is it really ordered?

Let's say process A sends `m` and `n`, and the protocol assigns the final timestamps `1` to `m` and `2` to `n`. A will deliver `m` before `n`. Could process B deliver `n` before `m`?

* No. B receives the same final, agreed-upon timestamps of `1` for `m` and `2` for `n` via multicast. It cannot invent different ones.
* What if B proposed a timestamp of `3` for `m`? Then the final, agreed-upon timestamp for `m` would have to be at least `3`, but we know it's `1`. This is a contradiction.
* What if B wants to deliver `n` (with final timestamp `2`) before it has even heard of `m`? This is not possible, because B would have to participate in the proposal round for `m`. In that round, it would propose a timestamp larger than `2`, leading to `m`'s final timestamp being greater than `2`, which contradicts the fact that it's `1`.

The protocol forces all nodes to converge on the same sequence.

---

## Totaly order broadcast via Lamport clocks

We can achieve total order by giving each message a logical timestamp.

* **Idea:** Attach a Lamport timestamp to all messages and deliver them in timestamp order.
* **Problem:** If I receive a message with timestamp 5, how do I know a message with timestamp 4 won't arrive later?
* **Solution:** Use FIFO links. A process can only deliver the message with timestamp 5 after it has received a message with a timestamp *greater than 5* from **every other process**. This confirms no earlier messages are still in transit.

---

## Causal broadcast via lamport clock

Physical clock timestamps may not respect causality. The solution is **Logical Clocks**. They are designed to capture the happens-before relation (`e1` ⇒ `e2` implies `T(e1) < T(e2)`).

We will look at two types:
1.  Lamport Clocks
2.  Vector Clocks

---

## Lamport clocks Algorithm

Each process maintains a single integer counter.

* Each process initializes a local clock `t` to 0.
* Before any event, a process increments its clock: `t = t + 1`.
* When sending a message `m`, it sends the tuple `(t, m)`.
* When receiving `(t_msg, m)`, a process updates its clock `t = max(t, t_msg)` and then increments it for the receive event.

### Properties

* If `a` happens-before `b` (`a` ⇒ `b`), then `L(a) < L(b)`.
* However, if `L(a) < L(b)`, it does **not** mean `a` ⇒ `b`. They could be concurrent.

---

## Vector Clocks

Each process `pi` maintains a vector `Vi` of size `N` (number of processes).

* Initially, `Vi[j] = 0` for all `j`.
* Before an event at `pi`, it increments its own clock entry: `Vi[i] = Vi[i] + 1`.
* When sending a message, it attaches its entire vector `V`.
* On receiving a message with vector `V'`, the process updates its local vector by taking the element-wise maximum: `Vi[j] = max(Vi[j], V'[j])` for all `j`.

**Comparison Rules:**
* `V = W` if `V[j] = W[j]` for all `j`.
* `V ≤ W` if `V[j] ≤ W[j]` for all `j`.
* `V < W` if `V ≤ W` and `V ≠ W`.

### Vector Clocks, as used for CO Multicast

To ensure causal order, when process `Pj` receives a message `m` from `Pi` (with vector `Vm`), it delays delivery of `m` until **both** conditions are met:

1.  `Vm[i] = Vj[i] + 1`
    * This ensures `m` is the very next message `Pj` expected from `Pi`.
2.  `Vm[k] ≤ Vj[k]` for all `k != i`
    * This ensures `Pj` has already delivered all messages that `Pi` had seen before it sent `m`.

![image](../images/Screenshot%202025-09-16%20at%2012.41.19.png)

As we can see in vector two we need to move the point of V2 = (1,1,0) ahead of (1,0,0) due to the order.