## **COURSE OVERVIEW**

## **INTRO**
### 1. Introduction
## **Theory & Algorithms**
### 1. Models of DS
### 2. Time in DS
### 3. Multicast
### 4. Consensus
### 5. Distributed Mutual Exclusion
## **Application**
### 7. Distributed Storage
### 8. Distributed Computing
### 9. Blockchains
### 10. Peer-to-Peer Networking
### 11. Internet of Things and Routing
### 12. Distributed Algorithms

# Lecture 08/09/2025

## What is a distributed systems?
A distributed system is one where hardware and software components in/on networked computers communicate and coordinate their activity only by passing messages.

## CONCURRENCY

Concurrency is the ability of a system to execute multiple tasks or processes simultaneously or at overlapping times, improving efficiency.
However it can cause some problems such as:
- Deadlocks and livelocks: these are conditions in which the processes do not make progress due to their circumstances.
- Non-determinism: occurs when the output or behavior of a concurrent program differs for the same input, depending on the precise, unpredictable timing of events, such as thread interleaving or resource access.

Other issues could rise from the absence of shared state that will eventually lead to:
- Pass messages to synchronize: for example if 2 people have a shared resource and both are trying to access it at the same time an error could occur for one member of the party.
- May not agree on time: 

Everything can fail in distributed systems:
- Devices
- Integrity of data
- Network
    - Security
    - Man-in-the-Middle (MITM): attack occurs when an attacker intercepts and potentially alters communication between two legitimate parties, unbeknownst to them.
    - Zibantine failure: is a condition of a system, particularly a distributed computing system, where a fault occurs such that different symptoms are presented to different observers, including imperfect information on whether a system component has failed.

Distributed systems are used for domain, redundancy, and performance.

---
## Domain
A domain is a specific area of knowledge or activity that a distributed system is designed to address.
Some examples of domains are:
- The internet
- Wireless Mesh Networks
- Industrail systems
- Ledgers (bitcoin, ethereum)
However these domains can encounter some limits that can be physical and logical.
- Physical limits: are constraints imposed by the physical properties of the system's components or environment, such as hardware limitations, network bandwidth, latency, and geographical distribution.   
- Logical limits: defined by its bounded context. This is a core concept from Domain-Driven Design (DDD). The logical limit of a domain in a distributed system is defined by its bounded context. This is a core concept from Domain-Driven Design (DDD), a software development approach that focuses on aligning software design with the business domain. A bounded context is an explicit boundary within a distributed system where a specific domain model and its language (ubiquitous language) are consistent and applicable.
---
## Redundancy
A system with redundacy means that it has duplicate ocmponents, processes or data.
Given these specifications the system will result:
- Robust: more resilient to failure
- Available: as in system availability which is measured in uptime.

A system with redundancy can:
- offer 99.9% uptime or "five nines": means it's designed to be operational and available 99.9% of the time, with a small amount of planned or unplanned downtime.
- be a backup: in an active-passive configuration, the redundant server acts as a hot or cold backup. The primary server handles all the workload, while the backup server waits on standby. Given the duplication of information and components if one fails the other automatically takes over.
- be a database: In an active-active configuration, all redundant servers are considered active and work together simultaneously. They are not simply waiting as a backup. Incoming requests are distributed among all the servers using a load balancer.
- used in the banking sector: any downtime can lead to significant financial losses.

---
## Performance
To ensure a performant system we need:
- Economics
- Scalability
Here we talk about different topics such as:
- Video streaming: requires a lot of procesing power
- Cloud computing: Offers on-demand, scalable resources, eliminating the need for upfront hardware investment.
- Supercomputers: Excel at massive, specialized calculations but are very expensive and not scalable for general-purpose use.
- Many inexpensive vs many expensive specialized: Distributing workloads across many inexpensive machines is often more economical and scalable than using a few expensive, specialized ones.

---

# Lecture 09/09/2025

# Models of distributed systems

## Aspects of models

Why do we build distributed systems?

- **Inherent distribution**: By definition, distributed systems span multiple computers, often connected through networks such as telecommunications systems.
- **Reliability**: Even if one node fails, the system as a whole can continue functioning, avoiding single points of failure.
- **Performance**: Workloads can be shared among multiple machines, and data can be accessed from geographically closer nodes to reduce latency.
- **Scalability for large problems**: Some datasets and computations are simply too large to fit into a single machine, requiring distributed processing.

### Modelling the process ‚Äì API Style

A distributed system can be described in terms of modules that exchange **events** through well-defined interfaces:

- **Event representation**:  
  \{Event\_type | Attributes, ‚Ä¶\}  
  or  
  \{Process\_ID | Event\_type | Attributes, ‚Ä¶\}

- **Module behavior**:  
  Each module reacts to incoming events and produces outputs according to specified rules:
upon event {condition | Event | attributes} such that condition holds do
perform some action


Multiple modules together (one per process or subsystem) should collectively satisfy desired **global properties** (e.g., safety, liveness).

### What we want/will make

We aim to:
- Design APIs for modules and prove that their composition satisfies global system properties.
- Implement modules that guarantee **local properties**.
- Use pseudocode and mathematics to formally demonstrate when such guarantees are possible‚Äîor prove impossibility.

---

## Failures

Failures are inevitable in distributed systems. They can arise due to hardware breakdowns, software bugs, network disruptions, or even human mistakes. Designing robust systems requires understanding different types of failures and strategies to mitigate them.

### Types of failures

1. **Crash-stop**: A process halts and all other processes can reliably detect the failure. *Easiest to handle.*
2. **Crash-silent**: A process halts but failures cannot be detected reliably.
3. **Crash-noisy**: Failures may be detected, but only with eventual accuracy (false positives or delays are possible).
4. **Crash-recovery**: Processes may fail and later recover, rejoining the system. Requires care to avoid state inconsistencies.
5. **Crash-arbitrary (Byzantine failures)**: Processes behave arbitrarily or maliciously, deviating from the protocol. *Hardest to handle.*
6. **Randomized behavior**: Processes make decisions probabilistically. Correctness is argued via probability theory rather than strict guarantees.

---

## Communication

Is communication always required? In distributed systems, yes‚Äîbut it can be realized in different ways:

- **Message passing**:
1. Types of links and their potential failures.
2. Network topology (commonly assumed fully connected).
3. Routing algorithms for multi-hop communication.
4. Broadcast and multicast primitives.

- **Shared memory**:
1. Which process can read or write to which location?
2. How do we guarantee reading the *freshest* value? (Consistency models)

### On types of links

A **link** is a module implementing send/receive operations with certain properties.

- **TCP/IP**: Enables reliable communication between a pair of nodes (or none).
- **SSH**: Adds protection against corruption, interception, and tampering.

**Network reliability models**:
1. **Perfect links**: Reliable delivery, no duplication, no spurious messages.
2. **Fair-loss links**: Messages may be lost occasionally, but infinitely many attempts guarantee eventual delivery; finite duplication possible.
3. **Stubborn links**: Messages are retransmitted until delivery is guaranteed but still no creation (this model is built upon the fair-loss).
4. **Logged-perfect links**: Perfect delivery with persistent logs for auditing/recovery.
5. **Authenticated links**: Reliable delivery, no duplication, and sender authenticity.

### Can networks fail?

While TCP/IP and lower-level protocols often give us the illusion of **perfect links** and **fail-stop crashes**, failures still happen.

- **Network partitions**: Occur when many links fail simultaneously, dividing the system into disconnected components. This is rare but catastrophic.

### Crashes vs Failures

Having discussed both **network** and **process** failures, it is important to distinguish between the two levels:

- A **process can crash** (e.g., by crashing, halting, or misbehaving).  
- A **system fails** when the combination of process crashes and communication assumptions no longer allows correct operation.

For the remainder of our discussion, we usually assume **perfect links** (thanks to TCP/IP and lower-level reliability mechanisms). This means that:
- Messages are delivered reliably,
- No duplicates are created,
- No spurious (phantom) messages appear.

Under this assumption, we can define **system failure models** in terms of process behavior:

- **Fail-stop system**: Processes may experience crash-stop failures, but links are perfect.  
- More complex models (e.g., crash-recovery, Byzantine failures) are defined similarly, always considering both the **process failure type** and the **communication assumptions**.

In short, a system failure model = (process failure model) + (assumed link properties).

---

## Timing

Timing plays a central role in distributed systems, especially when considering **synchronization** and **failure detection**.

- Systems may be **synchronous** (bounded delays) or **asynchronous** (no timing guarantees).
- Links are still modeled as modules with send/receive properties.

### Synchronous vs. Asynchronous Systems

Distributed systems can be broadly classified according to their **timing assumptions**:

1. **Asynchronous systems**:
   - No bounds on message transmission delays.
   - No assumptions about process execution speeds (relative speeds may differ arbitrarily).
   - Failure detection is unreliable, since a slow process cannot be distinguished from a failed one.
   - Coordination and ordering rely on **logical clocks** (e.g., Lamport clocks, vector clocks), rather than real time.

2. **Synchronous systems**:
   - Bounds exist on message transmission delays and process execution speeds.
   - **Timed failure detection** is possible: if a message or heartbeat is not received within a known bound, a failure can be suspected reliably.
   - Transit delays can be measured and incorporated into algorithms.
   - Coordination can be based on **real-time clocks** rather than purely logical clocks.
   - Performance is often analyzed in terms of **worst-case bounds**, since timing assumptions provide guarantees.
   - Processes may maintain **synchronized clocks** (to some degree of precision), enabling algorithms such as consensus and coordinated actions.

**Key question**: *Can processes in an asynchronous system with fair-loss links reach agreement (e.g., on coordinated attack time)?*

### Proof via contradiction (Two Generals Problem)

1. Assume a protocol exists where a fixed sequence of messages guarantees agreement.
2. Consider the last message in this sequence that is successfully delivered.
3. If this message is lost, the receiving general decides **not** to attack.
4. But the sender cannot distinguish whether the message was delivered or lost, so must behave deterministically and decide the same action in both cases.
5. This creates a contradiction: one general attacks, the other does not.  
 $\Rightarrow$ Perfect agreement is impossible under these assumptions.

### Which crash/link/timing assumptions implement distributed systems?

A **failure detector** can be modeled as just another module that provides (possibly imperfect) information about which processes are alive. Different combinations of timing assumptions and failure detectors allow different guarantees in distributed systems.  

### Example

![image](../images/Screenshot%202025-09-09%20at%2009.56.39.png)

#### Explanation:

This algorithm describes a **Perfect Failure Detector** for distributed systems using a heartbeat mechanism.

In short, here's what it does:

1.  **Sends Heartbeats:** Periodically, on a **timeout**, every process sends a `HEARTBEATREQUEST` message to all other processes in the system.
2.  **Waits for Replies:** It assumes no one is alive and waits for `HEARTBEATREPLY` messages. When a process receives a reply, it marks the sender as `alive`.
3.  **Detects Failures:** At the next timeout, any process that has not sent a reply is considered to have **crashed**. The algorithm then triggers a `Crash` event for that process.

Because it assumes **perfect communication links** (messages are never lost), this method guarantees that a non-responsive process has truly failed, making the failure detection "perfect."

### Network latency and bandwith

When discussing communication performance, two key metrics matter:

- **Latency**: The time it takes for a single message (or bit) to travel from sender to receiver.  
- **Bandwidth**: The rate at which data can be transmitted, usually measured in bits per second (bps) or bytes per second (B/s).

#### Physical Link
Sometimes, surprisingly ‚Äúlow-tech‚Äù physical methods can provide high bandwidth, even if latency is poor:
- **Hard drives in a van**  
- Messengers carrying storage devices  
- Smoke signals (extreme latency, minimal bandwidth)  
- Radio signals or laser communication

#### Network Links
More conventional digital communication technologies include:
- DSL (Digital Subscriber Line)  
- Cellular data (e.g., 3G, 4G, 5G)  
- Wi-Fi (various standards)  
- Ethernet/fiber cables  
- Satellite links  

#### Latency examples
1. Hard drives transported by van: $\approx$ 1 day latency  
2. Intra-continent fiber-optic cable: $\approx$ 100 ms latency  

#### Bandwidth examples
1. Hard drives in a van: $\frac{50 \, \text{TB}}{1 \, \text{day}}$ = **very high bandwidth** despite huge latency  
2. 3G cellular network: $\approx 1 \, \text{Mbit/s}$ bandwidth  

---

## Performance

### Performance measures

- **SLI (Service Level Indicator)**: What aspect of the system do we measure?  
Examples: bandwidth, latency, fault tolerance, uptime, failure detection time.
- **SLO (Service Level Objective)**: What target values do we aim for?  
Example: latency < 200ms.
- **SLA (Service Level Agreement)**: An SLO backed with contractual consequences.  
Example: "99% uptime, otherwise partial refund."

Why should we study these?
- Measuring means we can improve
- Spend time improving when it is needed.
- Reliability is kind of the point with distributed systems.

### Reading SLAs

When evaluating claims like *‚ÄúThis solution offers 99% uptime‚Äù*, consider:

- **Sampling frequency**: How often is system availability checked?
- **Responsibility scope**: Does the SLA cover only server uptime, or also account for client/network failures?
- **Time interval**: Does 99% apply per day, per month, or per year?

---

# LECTURE 15/09/2025

## The Challenge of Time

In distributed systems, we often contrast **synchronous** and **asynchronous** computation. A synchronous system has known, bounded delays for message delivery and process execution. An asynchronous system has no such guarantees. Most real-world systems are asynchronous, which makes coordination difficult. Without certain timing guarantees, some problems are impossible to solve deterministically, a classic example being the **Two Generals' Problem**, which illustrates the impossibility of reaching a consensus over an unreliable channel.

### Reasons for Asynchrony
Asynchrony isn't an abstract problem; it arises from concrete issues with the physical components of a system: the network and the nodes themselves.

#### Network unpredictability:
* **Physical failures:** Cables can be damaged (famously by sharks or cut by construction) requiring traffic to be rerouted. ü¶à
* **Message loss:** Packets can be dropped, requiring retransmission protocols (like TCP) to resend data.
* **Congestion:** High traffic can lead to queues and variable delays (latency).
* **Re-configuration:** The network topology itself may change, causing temporary disruptions.

#### Node unpredictability:
* **OS scheduling:** The operating system's scheduler can preempt a process at any time to run another one.
* **Garbage collection (GC):** In managed languages (like Java or Go), a "stop-the-world" GC pause can halt an application for milliseconds or even seconds.
* **Hardware faults:** Nodes can crash, reboot, or suffer from other hardware-related issues.

But what if a system were "perfect"? Imagine no network loss and perfectly functioning nodes. Could asynchrony still occur? **Yes**. The non-deterministic nature of process scheduling is a fundamental source of asynchrony. A real-world example is the **2012 Knight Capital Group glitch**, where a software deployment error led to an algorithm running haywire. The system's components were working "correctly," but the timing and interaction between processes led to a catastrophic failure, costing the company $440 million in 45 minutes.

---

## How Do Distributed Systems Use Time?

Systems need to measure time for many fundamental operations. Think about how you would implement these on a single computer; in a distributed system, this becomes much harder.

1.  **Scheduling and Timeouts:** To run a task for a specific duration or to give up on an operation if a response isn't received within a certain window.
2.  **Failure Detection:** Using **heartbeats** (periodic "I'm alive" messages) to detect if a node has crashed. If a heartbeat isn't received within a timeout period, the node is presumed dead.
3.  **Event Timestamping:** Recording the time an event occurred, which is critical in databases for transaction ordering and data versioning (e.g., using Multi-Version Concurrency Control or MVCC).
4.  **Performance Measurement:** Logging and statistics gathering to measure latency, throughput, and other performance metrics.
5.  **Data Expiration:** Caching systems use Time-To-Live (TTL) values to expire old data. DNS records and security certificates also have expiration times.
6.  **Causal Ordering:** Most importantly, to determine the **order of events** across different nodes to maintain consistency and causality.

---

## Types of Clocks

In distributed systems, we primarily talk about two types of clocks. From a practical standpoint, a clock is simply something we can query to get a timestamp.

* **Physical Clocks:** These measure the passage of real-world time in units like seconds. They are based on physical phenomena, like the oscillation of a crystal.
* **Logical Clocks:** These don't track real time. Instead, they count events (e.g., the number of requests processed) to determine the logical order of operations.

---

## Physical Clocks: The Quartz Crystal

Most computers use quartz clocks. Here's how they work:

* A thin slice of quartz crystal is precisely cut to control its oscillation frequency when an electric voltage is applied (the **piezoelectric effect**).
* When you boot your computer, it queries a **Real-Time Clock (RTC)**‚Äîa small, battery-powered circuit on the motherboard‚Äîwhich has been continuously counting these oscillations.
* By counting the cycles, the computer can calculate the elapsed time.

However, these clocks aren't perfect:
* **Manufacturing variations:** No two crystals are identical.
* **Temperature sensitivity:** Frequency changes with temperature.
* This imperfection leads to **clock drift**. We measure this in **parts per million (ppm)**. A drift of 1 ppm means the clock is off by one microsecond per second, which adds up to about **32 seconds per year**. A typical computer clock might have a drift of around 50 ppm.

Better, but more expensive, alternatives include:
* **Atomic clocks:** Extremely precise but very expensive.
* **GPS:** Satellites contain atomic clocks. A GPS receiver can use signals from multiple satellites to calculate a very precise time.
* **Network Time Protocol (NTP):** Ask another, more accurate server for the time.

---

## Time Standards and Representations

To agree on time, we need standards.

* **Solar Time (UT1):** Based on the Earth's rotation. A day is the time between the sun reaching its highest point in the sky on two consecutive days. This is not perfectly stable.
* **International Atomic Time (TAI):** Based on the oscillations of a caesium-133 atom. One second is defined as exactly 9,192,631,770 oscillations. TAI is extremely stable.
* **Coordinated Universal Time (UTC):** The global standard we all use. It's a compromise: it ticks at the same rate as TAI but is kept within 0.9 seconds of Solar Time (UT1) by adding **leap seconds**.

### Leap Seconds
To keep UTC aligned with the Earth's wobbly rotation, a second is occasionally added. This happens on June 30 or December 31.
* **Positive leap second:** The time `23:59:59` is followed by `23:59:60`, and then `00:00:00`.
* **Negative leap second:** `23:59:58` would be followed directly by `00:00:00`. (This has never happened).
Leap seconds are a notorious source of bugs in computer systems.

### Common Representations
* **Unix time:** The number of seconds that have elapsed since `00:00:00 UTC` on 1 January 1970 (the "epoch"). Importantly, it **ignores leap seconds**; a day with a leap second is still counted as having 86,400 seconds.
* **ISO 8601:** A standard format for representing dates and times, e.g., `2025-09-15T14:30:00Z` (where `Z` indicates UTC).

---

## Network Time Protocol (NTP)

Since computer clocks drift, they need to be periodically corrected. NTP is the most common protocol for this. A client synchronizes its clock with a more accurate time server.

The main protocols are **NTP** and the more precise **PTP** (Precision Time Protocol).
On Ubuntu/Linux, you can check the time synchronization service with: `systemctl status systemd-timesyncd`.

### NTP Synchronization Logic
Let's analyze the message exchange between a client and a server.

```
\--------t1-------------t4------------\> NTP CLIENT
           \           /
            \         /
\------------t2-----t3----------------\> NTP SERVER

```

* $T_1$: Client sends a request.
* $T_2$: Server receives the request.
* $T_3$: Server sends a response.
* $T_4$: Client receives the response.

The client can now calculate two important values:
1.  **Round-trip delay ($\delta$):** This is the total time the messages spent on the network, excluding the server's processing time.
    $$\delta = (T_4 - T_1) - (T_3 - T_2)$$
2.  **Clock offset/skew ($\theta$):** This is the client's best guess of the difference between its clock and the server's clock. Assuming the network delay is symmetric (i.e., the trip to the server takes as long as the trip back), the client calculates its offset as the difference between its local time ($T_4$) and what it thinks the server's time should be ($T_3$ plus half the round-trip delay).
    $$\theta = (T_3 + \frac{\delta}{2}) - T_4$$

Based on the calculated offset $\theta$, the client's clock is adjusted:
* If $|\theta| < 125ms$: **Slew** the clock. The clock is gradually sped up or slowed down until it's correct. This avoids sudden time jumps.
* If $125ms \le |\theta| < 1000s$: **Jump** the clock. The time is set immediately. This can cause issues for applications sensitive to time reversals.
* If $|\theta| \ge 1000s$: **Ignore**. The offset is too large and is likely an error, so the update is ignored.

---

## Clock Types Revisited: Monotonic vs. Time-of-Day

This brings us to two important types of clocks available in most programming environments.

* **Time-of-day Clock:**
    * Measures time since a fixed point in the past (e.g., the Unix epoch).
    * **Not monotonic:** It can jump forwards or backwards due to NTP adjustments or leap seconds.
    * Useful for timestamping events that need to be compared across different nodes.

* **Monotonic Clock:**
    * Measures time since an arbitrary point in the past (e.g., system boot).
    * **Guaranteed to move forward** and is not affected by NTP jumps.
    * Perfect for measuring elapsed time (e.g., timeouts) on a single node. You cannot use its value to compare timestamps between different nodes.

Relying on physical clocks alone is insufficient for ordering events correctly in a distributed system due to clock skew and network latency.

---

## The Happens-Before Relation

To reason about causality without perfect physical clocks, we use a logical concept called the **happens-before** relation, denoted by $\rightarrow$. An event is an atomic operation on a single node.

We say event **a happens before event b** ($a \rightarrow b$) if one of the following is true:
1.  `a` and `b` happen on the same node, and `a` occurs before `b`.
2.  `a` is the sending of a message, and `b` is the receipt of that same message.
3.  There exists some event `c` such that $a \rightarrow c$ and $c \rightarrow b$ (transitivity).

This relation defines a **partial order**. It's possible that neither $a \rightarrow b$ nor $b \rightarrow a$. In this case, we say `a` and `b` are **concurrent**, written as $a || b$. This means we cannot determine their causal order from the information we have.

This notion of causality is inspired by physics:
* Information cannot travel faster than the speed of light. If two events in spacetime are too far apart to influence each other, they are not causally related.
* In distributed systems, we replace the speed of light with the speed of messages. If no chain of messages connects event `a` to event `b`, then `a` cannot have caused `b`.

---

## Safety and Liveness

When designing distributed algorithms, we want them to satisfy certain properties across all possible executions. These properties usually fall into two categories:

* **Safety:** *Nothing bad ever happens.*
    * A safety property, once violated, can never be undone. For example, "a database will never return incorrect data." If it does so even once, the property is broken forever.
* **Liveness:** *Something good eventually happens.*
    * A liveness property can always be satisfied in the future. For example, "every request will eventually receive a response." Even if a request is waiting, there's always the possibility it will be answered later.

### Formal Definitions
* **Safety:** A property is a safety property if for any execution where it is violated, there is a finite prefix of that execution after which the violation is guaranteed and unavoidable.
* **Liveness:** A property is a liveness property if for any finite (partial) execution, there is at least one possible continuation of that execution where the property is satisfied.

### Examples
Consider a "perfect link" communication channel:
* **Safety Property:** A process only receives messages that were actually sent. (Prevents the "bad thing" of phantom messages).
* **Liveness Property:** If a correct process sends a message to another correct process, the destination eventually receives it. (Ensures the "good thing" of message delivery eventually happens).

---


# LECTURE 16/09/2025

## Multicast

A multicast is a one-to-many communication where a single process sends a message to a specific group, and all members of that group receive it.

### What is it?

**Examples:**
* **Systems needing redundancy:** Algorithms with failover or replication, such as in Databases, DNS, or Banks.
* **One-to-many streaming:** Live TV/Radio broadcasts.
* **Many-to-many collaboration:** Skype, Teams, TikTok, and Massively Multiplayer Online games (MMOs).

**Disclaimer for this lecture:**
* We assume groups are **closed and static** (no members joining or leaving).
* We will not be discussing multiple overlapping groups.
* We will **not assume any special hardware support** for multicast.
* **Good news:** All algorithms shown work in both synchronous and asynchronous networks.


### Requirements

**Assuming:**
* We have **reliable 1-to-1 communication** (like TCP) as a building block.
* The sending process might crash.
* There is no default message ordering.

**Guarantees we want:**
* If a message is sent, it is **delivered exactly once**.
* Messages are eventually delivered to all **non-crashed (correct) processes**.
* The system is **fault-tolerant**; if one node fails, the rest can continue.

### General broadcast structure

We introduce a "broadcast algorithm" layer that sits between the application and the network.

* **Node 1** doesn't send directly to the network; it tells the **broadcast algorithm** to broadcast a message.
* The **broadcast algorithm** handles the logic of sending, re-transmitting, and ordering messages over the network.
* The **broadcast algorithm** on the receiving end then **delivers** the message to Node 2.

---

## Problems - IP Multicast

Standard IP Multicast often uses UDP, which offers no guarantees.

* **No re-transmission:** Lost packets are gone forever.
* **No reception guaranteed:** Messages might never arrive.
* **No ordering:** Messages can be delivered in an arbitrary order.

We need to build smarter algorithms to solve these problems.

### Implementing reliable broadcast algorithms

Different algorithms provide different ordering guarantees.

* **FIFO broadcast:** If a process sends `m1` before `m2`, they are delivered in that order. Preserves order from a single sender.
* **Causal broadcast:** If `broadcast(m1)` *happens-before* `broadcast(m2)`, then `m1` is delivered before `m2` everywhere. Preserves causality across different senders.
* **Total order broadcast:** If one node delivers `m1` before `m2`, then *all* nodes must deliver `m1` before `m2`. Everyone agrees on a single, global delivery order.
* **FIFO-total order broadcast:** A combination of both FIFO and total order guarantees.

---

## Hierarchy

We can think of these broadcast types as layers, each adding stronger guarantees.

* **Best-effort broadcast** is the unreliable base layer. We add re-transmission to get...
* **Reliable broadcast**, which guarantees delivery but not order. From there, we can add...
* **FIFO broadcast**, which doesn't re-order messages from the same sender. Then...
* **Causal broadcast**, which doesn't re-order messages related by the happens-before rule. Finally...
* **Total order broadcast**, which ensures all processes deliver messages in the exact same sequence.

---

## Reliable Multicast

### Properties

A reliable multicast protocol must have these three properties:

* **Integrity:** Messages are delivered at most once (no duplicates).
* **Validity:** If a correct process sends a message, it is eventually delivered.
* **Agreement:** If a correct process delivers a message, all other correct processes also deliver it.

A naive implementation where everyone forwards to everyone else is inefficient ($O(N^2)$ messages). A better approach is **Gossip**, where each node forwards a message to a few random peers. This is far more scalable and works with high probability.

---

## Ordered Multicast

### Details and implications

1.  **FIFO and Causal ordering are partial orderings.** They don't specify the order for concurrent multicasts (those not linked by the happens-before relation).
2.  **Reliable totally ordered multicast** is often called **atomic multicast**. It's a powerful tool for building consistent distributed systems.
3.  **Ordering does not imply reliability.** A protocol could guarantee total order but still fail to deliver a message to a correct process, breaking the "Agreement" property.

---

## FIFO Broadcast

Reliable multicast that respects sender order is a FIFO broadcast. This is typically implemented by having the sender add a sequence number to each message. Receivers only deliver messages from a specific sender in the order of their sequence numbers.

---

## Totally Ordered Multicast

This is complex because everyone must agree on a single, global message order.

### Totally Ordered Multicast (Sequencer)

We elect a single process to act as a **leader** or **sequencer**.

1.  Processes send their messages to the sequencer.
2.  The sequencer assigns a global, sequential number to each message and broadcasts it to the group.
3.  All processes deliver messages in the order dictated by the sequencer.

* **Problems:** The sequencer is a performance bottleneck and a single point of failure.

### Totally Ordered Multicast (ISIS)

A decentralized approach where processes negotiate the order.

1.  Process `p` broadcasts a message `m` with a proposed ID.
2.  Every other process `q` responds to `p` with its own proposed ID (typically the highest it has seen + 1).
3.  Process `p` collects all proposals, picks the largest one as the final ID, and broadcasts this final ID to the group.

* **The Trick:** Each process tracks the "largest proposed ID" and the "largest agreed-upon ID" to ensure it never delivers a message out of order.
* **Tradeoff:** This is more robust than a sequencer but requires more communication rounds (3 rounds vs. the sequencer's 2).

### Is it really ordered?

Let's say process A sends `m` and `n`, and the protocol assigns the final timestamps `1` to `m` and `2` to `n`. A will deliver `m` before `n`. Could process B deliver `n` before `m`?

* No. B receives the same final, agreed-upon timestamps of `1` for `m` and `2` for `n` via multicast. It cannot invent different ones.
* What if B proposed a timestamp of `3` for `m`? Then the final, agreed-upon timestamp for `m` would have to be at least `3`, but we know it's `1`. This is a contradiction.
* What if B wants to deliver `n` (with final timestamp `2`) before it has even heard of `m`? This is not possible, because B would have to participate in the proposal round for `m`. In that round, it would propose a timestamp larger than `2`, leading to `m`'s final timestamp being greater than `2`, which contradicts the fact that it's `1`.

The protocol forces all nodes to converge on the same sequence.

---

## Totaly order broadcast via Lamport clocks

We can achieve total order by giving each message a logical timestamp.

* **Idea:** Attach a Lamport timestamp to all messages and deliver them in timestamp order.
* **Problem:** If I receive a message with timestamp 5, how do I know a message with timestamp 4 won't arrive later?
* **Solution:** Use FIFO links. A process can only deliver the message with timestamp 5 after it has received a message with a timestamp *greater than 5* from **every other process**. This confirms no earlier messages are still in transit.

---

## Causal broadcast via lamport clock

Physical clock timestamps may not respect causality. The solution is **Logical Clocks**. They are designed to capture the happens-before relation (`e1` ‚áí `e2` implies `T(e1) < T(e2)`).

We will look at two types:
1.  Lamport Clocks
2.  Vector Clocks

---

## Lamport clocks Algorithm

Each process maintains a single integer counter.

* Each process initializes a local clock `t` to 0.
* Before any event, a process increments its clock: `t = t + 1`.
* When sending a message `m`, it sends the tuple `(t, m)`.
* When receiving `(t_msg, m)`, a process updates its clock `t = max(t, t_msg)` and then increments it for the receive event.

### Properties

* If `a` happens-before `b` (`a` ‚áí `b`), then `L(a) < L(b)`.
* However, if `L(a) < L(b)`, it does **not** mean `a` ‚áí `b`. They could be concurrent.

---

## Vector Clocks

Each process `pi` maintains a vector `Vi` of size `N` (number of processes).

* Initially, `Vi[j] = 0` for all `j`.
* Before an event at `pi`, it increments its own clock entry: `Vi[i] = Vi[i] + 1`.
* When sending a message, it attaches its entire vector `V`.
* On receiving a message with vector `V'`, the process updates its local vector by taking the element-wise maximum: `Vi[j] = max(Vi[j], V'[j])` for all `j`.

**Comparison Rules:**
* `V = W` if `V[j] = W[j]` for all `j`.
* `V ‚â§ W` if `V[j] ‚â§ W[j]` for all `j`.
* `V < W` if `V ‚â§ W` and `V ‚â† W`.

### Vector Clocks, as used for CO Multicast

To ensure causal order, when process `Pj` receives a message `m` from `Pi` (with vector `Vm`), it delays delivery of `m` until **both** conditions are met:

1.  `Vm[i] = Vj[i] + 1`
    * This ensures `m` is the very next message `Pj` expected from `Pi`.
2.  `Vm[k] ‚â§ Vj[k]` for all `k != i`
    * This ensures `Pj` has already delivered all messages that `Pi` had seen before it sent `m`.

![image](../images/Screenshot%202025-09-16%20at%2012.41.19.png)

As we can see in vector two we need to move the point of V2 = (1,1,0) ahead of (1,0,0) due to the order.

---

# LECTURE 22/09/2025

## Consensus
In distributed computing, **consensus** is the fundamental challenge of getting a group of independent processes (or nodes) to **agree on a single value**. This agreed-upon value is final. Think of it as a committee that must vote on and finalize one decision, and once made, it cannot be changed.

This is formally equivalent to **total order broadcast**, where processes must agree on the *sequence* of messages to deliver. If they can agree on the first message, then the second, then the third, and so on, they are effectively solving consensus for each message slot in the order.

### Practical Examples
* **Multicast & Bank Accounts**: Imagine you have $100 in an account replicated across multiple servers. If you deposit $50 and simultaneously withdraw $30, all servers must agree on the order of operations. Do they process the deposit first (balance becomes $120) or the withdrawal first (balance becomes $120)? They must reach a consensus to ensure the final balance is consistent everywhere.
* **Redundancy**:
    * **Space and Aeronautics**: The flight control computers on a spacecraft or modern airplane must agree on sensor readings and control actions. If one computer thinks the plane should pitch up and another thinks it should pitch down, they must reach a consensus to avoid a catastrophic failure.
* **Replication**:
    * **Distributed File Systems**: When you write to a file stored on Google Drive or Dropbox, multiple replicas of that data are updated. Consensus ensures all replicas agree on the latest version of the file.
    * **Ledger Technology (e.g., Blockchain)**: A blockchain is essentially a chain of consensus decisions. Miners or validators around the world must agree on which block of new transactions is the next one to be added to the chain.

### Common algorithms
* **Paxos**: A classic and highly influential algorithm for reaching consensus on a single value in an asynchronous system where nodes can crash. **Multi-Paxos** extends this to agree on a sequence of values, effectively creating a total order broadcast.
* **Raft, Viewstamped Replication, Zab**: These are more modern algorithms designed to be more understandable and easier to implement than Paxos. They solve total order broadcast by default, often by first electing a stable leader to coordinate decisions.

---

## System model
In distributed systems, we must define the "rules of the game" under which an algorithm operates. This is the **system model**, and it typically specifies:
* **Network Behavior**: Are messages delivered reliably? Can they be delayed indefinitely (**asynchronous**) or is there a known maximum delay (**synchronous**)?
* **Node Behavior**: How can processes fail? Can they simply stop (**crash-fail**) or can they behave maliciously and lie (**Byzantine**)?
* **Timing Assumptions**: Do processes have access to synchronized clocks?

The choice of system model drastically affects what problems are solvable.

---

## Reliable consensus vs failures summary
Achieving consensus becomes progressively harder as the system becomes less reliable.

* **No Failures (Easy Case)**: If no process ever fails, consensus is trivial.
    1.  Every process broadcasts its proposed value to all others.
    2.  Each process waits until it has received a value from every other process.
    3.  Each process applies a simple function (like choosing the minimum value, or the first one received) to its collection of received values. Since everyone has the same set of values, they will all decide on the same outcome.

* **With Crash Failures**: If processes can crash, the simple approach fails. A process might wait forever for a message from a crashed node.
    * **The core problem**: How do you distinguish a process that is just very **slow** from one that is **dead**? This ambiguity is a central challenge in asynchronous systems.
    * **Solution**: You need a **failure detector** mechanism to handle crashed nodes, but these are often imperfect.

* **With Lies (Byzantine Failures)**: This is the hardest case. A faulty process can lie, sending value `A` to one node and value `B` to another.
    * **The trust problem**: If you receive conflicting information, how do you know who is telling the truth?
    * **Impact**: To tolerate these malicious failures, you need more nodes in total. Intuitively, you need enough honest nodes to "outvote" the liars. This significantly decreases the number of faulty nodes a system can withstand compared to simple crash failures.

---

## Requirements for consensus
For a set of processes `p_i` proposing values, we define a set of formal properties that any correct consensus algorithm must satisfy. Each process has a decision variable `d_i`, initially set to `‚ä•` (undecided).

* **Termination**: Eventually, every **correct** (non-faulty) process must decide on a value (i.e., set its `d_i` to something other than `‚ä•`). The system cannot get stuck forever.
* **Agreement**: No two correct processes decide on different values. If process `p_i` decides `v_a` and process `p_j` decides `v_b`, then it must be that `v_a = v_b`.
* **Integrity**: If all correct processes propose the same value `v`, then any correct process that decides must decide on that value `v`. This prevents trivial solutions like "always decide 0".
* **Weak Integrity (often used)**: A slightly different version states that the decided value must have been proposed by at least one of the processes. This ensures the outcome is not just made up.

A process `p_i` is in the **Decided State** as soon as its decision variable `d_i` is no longer `‚ä•`.

---

## Synchronous Consensus Algorithm
In a **synchronous** system, we assume that message delivery and processing happen in lock-step **rounds**. There's a known upper bound on how long a message takes to arrive. This assumption simplifies things greatly but is often unrealistic.

**Goal**: To create an algorithm that is resilient to `f` crash failures and computes the minimum proposed value.

### f-resilient (synchronous) Consensus Algorithm
The algorithm operates in `f + 1` rounds. In each round, every process broadcasts the set of values it knows about so far, and then updates its set with the values it receives from others.

```
1 v = { value from application (call x) }
2 B-multicast(v)
3 for each round i ‚àà 1 ... f + 1 do
4 v' = v
5 for each m received do
6 update v = v ‚à™ m
7 end
8 B-multicast(v \ v') // not needed in round f+1
9 end
10 Pick d as minimal value of v
11 return d
```

* **Initialization**: Each process `p_i` starts with a set of values `V_i = {v_i}`, where `v_i` is its own initial proposal.
* **Rounds**: For `k` from 1 to `f + 1`:
    1.  Each process `p_i` broadcasts its current set of values `V_i` to all other processes.
    2.  Each process `p_i` waits to receive messages from all other non-faulty processes. It updates its set `V_i` by taking the union of its current set and all the sets it received in this round.
* **Decision**: After `f + 1` rounds, each process `p_i` decides on the minimum value in its final set `V_i`.

#### Why does this work? (Proof Sketch)
The proof for **Agreement** works by contradiction.
* **Assume** two correct processes, `p_i` and `p_j`, decide on different minimum values, `x` and `y`, where `x < y`.
* This means that at the end of round `f+1`, `p_j`'s set of values *did not contain* `x`.
* For this to happen, the value `x` must have been "hidden" from `p_j` for all `f + 1` rounds. The only way to hide a value is if a process holding it crashes before sending it.
* But we assume there are at most `f` faulty processes. In each round, at most one new process can fail and "block" the propagation of `x`. Over `f+1` rounds, even if a different process fails each round, the value `x` from a correct process would have had at least one round to propagate to everyone.
* Therefore, it's a **contradiction** to think `p_j` never received `x`. Both `p_i` and `p_j` must have the same set of values from all correct processes and will thus decide on the same minimum.

### Theorem
A famous result in distributed computing states that any optimal, deterministic consensus algorithm that can tolerate `f` crash failures requires **at least `f + 1` rounds** of communication in the worst case.

---

## Byzantine Error
What if processes don't just crash, but behave unpredictably or maliciously? This is a **Byzantine error**. A Byzantine node can lie, send conflicting messages to different peers, or collude with other faulty nodes. This models the most challenging failure scenario. The name comes from Lamport's famous paper, "The Byzantine Generals Problem."

### Examples
This isn't just a theoretical or software problem; it can be caused by hardware faults.
* **Single Event Upset (SEU)**: A cosmic ray or high-energy particle strikes a memory cell, flipping a bit from 0 to 1 (or vice-versa). This can corrupt data or instructions, causing the node to behave erratically.
* **Single Event Latchup (SEL)**: A hardware error that can cause a short-circuit, leading to unpredictable behavior or permanent damage.

These issues are critical in:
* Aerospace, where radiation is higher.
* Systems using non-**ECC (Error-Correcting Code) memory**.
* High-reliability systems like **nuclear power plants** or **avionics**.

---

## Byzantine Consensus
To solve consensus with Byzantine failures, we need a stronger integrity property.

* **Byzantine Integrity**: If all **non-faulty** (i.e., correct) processes start with the same value `v`, then all non-faulty processes must decide on `v`. This ensures that a few Byzantine nodes cannot trick the honest majority into deciding on a wrong value when they already agree.

**Goal**: Design an `f`-byzantine-resilient synchronous consensus algorithm.

### The Bad News: Impossibility Result
A groundbreaking result shows that no solution can exist if the number of faulty nodes `f` is too high relative to the total number of nodes `n`. Consensus is **impossible for `f ‚â• n/3`**, or `n ‚â§ 3f`.

### The Good News
If `n > 3f`, solutions are possible. For example, to tolerate 1 Byzantine fault (`f=1`), you need at least 4 nodes in total (`n=4`). To tolerate 2 (`f=2`), you need at least 7 (`n=7`).

### f-byzantine resilience?
Let's see why `n=3, f=1` is impossible.
Imagine a Commander (C) sending an order ("attack" or "retreat") to two Lieutenants (L1, L2). One of them is a traitor.

* **Scenario**: The Commander is the traitor. C tells L1 to "attack" and L2 to "retreat". Now L1 and L2 have conflicting information. L1 tells L2 "C told me to attack", and L2 tells L1 "C told me to retreat". L1 knows one of them is a traitor, but it could be C or L2. L2 faces the same dilemma. They cannot agree.

### Byzantine Non-Consensus larger n simulation
This is a proof technique to show that if a solution existed for `n ‚â§ 3f`, it would lead to a contradiction.

* **Practical Example (Proof by Reduction)**: Let's **assume** we have a magical algorithm that solves Byzantine consensus for `n=3` generals with `f=1` traitor. We will use this faulty assumption to solve an even simpler problem, which we know is truly impossible, thereby proving our initial assumption was wrong.
* The "truly impossible" problem is the `n=2, f=1` scenario (one Commander, one traitor Lieutenant). The Lieutenant can never know if the Commander is lying or not.
* **The Simulation**:
    1.  The Commander (C) and Lieutenant (L) in the `n=2` problem will *simulate* the `n=3, f=1` algorithm.
    2.  C will simulate being the Commander from the `n=3` world.
    3.  L will simulate being *both* Lieutenant 1 and Lieutenant 2 from the `n=3` world.
    4.  They run the magical `n=3` algorithm on these simulated roles.
* **The Contradiction**: The algorithm is supposed to work even with one traitor. In this simulation, if the real Commander C is the traitor, then the simulated Commander is the traitor. If the real Lieutenant L is the traitor, then the simulated L1 and L2 are traitors. In either case, the number of simulated traitors is at most 1. The `n=3` algorithm should therefore work, allowing the real Commander and Lieutenant to reach consensus.
* But we know consensus is impossible for `n=2, f=1`! Since our "magical" algorithm allowed us to solve an unsolvable problem, the magical algorithm itself cannot exist. This logic extends to show `n ‚â§ 3f` is impossible in general.

---

## Three Equivalent Problems
These three problems are different formulations of the same core challenge and can be transformed into one another.

1.  **Consensus**:
    * **Goal**: All processes propose a value `v_i`; they must agree on a single one.
    * **Properties**: Termination, Agreement, Integrity.

2.  **Byzantine Generals**:
    * **Goal**: A single Commander issues an order to `n-1` Lieutenants. They must all agree on the order received.
    * **Properties**: Termination, Agreement, and a special **Integrity**: If the Commander is correct, all correct Lieutenants must decide on the Commander's proposed order. (Note: if the Commander is faulty, they just need to agree on *some* order).

3.  **Interactive Consistency**:
    * **Goal**: Every process `p_i` proposes a value `v_i`. All correct processes must agree on the *same vector* of values `V = (v_1, v_2, ..., v_n)`.
    * **Properties**: Termination, Agreement (on the whole vector), and **Integrity**: If process `p_i` is correct, then the `i`-th component of the decided vector must be `v_i`.

### Equivalence of the problems
* **Byzantine Generals (BG) to Interactive Consistency (IC)**: To agree on a vector, simply run the BG algorithm `n` times. In the first run, `p_1` acts as Commander. In the second, `p_2` acts as Commander, and so on. The final vector is built from the outcomes of each run.
* **Interactive Consistency (IC) to Consensus (C)**: First, run IC to get an agreed-upon vector of proposals. Then, each process independently applies a deterministic function (e.g., `min()`, `max()`, `majority()`) to that vector to compute a single final value. Since they all start with the same vector and apply the same function, they will arrive at the same consensus value.
* **Consensus (C) to Byzantine Generals (BG)**: The Commander sends its value to all Lieutenants. Then, every process (including the Commander) initiates a Consensus round, proposing the value it received (or its own value, if it's the Commander). Because the Consensus algorithm can tolerate traitors, the honest nodes will agree on a single value, achieving the BG goal.

---

## Byzantine Generals Algorithm (f=1)
This is a simple synchronous algorithm that solves the problem for `n=4, f=1`. It takes two rounds of communication.

```
// Executed by the Commander
def Commander:
v = value from application // e.g., "attack"
B-multicast(v) to all Lieutenants // Round 1: Send the order
// Executed by each Lieutenant
def Lieutenant:
  let v = value received from commander
  let i = my unique process id
  // Round 2: Relay the order you received to everyone else
  B-multicast(i : v) to all other Lieutenants
  // Wait to receive messages from the other n-2 Lieutenants
  // Decide based on the majority vote of all orders received
  let d = the majority vote of received answers. If there's a tie, use a default.
  return d
```

**Why it works for `n=4, f=1`**:
* **Case 1: Commander is honest**. All 3 Lieutenants receive the same correct order. They will all decide on that order.
* **Case 2: A Lieutenant is the traitor**. The 2 honest Lieutenants and the Commander are honest. The 2 honest Lieutenants receive the correct order from the Commander. When they exchange messages, they will each have 2 votes for the correct order (one from the C, one from the other honest L) and 1 vote for whatever the traitor says. The majority vote will be the correct order.

---

## Fixing the Async Problem
In a purely asynchronous system, the famous **FLP Impossibility Result** proves that there is no deterministic algorithm that can solve consensus while tolerating even a single crash failure. The core issue is the inability to distinguish a crashed node from a very slow one.

So how do we build real systems?
* **Use randomness**: If we allow algorithms to use random numbers, we can design protocols that are guaranteed to reach consensus with a probability of 1. They might not terminate on a specific run, but over infinite runs, they will.

---

## Paxos

### What is it?
**Paxos** is a family of protocols for solving consensus in an asynchronous network where processors may fail by crashing (it does not handle Byzantine failures). It was created by Leslie Lamport.

* **Key features**:
    * It does **not** rely on a fixed coordinator/leader.
    * It works in an **asynchronous** system.
    * It is resilient to up to `(n-1)/2` crash failures.
    * It prioritizes **safety (Agreement)** over **liveness (Termination)**. This means it will never allow two nodes to decide differently, but it's not guaranteed to make progress and decide at all.



### The Paxos Nodes
Paxos operates by electing a temporary "leader" (called a **Proposer**) for a specific decision. Any node can try to become a proposer. Nodes that are not proposers act as **Acceptors**, voting on proposals.
* A node can become a **Proposer** at any time.
* All nodes are **Acceptors**.
* Nodes that learn the final outcome are **Learners**. In practice, all nodes often play all three roles.

### Steps
Paxos works in two phases to decide on a single value.

**Phase 1: Prepare/Promise (Electing a Leader)**
1.  A **Proposer** decides it wants to lead. It picks a proposal number `n` that is unique and higher than any number it has used before. It sends a `Prepare(n)` message to a majority of Acceptors.
2.  An **Acceptor** receives `Prepare(n)`.
    * If `n` is higher than any proposal number it has promised to listen to before, it responds with a `Promise(n)` message. This is a promise to not accept any proposals with a number less than `n`.
    * **Crucially**: If the Acceptor has *already accepted* a value `val_prev` from a previous proposal `n_prev`, it must include `(n_prev, val_prev)` in its `Promise` response.
    * If `n` is not the highest it has seen, it ignores the message.

**Phase 2: Accept/Accepted (Deciding on a Value)**
3.  The **Proposer** waits for `Promise` responses. If it receives them from a **majority** of Acceptors, it is now the leader for proposal `n`. It then chooses a value `val` to propose.
    * **The Rule**: If any of the `Promise` responses it received contained a previously accepted value, the Proposer **must** choose the value `val_prev` associated with the highest proposal number `n_prev` it saw. Otherwise, it is free to propose its own initial value.
    * It then sends an `Accept(n, val)` message to a majority of Acceptors.
4.  An **Acceptor** receives `Accept(n, val)`.
    * If it has not made a newer promise (to a proposal number higher than `n`), it accepts the value and sends an `Accepted(n, val)` message to all nodes (who act as Learners).
    * Once a Learner sees `Accepted` messages from a majority of nodes for the same value, that value is **decided**.

---

## The proof of paxos

We will not go through the formal proof, as it involves a very large and detailed analysis of all possible message orderings and failure scenarios. The key safety property relies on the rule in Step 3: forcing a new leader to continue with a value that might have already been decided ensures that once a value is chosen, it can never be changed.

However, Paxos can fail to terminate. This does not violate its safety guarantee, but it does violate the **Termination** requirement for consensus.

* **Practical Example of Non-Termination (Dueling Proposers)**:
    1.  Proposer P1 sends `Prepare(n=10)` and gets promises from a majority of Acceptors (A, B, C). P1 is now leader.
    2.  Before P1 can send its `Accept` message, another Proposer P2 wakes up, chooses a higher number, and sends `Prepare(n=11)` to the same Acceptors.
    3.  Acceptors A, B, and C see this higher proposal number. They respond to P2 with `Promise(n=11)`, and will now ignore any messages related to `n=10`.
    4.  P1 finally sends its `Accept(n=10, value="X")` message, but it is ignored by the majority because they've promised to listen to `n=11`. P1's proposal fails.
    5.  Now P2 has a majority of promises and is the leader. But before it can send its `Accept` message, P1 realizes it failed, chooses an even higher number, and sends `Prepare(n=12)`.
    6.  This cycle can repeat indefinitely, with the two proposers constantly preempting each other, and no value is ever decided. This is a "livelock" situation. In practice, randomized timeouts are used to make this scenario highly unlikely.

---

## Resources and alternatives
* **Google TechTalk on Paxos**: A good video resource for understanding the algorithm in more detail.
* **Raft Algorithm Illustration**: [https://raft.github.io/](https://raft.github.io/) provides an excellent interactive visualization of Raft, an alternative to Paxos designed for understandability.

---

## Heartbeat for Synchronized Systems
A **heartbeat** is a common mechanism for failure detection in systems that are not fully asynchronous.
* **How it works**:
    1.  You guess a reasonable upper bound for message delay, `D`.
    2.  Each process sends a "beat" message to others every `T` seconds.
    3.  If a process hasn't received a beat from another process in the last `T + D` seconds, it **suspects** that the process has crashed.

* **The Trade-off**:
    * If `D` is **too small**, you get **inaccurate** detections. A slow but perfectly alive process might be declared dead.
    * If `D` is **too large**, your detection is **incomplete**. A dead process might be considered alive for a long time (a "zombie").

This shows we can only ever **suspect** a crash in a distributed system; we can never be 100% certain.

---

# LECTURE 23/09/2025

## Mutual Exclusions

What is a mutual exclusion (mutex)? 
A mutual exclusion (mutex) is a synchronization primitive that prevents multiple processes from concurrently accessing a **critical section** or a **shared resource**. It ensures that when one process is executing code within the critical section, no other process can enter it, guaranteeing exclusive access.

### Examples
‚ñ∂ **Printing:** A print spooler must grant access to the printer to only one document at a time to prevent the pages of different documents from being interleaved.
‚ñ∂ **Using Coffee Machine:** Only one person can brew a coffee at a time. The machine is a shared resource, and the process of making coffee is the critical section.
‚ñ∂ **Writing a file:** If two processes write to the same file simultaneously, the data can become corrupted. A mutex ensures one process finishes its write operation before another can begin.
‚ñ∂ **Changing the state of an actuator:** An actuator, like a robotic arm, can only receive one command at a time to move to a specific position. Conflicting commands could cause damage or unpredictable behavior.
‚ñ∂ **Wireless/Wired Communication:** Processes need exclusive access to the communication channel (e.g., a specific frequency or Ethernet cable) to transmit a packet of data without causing collisions and data corruption.

---

## System model
What is a (computer science) process?
A process is an instance of a running computer program. In distributed systems, we can formally model a process `p` as a state machine. The tuple `p = (S, s‚ÇÄ, M, ‚Üí)` consists of:
‚ñ∂ a set of states **S**, representing all possible conditions the process can be in.
‚ñ∂ an initial state **s‚ÇÄ ‚àà S**, where the process begins.
‚ñ∂ a set of messages **M** that it can send or receive, including the empty message `œµ` for internal state changes.
‚ñ∂ and a transition function **‚Üí**, which defines how a process changes its state `s ‚àà S` upon receiving a message `m ‚àà M`, and what messages it sends to other processes as a result.

---

## Type of communication
Mutex algorithms can be designed for different underlying system architectures.
‚ñ∂ **Message Passing:** Processes are independent and communicate by sending and receiving messages over a network. They do not share memory. Exclusive access is coordinated through explicit communication.
    * **Example:** A client-server application where clients send requests to a server to access a database.
‚ñ∂ **Shared Memory:** Processes have access to a common area of memory. They can communicate and synchronize by reading and writing to shared variables. This is more common in multi-threaded applications on a single machine.
    * **Example:** Two threads updating a shared counter variable. They need a mutex to ensure the update operation (`read-increment-write`) is atomic.
‚ñ∂ We will start with message passing.

---

## Assumptions

‚ñ∂ **Process Failures:** Processes can fail by **crashing**. Once a process crashes, it stops executing and does not recover or send any more messages (it stays dead).
‚ñ∂ **Direct Communication:** Processes can send messages directly to any other process. We don't need to worry about routing or message forwarding.
‚ñ∂ **Reliable Communication:** The communication channels are reliable, meaning messages are not lost, corrupted, or duplicated.
    ‚ñ∂ **Synchronous:** There's a known upper bound on the time it takes for a message to be delivered. This makes it easier to detect failures.
    ‚ñ∂ **Asynchronous:** A message will eventually be delivered, but there's no guarantee on how long it will take. The underlying protocol handles re-transmissions for reliability.
‚ñ∂ **Network Partitions:** We assume that if the network splits (partitions), it will eventually heal, and processes will be able to communicate again.

---

## Requirements (Mutex Algorithms)

1.  **Safety (Correctness):** At most one process can be in its critical section at any given time. This is the fundamental "mutual exclusion" property. Violating this leads to race conditions and corrupted data.
2.  **Liveness (Progress):** Every request to enter the critical section is eventually granted. This ensures the system does not grind to a halt. It prevents **deadlock** (where processes are stuck waiting for each other) and **starvation** (where a process is indefinitely denied access).
3.  **Ordering/Fairness:** If one request to enter the critical section happens-before another, access should be granted in that same order. This is typically based on logical clocks (like Lamport timestamps) and ensures fairness, preventing a process that requested first from being overtaken by later requests.

---

## Properties (Mutex Algorithms)
When evaluating different mutex algorithms, we consider these key metrics:
‚ñ∂ **Fault tolerance:** How does the algorithm behave if one or more processes crash? Can the system continue to function, or does it deadlock?
‚ñ∂ **Performance:**
    ‚ñ∂ **Message Complexity:** The number of messages required per entry into the critical section. This is a primary measure of network overhead.
    ‚ñ∂ **Client Delay:** The time a process has to wait from the moment it requests entry into the critical section until it is granted access.
    ‚ñ∂ **Synchronization Delay:** The time between one process exiting the critical section and the next process being granted entry. This is a good measure of system throughput.
    ‚ñ∂ **Bandwidth:** The total amount of data transmitted, which is proportional to the number and size of messages sent for each critical section entry and exit.

---

## Centralized Algorithm - With a Token
This is the most straightforward approach, mimicking a real-world queue manager. 
‚ñ∂ A single, designated **coordinator** process manages access to the resource.
‚ñ∂ The coordinator holds a **token**. A process can only enter the critical section if it possesses this token.
‚ñ∂ The coordinator maintains a FIFO (First-In, First-Out) queue of requests.

A process that wants to enter the critical section:
1.  Sends a `REQUEST` message to the coordinator.
2.  Waits until it receives a `GRANT` message (the token) from the coordinator.
3.  Enters the critical section, does its work, and exits.
4.  Sends a `RELEASE` message (returns the token) to the coordinator.

The coordinator's logic:
1.  When it receives a `REQUEST`, it checks if the token is available.
2.  If the token is available, it sends `GRANT` to the requesting process.
3.  If the token is in use, it adds the request to its queue.
4.  When it receives a `RELEASE`, it takes the next process from the queue (if any) and sends it the `GRANT` message.

### Properties (Centralized Algorithm)

**Requirements**
‚ñ∂ **Safety:** Yes. The coordinator only gives the token to one process at a time.
‚ñ∂ **Liveness:** Yes. As long as the coordinator doesn't crash, every request is queued and will eventually be served.
‚ñ∂ **Ordering:** Yes, FIFO ordering is provided by the coordinator's queue.

**Properties**
‚ñ∂ **Message Complexity:** 3 messages per CS entry (`REQUEST`, `GRANT`, `RELEASE`).
‚ñ∂ **Client Delay:** 1 round trip time (`REQUEST` + `GRANT`).
‚ñ∂ **Synchronization Delay:** 1 round trip time (`RELEASE` + `GRANT` to the next process).
‚ñ∂ **Fault Tolerance:** Poor. If the **coordinator crashes**, the entire system halts. This is a single point of failure. If a process crashes while holding the token, the system also starves unless the coordinator has a timeout mechanism.

---

## Token Ring Algorithm (no leader)
This algorithm organizes processes into a logical ring, where each process knows its successor. üíç
‚ñ∂ A single **token** circulates continuously around the ring.
‚ñ∂ A process `pi` waits for the token to arrive from its predecessor.
‚ñ∂ When `pi` receives the token:
    * If `pi` needs to enter the critical section, it holds the token, enters the CS, and upon exit, passes the token to its successor.
    * If `pi` does not need to enter the critical section, it immediately passes the token to its successor.

### Properties
**Requirements**
‚ñ∂ **Safety:** Yes. Only the process holding the token can enter the critical section.
‚ñ∂ **Liveness:** Yes. The token continuously circulates, so every process will eventually get a chance to enter.
‚ñ∂ **Ordering:** No. Access order is determined by the process's position in the ring, not by the time of the request.

**Properties**
‚ñ∂ **Client Delay:** Varies. Best case is 0 (token arrives just as it's needed). Worst case is the time it takes for the token to make a full circle, which involves `n` messages.
‚ñ∂ **Message Complexity:** Between 1 (if every process wants to enter) and infinity (if no process wants to enter, the token just circulates endlessly).
‚ñ∂ **Synchronization Delay:** Between 1 and `n` message hops.
‚ñ∂ **Fault Tolerance:** Poor. If a process crashes, the ring is broken, and the token can be lost. If the token itself is lost (e.g., due to a message drop in an unreliable network), the system deadlocks. Detecting these failures is complex.

---

## Ricart and Agrawala‚Äôs Algorithm
This is a fully distributed algorithm where processes achieve consensus to grant access. It uses timestamps to create a total ordering of requests. üïí
‚ñ∂ **Core Idea:** A process that wants to enter the critical section must get permission from every other process. "He who asks first, gets served first."
‚ñ∂ **Secret Ingredient:** **Lamport Clocks** are used to assign a unique, ordered timestamp (`ts`, `pid`) to every `REQUEST` message. This breaks ties and ensures all processes agree on the request order.

### Lamport Clocks (reminder)
‚ñ∂ Each process `pi` maintains a local logical clock `Ci`.
‚ñ∂ Before sending a message, `pi` increments `Ci`. The message is timestamped with `Ci`.
‚ñ∂ When a process `pj` receives a message with timestamp `T`, it updates its own clock: `Cj = max(Cj, T) + 1`. This ensures causality is captured.

### Algorithm
When a process `pi` wants to enter the critical section:
1.  It sets its state to `WANT`.
2.  It creates a request with the current Lamport timestamp `(ts, i)` and multicasts a `REQUEST` message to all other processes.
3.  It waits for a `REPLY` from every other process. Once all replies are received, it enters the critical section (state = `USE`).

When a process `pj` receives a `REQUEST(ts, i)` from `pi`:
1.  If `pj` is in state `USE` (in the CS), it defers its reply.
2.  If `pj` is in state `WANT`, it compares the timestamp of its own request `(ts', j)` with the incoming request `(ts, i)`. If `(ts, i)` is smaller (happened earlier), it sends a `REPLY`. Otherwise, it defers the reply.
3.  If `pj` is in state `FREE`, it sends a `REPLY` immediately.

After exiting the critical section, `pi` changes its state to `FREE` and sends a `REPLY` to all deferred requests.

### Properties (Ricart and Agrawala)
**Requirements**
‚ñ∂ **Safety:** Yes. A process `pi` can only enter if it has received a `REPLY` from all others. Another process `pj` will not send its `REPLY` to `pi` if `pj` has an earlier request or is already in the CS.
‚ñ∂ **Liveness:** Yes. No deadlock because request timestamps provide a total ordering.
‚ñ∂ **Ordering:** Yes, by Lamport timestamp.

**Properties**
‚ñ∂ **Message Complexity:** `2(n-1)` messages per CS entry. This consists of `n-1` `REQUEST` messages and `n-1` `REPLY` messages.
‚ñ∂ **Synchronization Delay:** One message propagation time.
‚ñ∂ **Fault Tolerance:** Poor. If any process crashes, it will not send its `REPLY`, causing all other requesting processes to block forever.

---

## Maekawa's Algorithm
This algorithm optimizes Ricart & Agrawala by requiring a process to get permission from only a subset of other processes, called a **voting set** or **quorum**. üó≥Ô∏è
‚ñ∂ **Core Idea:** Instead of asking everyone, ask a cleverly chosen subset of processes. The subsets are designed to overlap, ensuring that only one process can get a majority "vote" at a time.

### Voting Set
A voting set `Vi` for a process `pi` must satisfy two conditions:
1.  `pi ‚àà Vi` (A process is always in its own voting set).
2.  `‚àÄ i, j : Vi ‚à© Vj ‚â† ‚àÖ` (Any two voting sets must have at least one common member).

This intersection property is key to safety: the common member acts as an arbiter who will only grant permission to one of the two competing processes at a time. To minimize message complexity, we want the size of the voting sets, `K = |Vi|`, to be as small as possible. This is achieved when `K ‚âà ‚àön`, leading to `|Vi| ‚âà ‚àön`.

A common way to construct these sets is to arrange the `n` processes in a `‚àön x ‚àön` grid. The voting set for a process `pi` then consists of all processes in the same row and column as `pi`.



### Algorithm
Similar to Ricart & Agrawala, but communication is limited to the voting set `Vi`.
1.  **Request:** To enter the CS, `pi` sends `REQUEST` to every process in its voting set `Vi`.
2.  **Wait:** `pi` waits for a `GRANT` message from every process in `Vi`.
3.  **Enter CS:** Once all grants are received, `pi` enters the CS.
4.  **Release:** After exiting, `pi` sends a `RELEASE` to every process in `Vi`.

Each process `pj` will only send a `GRANT` to one request at a time. It keeps other requests queued.

### Properties (Maekawa's Algorithm)

**Requirements**
‚ñ∂ **Safety:** Yes, guaranteed by the non-empty intersection of voting sets.
‚ñ∂ **Liveness:** No. This algorithm is prone to deadlock. For example, `p1` might be waiting for a grant from `p2`, while `p2` is waiting for a grant from `p3`, who is waiting for a grant from `p1`.
‚ñ∂ **Ordering:** No, not inherently.

**Properties**
‚ñ∂ **Message Complexity:** `3‚àön` (`‚àön` REQUEST, `‚àön` GRANT, `‚àön` RELEASE). This is a significant improvement over `2(n-1)`.
‚ñ∂ **Synchronization Delay:** 2 message propagation times (`RELEASE` then `GRANT`).
‚ñ∂ **Fault Tolerance:** Poor. If a process in a voting set crashes, any process that needs its vote will starve.

---

## Overview of Mutex Algorithms
| Algorithm | Messages per Entry | Synchronization Delay | Key Problem(s) |
| :--- | :---: | :---: | :--- |
| **Centralized** | 3 | 2 message delays | Coordinator crash (single point of failure) |
| **Token Ring** | 1 to ‚àû | 1 to `n` delays | Lost token, process crash breaks the ring |
| **Ricart & Agrawala** | `2(n-1)` | 1 message delay | Crash of any process blocks the system |
| **Maekawa** | `3‚àön` | 2 message delays | Prone to deadlock, crash in voting set |

---

## Mutual exclusion on shared memory
In this model, processes run on the same machine (e.g., as threads) and can read/write to common memory locations. Synchronization is achieved by manipulating shared variables. Operations like `read` and `write` to a single memory location are assumed to be **atomic**.

A classic example for two processes is **Peterson's Algorithm**.
‚ñ∂ **Shared Variables:**
    ‚ñ∂ `flags: array[2] of boolean` (initialized to `false`)
    ‚ñ∂ `turn: integer`
‚ñ∂ **Entry Protocol for Process `i` (where `j` is the other process):**
    ```
    1: flags[i] = true;         // I want to enter
    2: turn = j;                // I yield to the other process
    3. while (flags[j] && turn == j) {
    4:   // busy-wait
    5: }
    ```
‚ñ∂ **Exit Protocol for Process `i`:**
    ```
    6: flags[i] = false;        // I am done
    ```
**How it works:** Process `i` signals its intent by raising its flag. It then gives the other process `j` priority by setting `turn = j`. It will only enter the critical section if `j` does not want to enter (`flags[j]` is false) OR if it's `j`'s turn but `j` has yielded (`turn != j`, i.e., `turn == i`).

### Requirements
‚ñ∂ **Safe:** Yes.
‚ñ∂ **Liveness:** Yes. A process is only blocked if the other is in the CS. No deadlock.
‚ñ∂ **Ordering:** Not strictly, but it is starvation-free (fair).

### Fault Tolerance
Performance is measured in memory accesses, not messages. However, these algorithms depend heavily on the memory consistency model provided by the hardware. They work correctly under **Sequential Consistency**.

---

## Dekker's algorithm
Dekker's algorithm solves mutual exclusion for **two processes** using shared memory. It uses flags to show intent and a turn variable to break ties, ensuring one process eventually gets priority.

### ## How It Works

* **Variables (Shared):**
    * `flags[2]`: A boolean array, initially `false`. `flags[i] = true` means process `i` wants to enter.
    * `turn`: An integer (`0` or `1`) indicating whose turn it is if both want to enter.

* **Protocol (Process `i`):**
    1.  **Enter:** Set `flags[i] = true`. While the other process (`j`) also has its flag up, politely wait if it's their `turn`.
    2.  **Critical Section:** Once inside, perform the critical work.
    3.  **Exit:** Give the `turn` to process `j` and set `flags[i] = false`.

### ## Properties

* **Safety:** Yes (guarantees mutual exclusion).
* **Liveness:** Yes (no deadlock or starvation).
* **Ordering:** No (but it is fair).
* **Fault Tolerance:** None. If a process fails in the critical section, the other process will be blocked forever.

---

## Sequential consistency: implementation
This model assumes:
‚ñ∂ **Atomic writes:** A write operation appears to happen instantaneously to all processes.
‚ñ∂ **Read-from-memory:** A read operation gets the value from the last completed write.

![img](../images/Screenshot%202025-09-23%20at%2011.07.48.png)

Sequential consistency is a model that defines how these concurrent operations behave. It ensures that all processes see the same single, interleaved order of all read and write operations, as if the operations were executed one after another on a single processor.

---

## Sequential Consistency: formally
**Sequential Consistency (SC)** is a memory model where the result of any execution is the same as if all processes' operations were executed in some single sequential order, and the operations of each individual process appear in this sequence in the order specified by its program.

Essentially, you can imagine all operations from all processes being put into a single timeline (interleaved), but the local order for each process is preserved.



---

## Weak Memory Models
Modern CPUs often reorder memory operations for performance, leading to memory models that are "weaker" than SC.
**Rule of thumb:** Any behavior not allowed by Sequential Consistency is the result of a weak memory model.


![img](../images/Screenshot%202025-09-23%20at%2011.07.03.png)
The key feature is the addition of a private store buffer for each process. When a process like P1‚Äã executes a write instruction (e.g., x = 1), the data is not sent directly to the main shared memory. Instead, it's placed in its local store buffer first and this creates a delay.

### An example: total store order (TSO)
TSO is a common weak model where each process has a **write buffer**.
‚ñ∂ **Non-atomic writes:** A write is first placed in a local buffer and committed to main memory later.
‚ñ∂ **Read locally or from memory:** A process can read its own writes from its buffer before they are visible to others.

This means a process `p1` might execute a write, but another process `p2` won't see that new value until it's flushed from `p1`'s buffer to main memory.

![img](../images/Screenshot%202025-09-23%20at%2011.07.03.png)

Process P1‚Äã's writes (x:=1, x:=2) are queued in its private store buffer, while main memory still holds the old value x=0.

Due to store forwarding ("read-your-own-writes"), P1‚Äã reads its own latest update for x (which is 2) directly from this buffer. In contrast, it reads other variables like y from main memory. This mechanism can produce results that would be impossible under a stricter Sequential Consistency model.


### Peterson's Algorithm under TSO

**Main issue:** The processor might reorder independent instructions. In Peterson's algorithm:
`flags[i] = true;`
`turn = j;`
A TSO model could reorder these writes. Even worse, it might reorder the write to `flags[i]` with the read of `flags[j]` in the `while` loop. If both processes read the old `false` value for the other's flag before their own `true` value becomes visible to the other, they could both enter the critical section, violating safety!

**How to fix this?**
We need to insert **memory fences** (or barriers). A fence is an instruction that forces the CPU to commit all pending writes to memory and/or wait for all reads to complete before proceeding. This enforces the intended order at critical points in the algorithm, restoring correctness at a slight performance cost.
```markdown

---

# LECTURE 06/10/2025

## Replication and Election

Replication involves maintaining copies of data on multiple machines (replicas) to achieve specific system goals. This lecture also covers how a group of processes can agree on a single leader.

---

## Goals of replication

Replication is a fundamental technique in distributed systems for creating robust and scalable services.

‚ñ∂ **Fault Tolerance**: The system continues to operate correctly even if some of its replicas fail. This is ideally **transparent** to the user, who doesn't notice the failure. It can tolerate both node and network failures.

‚ñ∂ **High Availability**: By having multiple copies, the service can remain accessible and responsive most of the time. If one replica is down, requests can be served by another, minimizing downtime.

‚ñ∂ **Performance**: Replication can improve performance by placing data closer to users (reducing latency) and by distributing the workload across multiple machines, overcoming the scaling limits of a single server (**vertical scaling**).

**Caching** is a common form of replication. Examples include:
‚ñ∂ Your browser caching website assets locally.

‚ñ∂ Netflix prefetching video segments to a server near you.

‚ñ∂ The DNS system replicating domain name records across the world.

---

## Problems

While powerful, replication introduces significant challenges:

‚ñ∂ **Consistency**: How do we ensure that all replicas have the same data, or at least a consistent view of it? If a client writes to one replica, how and when do other replicas see that change? This is the core problem of consensus.

‚ñ∂ **Overhead**: Keeping replicas in sync requires communication. This adds network traffic and processing overhead.

‚ñ∂ **Failure Handling**: How does the system detect that a replica has crashed? Once detected, how does it manage failover to another replica and reintegrate the failed replica once it recovers?

---

## CAP theorem

The CAP theorem, also known as Brewer's theorem, presents a fundamental trade-off in distributed system design.

‚ñ∂ **Consistency**: Every read operation receives the most recent write or an error. All nodes have the same data at the same time. Think of it like a bank account balance that is identical no matter which ATM you use.

‚ñ∂ **Availability**: Every request receives a (non-error) response, without the guarantee that it contains the most recent write. Your bank account is always accessible, even if the balance shown is slightly out of date.

‚ñ∂ **Partition Tolerance**: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. A network failure between two data centers will not cause the entire service to fail.

### Theorem

It is **impossible** for a distributed system to simultaneously provide all three guarantees: **Consistency**, **Availability**, and **Partition Tolerance**.

In the presence of a network partition (a realistic scenario in any distributed system), you must choose between consistency and availability.

* To maintain **Consistency**, you must sacrifice Availability (e.g., stop accepting writes or reads until the partition heals).
* To maintain **Availability**, you must sacrifice Consistency (e.g., allow operations on both sides of the partition, which may lead to conflicting data that needs to be reconciled later).



### Examples

The choice between C, A, and P depends entirely on the application's requirements.

‚ñ∂ **CP Systems (Consistency & Partition Tolerance)**: These systems choose consistency over availability during a partition. They are essential where data accuracy is non-negotiable.
    * **Financial Sector**: Banking systems cannot tolerate inconsistent account balances.
    * **Scientific Computing**: Large-scale simulations (e.g., weather forecasting) require a consistent state.

‚ñ∂ **AP Systems (Availability & Partition Tolerance)**: These systems choose availability, accepting that data might be temporarily inconsistent across replicas.
    * **Social Networks**: It's more important for users to be able to post content than for every other user to see it instantly. Eventual consistency is acceptable.
    * **Search Engines**: An index might be slightly out of date, but the search service must always be available.

‚ñ∂ **CA Systems (Consistency & Availability)**: These systems cannot tolerate partitions. This model is effectively limited to single-node systems (like a traditional single-server database), as any system with a network is subject to partitioning.

---

## Assumptions

For the following replication models, we assume:

‚ñ∂ **Asynchronous System**: There are no bounds on message delivery time or process execution speed.

‚ñ∂ **Reliable Communication**: Messages are eventually delivered, but not necessarily in order.

‚ñ∂ **Crash-fail Model**: Processes fail by crashing (stopping completely) and do not send malicious or incorrect data (i.e., no Byzantine failures unless stated otherwise).

‚ñ∂ **Atomic Operations**: Operations are all-or-nothing.

‚ñ∂ **Deterministic Objects**: Objects behave as "state machines." Their output depends solely on the sequence of operations applied, not on random chance, timers, or external events.
    * *Notation*: `o.m(v)` means applying the modifier `m` with value `v` to object `o`. Example: `myAccount.deposit(1000)`.

---

## Requirements

The ideal replicated system should meet these criteria:

‚ñ∂ **Transparency**: The user interacts with the service as if it were a single, non-replicated entity.

‚ñ∂ **Consistency**: The state of replicated objects remains consistent across all replicas.

The ultimate goal is a system that is **indistinguishable from a single, correct, and highly available copy**.

---

## Operations

A generalized workflow for a replicated operation involves five phases:

1.  **Request**: A client sends a request to a frontend or directly to a replica.
2.  **Coordination**: The replicas decide whether the operation can be applied immediately or must wait. They agree on a definitive order for operations.
3.  **Execution**: Each replica executes the operation.
4.  **Agreement**: The replicas communicate to reach a consensus on the outcome of the operation.
5.  **Response**: One or more replicas send a response back to the client.

---

## Fault tolerance

The primary goal here is to build an ***f*-resilient** system, meaning it can tolerate the failure of up to *f* replicas without any service interruption (**downtime**) or impact on the user (**transparency**).

---

## Consistency models

Consistency models are rules that define what guarantees a system provides regarding the order and visibility of operations.

‚ñ∂ **Strong Consistency**: Guarantees that once a write completes, any subsequent read from any replica will see that new value (or a newer one). This is the most intuitive model but often the hardest to implement efficiently.
    * *Inconsistency Example*: A client writes `x=1` to replica A. Replica A crashes before it can inform replica B. Another client then reads from replica B and gets the old value `x=0`.

‚ñ∂ **Weak Consistency**: Offers fewer guarantees. It doesn't ensure that subsequent reads will see the latest write. Systems prioritize availability and performance, accepting that replicas may be out of sync for a period.

‚ñ∂ **Eventual Consistency**: A specific form of weak consistency. It guarantees that if no new updates are made, all replicas will *eventually* converge to the same value. It's a common model for highly available systems, but its drawback is that clients may temporarily read stale data. Typically, conflicts are resolved with a "last write wins" policy.

### Desired Temporal Consistencies

These are common client-centric consistency guarantees:

‚ñ∂ **Read-your-writes**: If a process writes a value, a subsequent read by that *same process* will always see that value or a newer one.

‚ñ∂ **Monotonic reads**: If a process reads a value, any subsequent read by that *same process* will see the same value or a newer one. It never sees an older value.

‚ñ∂ **Causal consistency**: If operation A *happened-before* operation B (e.g., a user posts an answer to a question), then every process sees A before it sees B. Unrelated operations can be seen in different orders.

---

## Linearizability (Lamport)

Linearizability is a **strong consistency** model. It provides the illusion that there is only a single copy of the data and that all operations on it are **atomic**.

An execution history is **linearizable** if:
1.  All operations can be reordered into a sequential history that is correct according to the object's specification (e.g., a queue's operations).
2.  This sequential order **respects the real-time order** of non-overlapping operations. If operation A finishes before operation B begins, then A must appear before B in the sequential history.

### Implementation & Drawbacks

A common theoretical implementation approach highlights its difficulty:
‚ñ∂ Synchronize hardware clocks across all machines.

‚ñ∂ Guess a maximal network delay, `D`.

‚ñ∂ When a request arrives, place it in a hold-back queue and wait for time `D` to pass to ensure no earlier message is still in transit.

‚ñ∂ Process operations from the sorted queue.

This is impractical because:
‚ñ∂ **No perfect clock synchronization** algorithm exists for distributed systems.

‚ñ∂ In an asynchronous system, there is **no guaranteed upper bound** on network delay `D`.

---

## Sequential Consistency (Lamport)

Sequential consistency is a slightly weaker, but more practical, strong consistency model than linearizability.

An execution history is **sequentially consistent** if:
1.  The result of any execution is the same as if all operations were executed in *some* sequential order.
2.  The operations of each individual process appear in this sequence in the order specified by its program (**program order**).

The key difference from linearizability is that sequential consistency **does not have to respect real-time order** across different processes. An operation that finished earlier in real-time might appear later in the sequential history, as long as the program order for each process is maintained.



---

## Replication Architectures for Fault Tolerance

‚ñ∂ **Read-only replication**: Used for static or immutable content like files on a CDN. Simple to manage as data doesn't change.

‚ñ∂ **Passive replication (Primary-Secondary)**: A primary replica handles all writes and propagates them to passive secondary replicas. Provides high consistency. Used in systems where data integrity is critical, like traditional databases.

‚ñ∂ **Active Replication (State Machine Replication)**: All replicas are peers and process every request. Requires a total order for all incoming requests to ensure replicas remain in sync. Offers fast failover and can handle Byzantine failures.

---

## Passive Replication

Also known as the Primary-Backup model.

1.  **Request**: Client sends a write request to the **primary** replica.
2.  **Coordination**: The primary decides the order of operations.
3.  **Execution**: The primary executes the operation and updates its state.
4.  **Agreement**: The primary sends the update to all **backup** replicas. It waits for acknowledgements (ACKs) from them.
5.  **Response**: Once acknowledged by backups, the primary replies to the client.

* **Pros**: Conceptually simple and provides **linearizability** (if reads also go through the primary).
* **Cons**: The primary is a performance bottleneck and a single point of failure. **Failover** (electing a new primary) can be slow and complex. It does not tolerate a Byzantine primary.
* **Optimization**: To improve read performance, reads can be offloaded to backups, but this sacrifices strong consistency for eventual consistency.

---

## Active Replication

All replicas are equal and process requests concurrently.

1.  **Request**: A frontend multicasts the client's request to all replicas using a **totally-ordered, reliable multicast**.
2.  **Coordination**: The total-order protocol delivers requests to all replicas in the same, deterministic sequence.
3.  **Execution**: Each replica executes the request as it is delivered. Since they all start in the same state and execute the same operations in the same order, they remain in sync.
4.  **Agreement**: Implicit in the ordered multicast; no separate agreement phase is needed.
5.  **Response**: Any replica can respond to the client. For Byzantine fault tolerance, the client may wait for `f+1` or `(n/2)+1` identical responses.

* **Pros**: **Fast failover** (just remove the crashed replica from the group). Excellent for load distribution on reads. Can handle Byzantine failures. Provides **sequential consistency**.
* **Cons**: The required **total-order multicast** is complex and expensive to implement, especially in asynchronous systems where it's theoretically impossible.

---

## Availability

This section shifts focus from consistency-first fault tolerance to systems where **availability is the primary goal**. We are willing to relax consistency guarantees for higher uptime and faster responses. This is the philosophy behind most large-scale web services (e.g., social media, e-commerce).

---

## Gossip Architecture

Gossip (or Epidemic) protocols are a popular method for building highly available systems with **eventual consistency**. Replicas periodically exchange information with random peers to spread updates throughout the system, much like a rumor spreads through a crowd.

### Operations

* **Reads**: Can be served by any replica, but might return stale data.
* **Writes (Updates)**: Sent to one or more replicas and then propagated via gossip.

### Relaxed Consistency

* Clients may read outdated data.
* However, guarantees like **causal consistency** are often provided to ensure a sensible user experience (e.g., you always see your own updates). Vector clocks are the key mechanism for this.

### The Idea: Vector Clocks Everywhere

* Each replica manager `R_i` maintains a **vector clock** that tracks the latest update it has seen from every other replica.
* When a frontend `F_j` sends an update, it tags it with its own vector clock (`prev`), representing the state of the world it last knew.
* A replica manager uses these timestamps to order incoming updates, delay updates from the "future" (whose prerequisites haven't been seen yet), and avoid applying duplicate updates.

### Phases

1.  **Request**: A frontend sends the client's request (tagged with its timestamp) to one or more replica managers.
2.  **Coordination**: The replica manager queues the request until its timestamp's dependencies are met (i.e., all causally preceding updates have been applied).
3.  **Execution**: The operation is applied in the correct causal order.
4.  **Agreement**: Updates are spread lazily to other replicas through periodic **gossip** messages. If a replica detects a gap in its history, it can request the missing data.
5.  **Response**:
    * **Writes** can be acknowledged immediately for low latency.
    * **Reads** may need to wait until the replica's state is at least as new as the timestamp provided by the client's frontend.

### Frontend View

The frontend `F` maintains a vector timestamp `prev` summarizing the latest updates it is aware of.
1.  On a client operation `o`, it sends the pair `(o, prev)` to a replica `R_i`.
2.  It waits for a response from `R_i`, which will include an updated timestamp.
3.  It merges the new timestamp with its own `prev`. This `prev` state can also be updated via gossip from other frontends.

### Replica Manager View

#### Read Logic
A replica manager has a current value `v` with its associated timestamp `vts`.
1.  It receives a read request `(o, prev)` from a frontend.
2.  If `prev <= vts` (the replica's data is at least as new as what the frontend has seen), it can **return `(v, vts)` instantly**.
3.  Otherwise, the replica must wait for gossip to deliver the necessary updates until the condition `prev <= vts` is met.

#### Write Logic
1.  Receives a write request `(v, id, prev)` from a frontend.
2.  Assigns a new, unique timestamp to the update based on the `prev` timestamp and its own local clock.
3.  Stores the update in a stable log and acknowledges the write to the frontend immediately.
4.  The update is later applied locally once all its causal dependencies are met.
5.  The update is propagated to other replicas during gossip rounds.

---

## Leader election

Leader election is a procedure to dynamically choose one process to act as a coordinator or leader. This is critical in many algorithms, such as passive replication, where a new primary must be chosen if the old one fails.

### Requirements

Let `P` be the set of processes. The leader for process `p_i` is `L(p_i)`.

1.  **Safety**: For any process `p_i`, either it has no leader (`L(p_i) = ‚ä•`) or its leader `L(p_i)` is the non-crashed process with the single largest identifier.
2.  **Liveness**: Eventually, every non-crashed process `p_i` has a leader (`L(p_i) != ‚ä•`).

### Assumptions

* Processes have unique identifiers (IDs).
* Processes can crash, but they stay dead.
* The system can reliably detect crashes (often via timeout, which requires synchrony assumptions).

---

## Chang and Roberts Algorithm

A simple and efficient election algorithm for systems where processes are arranged in a logical **ring**.

### Idea
‚ñ∂ A token containing a process ID is passed around the ring.

‚ñ∂ When a process notices the leader has failed, it starts an election by sending an `election` message with its own ID to its neighbor.

‚ñ∂ When a process receives an `election` message:
    * If the ID in the message is **higher** than its own, it forwards the message unchanged.
    * If the ID is **lower** and the process isn't already participating, it replaces the ID with its own and forwards it.
    * If it receives its **own ID** back, it declares itself the winner and sends an `elected` message around the ring to announce the new leader.

### Properties
‚ñ∂ **Safe and Live** (if no failures occur during the election).

‚ñ∂ Message Complexity: `3N - 1` messages in the worst case for one election.

‚ñ∂ **Vulnerable to crashes**: If a process in the ring fails during the election, the token can be lost. Requires a reliable failure detection and ring-management layer to handle this.

---

## Bully Algorithm

An election algorithm for synchronous systems where any process can communicate with any other.

### Idea
‚ñ∂ The process with the **highest ID** always wins. It "bullies" lower-ID processes into submission.

‚ñ∂ When a process `P` detects the leader has failed, it starts an election:
    1.  `P` sends an `ELECTION` message to all processes with a **higher ID**.
    2.  `P` waits for a response.
        * If it gets **no response** within a timeout, it assumes all higher-ID processes are down. `P` declares itself the leader and sends a `COORDINATOR` message to all other processes.
        * If it **receives an `ANSWER`** from a higher-ID process, it gives up. Its job is done, as it knows a higher-ID process is taking over the election.

‚ñ∂ When a process receives an `ELECTION` message from a lower-ID process, it responds with an `ANSWER` and starts its own election (if it's not already running one).

### Properties
‚ñ∂ **Safe & Live**, but relies heavily on:
    * A **synchronous system** where timeouts can reliably detect crashes.
    * Reliable failure detection.

‚ñ∂ **Message Complexity**:
    * Best-case: `N-2` messages (the second-highest ID process starts the election).
    * Worst-case: `O(N^2)` messages (the lowest-ID process starts the election, triggering a cascade).

‚ñ∂ **Safety is broken** if the timeout is too short (a slow process is wrongly assumed dead) or if the system is actually asynchronous.

# LECTURE 07/10/2025

# LECTURE: Clustered Storage

This lecture explores Google's clustered storage infrastructure, focusing on the Google File System (GFS), the Chubby lock service, and the Bigtable database system. It details their architecture, design choices, and performance characteristics, which are tailored for Google's large-scale data processing workloads.

***

## Google's Infrastructure and Workload

Google's services, like its search engine, Gmail, and YouTube, run on a massive, distributed infrastructure designed to handle two main types of workloads:

* **Offline Batch Jobs**: These involve processing petabytes of data with large, sequential reads and writes. Short outages are generally acceptable for tasks like web indexing and log processing.
* **Online Applications**: These manage smaller datasets (terabytes) with frequent, small reads and writes. Low latency and high availability are critical, as any downtime directly impacts users of services like web search and Google Docs.

This infrastructure is built on clusters of commodity hardware, where component failures are considered normal events. Google's philosophy is to achieve fault tolerance and high throughput through software techniques like replication and parallelism rather than relying on expensive, highly reliable hardware.

### Key Infrastructure Components

Google's infrastructure consists of several core components for data storage, coordination, and computation:

* **GFS (Google File System)**: A distributed file system for storing massive datasets.
* **Chubby**: A distributed lock service for coordination and master election.
* **Bigtable**: A sparse, distributed, multi-dimensional sorted map built on top of GFS.
* **MapReduce**: A framework for large-scale parallel data processing.

***

## Google File System (GFS)

GFS is a distributed filesystem designed specifically for Google's application workload and infrastructure.

### GFS Design Assumptions

GFS is built on several key assumptions about its use case:

* **Component failures are normal**: The system is built from thousands of commodity machines and must constantly monitor itself and recover from failures transparently.
* **Files are huge**: Multi-gigabyte files are common, so I/O operations and block sizes are optimized for large datasets.
* **Workloads are specific**: The primary operations are **large sequential reads** and **record appends** (concurrent appends to the same file). High sustained throughput is more important than low latency.
* **Non-POSIX API**: GFS provides a specialized API that includes operations like `snapshot` and `record append` to better serve its applications.

### GFS Architecture

GFS uses a single-master architecture to simplify design and allow for optimal data placement.

* **GFS Master**: A single server that stores all file system metadata, including the namespace, access control information, and the mapping from files to chunks. It also manages chunk replication and placement. The master does not handle file data directly; it only provides metadata to clients.
* **Chunkservers**: These servers store data on their local disks as fixed-size **chunks** (64 MB). They read or write chunk data based on instructions from clients or the master.
* **GFS Client**: A library linked into applications. It communicates with the master for metadata and then interacts directly with chunkservers for data I/O, minimizing the master‚Äôs involvement.

### Write Operation and Replication

GFS uses a **passive replication** model to ensure data consistency and fault tolerance. For each chunk, the master designates one replica as the **primary** and others as **secondaries**.

The write process follows these steps:

1. The client asks the master for the primary and secondary replica locations for a specific chunk.
2. The master provides this information to the client.
3. The client pushes the data to all replicas, typically in a chained fashion for network efficiency.
4. Once all replicas acknowledge receiving the data, the client sends a write request to the primary replica.
5. The primary determines a serial order for mutations and forwards the write request to all secondaries.
6. The secondaries reply to the primary after completing the operation.
7. Finally, the primary replies to the client. If errors occur, the file region may be left in an inconsistent state.

### GFS Consistency Model

GFS has a **relaxed consistency model** that prioritizes performance for its specific workload.

* A file region is "**defined**" if all clients see the same data and the writes in their entirety. A region is "**consistent**" if all clients see the same data, but it might be a mix of writes from different clients.
* **Successful serial writes** leave a region in a defined state.
* **Successful concurrent writes** leave the region consistent but undefined.
* **Record appends** are guaranteed to be written atomically at least once.

### Master Fault Tolerance

To mitigate the single-point-of-failure risk, the master's state is replicated.

* All metadata changes are written to an **operations log** stored on multiple machines.
* **Shadow masters** provide read-only access to the filesystem, even if the primary master is down.
* If the master fails, an external mechanism selects a new master to take over.

***

## Chubby: A Distributed Lock Service

Chubby is a coarse-grained lock service designed to provide coordination and reliable storage for loosely coupled distributed systems.

### Purpose and Use Cases

Chubby's primary role is to allow clients to synchronize their activities, often for tasks that take hours or days. Key uses include:

* **Master Election**: GFS and Bigtable use Chubby to elect a single active master from a pool of potential servers.
* **Name Service**: It provides a well-known location for clients to discover the location of other services.
* **Reliable Storage**: It offers a reliable, though low-volume, file system for storing small amounts of critical metadata.

### Chubby Architecture

A Chubby deployment, known as a **cell**, typically consists of five replicated servers (replicas) per data center.

* The replicas use the **Paxos consensus algorithm** to elect a **master**.
* All read and write requests are handled by the master, which ensures strict consistency. Updates are propagated to a quorum of replicas before being acknowledged.
* Clients interact with Chubby through a library that caches file data and handles communication with the master.
* **Sessions** are maintained between the client and the master using leases. If a client fails to renew its lease, the session expires, and any locks it holds are released.

***

## Bigtable: A High-Performance Storage System

Bigtable is a distributed storage system for managing structured data at a very large scale. It is not a relational database but rather a sparse, persistent, multi-dimensional sorted map.

### Bigtable Data Model

The fundamental mapping in Bigtable is:
`(row:string, column:string, time:int64) -> string`

* **Rows**: Data is indexed by a **row key**. Rows are sorted lexicographically, allowing for efficient scans over consecutive rows. This is often exploited by structuring row keys to group related data together (e.g., reversing domain names). All operations on a single row are **atomic**.
* **Columns**: Columns are grouped into **column families**, which are the basic unit of access control and storage. A column is identified by its family and a **qualifier** (e.g., `anchor:cnnsi.com`).
* **Timestamps**: Each cell can contain multiple versions of data, indexed by a 64-bit timestamp.

### Bigtable Architecture

Bigtable is built on top of GFS and Chubby and consists of three main components:

* **Client Library**: Linked into every client application, it handles communication with the rest of the system.
* **Master Server**: Responsible for metadata tasks like assigning tablets to tablet servers, load balancing, and garbage collection of GFS files. The master is largely stateless, storing all its metadata in Chubby, which simplifies recovery.
* **Tablet Servers**: Each tablet server manages a set of **tablets** (10‚Äì1000 per server). A tablet is a contiguous range of rows from a table. Tablet servers handle all read/write requests for the tablets they serve and split tablets that grow too large.

### Tablet Location and Serving

* To find the server responsible for a particular row, clients traverse a three-level **B+-tree** hierarchy stored in Bigtable itself. The location of the root tablet is stored in a Chubby file.
* When a tablet server receives a write, it writes the update to a commit log in GFS and then stores it in an in-memory sorted map called a **memtable**.
* Reads are served from a merged view of the memtable and a set of immutable files on GFS called **SSTables**.
* Periodically, the memtable is converted into a new SSTable (**minor compaction**), and existing SSTables are merged to reduce clutter and reclaim space (**major compaction**).

### Performance

Bigtable's performance scales well with the number of tablet servers.

* **Random reads** are relatively slow because they can't take advantage of locality on GFS. However, reads from in-memory locality groups are much faster.
* **Sequential reads and scans** are very fast because they read entire data blocks from GFS, maximizing data utilization.

---

# LECTURE 20/10/2025

# PEER TO PEER NETWORKS

## Overlay Networks and Peer to Peer Networks

**Overlay Networks** are virtual networks built over an underlying network. In overlay networks, nodes are connected through logical links rather than physical links. These networks allow flexible routing, improved scalability, and support for new services without modifying the underlying physical network.

Overlay networks adds a layer to the stack to provide something that the underlying network does not have without changing it (e.g. service like multimedia content distribution, routing protocol, etc...)

Is it possible to build everything on the application layer? No, while an Application Layer Overlay Network provides valuable application-specific intelligence and routing, it still fundamentally relies on the lower layers (especially IP and TCP/UDP) for the actual transmission, addressing, and physical delivery of data across the network.

### Types of overlay:

| For what | Type | Description |
|:------------:|:------------:|:-------------:|
| Application needs | Distributed hash tables | Decentralized key mapping service in a large network |
|| P2P file sharing | Constructing addressing and routing mechanism to support cooperative discovery and use of files |
|| Content distribution networks | Replication, caching and placement strategies |
| Network Style | wireless ad hoc networks | Provides customized routing protocols |
|| Distruption-tolerant networks | Networks designed to operate in hostile environments |
| Additional features | Multicast | Provides multicast services where multicast routers are not available |
|| Resilience | Improved robustness and availability |
|| Security | Enhanced security over underling IP network, including VPN |

### Limitations of client/server paradigm
Simple architecture:
client -- request --> [Server]

However this architecture has some issues uch as scalability (more computing power if there are more users) and reliability (network depends on server)

### Limitations of P2P

System is based on the direct communication between peers, it is decentralized, it survives extreme network changes, model is highly scalable and benefits from consumer technology.

However it has some issues: on/off behaviour, need to join, need to discover other peers, misunderstanding communication rules when implementing peers, prevent free riding, incentivise participation and reciprocation.

This networks needs protocols for: finding peers, finding what services a peer provides, obtain status from a peer, invoke a service on a peer, create/join/leave peer groups, create data connection, relaying messages. Basically any kind of communication between peers and network.

A P2P system consists of autonomous entities (peers) able to auto organize and share a set of distributed resources in a computer network. 

A peer is a node, it possesses a unique id, it belongs to either one or multiple groups and each one of them can communicate with other peers.

### Types of P2P applications categories
| Type | Description |
|-----|-----|
| Distributed computing | Decomposition of larger problem into smaller paraller problems |
| File sharing | Efficient search across WAN |
| Collaborative applications | Update mechanism to provide consistency in multi-user requirement |

### Common primitives in a P2P file sharing system

- join: participate in network, discover at leas one existing peer and register itself in the network.
- publish: advertise my file, for example napster a centralized server stored mappings, in gnutella broadcasts a query hit, in a DHT-based system inserts a record into the DHT. (distributed hash table).
- search: find file/service, flooding (broadcast queries to all neighbors) or indexing/dht lookup or hybrid approaches.
- fetch: retrieve file/use service, establish connection with one or more peers and download the file.

---

## Centralized Peer to Peer Network

How did it start?
It started with Napster which was an American proprietary peer-to-peer file sharing application primarily associated with digital audio file distribution.
The kkey idea was to share content, storange and bandwith of individual users.

### Challenges of Napster
Main:
- Find where a file is stored

Other:
- How to scale it up in order to support countless machines
- How to make the system dynamic in which each machine can come and go

### Solutions

A centralized index system maintains a mapping between files (e.g., songs) and the machines currently storing them. To locate a file, a user queries the index system, which returns the machine hosting the requested file (preferably the nearest or least) loaded one. The file can then be retrieved via FTP. 

Advantages:
- Simple and easy to implement.
- Supports sophisticated search functionalities.

Disadvantages:
- Single point of failure reduces robustness.
- Centralized bottleneck limits scalability.

### Search operation
In this system, the client sends search keywords to the server, which looks up matching files in its index and returns a list of hosts represented as `<ip_address, portnum>` pairs. The client then pings each host to measure transfer rates and selects the optimal one to download the file from.

### Issues with search operation
This approach faces several issues: the centralized server can become a source of congestion and represents a single point of failure. Additionally, communication lacks security since messages and passwords are sent in plaintext. Napster was also held legally responsible for users‚Äô copyright violations, leading to claims of indirect infringement.

---

## Unstructured Peer to Peer Network

The first ever unstructured and decentralized P2P network was Gnutella, founded in 2000. In this system, peers form an overlay network. When a client joins, it connects to a few known nodes, which become its neighbors. Unlike centralized systems, there is no need to publish files. For searching, a node sends a query to its neighbors, who forward it recursively until a match is found, at which point a reply is sent back to the requester. Once the desired file is located, it is fetched directly from the peer hosting it.

### Search operation in GNutella

**Advantages:**
- Fully decentralized with no central point of failure.  
- Highly robust against node failures.  

**Disadvantages:**
- Poor scalability, as deterministic searches may require contacting many peers.  
- Network can become flooded with excessive query traffic.  
- Each request must include a Time-To-Live (TTL) limit to prevent uncontrolled flooding.  

### Avoiding excessive traffic

In Gnutella, queries are forwarded to all neighbors except the one they were received from, and each query, identified by a DescriptorID, is forwarded only once. Peers maintain a list of recently seen messages to avoid duplicates, dropping any repeated queries with the same DescriptorID and payload type. Responses (QueryHits) are routed back only along the path from which the original query arrived, and any QueryHit without a corresponding query is discarded.

### Download operation in GNutella

The requester selects the "best" responder from the QueryHits and fetches the file directly using HTTP:
```
GET /get/<File Index>/<File Name>/HTTP/1.0\r\n
Connection: Keep-Alive\r\n
Range: bytes=0-\r\n
User-Agent: Gnutella\r\n
\r\n
```

HTTP is used because it is widely supported and accepted by firewalls, and the `Range` field allows partial file transfers.

### Comparing Napster and GNutella

|                | Napster                          | Gnutella                              |
|----------------|---------------------------------|----------------------------------------|
| **Pros**       | - Simple                        | - Simple                               |
|                | - Search scope is O(1)          | - Fully decentralized                  |
|                |                                 | - Search cost distributed              |
| **Cons**       | - Server maintains O(N) state   | - Search scope is O(N)                 |
|                | - Server performance bottleneck | - Search scope is O(N)                 |
|                | - Single point of failure       | - Large number of freeloaders          |


### New GNutella protocol

The FastTrack protocol, initially implemented in Kazaa, KazaaLite, and Grokster, and later adapted in Gnutella, improves on traditional P2P networks by designating some peers as supernodes (ultrapeers). These supernodes leverage healthier, more reliable participants and maintain a Napster-like directory of files.  

**Smart Query Flooding:**
- **Join:** Client contacts a supernode on startup; may become a supernode itself.  
- **Publish:** Client sends its list of shared files to its supernode.  
- **Search:** Queries are sent to the supernode, which floods them only among other supernodes.  
- **Fetch:** Files are downloaded directly from peers and can be fetched simultaneously from multiple sources.  


### FastTrack
In FastTrack, each supernode maintains a directory of nearby peers, storing entries like `<filename, peer pointer>`, similar to Napster servers. Supernode membership is dynamic, and any peer can become a supernode if it earns enough reputation, which depends on factors like connection uptime and total uploads. More advanced reputation schemes, sometimes based on economic incentives, have also been developed to manage supernode selection and reliability.

**Pros:**
- Balances search overhead with storage requirements.  
- Accounts for node heterogeneity, including bandwidth and computational resources.  

**Cons:**
- No strict guarantees on search scope or search time.  


---

## Structured Peer to Peer Network

Structured P2P networks organize peers and resources using a predefined structure, often based on distributed hash tables (DHTs), to enable efficient and deterministic data location. Unlike unstructured networks, where searches rely on flooding, structured networks provide guaranteed lookup performance. Prominent examples include **Chord**, **Pastry**, and **Tapestry**, each using different routing and identifier schemes to map keys to nodes and facilitate scalable, fault-tolerant file retrieval.

### API based on unique GUID associated to data

Structured P2P networks often provide an API based on a unique **GUID** associated with each data object. The basic operations include storing, retrieving, and deleting data, with replicas maintained at all nodes responsible for the GUID. An alternate approach treats the GUID as a reference to an object, supporting messaging and access control.  

**API:**

```text
put(GUID, data)         # Store data; replicated at responsible nodes
remove(GUID)            # Delete all references and data associated with GUID
value = get(GUID)       # Retrieve data from one of the responsible nodes

publish(GUID)           # Make object accessible via its GUID
unpublish(GUID)         # Make object inaccessible
sendToObj(msg, GUID, n) # Send a message to n replicas of the object (e.g., request to download)
```

### Robustness in this system

To maintain stability, each node periodically sends a `stabilize()` message to its successor. When a node receives this message, it checks whether the sender is a better predecessor and updates its predecessor if needed, then responds with its own predecessor via `notify()`. Upon receiving `notify()`, the sender updates its successor if the notified node falls between itself and its current successor; otherwise, no change is made. This process ensures the network remains consistent and resilient to node joins or failures.

### Achieving Efficiency: finger tables

Each node maintains a finger table to speed up lookups. The i-th entry of a node with ID `n` points to the first node whose ID is ‚â• `(n + 2^(i-1)) mod 2^m`, where `m` is the number of bits in the ID space. These fingers allow queries to ‚Äújump‚Äù exponentially closer to the target GUID rather than traversing one successor at a time, reducing lookup time from O(N) to O(log N) hops.

---

## Comparison between Napster, Gnutella and Chord
| Network   | Memory Lookup      | Latency      | #Messages for a Lookup |
|-----------|-----------------|-------------|----------------------|
| Napster   | O(1) (O(N) @server) | O(1)       | O(1)                 |
| Gnutella  | O(N)             | O(N)        | O(N)                 |
| Chord     | O(log N)         | O(log N)    | O(log N)             |

---

## Pastry Overview

Pastry assigns IDs to nodes using a virtual ring, similar to Chord. Each node maintains a **leaf set**, which includes its immediate successors and predecessors, and a **routing table** based on prefix matching. For example, a node with ID `01110100101` keeps neighbor peers for each prefix `*`, `0*`, `01*`, `011*`, ‚Ä¶ up to `0111010010*`. When routing to a target ID, such as `01110111001`, the node forwards the message to the neighbor whose ID shares the **longest prefix match** with the target, progressively moving the message closer to its destination.

### Pastry Locality

Pastry incorporates network locality into its routing. For each prefix (e.g., `011*`), the node selects the neighbor with the **shortest round-trip time** among all peers matching that prefix. Shorter prefixes have many candidates spread across the network, so early routing hops tend to be physically shorter, while later hops toward longer prefixes generally cover larger distances. Nodes can also monitor network traffic to discover better neighbors, ensuring more efficient routing based on proximity.

#### Pastry Prefix Routing Example

Node IDs (hex, simplified for illustration):

- Source: 65a1fc
- Target: d46a1c

Routing table for 65a1fc (showing first 4 prefix rows):

| Row | Prefix | Example Neighbor |
|-----|--------|----------------|
| 1   | 6      | 6xxxxxx        |
| 2   | 65     | 65xxxxx        |
| 3   | 65a    | 65axxxx        |
| 4   | 65a1   | 65a1xxx        |

**Routing Steps:**

1. Current node: 65a1fc  
   - Longest shared prefix with target d46a1c = 0 (no match)  
   - Forward to neighbor in **row 1** that starts with a different first hex digit closer to `d` ‚Üí e.g., neighbor `dxxxxxx`

2. Next node: `dxxxxxx`  
   - Prefix match with target: 1 hex digit (`d`)  
   - Forward to neighbor in **row 2 or 3** with longer matching prefix ‚Üí `d4xxxxx`

3. Next node: `d4xxxxx`  
   - Prefix match: 2 hex digits (`d4`)  
   - Forward to neighbor in **row 3 or 4** ‚Üí `d46xxxx`

4. Next node: `d46xxxx`  
   - Prefix match: 3 hex digits (`d46`)  
   - Forward to neighbor in **row 4** ‚Üí target `d46a1c`  

**Key Idea:**  
- At each hop, the node forwards to a neighbor with the **longest matching prefix** with the target.  
- This reduces the number of hops to roughly **log_base_16(N)** for N nodes.

---

## Tapestry

Tapestry uses the same **prefix-based routing** as Pastry but adds more flexibility through its **DOLR (Distributed Object Location and Routing) interface**, which includes operations like `publish(GUID)`, `unpublish(GUID)`, and `sendToObj(msg, GUID, [n])`. The key advantage is that applications can place replicas of objects **closer to frequent users**, reducing latency, minimizing network load, and improving tolerance to network and host failures.

---

# LECTURE 21/10/2025

# Internet of Things

Before IoT it was WSN (Wirless Sensor Networks), it was a key topic around 2005 with the following life:

Embedded Systems -> Wireless Sensor Networks -> Cyber Physical Systems -> Internet of Things

Nowadays the industry if focused in IoT which is similar to CPS, WSN and ES.

These technologies are more or less the same but there are some differences:

WSN is ES with a wireless interface

CPS is WSN with actuators

IoT is CPS with IP

## Challenges: resource constraints

- Power (energy)
- Bandwith
- Memory, CPU

To overcome the power consumption issue a Dynamic power management (DPM) was implemented:

example: STRONGARM SA1100

THe would be 3 stages: RUN, IDLE (SW routing may stio the CPU not in use while monitoring interrupts), SLEEP (shutdown of onchip activity)

Each state consumes a different amount of power.

### Sample Applications of IoT and Dynamic Power Management

#### 1. Flower Auction
- IoT sensors manage **temperature, humidity, and lighting** in flower storage areas.  
- **Dynamic Power Management (DPM)** optimizes energy use based on auction activity and sensor data.  
- Results in **improved product quality**, **lower operational costs**, and **energy efficiency**.

#### 2. PermaSense Case
- Long-term **environmental monitoring** system deployed in the Swiss Alps.  
- Uses **low-power sensors** and **adaptive duty-cycling** to study permafrost and climate change.  
- Integrates **solar energy harvesting** and **event-driven wake-ups** for year-round operation.  
- Demonstrates **reliable, autonomous IoT systems** under extreme conditions.

##### 3. Basic System Architecture
### Core Components
- **Sensor Nodes** ‚Äì sense data and manage local power states.  
- **Dynamic Power Manager (DPM)** ‚Äì controls active/sleep modes based on workload.  
- **Communication Network** ‚Äì low-power wireless (LoRa, ZigBee, BLE).  
- **Edge/Cloud Layer** ‚Äì performs analytics and optimizes energy use.  
- **User Interface** ‚Äì provides monitoring and control dashboards.

### Key Idea
A **distributed, hierarchical IoT architecture** where each node intelligently manages its power while cooperating for system-wide energy optimization.


### Low Duty Cycle (50ms/30min)

This slide illustrates a communication protocol designed for extreme power saving, operating with a very low duty cycle of 50 milliseconds of activity every 30 minutes. To maintain timing despite long periods of sleep mode, the system relies on internal clock synchronization rather than direct network alignment. The protocol manages time accuracy by calculating node wake-up times based on other clocks and providing local compensation for crystal drift and temperature impact. Finally, the duty cycle is adjustable, allowing the system to schedule more frequent activity bursts during critical times, such as day/night transitions

---

## Mainstream protocolos / MAC layers

### Standardized WSN protocols

THe traditional WiFi + IP is too resource intensive for WSN.

802.15.1 (bluetooth: cable replacement):

It is classic, there is one PAN coordinator that can have up to 7 slaves with a bandwith of 1 Mb/s, through BLE it can have unlimited connections but a max speed of 125 kb/s

It is based on a bluetooth mesh network

### Personal area networks

802.15.4 are protocol designed for wireless personal area networks (like a smart home, cars, remote engineering) and is used for monitoring and control, it is also easy to install but it lacks mobility and has advantages such as low power but with a downside of low transmission rates and low range.

802.15.4 protocol defines physical and MAC layer

### IEEE 802.15.4 MAC Overview

THe network of a mac protocol is composed by different nodes that have different funcionalities/roles:

- FFD (full function device): any topologu, network coordinator capable, talks to any device.
- RFD (reduced function device): star topology only, network coordinator incapable, talks only to the coordinator, very simple implementation.

Basically one of the FFD can become a master while the other are slaves, if a network assumes a different form other than a star every node is FFD, otherwise if a network results very complex there will be clustered stars where there are multiple stars inside the network and only in those forms there are RFDs.

**CSMA-CA**: stands for carrier-sense multiple access collision avoidance, basically happens when there are 2 nodes trying to communicate with the same node in the same instance (i think, redo this chapter with carl), basically after the receivers sends a clear to send signal it the other node should back off if carrier is occupied (since only the other node receives the ACK), in wifi it uses RTS/CTS (idk what these stand for).

**IEEE 802.15.4 Optional Supeframe Structure**: a period of time of 15ms * 2^n (where n stards for amount of nodes i guess and it is between 0 and 14 included), these period begins and ends with a short window of time of network beacon in thich the coordinator transmits a signal containing network information, frame structure and notification of pending node messages, after the beacon there is an extended period at the beginning which is reserved in case the beacon requires more time, then a contention access period in which ca be accessed by any node using CSMA-CA, then Contention Free Period which is a reserved time for nodes requiring guaranteed bandwidth (n=0) and at the very end the small window of network beacon mentioned earlier.

### MAC Modes

Starting from the base 802.15.4 MAC protocol it branches into various different modes and each implements different features.


### Zigbee
This is a low power wireless mesh network standard targeted at battery powered devices in wireless control and monitoring applications.

- Built on top of the IEEE 802.15.4 standard, adding a Network layer.
- Created and maintained by the Zigbee Alliance.
- Supports dynamic node joining with assigned 16-bit addresses.
- Enables tree, star, and mesh network topologies.
- Uses a distance-vector algorithm for route discovery.
- Provides 128-bit AES encryption for security.
- Includes an application layer framework for building applications.

### 6LoWPAN

6LoWPAN (IPv6 over Low-Power Wireless Personal Area Networks) is an IETF standard that enables efficient transmission of IPv6 packets over IEEE 802.15.4 low-power radio networks. Although IPv6 already exists, its default header size and packet structure are too large for constrained IoT devices, so 6LoWPAN introduces an adaptation layer to make IPv6 practical in low-power environments.

### Key Points

- **Why not just use standard IP?**
  - Developers are already familiar with IP and its tools.
  - But IPv6 headers (40 bytes) are too large for low-power radios with very small frame sizes.

- **Addressing limitations**
  - Traditional ZigBee uses 16-bit addresses, which is insufficient for large-scale IoT deployments.
  - 6LoWPAN uses IPv6 addresses, enabling global addressing and interoperability.

- **IETF Standard**
  - Developed by the IETF as: **6LoWPAN = IPv6 over IEEE 802.15.4**.

- **Header Compression**
  - IEEE 802.15.4 devices have **64-bit MAC addresses**.
  - IPv6 interface identifiers also use **64 bits**, so these can be derived directly from the MAC.
  - This eliminates the need to repeat the address in the IPv6 header.
  - Result: a compressed 6LoWPAN header as small as **4 bytes** for simple point-to-point or star networks.

- **Optional Headers**
  - **Fragmentation Header** ‚Üí Used when packets exceed the tiny 802.15.4 frame size.
  - **Mesh Header** ‚Üí Adds routing information for multi-hop mesh networks.

**In Short**
6LoWPAN makes IPv6 feasible on extremely constrained IoT devices by compressing headers, supporting fragmentation, and enabling mesh networking, all while preserving the familiar IPv6 protocol model.

### LoRa (Long Range) Summary

LoRa is a long-range, low-power, and low-throughput wireless communication technology designed for IoT devices that need extended range and long battery life. It separates responsibilities between the physical layer (LoRa) and the MAC layer (LoRaWAN), enabling flexible deployments and adaptive performance.

### Key Points

- **Characteristics**
  - Designed for **long-range**, **low-power**, and **low-data-rate** communication.
  - Ideal for battery-powered IoT sensors that transmit small, infrequent messages.

- **Layer Structure**
  - **LoRa (PHY Layer)** ‚Üí A proprietary modulation technique developed by *Semtech*.
  - **LoRaWAN (MAC Layer)** ‚Üí An open specification defined by the **LoRa Alliance** for networking, security, device classes, and gateway operation.

- **Range**
  - Typically **2‚Äì5 km** in urban/obstructed environments.
  - Can exceed **15 km** in rural or open areas.

- **Data Throughput**
  - Ranges from **0.3 kbps to 50 kbps**, depending on configuration.

- **Adaptive Spreading Factor (SF)**
  - LoRaWAN dynamically adjusts the **Spreading Factor** of the LoRa PHY.
  - Higher SF ‚Üí **longer range** but **lower data rate**.
  - Lower SF ‚Üí **higher data rate** but **reduced range**.
  - This trade-off helps optimize **network capacity**, **device lifetime**, and **reliability**.

**Summary**
LoRa and LoRaWAN enable ultra-long-range, energy-efficient IoT communication by combining Semtech‚Äôs PHY technology with an open, adaptive MAC layer that balances range, throughput, and power consumption.

### SigFox

SigFox is an ultra-narrowband, long-range, low-power communication technology designed for extremely small and infrequent data transmissions. It operates on a global network deployed and maintained by SigFox through partnerships with local telecom operators, enabling wide coverage without requiring users to install or manage their own infrastructure.

### Key Points

- **Range**
  - Typically **a few kilometers** in dense urban environments.
  - Up to **40 km** in rural areas when using directional antennas.
  - Signals can even **penetrate underground**, making it suitable for buried sensors.

- **Network Deployment**
  - SigFox builds and manages the antenna network globally.
  - Works with **local telecom partners** for deployment and maintenance.
  - Device owners simply **subscribe** to the service (around **1 euro per device per year**).

### Uplink Capabilities (Device ‚Üí Network)
- Up to **140 messages per day**.
- Maximum **6 messages per hour**.
- Each uplink message can contain **up to 12 bytes** of payload.

### Downlink Capabilities (Network ‚Üí Device)
- Up to **4 downlink messages per day**.
- Each downlink message carries **up to 8 bytes** of payload.

### Best Use Cases
SigFox is ideal for applications that:
- Send **very small** and **infrequent** bursts of data.
- Require **minimal downlink communication**.
- Need **wide coverage** and **very long battery life**.

Examples include:
- Utility meters  
- Environmental sensors  
- Alarms and alert systems  

### Narrowband Internet of Things (NB-IoT)

Narrowband Internet of Things (NB-IoT) is a secure, reliable, and efficient Low-Power Wide-Area (LPWA) technology standardized by **3GPP**. It operates on **licensed cellular spectrum**, ensuring high reliability and predictable network performance. NB-IoT is designed to support massive numbers of low-power devices with excellent coverage, long battery life, and low deployment cost.

### Key Characteristics

- **Low Power Consumption**
  - Optimized for long battery life (often exceeding 10 years) through power-saving modes and efficient signaling.

- **Low Device & Connectivity Cost**
  - Minimal hardware complexity and lightweight communication protocols reduce both module costs and subscription fees.

- **Massive Scalability**
  - Supports **tens of thousands of devices per base station**, making it ideal for large-scale IoT deployments.

- **Long Range**
  - Approximately **5 km** in dense urban areas.
  - Up to **50 km** in rural or open environments.

- **Excellent Signal Penetration**
  - Strong indoor coverage, capable of reaching:
    - Elevators
    - Basements
    - Underground parking facilities

**Summary**
NB-IoT delivers secure, energy-efficient, and large-scale IoT connectivity using licensed cellular networks, making it suitable for applications like smart meters, tracking devices, industrial sensors, and other deployments requiring deep indoor coverage and high reliability.

---

## IoT Routing

Routing is the process of selecting the best path and sending data packets across networks, from a source to a destination. Its job is to provide low power wireless links and create a network so that the data package traverses multiple links to reach the destination.

### Address Assignment

ZigBee uses a **distributed address assignment scheme** to allocate network addresses to devices in a structured and scalable way. Instead of relying on a centralized server, each parent node calculates address ranges for its children based on global network parameters defined by the ZigBee coordinator.

**Coordinator-Defined Network Parameters**

The ZigBee coordinator sets three key parameters that shape the tree-based network:

- **Cm ‚Äì Maximum number of children per router**
  - Total number of child devices (end devices + routers) a router can support.

- **Rm ‚Äì Maximum number of child routers per parent**
  - Number of children that can themselves act as routers.

- **Lm ‚Äì Maximum network depth**
  - Defines how many layers the network tree can have.

**Address Allocation Logic**

Each parent device uses the parameters **Cm**, **Rm**, and **Lm** to compute a value called **Cskip**.

- **Cskip**
  - Determines the size of the address block reserved for each child.
  - Helps ensure that child routers receive non-overlapping address pools.
  - Enables scalable and conflict-free hierarchical addressing.

**Formula**

Basically Zigbee generates a tree of parents and children and uses a formula to determine how many child a node can have:

Let `A_parent` be the parent's network address, `Cskip(d)` the skip value at depth `d`,
`Rm` the maximum number of child routers a parent can have, and `n` the 1-based index
of the child.

### nth child **router** (1 ‚â§ n ‚â§ Rm)

    A_parent + (n - 1) * Cskip(d) + 1

### nth child **end device** (1-based index for end devices)

    A_parent + Rm * Cskip(d) + n

### Notes
- `Cskip(d)` depends on the coordinator-defined parameters (`Cm`, `Rm`, `Lm`) and the parent's depth `d`.
- Router children receive full address blocks of size `Cskip(d)`; end devices receive single addresses placed after all router blocks.
- `n` starts at `1` (the first child).

When a node receives a packet it checks if the destination is itself or one of the child, relays the data otherwise.

### Flooding route discovery

Flooding is a basic route discovery technique in wireless networks where a source node broadcasts a route-request packet to all its neighbors.

**Directed diffusion**
In normal flooding, every node rebroadcasts every Route Request (RREQ) it receives, creating large overhead. Directed diffusion reduces this cost by forwarding RREQs only in the direction of the destination. The node might know something about the direction given some info (geographical position, signal strength etc...)

## Directed Diffusion ‚Äî Short Summary (Markdown Source)

Directed Diffusion is a **data-centric, distributed routing protocol** for sensor networks where the sink requests data, and sources respond along locally-created gradients. The best path emerges through **reinforcement**, without global network knowledge.

1. **Interest Flooding (Sink ‚Üí Network)**
   - Sink broadcasts an **Interest** describing the desired data.
   - Each node receiving it creates **gradients** pointing to the neighbor it got the Interest from.
   - Gradients contain: neighbor ID, data rate, lifetime.
   - Prevent duplicate forwarding; interests are soft-state and expire.

2. **Exploratory Data Flow (Source ‚Üí Sink)**
   - Nodes with matching data send **exploratory data** along all available gradients.
   - Intermediate nodes forward data only along known gradients.
   - Loops are avoided using data caches.

3. **Gradient Reinforcement**
   - Sink selects the **best path** (e.g., lowest latency or highest reliability).
   - Sends a **reinforced Interest** along that path.
   - Nodes strengthen the corresponding gradient, suppressing weaker gradients.

4. **Data Transmission**
   - Sources now send data along the **reinforced path**.
   - Forwarding decisions are **local**; nodes only know their own gradients.

### Key Characteristics
- Distributed and local; no global routing tables.
- Initial flooding is limited; reinforcement ensures efficiency.
- Emergent optimal path similar in effect to shortest-path algorithms.
- Naturally supports multiple sources and sinks.
- Reduces energy consumption and avoids redundant transmissions.


Through this method flooding is really poor because of the multiple paths from the origin to its destination, but the diffusion is better than OM (Omniscient Multicast) because duplication is suppressed. Also flood has a really high latency due to MAC collisions but it finds the lowest delay path in the network based on the gradient.

### Dynamic Source Routing (DSR)

DSR is an **on-demand routing protocol** for ad hoc networks. When a source node (S) wants to send data to a destination (D) and has no known route:

1. **Route Discovery:**  
   - S broadcasts a Route Request (RREQ) to neighbors.  
   - Each node appends its own ID and forwards the RREQ, avoiding revisiting nodes.  
2. **Route Reply:**  
   - D receives the RREQ and sends a Route Reply (RREP) back along the reverse path.  
   - RREP contains the complete route from S to D.  
3. **Data Transmission:**  
   - S caches the route.  
   - Packets include the **source route** in the header.  
   - Intermediate nodes forward packets using this source route.

**Pros:**  
- Routes are maintained only between communicating nodes.  
- A single route discovery can yield multiple routes to the destination.

**Cons:**  
- Packet headers grow with route length.  
- Flooding RREQs may reach all nodes ‚Üí collisions and high contention.  
- Random delays are often inserted to reduce collisions.  
- Multiple RREPs can cause a ‚ÄúRoute Reply Storm.‚Äù

---

### Ad Hoc On-Demand Distance Vector (AODV) Routing

- Unlike DSR, **AODV maintains routing tables at each node**, avoiding large source-route headers.  
- Retains DSR‚Äôs advantage: routes are maintained only as needed.  
- Reduces header overhead, especially for small data packets, improving efficiency over DSR.

### Ad Hoc On-Demand Distance Vector (AODV)

AODV is an **on-demand routing protocol** similar to DSR but uses **routing tables** instead of source routes in packet headers.

**How it works:**

1. **Route Discovery:**  
   - Source broadcasts a Route Request (RREQ).  
   - Each forwarding node sets up a **reverse path** pointing to the source.  
   - Assumes **symmetric (bi-directional) links**.

2. **Route Reply:**  
   - Destination receives the RREQ and sends a Route Reply (RREP).  
   - RREP travels back along the **reverse path** created during RREQ forwarding.  

**Key Features:**  
- Reduces header overhead compared to DSR.  
- Routes are created **only when needed**.  
- Maintains route information in **routing tables** at intermediate nodes.

### AODV: Route Request and Route Reply

- **RREQ:**  
  - Includes the last known **sequence number** for the destination.  
  - Intermediate nodes can reply with a **Route Reply (RREP)** if they know a more recent path than the sender.  
- **RREP forwarding:**  
  - Intermediate nodes record the **next hop** to the destination.  
- **Route expiration:**  
  - Reverse path entries expire after a timeout.  
  - Forward path entries expire if not used within **active_route_timeout**.

### AODV: Key Features

- Routes are **not included in packet headers**.  
- Nodes maintain **routing tables only for active routes**.  
- Each node keeps **at most one next-hop per destination**.  
- Sequence numbers ensure **fresh routes** and prevent **routing loops**.  
- Unused routes expire automatically, even if topology is unchanged.  
- Compared to DSR: DSR may store multiple routes per destination.


### In summary
AODV is like DSR conceptually (on-demand, route discovery) but relies on local routing table entries (predecessor/next-hop pointers) instead of embedding the entire route in the packet.

---

# LECTURE 27/10/2025

# BLOCKCHAIN

A blockchain is a decentralized, immutable digital ledger that securely records transactions across a network of computers. 

## Cryptographic Tools (Pages 3-25)

### Cryptographic Hash Function (Pages 3-4)
A **Cryptographic Hash Function** is defined as $H: X=\{0,1\}^{*} \rightarrow Y=\{0,1\}^{L}$ with a fixed length $L$, such as $128/160/256/512$ bits. The informal property is that a **small change in the input produces a completely different output**. The key security properties, where collisions exist but are hard to find, are:
* **Pre-image resistance:** It's hard to find an input $x$ for a given output $y$ such that $H(x)=y$.
* **Second pre-image resistance:** Given an input $M$, it's hard to find a different $M'$ such that $H(M')=H(M)$.
* **Collision-resistance:** It's hard to find two different inputs, $x_1 \neq x_2$, that produce the same hash output, $H(x_1) = H(x_2)$.

### Hash Function Application (Pages 5-8)
A case study demonstrates using hash functions to ensure fairness when playing **Rock-paper-scissors** over the internet or email. A robust solution involves the player computing $H(T||M||S)$, where $M$ is the move, $S$ is a random string computed by the player, and $T$ is a string (nonce/timestamp) sent by the opponent.

### Secure Hash Algorithm (SHA) (Page 9)
The Secure Hash Algorithm (SHA) is a family of cryptographic hash functions published by the National Institute of Standards and Technology (NIST). Notable variants include:
* **SHA-256:** Output size 256 bits, used in **Bitcoin**.
* **SHA3-256 (Keccak 256):** Output size 256 bits, used in **Ethereum**.

### Using Hashes for Data Integrity (Pages 10-12)
To verify a large data structure, one can compute a hash $h$ and send it over a secure/expensive channel, then send the data normally.

For data split into chunks $c_1...c_k$, instead of sending $k$ hashes, a chain of hashes can be used (chained hashes).
$$h_{1}=H(c_{1}), h_{2}=H(c_{2}||h_{1})...h_{k}=H(c_{k}||h_{k-1})$$
This allows verification of a sequence of chunks by checking a single hash, e.g., $h_{10}$ verifies the first 10 chunks/blocks.

### Digital Signatures and Asymmetric Crypto (Pages 13-17)
Digital signatures rely on three algorithms: **KeyGen**, **Sign**, and **Verify**. KeyGen produces a **secret signing key ($sk$)** and a **public verification key ($pk$)**.
* Algorithms using private/public keys are very **slow**.
* The signature is usually computed on the **hash** of the message for efficiency.
* The receiver uses the public key to decrypt the signature and compares the result to the hash of the message.

The algorithm used for Bitcoin transactions is **Elliptic Curve Digital Signature Algorithm (ECDSA)** with curve **SECP256K1** and hashing algorithm **SHA256**.

### Merkle Tree (Hash Tree) (Pages 18-25)
A **Merkle Tree** or Hash Tree is a data structure summarizing information about a large quantity of data to check its content. 
* It combines **hash functions** ($H$) with a **binary tree structure**.
* **Leaves** are $H$ applied to initial symbols.
* **Internal nodes** are $H$ applied to the concatenation of their sons' hashes.
* If data is corrupted (e.g., data B), it invalidates the hash of its leaf and all nodes up the branch to the root.
* The root hash can be stored safely. The work to obtain the root is approximately $2N$ hash function evaluations, where $N$ is the number of leaves.

### Blockchain Basics (Pages 27-47)

### Definition and Hash Pointers (Pages 27-28)
A **blockchain** is a digitized, decentralized, public ledger of all cryptocurrency transactions. It is a distributed database that maintains a continuously growing list of records, called **blocks**.

The blocks are linked using a **Hash Pointer (HP)**, which is a **tamper-evident data pointer**. An HP includes a pointer to where the information is stored and a cryptographic hash of that information. This allows retrieval of the data and verification that it hasn't changed. 

### The Chain of Blocks (Pages 29-32)
In the blockchain structure, **each block has a Hash Pointer to the previous block**. 
* If data in block $i$ is tampered with, its hash changes, requiring the hash pointer in block $i+1$, $i+2$, and so on to be tampered with.
* This makes the block addition process tamper-free.

### Decentralization and Network (Pages 33-34)
The blockchain functions as a decentralized database, providing:
* **Consistency:** Information is a shared, continually reconciled, distributed database.
* **Robustness:** No centralized version exists for a hacker to corrupt.
* **Availability:** Data is stored by millions of computers (nodes) simultaneously and is publicly verifiable. 


A **Node** is a computer connected to the network that validates and relays data (e.g., transactions). Every node gets a copy of the blockchain and is an "administrator," making the network decentralized.

### Nakamoto Consensus (Pages 35-37)
To decide which chain to trust in the event of a fork, the **longest blockchain has consensus**. 
* The system assumes adding a block is **computationally expensive** and that most nodes are honest.
* A shorter blockchain is ignored, as it was likely created later or is out of sync.
* A transaction is considered **"accepted"** (immutable) if it is **buried deep enough** (e.g., several blocks deep).

### Double Spending and Proof of Work (PoW) (Pages 38-45)
**Double spending** is the result of successfully spending digital money more than once. Protection against this involves verifying that the input for a transaction has not previously been spent.

The challenge is preventing the **Sybil attack**, where a malicious actor creates many fake IDs cheaply to distribute a fraudulent fork.

This is solved by making block addition **computationally expensive** using **Proof of Work (PoW)**.
* The puzzle is **hard to solve** when adding a block but **easy to verify**.
* The puzzle is solved by inserting a **nonce** into the header such that $H(\text{header})$ starts with $n$ zeros. A malicious actor cannot redo the work on their fork (adding $N+1$ blocks) faster than all other miners add blocks on top of the accepted block.

### Blockchain Details: Users and Mining (Pages 46-59)

### Transactions and Users (Pages 50-54)
A **user** is someone who can transfer money and is represented by a wallet, which is a pair of keys: **(sk: private key, pk: public key)**.

A **transaction** format is `([input transactions], [output identity pk, how much], signature)`.
* The transaction is signed with the user's private key ($sk$).
* It specifies the output recipients by their public keys.
* **All money from input transactions must be used**.
* The output contains a **challenge script** (locking script or scriptPubKey) with the spending conditions.
* The script is encoded in a **non-Turing complete language** to protect miners against DOS attacks (e.g., infinite loops).

### Block and Transaction Organization (Pages 55-58)
Transactions are received continuously and organized into a **Merkle tree**. 
* The **Root Hash (Tx\_Root)** of the Merkle tree is part of the new block header, impacting the next block's hash.
* The Merkle tree is efficient because it can be computed as transactions are received, and there's no need to recompute the hash of all transactions when one is received.

### Miners and Incentives (Page 59)
A **miner's job** is to verify transactions, compute the Merkle root hash, solve the PoW puzzle (by finding the nonce), and broadcast the new header.

Miners are incentivized by:
* **Block Reward:** Bitcoins granted by default for finding the nonce/new block.
* **Fees:** A difference between transaction input and output that is paid to the winning miner.
* The process involves a vast amount of energy waste.

## Basics on Ethereum Smart Contracts (Pages 61-73)

### Smart Contracts (Pages 61-63)
**Smart Contracts** are computer protocols that facilitate, verify, or enforce the negotiation or performance of a contract, or that make a contractual clause unnecessary. They automatically enforce obligations, often summarized as "**code is law**".

Smart contracts are the main building blocks of **Ethereum**.
* A contract is a computer program that lives inside the distributed Ethereum network.
* It has its own Ether balance, memory, and code.
* Ethereum uses **Turing complete contracts**.

### Execution and Gas (Page 64)
A contract can be activated and run by sending a transaction that funds it with **Ether (ETH)**.
* The contract runs for a time dependent on how much **gas** (a unit of ETH) is paid. ETH fees go to the winning miner.
* Each miner runs the smart contract and produces the same output. Other miners validate the result when the winning miner publishes the block.

### Blockchain Use Cases and Characteristics (Pages 71-73)
Blockchain use cases fall into categories like **Smart Contracts** (e.g., Escrow, Wagers, Digital Rights), **Digital Currency** (e.g., Global Payments, Microfinance), **Securities** (e.g., Debt, Equity, Derivatives), and **Record Keeping** (e.g., Healthcare, Title Record, Voting). 


Blockchains are characterized by being a **Public ledger system**, **Distributed**, **Secure & reliable**, and **Immutable**.

A five-point test for using a blockchain includes:
1.  Are there **multiple parties** in the ecosystem?
2.  Is establishing **trust** between all parties an issue?
3.  Is it critical to have a **tamper-proof permanent record**?
4.  Are we securing the **ownership or management of a finite resource**?
5.  Does this ecosystem benefit from improved **transparency**?

---


# LECTURE 28/10/2025

# Distributed Algorithms 

We have seen how to fix all of the problems that were introduced with distributed communicatoin but not really how to implement those kind of solution.

Here are some of the algorithms we have seen:

**FIFO-broadcast**
```
on initialisation do
    sendSeq:=0; delivered:=<0,0,...,0>; buffer:={}
end on

on request to broadcast m at node Ni do
    send(i,sendSeq,m) via reliable broadcast
    sendSeq:=sendSqe+1
end on

on receiving msg from reliable multicast at node Ni do
    buffer:=buffer U {msg}
    while exist sender, m. (sender, delivered[sender],m) in buffer do
        deliver m to the application
        delivered[sender]:=delivered[sender]+1
    end while
end on
```


All the algorithm we have seen assume that each process has an unique ID.

## The Model
The system is represented as an undirected graph $G=(V,E)$:
* **Nodes ($V$):** Processes with $n=|V|$.
* **Edges ($E$):** Communication channels.
* **Knowledge:** Processes know their neighbors (local ports) but not the full global topology.
* **IDs:** Algorithms typically assume unique IDs, though randomization is used when IDs are absent or symmetry needs breaking.

---

## Synchronous Algorithms
In this model, computation proceeds in global rounds.

### A. Leader Election (Symmetry Breaking)
If processes are indistinguishable (no unique IDs), deterministic leader election is impossible due to symmetry.
* **Solution:** Processes choose random IDs from a large range $\{1, ..., r\}$.
* **Probability:** The probability of any two processes picking the same ID is $1/r$.
* **Algorithm:** Processes exchange random IDs. If the maximum ID is unique, that node wins. Otherwise, the process repeats.

### B. Maximal Independent Set (MIS)

**The Problem:** Select a subset of nodes $S$ such that:
1.  **Independent:** No two neighbors are both in $S$.
2.  **Maximal:** No nodes can be added to $S$ without violating independence (distinct from *Maximum* size).

**Luby‚Äôs Algorithm:**
The algorithm operates in phases consisting of 2 rounds each.

* **Initialization:**
    * Set `status = active`.
    * Set `just_joined = false`.

* **Round $2 \cdot i$ (Comparison Step):**
    * If `status == active`:
        1.  Choose a random value: $\underline{val} = rand(1, n^5)$.
        2.  Send $\underline{val}$ to all neighbors in $N$.
        3.  Wait to receive values from active neighbors.
        4.  **Check:** Is $\underline{val} >$ all received values?

* **Round $2 \cdot i + 1$ (Decision Step):**
    * **If Check Passed (Winner):**
        1.  Trigger output `in`.
        2.  Set `just_joined = true`.
        3.  Set `status = sleep`.
    * **Notification:**
        * If `just_joined == true`, send `joined` message to all neighbors.
        * Reset `just_joined = false`.
    * **Elimination:**
        * If a neighbor sent a `joined` message:
            1.  Trigger output `out`.
            2.  Set `status = inactive`.

### C. Breadth-First Spanning Trees (BFST)

**The Problem:** Construct a tree rooted at a distinguished vertex $V_0$. A node at distance $d$ from the root must appear at depth $d$ in the tree.

**Simple BFS Algorithm:**
* **Initialization:** Only the root $V_0$ is marked. It sends a search message to neighbors.
* **Execution:** If an unmarked node $i$ receives a search message from node $j$:
    1.  Marks itself.
    2.  Sets $j$ as its parent ($parent = j$).
    3.  Sends the message to its own neighbors in the next round.

**Complexity:**
* **Time:** Proportional to the graph diameter (number of rounds).
* **Message Complexity:** Each edge carries a message at most once, so complexity is $O(E)$.

### D. Graph Coloring
**The Problem:** Assign colors so no two neighbors share the same color.
**Greedy Algorithm:**
* Nodes determine colors based on local information.
* In a round, if a node's ID is smaller than all its active neighbors, it picks the smallest valid color:
    $$color = \min\_allowed(forbidden)$$
* It sends this color to neighbors, who add it to their `forbidden` set.

---

## 3. Asynchronous Setting
In asynchronous systems, there are no global rounds; messages arrive at unpredictable times.

* **BFST Anomaly:** The simple flooding algorithm used in synchronous systems fails. A message traveling a "slow" short path might arrive *after* a message traveling a "fast" long path, resulting in a tree path longer than the shortest path.
* **Correction Strategy:**
    * Nodes track "hop distance" from the root.
    * If a node receives a message offering a shorter path than its current parent, it updates its parent and propagates the new distance to its neighbors.
    * The system eventually stabilizes to a valid BFST.

---
