## **COURSE OVERVIEW**

## **INTRO**
### 1. Introduction
## **Theory & Algorithms**
### 1. Models of DS
### 2. Time in DS
### 3. Multicast
### 4. Consensus
### 5. Distributed Mutual Exclusion
## **Application**
### 7. Distributed Storage
### 8. Distributed Computing
### 9. Blockchains
### 10. Peer-to-Peer Networking
### 11. Internet of Things and Routing
### 12. Distributed Algorithms

# Lecture 08/09/2025

## What is a distributed systems?
A distributed system is one where hardware and software components in/on networked computers communicate and coordinate their activity only by passing messages.

## CONCURRENCY

Concurrency is the ability of a system to execute multiple tasks or processes simultaneously or at overlapping times, improving efficiency.
However it can cause some problems such as:
- Deadlocks and livelocks: these are conditions in which the processes do not make progress due to their circumstances.
- Non-determinism: occurs when the output or behavior of a concurrent program differs for the same input, depending on the precise, unpredictable timing of events, such as thread interleaving or resource access.

Other issues could rise from the absence of shared state that will eventually lead to:
- Pass messages to synchronize: for example if 2 people have a shared resource and both are trying to access it at the same time an error could occur for one member of the party.
- May not agree on time: 

Everything can fail in distributed systems:
- Devices
- Integrity of data
- Network
    - Security
    - Man-in-the-Middle (MITM): attack occurs when an attacker intercepts and potentially alters communication between two legitimate parties, unbeknownst to them.
    - Zibantine failure: is a condition of a system, particularly a distributed computing system, where a fault occurs such that different symptoms are presented to different observers, including imperfect information on whether a system component has failed.

Distributed systems are used for domain, redundancy, and performance.

---
## Domain
A domain is a specific area of knowledge or activity that a distributed system is designed to address.
Some examples of domains are:
- The internet
- Wireless Mesh Networks
- Industrail systems
- Ledgers (bitcoin, ethereum)
However these domains can encounter some limits that can be physical and logical.
- Physical limits: are constraints imposed by the physical properties of the system's components or environment, such as hardware limitations, network bandwidth, latency, and geographical distribution.   
- Logical limits: defined by its bounded context. This is a core concept from Domain-Driven Design (DDD). The logical limit of a domain in a distributed system is defined by its bounded context. This is a core concept from Domain-Driven Design (DDD), a software development approach that focuses on aligning software design with the business domain. A bounded context is an explicit boundary within a distributed system where a specific domain model and its language (ubiquitous language) are consistent and applicable.
---
## Redundancy
A system with redundacy means that it has duplicate ocmponents, processes or data.
Given these specifications the system will result:
- Robust: more resilient to failure
- Available: as in system availability which is measured in uptime.

A system with redundancy can:
- offer 99.9% uptime or "five nines": means it's designed to be operational and available 99.9% of the time, with a small amount of planned or unplanned downtime.
- be a backup: in an active-passive configuration, the redundant server acts as a hot or cold backup. The primary server handles all the workload, while the backup server waits on standby. Given the duplication of information and components if one fails the other automatically takes over.
- be a database: In an active-active configuration, all redundant servers are considered active and work together simultaneously. They are not simply waiting as a backup. Incoming requests are distributed among all the servers using a load balancer.
- used in the banking sector: any downtime can lead to significant financial losses.

---
## Performance
To ensure a performant system we need:
- Economics
- Scalability
Here we talk about different topics such as:
- Video streaming: requires a lot of procesing power
- Cloud computing: Offers on-demand, scalable resources, eliminating the need for upfront hardware investment.
- Supercomputers: Excel at massive, specialized calculations but are very expensive and not scalable for general-purpose use.
- Many inexpensive vs many expensive specialized: Distributing workloads across many inexpensive machines is often more economical and scalable than using a few expensive, specialized ones.

---

CHECK CONTENT

# Lecture 09/09/2025

# Models of distributed systems

## Aspects of models

Why do we build distributed systems?

- **Inherent distribution**: By definition, distributed systems span multiple computers, often connected through networks such as telecommunications systems.
- **Reliability**: Even if one node fails, the system as a whole can continue functioning, avoiding single points of failure.
- **Performance**: Workloads can be shared among multiple machines, and data can be accessed from geographically closer nodes to reduce latency.
- **Scalability for large problems**: Some datasets and computations are simply too large to fit into a single machine, requiring distributed processing.

### Modelling the process – API Style

A distributed system can be described in terms of modules that exchange **events** through well-defined interfaces:

- **Event representation**:  
  \{Event\_type | Attributes, …\}  
  or  
  \{Process\_ID | Event\_type | Attributes, …\}

- **Module behavior**:  
  Each module reacts to incoming events and produces outputs according to specified rules:
upon event {condition | Event | attributes} such that condition holds do
perform some action


Multiple modules together (one per process or subsystem) should collectively satisfy desired **global properties** (e.g., safety, liveness).

### What we want/will make

We aim to:
- Design APIs for modules and prove that their composition satisfies global system properties.
- Implement modules that guarantee **local properties**.
- Use pseudocode and mathematics to formally demonstrate when such guarantees are possible—or prove impossibility.

---

## Failures

Failures are inevitable in distributed systems. They can arise due to hardware breakdowns, software bugs, network disruptions, or even human mistakes. Designing robust systems requires understanding different types of failures and strategies to mitigate them.

### Types of failures

1. **Crash-stop**: A process halts and all other processes can reliably detect the failure. *Easiest to handle.*
2. **Crash-silent**: A process halts but failures cannot be detected reliably.
3. **Crash-noisy**: Failures may be detected, but only with eventual accuracy (false positives or delays are possible).
4. **Crash-recovery**: Processes may fail and later recover, rejoining the system. Requires care to avoid state inconsistencies.
5. **Crash-arbitrary (Byzantine failures)**: Processes behave arbitrarily or maliciously, deviating from the protocol. *Hardest to handle.*
6. **Randomized behavior**: Processes make decisions probabilistically. Correctness is argued via probability theory rather than strict guarantees.

---

## Communication

Is communication always required? In distributed systems, yes—but it can be realized in different ways:

- **Message passing**:
1. Types of links and their potential failures.
2. Network topology (commonly assumed fully connected).
3. Routing algorithms for multi-hop communication.
4. Broadcast and multicast primitives.

- **Shared memory**:
1. Which process can read or write to which location?
2. How do we guarantee reading the *freshest* value? (Consistency models)

### On types of links

A **link** is a module implementing send/receive operations with certain properties.

- **TCP/IP**: Enables reliable communication between a pair of nodes (or none).
- **SSH**: Adds protection against corruption, interception, and tampering.

**Network reliability models**:
1. **Perfect links**: Reliable delivery, no duplication, no spurious messages.
2. **Fair-loss links**: Messages may be lost occasionally, but infinitely many attempts guarantee eventual delivery; finite duplication possible.
3. **Stubborn links**: Messages are retransmitted until delivery is guaranteed but still no creation (this model is built upon the fair-loss).
4. **Logged-perfect links**: Perfect delivery with persistent logs for auditing/recovery.
5. **Authenticated links**: Reliable delivery, no duplication, and sender authenticity.

### Can networks fail?

While TCP/IP and lower-level protocols often give us the illusion of **perfect links** and **fail-stop crashes**, failures still happen.

- **Network partitions**: Occur when many links fail simultaneously, dividing the system into disconnected components. This is rare but catastrophic.

### Crashes vs Failures

Having discussed both **network** and **process** failures, it is important to distinguish between the two levels:

- A **process can crash** (e.g., by crashing, halting, or misbehaving).  
- A **system fails** when the combination of process crashes and communication assumptions no longer allows correct operation.

For the remainder of our discussion, we usually assume **perfect links** (thanks to TCP/IP and lower-level reliability mechanisms). This means that:
- Messages are delivered reliably,
- No duplicates are created,
- No spurious (phantom) messages appear.

Under this assumption, we can define **system failure models** in terms of process behavior:

- **Fail-stop system**: Processes may experience crash-stop failures, but links are perfect.  
- More complex models (e.g., crash-recovery, Byzantine failures) are defined similarly, always considering both the **process failure type** and the **communication assumptions**.

In short, a system failure model = (process failure model) + (assumed link properties).

---

## Timing

Timing plays a central role in distributed systems, especially when considering **synchronization** and **failure detection**.

- Systems may be **synchronous** (bounded delays) or **asynchronous** (no timing guarantees).
- Links are still modeled as modules with send/receive properties.

### Synchronous vs. Asynchronous Systems

Distributed systems can be broadly classified according to their **timing assumptions**:

1. **Asynchronous systems**:
   - No bounds on message transmission delays.
   - No assumptions about process execution speeds (relative speeds may differ arbitrarily).
   - Failure detection is unreliable, since a slow process cannot be distinguished from a failed one.
   - Coordination and ordering rely on **logical clocks** (e.g., Lamport clocks, vector clocks), rather than real time.

2. **Synchronous systems**:
   - Bounds exist on message transmission delays and process execution speeds.
   - **Timed failure detection** is possible: if a message or heartbeat is not received within a known bound, a failure can be suspected reliably.
   - Transit delays can be measured and incorporated into algorithms.
   - Coordination can be based on **real-time clocks** rather than purely logical clocks.
   - Performance is often analyzed in terms of **worst-case bounds**, since timing assumptions provide guarantees.
   - Processes may maintain **synchronized clocks** (to some degree of precision), enabling algorithms such as consensus and coordinated actions.

**Key question**: *Can processes in an asynchronous system with fair-loss links reach agreement (e.g., on coordinated attack time)?*

### Proof via contradiction (Two Generals Problem)

1. Assume a protocol exists where a fixed sequence of messages guarantees agreement.
2. Consider the last message in this sequence that is successfully delivered.
3. If this message is lost, the receiving general decides **not** to attack.
4. But the sender cannot distinguish whether the message was delivered or lost, so must behave deterministically and decide the same action in both cases.
5. This creates a contradiction: one general attacks, the other does not.  
 $\Rightarrow$ Perfect agreement is impossible under these assumptions.

### Which crash/link/timing assumptions implement distributed systems?

A **failure detector** can be modeled as just another module that provides (possibly imperfect) information about which processes are alive. Different combinations of timing assumptions and failure detectors allow different guarantees in distributed systems.  

### Example

![image](../images/Screenshot%202025-09-09%20at%2009.56.39.png)

#### Explanation:

This algorithm describes a **Perfect Failure Detector** for distributed systems using a heartbeat mechanism.

In short, here's what it does:

1.  **Sends Heartbeats:** Periodically, on a **timeout**, every process sends a `HEARTBEATREQUEST` message to all other processes in the system.
2.  **Waits for Replies:** It assumes no one is alive and waits for `HEARTBEATREPLY` messages. When a process receives a reply, it marks the sender as `alive`.
3.  **Detects Failures:** At the next timeout, any process that has not sent a reply is considered to have **crashed**. The algorithm then triggers a `Crash` event for that process.

Because it assumes **perfect communication links** (messages are never lost), this method guarantees that a non-responsive process has truly failed, making the failure detection "perfect."

### Network latency and bandwith

When discussing communication performance, two key metrics matter:

- **Latency**: The time it takes for a single message (or bit) to travel from sender to receiver.  
- **Bandwidth**: The rate at which data can be transmitted, usually measured in bits per second (bps) or bytes per second (B/s).

#### Physical Link
Sometimes, surprisingly “low-tech” physical methods can provide high bandwidth, even if latency is poor:
- **Hard drives in a van**  
- Messengers carrying storage devices  
- Smoke signals (extreme latency, minimal bandwidth)  
- Radio signals or laser communication

#### Network Links
More conventional digital communication technologies include:
- DSL (Digital Subscriber Line)  
- Cellular data (e.g., 3G, 4G, 5G)  
- Wi-Fi (various standards)  
- Ethernet/fiber cables  
- Satellite links  

#### Latency examples
1. Hard drives transported by van: $\approx$ 1 day latency  
2. Intra-continent fiber-optic cable: $\approx$ 100 ms latency  

#### Bandwidth examples
1. Hard drives in a van: $\frac{50 \, \text{TB}}{1 \, \text{day}}$ = **very high bandwidth** despite huge latency  
2. 3G cellular network: $\approx 1 \, \text{Mbit/s}$ bandwidth  

---

## Performance

### Performance measures

- **SLI (Service Level Indicator)**: What aspect of the system do we measure?  
Examples: bandwidth, latency, fault tolerance, uptime, failure detection time.
- **SLO (Service Level Objective)**: What target values do we aim for?  
Example: latency < 200ms.
- **SLA (Service Level Agreement)**: An SLO backed with contractual consequences.  
Example: "99% uptime, otherwise partial refund."

Why should we study these?
- Measuring means we can improve
- Spend time improving when it is needed.
- Reliability is kind of the point with distributed systems.

### Reading SLAs

When evaluating claims like *“This solution offers 99% uptime”*, consider:

- **Sampling frequency**: How often is system availability checked?
- **Responsibility scope**: Does the SLA cover only server uptime, or also account for client/network failures?
- **Time interval**: Does 99% apply per day, per month, or per year?

---