# Chapter 85: Distributed Systems

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the need for distributed computing in large‑scale time‑series prediction systems.
- Distinguish between different parallelism strategies: data parallelism, model parallelism, and task parallelism.
- Implement distributed data processing using frameworks like Dask, Ray, or Apache Spark.
- Design distributed training for models that cannot fit on a single machine or require faster training.
- Build distributed inference services that can handle high‑throughput prediction requests (e.g., for thousands of stocks).
- Handle data partitioning (sharding) by time, symbol, or other keys to enable scalability.
- Understand consistency and fault tolerance challenges in distributed ML systems.
- Monitor and debug distributed pipelines.
- Evaluate trade‑offs between different distributed architectures for time‑series workloads.

---

## **85.1 Introduction to Distributed Systems for Time‑Series Prediction**

As our time‑series prediction system grows, we may encounter limitations of a single machine:

- **Data volume**: The NEPSE dataset may be small, but a system covering thousands of stocks with years of tick‑by‑tick data can easily reach terabytes.
- **Model complexity**: Deep learning models with millions of parameters may require multiple GPUs for training.
- **Inference throughput**: Serving predictions for many symbols in real time may exceed the capacity of a single server.
- **Feature engineering**: Computing rolling statistics across many time series can be computationally intensive and benefit from parallelisation.

Distributed systems allow us to scale horizontally by adding more machines. In this chapter, we will explore how to distribute the components of a time‑series prediction system: data storage, feature engineering, model training, and inference. We'll use frameworks like Dask (for parallel computing in Python) and Ray (for distributed execution) to illustrate the concepts.

---

## **85.2 Fundamentals of Distributed Computing**

### **85.2.1 Cluster, Nodes, and Tasks**
A **cluster** is a collection of machines (nodes) working together. A **node** can be a physical server or a container. Work is divided into **tasks** that are scheduled across nodes.

### **85.2.2 Shared‑Nothing Architecture**
Most distributed systems use a shared‑nothing architecture: each node has its own CPU, memory, and disk. Nodes communicate via network messages (e.g., using TCP). This design scales well because there is no contention for shared resources.

### **85.2.3 Data Parallelism vs. Model Parallelism**
- **Data parallelism**: The same model is replicated on multiple nodes, and each node processes a different subset of the data. Gradients are aggregated (e.g., all‑reduce) to update the model. This is the most common approach for training large models on large datasets.
- **Model parallelism**: Different parts of the model are placed on different nodes. This is used when the model itself is too large to fit on one node (e.g., some deep learning models with billions of parameters).

For time‑series, data parallelism is more common: we can train the same model on different chunks of time series (e.g., different stocks, different time windows) and synchronise updates.

### **85.2.4 Task Parallelism**
Many tasks (e.g., feature computation for each stock) are independent and can be executed in parallel. This is often easier to implement than model parallelism.

---

## **85.3 Distributed Data Processing with Dask**

Dask is a flexible parallel computing library for Python that integrates with pandas, NumPy, and scikit‑learn. It can scale from a single machine to a cluster.

### **85.3.1 Dask DataFrames**
For large time‑series datasets, we can use Dask DataFrames, which partition a pandas‑like DataFrame across multiple workers.

```python
import dask.dataframe as dd

# Read a large CSV with Dask (partitioned automatically)
df = dd.read_csv('nepse_all_stocks.csv', blocksize='100MB')

# Perform operations lazily
df['daily_return'] = df.groupby('symbol')['close'].pct_change()
df = df.dropna()

# Compute result (triggers execution)
result = df.compute()
```

**Explanation:**

- Dask reads the CSV in chunks (partitions). Operations like `groupby` are performed in parallel on each partition, then combined.
- `compute()` brings the result into a single pandas DataFrame (if it fits in memory). For larger results, you can write to disk or further process in Dask.

### **85.3.2 Parallel Feature Engineering with Dask**

Feature engineering for many stocks can be parallelised naturally: each stock's time series can be processed independently.

```python
import pandas as pd
import dask
from dask.distributed import Client

# Start a local Dask client
client = Client()  # uses all cores on local machine

# Function to engineer features for a single stock
def engineer_features(symbol, df_symbol):
    # df_symbol is a pandas DataFrame for one stock
    df = df_symbol.copy()
    df = df.sort_values('date')
    # Compute features (lags, rolling, etc.)
    df['lag_1'] = df['close'].shift(1)
    df['sma_10'] = df['close'].rolling(10).mean()
    df['volatility'] = df['close'].rolling(20).std()
    # Drop NaNs
    df = df.dropna()
    return df

# Read full data (assuming it's a pandas DataFrame)
# For demonstration, we'll simulate data for multiple stocks
symbols = ['AAPL', 'GOOG', 'MSFT', 'NEPSE']
dfs = []
for sym in symbols:
    dates = pd.date_range('2020-01-01', periods=1000, freq='D')
    prices = 100 + np.cumsum(np.random.randn(1000))
    dfs.append(pd.DataFrame({'date': dates, 'symbol': sym, 'close': prices}))
df_all = pd.concat(dfs, ignore_index=True)

# Group by symbol and apply function in parallel using Dask delayed
lazy_results = []
for symbol, group in df_all.groupby('symbol'):
    lazy_result = dask.delayed(engineer_features)(symbol, group)
    lazy_results.append(lazy_result)

# Compute all in parallel
results = dask.compute(*lazy_results)
feature_df = pd.concat(results, ignore_index=True)
print(feature_df.head())
```

**Explanation:**

- We use `dask.delayed` to wrap the function call for each symbol. These calls are scheduled on the Dask cluster and executed in parallel.
- `dask.compute` triggers the computation and returns the results as a list of pandas DataFrames, which we concatenate.
- This pattern is ideal for "embarrassingly parallel" workloads like per‑symbol feature engineering.

---

## **85.4 Distributed Training**

When training a model on a large dataset (e.g., many stocks with many years of data), we can use distributed training to reduce wall‑clock time. Dask‑ML and Ray provide integrations with popular ML libraries.

### **85.4.1 Distributed Training with Dask‑XGBoost**

XGBoost supports distributed training via its native interface, and Dask can act as a backend.

```python
import xgboost as xgb
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster  # for GPU
from dask.distributed import Client

# Start a Dask cluster with GPU workers
cluster = LocalCUDACluster()
client = Client(cluster)

# Load data as Dask DataFrame
df = dd.read_parquet('features.parquet')

# Split into features and target
X = df[feature_cols]
y = df['target']

# Train distributed XGBoost
dtrain = xgb.dask.DaskDMatrix(client, X, y)
params = {'objective': 'reg:squarederror', 'max_depth': 5, 'eta': 0.1}
output = xgb.dask.train(client, params, dtrain, num_boost_round=100)

# Get the booster
booster = output['booster']

# Predict in parallel
predictions = xgb.dask.predict(client, booster, X).compute()
```

**Explanation:**

- Dask‑XGBoost uses the Dask cluster to distribute data and computation. The `DaskDMatrix` holds the distributed data.
- Training is performed in parallel across workers, with gradient aggregation handled by XGBoost's built‑in all‑reduce.
- This scales to datasets that do not fit in a single machine's memory.

### **85.4.2 Distributed Deep Learning with Ray**

Ray is a general‑purpose distributed execution framework with libraries for machine learning (Ray Train, Ray Tune). It integrates with PyTorch and TensorFlow.

Example: Distributed PyTorch training with Ray Train.

```python
# pip install ray[train] torch
import ray
from ray import train
from ray.train import Trainer
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Define a simple model
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])
        return out

# Training function
def train_func(config):
    # This function runs on each worker
    model = LSTMModel(input_size=config['input_size'], hidden_size=64, num_layers=2)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    # Load data (each worker gets a different shard)
    # Assume we have a function to get sharded data
    train_loader = ...  # DataLoader for this worker's shard
    for epoch in range(config['epochs']):
        for batch in train_loader:
            optimizer.zero_grad()
            loss = compute_loss(model, batch)
            loss.backward()
            optimizer.step()
        # Report metrics to Ray
        train.report(epoch=epoch, loss=loss.item())

# Initialize Ray
ray.init(address='auto')  # or start a cluster

# Set up distributed training
trainer = Trainer(backend='torch', num_workers=4)
trainer.start()
results = trainer.run(
    train_func,
    config={'input_size': 10, 'epochs': 5}
)
trainer.shutdown()
```

**Explanation:**

- Ray Train handles distributed data loading (sharding), gradient synchronisation, and checkpointing.
- Each worker runs the `train_func` on its shard of data. PyTorch's DistributedDataParallel is used under the hood.
- This enables scaling deep learning to multiple GPUs across nodes.

---

## **85.5 Distributed Inference**

For serving predictions at scale, we need to distribute the inference load across multiple instances. This can be done by:

- **Load balancing**: Requests are routed to any available instance. Suitable when the model is small and stateless.
- **Sharding**: Each instance is responsible for a subset of symbols. This can improve cache locality and reduce model loading overhead.

### **85.5.1 Sharding by Symbol**

We can partition the symbol space (e.g., by hash) and assign each partition to a dedicated inference service. A gateway receives requests and forwards them to the appropriate shard.

```python
# Simplified gateway
class PredictionGateway:
    def __init__(self, shard_map):
        self.shard_map = shard_map  # e.g., {0: 'http://shard0:8000', 1: 'http://shard1:8000'}
    
    def predict(self, symbol, date):
        shard_id = hash(symbol) % len(self.shard_map)
        url = self.shard_map[shard_id]
        # Forward request to shard
        response = requests.post(f"{url}/predict", json={"symbol": symbol, "date": date})
        return response.json()
```

Each shard loads only the models for its assigned symbols, reducing memory usage and allowing independent scaling.

### **85.5.2 Load Balancing with Kubernetes**

If the model is small and stateless, we can deploy multiple replicas behind a load balancer. Kubernetes services provide this natively. Each replica runs the same prediction service and can handle any request.

---

## **85.6 Data Partitioning (Sharding) Strategies**

In distributed systems, data is often partitioned to enable parallel processing. For time‑series, common partitioning keys are:

- **By symbol**: Each partition contains all data for a subset of symbols. This works well for per‑symbol operations (feature engineering, training, inference).
- **By time**: Partition by date ranges (e.g., monthly). This is useful for time‑series joins or when processing windows that span all symbols.
- **Hybrid**: Partition by symbol first, then by time within each symbol.

In Dask, you can set the partition index when reading data.

```python
# Partition by symbol (assuming symbol column)
df = dd.read_parquet('data.parquet').set_index('symbol')
# Now operations grouped by symbol will be efficient.
```

---

## **85.7 Consistency and Fault Tolerance**

### **85.7.1 Consistency Models**
In distributed training, we often use **asynchronous** or **synchronous** gradient updates:
- **Synchronous**: All workers compute gradients on their batch, then average them. This ensures consistent model updates but can be slow if workers are stragglers.
- **Asynchronous**: Workers update the model independently. This is faster but can lead to stale gradients and less stable convergence.

Most frameworks support both; the choice depends on the application.

### **85.7.2 Fault Tolerance**
Distributed systems must handle node failures. Common techniques:

- **Checkpointing**: Periodically save model state to durable storage. If a node fails, restart from the latest checkpoint.
- **Replication**: Run multiple copies of critical components.
- **Task retries**: If a task fails, reschedule it on another node (Dask and Spark do this automatically).

Dask can recover from worker failures by re‑running lost tasks. For long‑running training jobs, use checkpointing with libraries like PyTorch Lightning.

---

## **85.8 Case Study: Scaling NEPSE Prediction with Dask**

Let's design a distributed pipeline for the NEPSE system that processes data for 1000 stocks.

**Components**:
1. **Raw data stored in Parquet**, partitioned by symbol.
2. **Feature engineering** using Dask delayed per symbol.
3. **Model training** using Dask‑XGBoost on the full feature set.
4. **Distributed inference** with sharding by symbol.

```python
import dask.dataframe as dd
from dask.distributed import Client
import dask
import pandas as pd
import numpy as np
import xgboost as xgb

# Start Dask client
client = Client(n_workers=4, threads_per_worker=2, memory_limit='4GB')

# Step 1: Read partitioned data
# Assume data is stored as parquet files partitioned by symbol
# e.g., 'data/symbol=AAPL/*.parquet', 'data/symbol=GOOG/*.parquet'
df = dd.read_parquet('data/partitions/symbol=*/*.parquet')

# Step 2: Feature engineering per symbol (using map_partitions)
def engineer_features(partition):
    # partition is a pandas DataFrame for a single symbol (due to partitioning)
    partition = partition.sort_values('date')
    partition['lag_1'] = partition['close'].shift(1)
    partition['lag_5'] = partition['close'].shift(5)
    partition['sma_10'] = partition['close'].rolling(10).mean()
    partition['volatility'] = partition['close'].rolling(20).std()
    partition['target'] = partition['close'].shift(-1)  # next day close
    return partition.dropna()

df_feat = df.map_partitions(engineer_features)

# Step 3: Prepare data for training
# Drop rows with NaN and select features/target
feature_cols = ['lag_1', 'lag_5', 'sma_10', 'volatility']
X = df_feat[feature_cols]
y = df_feat['target']

# Step 4: Distributed training with XGBoost
dtrain = xgb.dask.DaskDMatrix(client, X, y)
params = {'objective': 'reg:squarederror', 'max_depth': 5, 'eta': 0.1}
output = xgb.dask.train(client, params, dtrain, num_boost_round=100)
model = output['booster']

# Step 5: Save model
model.save_model('nepse_model.json')

# Step 6: Distributed inference (on new data)
# Assume new_data is a Dask DataFrame for today's features
new_data = ...  # load
X_new = new_data[feature_cols]
predictions = xgb.dask.predict(client, model, X_new).compute()

print(predictions.head())
```

**Explanation:**

- By storing data partitioned by symbol, Dask can process each symbol independently and in parallel (`map_partitions` ensures that each partition corresponds to a single symbol).
- XGBoost training is distributed across the Dask workers, leveraging all available cores.
- Inference is also distributed, with predictions computed in parallel.

---

## **85.9 Monitoring and Debugging Distributed Systems**

Distributed systems are harder to debug. Essential tools:

- **Dask dashboard**: Provides real‑time metrics on task execution, memory, and worker status.
- **Ray dashboard**: Similar for Ray.
- **Logging**: Centralised logging (e.g., ELK stack) to aggregate logs from all nodes.
- **Distributed tracing**: Tools like Jaeger can trace requests across services (useful for inference pipelines).

Dask's dashboard is particularly useful: you can see which tasks are running, memory usage, and identify bottlenecks.

---

## **85.10 Best Practices and Trade‑offs**

- **Start with a single machine** and scale only when necessary. Distributed systems add complexity.
- **Choose the right level of parallelism**: For many time‑series tasks, per‑symbol parallelism is natural and easy.
- **Monitor resource usage**: Ensure that workers are not overloaded or underutilised.
- **Handle data skew**: Some symbols may have much more data than others, causing stragglers. Use partitioning strategies to balance load.
- **Use columnar storage (Parquet)** for efficient I/O.
- **For training**, consider whether you need distributed training or if a single GPU suffices. Many time‑series datasets are small enough to fit on one machine.
- **For inference**, consider caching models and precomputing features to reduce latency.
- **Plan for failure**: Use checkpointing and retries.

---

## **Chapter Summary**

In this chapter, we explored distributed systems for time‑series prediction. We covered fundamental concepts, parallelism strategies, and practical implementations using Dask and Ray. We demonstrated distributed feature engineering, training with XGBoost, and inference sharding. We also discussed data partitioning, consistency, fault tolerance, and monitoring. Distributed systems enable us to scale our NEPSE prediction system to thousands of stocks and high‑throughput requests, but they come with increased complexity. The key is to apply them judiciously where the benefits outweigh the costs.

In the next chapter, we will continue with **Development Best Practices**, focusing on code quality, testing, and documentation.

---

**End of Chapter 85**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='84. real_time_learning_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../12. industry_best_practices_and_standards/86. development_best_practices.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
