# Chapter 42: Real-Time Prediction Systems

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the architecture of a real‑time prediction system and its components
- Identify the differences between batch and stream processing for time‑series forecasting
- Choose and implement a streaming platform (Apache Kafka) to ingest live market data
- Build stream processing pipelines that compute features and serve predictions with low latency
- Manage state in streaming applications to maintain rolling windows and feature aggregations
- Handle backpressure and guarantee exactly‑once processing semantics
- Monitor and scale real‑time prediction systems for production workloads

---

## Introduction

In the previous chapters, we built batch prediction systems that process historical NEPSE data once per day after the market closes. But in financial markets, opportunities vanish in seconds. A **real‑time prediction system** allows us to react to price movements as they happen—detecting breakout patterns, sudden volatility changes, or circuit‑breaker triggers within milliseconds of a new trade.

For the NEPSE (Nepal Stock Exchange) example, a real‑time system would continuously consume live tick data (or minute‑level aggregates) from a market data feed, compute technical indicators on the fly, and emit predictions (e.g., next‑minute price direction) before the next tick arrives. This chapter covers the design, implementation, and operational considerations of such systems, using NEPSE as a running example.

---

## 42.1 Real‑Time Architecture

A real‑time prediction system is a continuous data pipeline that ingests events as they occur, processes them with minimal delay, and produces predictions or alerts. Unlike batch systems that run on a schedule (e.g., every 24 hours), stream processors operate on **unbounded data** and must handle out‑of‑order events, late arrivals, and stateful computations.

### Core Components

1. **Data Source** – The origin of live events (e.g., stock ticks from a WebSocket, messages from a message queue, or changes in a database).
2. **Ingestion Layer** – A distributed, fault‑tolerant message broker that buffers and distributes the event stream (e.g., Apache Kafka, Apache Pulsar).
3. **Stream Processor** – A continuous computation engine that reads from the ingestion layer, applies transformations (feature engineering, model inference), and writes results (e.g., Apache Flink, Apache Spark Streaming, or custom microservices).
4. **Model Serving** – The component that hosts the trained machine learning model and exposes it for low‑latency scoring. This can be embedded in the stream processor or run as a separate service.
5. **Output Sink** – The destination of the predictions: a database, a dashboard, an alerting system, or another message queue for downstream applications.

![Real‑Time Prediction Architecture](images/real_time_arch.png)

For the NEPSE system, a realistic architecture might look like this:

- **Data Source**: A WebSocket connection to a market data provider that pushes real‑time trade and quote data.
- **Ingestion**: Apache Kafka, which accepts millions of events per second and retains them for replay.
- **Stream Processor**: A Python application using the Faust library, or a Flink job written in Java, that:
  - Parses each tick
  - Updates rolling windows (e.g., 5‑minute moving average)
  - Computes features (RSI, MACD, volume anomalies)
  - Calls a pre‑trained XGBoost model to predict next‑minute direction
- **Output Sink**: A PostgreSQL database for historical logging and a Redis cache for real‑time dashboards.

The following code snippet simulates a simple real‑time ingestion pipeline using Python and `confluent_kafka` to consume NEPSE ticks from a Kafka topic.

```python
# consumer.py
from confluent_kafka import Consumer, KafkaError
import json
import joblib
import numpy as np

# Load pre‑trained model and feature scaler (trained in batch)
model = joblib.load('nepse_xgboost.pkl')
scaler = joblib.load('feature_scaler.pkl')

# Configure Kafka consumer
conf = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'nepse-predictor',
    'auto.offset.reset': 'latest'
}
consumer = Consumer(conf)
consumer.subscribe(['nepse-ticks'])

def extract_features(tick):
    """
    Convert a raw tick dict into a feature vector.
    In a real system, we would maintain state (previous ticks)
    to compute lag features.
    """
    # For demonstration, we use only the latest price and volume
    return np.array([[tick['price'], tick['volume']]])

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        if msg.error().code() == KafkaError._PARTITION_EOF:
            continue
        else:
            print(msg.error())
            break

    # Decode the JSON message
    tick = json.loads(msg.value().decode('utf-8'))
    print(f"Received tick: {tick}")

    # Feature extraction (simplified)
    features = extract_features(tick)
    features_scaled = scaler.transform(features)

    # Predict (e.g., probability of price increase in next minute)
    prob = model.predict_proba(features_scaled)[0, 1]
    print(f"Predicted up probability: {prob:.3f}")

    # Here we would write the prediction to a sink (e.g., Redis, InfluxDB)
```

**Explanation:**  
This consumer demonstrates the core loop of a real‑time predictor. It subscribes to a Kafka topic `nepse-ticks` where each message is a JSON object representing a stock tick (price, volume, timestamp). After polling for new messages, it decodes the JSON, extracts features (in a real system we would maintain a state store with previous ticks), scales them using a pre‑fitted `StandardScaler`, and feeds them into an XGBoost classifier. The resulting probability is printed and could be sent to a dashboard. Note the use of `auto.offset.reset=latest` to only consume new messages from the moment the consumer starts – this is typical for real‑time applications that don't need historical replay.

---

## 42.2 Streaming Platforms

A streaming platform is the backbone of any real‑time system. It decouples data producers from consumers, provides durability, and enables multiple applications to read the same stream independently. For the NEPSE system, we need a platform that can handle high throughput (thousands of ticks per second) and provide at‑least‑once or exactly‑once delivery guarantees.

### 42.2.1 Apache Kafka

Apache Kafka is the de facto standard for event streaming. It is a distributed, partitioned, replicated commit log service. Messages are organized into **topics**, and each topic can be split into **partitions** for parallelism. Producers write to topics, consumers read from them, and Kafka retains messages for a configurable period (even after consumption) allowing replay.

**Key Concepts:**
- **Producer**: Publishes messages to a topic.
- **Consumer**: Subscribes to topics and processes messages.
- **Consumer Group**: A set of consumers that cooperate to consume a topic; each partition is assigned to one consumer in the group.
- **Offset**: A unique identifier for each message within a partition, used to track consumption progress.

For the NEPSE system, we might have a topic `nepse-ticks` partitioned by stock symbol. This allows parallel consumption: one consumer per symbol, or multiple consumers sharing the load.

**Example Producer** (simulating tick data from a CSV file):

```python
# producer.py
import csv
import json
import time
from confluent_kafka import Producer

conf = {'bootstrap.servers': 'localhost:9092'}
producer = Producer(conf)

def delivery_report(err, msg):
    if err is not None:
        print(f"Message delivery failed: {err}")

# Simulate real‑time by reading a NEPSE CSV file row by row
with open('nepse_daily.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Convert to JSON and send
        tick = {
            'symbol': row['Symbol'],
            'price': float(row['Close']),
            'volume': int(row['Vol']),
            'timestamp': row.get('Date', time.strftime('%Y-%m-%d %H:%M:%S'))
        }
        producer.produce(
            'nepse-ticks',
            key=tick['symbol'].encode('utf-8'),
            value=json.dumps(tick).encode('utf-8'),
            callback=delivery_report
        )
        producer.poll(0)  # Trigger delivery reports
        time.sleep(1)     # Simulate one tick per second
    producer.flush()
```

**Explanation:**  
This producer reads a batch CSV file line by line, constructs a JSON tick message, and publishes it to the `nepse-ticks` topic. By using the stock symbol as the message key, we ensure that all ticks for the same symbol go to the same partition, preserving order per symbol. The `time.sleep(1)` simulates a real‑time stream; in production, ticks arrive as they happen, not on a fixed schedule.

### 42.2.2 Apache Pulsar

Apache Pulsar is another cloud‑native streaming platform that offers native support for multi‑tenancy, geo‑replication, and a simpler architecture than Kafka. It separates serving and storing layers, making it easier to scale independently. Pulsar also provides a richer set of subscription types (exclusive, shared, failover) and supports both queue and stream semantics. While less commonly used than Kafka, it is gaining traction for large‑scale deployments.

### 42.2.3 Cloud Streaming Services

Major cloud providers offer managed streaming services that abstract away cluster management:

- **AWS Kinesis**: Fully managed service for real‑time data streaming. Data is stored in shards, and consumers read from shards using the Kinesis Client Library (KCL).
- **Google Cloud Pub/Sub**: Simple, reliable messaging with at‑least‑once delivery and configurable retention.
- **Azure Event Hubs**: Scalable event ingestion service compatible with Kafka protocol.

For a NEPSE prototype, a managed service can be easier to set up, but for learning purposes, running Kafka locally (or in Docker) is sufficient.

---

## 42.3 Stream Processing

Once data is in a streaming platform, we need to process it continuously. Stream processing engines apply transformations—filtering, aggregation, joining, windowing—to the unbounded data stream. They also maintain state and handle late‑arriving data.

### 42.3.1 Apache Flink

Apache Flink is a powerful stream processing framework that provides exactly‑once semantics, event‑time processing, and sophisticated windowing. It can be used with Java/Scala, but also offers a Python API (PyFlink). For the NEPSE system, we might use Flink to compute rolling technical indicators every minute.

**Example Flink job (simplified, in Java):**

```java
// Pseudo‑code for Flink job that computes 5‑minute SMA
DataStream<Tick> ticks = env.addSource(new FlinkKafkaConsumer<>("nepse-ticks", ...));

ticks
    .keyBy(tick -> tick.symbol)
    .window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
    .aggregate(new AveragePrice())
    .addSink(new RedisSink<>());
```

Flink handles out‑of‑order events via watermarks and supports event‑time processing, which is crucial for accurate financial analytics where timestamps are embedded in the data, not when the message is processed.

### 42.3.2 Apache Spark Streaming

Apache Spark Streaming (now unified under Structured Streaming) treats streams as continuous tables. It offers a high‑level DataFrame API and integrates seamlessly with Spark MLlib. Micro‑batch processing (as opposed to true streaming) introduces small latencies (sub‑second to a few seconds), which may be acceptable for many financial applications.

**Example Structured Streaming with Python:**

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, avg

spark = SparkSession.builder.appName("NEPSEStreaming").getOrCreate()

# Read from Kafka
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "nepse-ticks") \
    .load()

# Parse JSON and compute 5‑minute average price per symbol
ticks = df.selectExpr("CAST(value AS STRING) as json") \
    .selectExpr("json_tuple(json, 'symbol', 'price', 'timestamp') as (symbol, price, ts)") \
    .withColumn("price", col("price").cast("double")) \
    .withColumn("timestamp", col("ts").cast("timestamp"))

windowedAvg = ticks \
    .withWatermark("timestamp", "1 minute") \
    .groupBy(window("timestamp", "5 minutes"), "symbol") \
    .agg(avg("price").alias("avg_price"))

# Write to console (or sink)
query = windowedAvg \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()
```

**Explanation:**  
This Spark Structured Streaming application reads from Kafka, parses the JSON, casts fields to appropriate types, and then computes a 5‑minute tumbling window average of the price for each symbol. Watermarks handle late data (up to 1 minute). The result is printed to the console, but could be written to a database or Kafka.

### 42.3.3 Custom Solutions with Python

For lightweight applications or when integrating with custom ML models, a pure Python stream processor can be built using libraries like **Faust** (a stream processing library that mimics Kafka Streams) or **Bytewax** (a Python‑native stream engine). Faust allows you to define stateful operators using async/await syntax and integrates tightly with Kafka.

**Example Faust application for NEPSE:**

```python
import faust
import joblib
import numpy as np

app = faust.App('nepse-processor', broker='kafka://localhost:9092')

# Define a Faust record (schema) for ticks
class Tick(faust.Record):
    symbol: str
    price: float
    volume: int
    timestamp: str

topic = app.topic('nepse-ticks', value_type=Tick)

# Load model once at startup (outside agent)
model = joblib.load('nepse_xgboost.pkl')
scaler = joblib.load('feature_scaler.pkl')

@app.agent(topic)
async def process(ticks):
    # Maintain state for last 10 prices per symbol (e.g., using a Table)
    # Here we simply predict on each tick (not realistic, just for demo)
    async for tick in ticks:
        features = np.array([[tick.price, tick.volume]])
        features_scaled = scaler.transform(features)
        prob = model.predict_proba(features_scaled)[0, 1]
        print(f"{tick.symbol}: up prob {prob:.3f}")
        # Optionally send prediction to another topic
```

**Explanation:**  
Faust applications define agents (asynchronous generators) that consume from topics. The agent runs continuously, processing each message as it arrives. Faust also provides tables for stateful operations (e.g., keeping a rolling window). This is a good choice for Python‑heavy teams.

---

## 42.4 Low‑Latency Inference

In real‑time prediction, the time from receiving a tick to emitting a prediction must be as short as possible—ideally under a few milliseconds. Achieving low latency requires careful optimization at multiple levels:

- **Model Choice**: Simple models (linear regression, small trees) are faster than complex neural networks. For NEPSE, an XGBoost model with limited depth (e.g., max_depth=6) can score in microseconds.
- **Pre‑computation**: Features that depend only on current tick (price, volume) can be computed on‑the‑fly; features that require historical windows need state management.
- **Model Serving**: Instead of loading the model in the stream processor, it can be exposed as a REST/gRPC service for better isolation and scaling. However, network overhead adds latency. For ultra‑low latency, embed the model directly.
- **Hardware**: Use of GPUs or specialised inference chips (e.g., NVIDIA Triton) for deep learning; for tree‑based models, CPUs are sufficient.
- **Batching**: Grouping multiple inference requests can improve throughput but adds latency; trade‑off depends on requirements.

**Example of embedding an XGBoost model in a Faust agent:**

```python
import xgboost as xgb
import numpy as np

# Load model globally once
model = xgb.Booster(model_file='nepse.model')
# Load feature scaler
scaler = joblib.load('scaler.pkl')

# Inside the agent:
features = np.array([[tick.price, tick.volume, tick.sma_5, tick.rsi]])
dmatrix = xgb.DMatrix(features)
pred = model.predict(dmatrix)  # returns probability
```

**Optimization tip:** Pre‑allocate arrays and reuse them to avoid allocation overhead.

---

## 42.5 State Management

Many time‑series features require **state**—for example, a 20‑period moving average needs the last 20 prices. In stream processing, state must be maintained across events and recovered after failures. Stream processors provide built‑in state stores (rocksDB, in‑memory) with exactly‑once semantics.

### State in Faust

Faust provides `Table` objects that are partitioned across instances and backed by a changelog topic in Kafka. They can store aggregations per key.

```python
# Table for storing last 5 prices per symbol
price_buffer = app.Table('price_buffer', default=list, partitions=8)

@app.agent(topic)
async def process(ticks):
    async for tick in ticks:
        # Append price to buffer for this symbol
        buf = price_buffer[tick.symbol]
        buf.append(tick.price)
        # Keep only last 5
        if len(buf) > 5:
            buf.pop(0)
        price_buffer[tick.symbol] = buf

        # Compute SMA_5 if buffer has enough values
        if len(buf) == 5:
            sma_5 = sum(buf) / 5
            # Use sma_5 as a feature for prediction
            # ... predict ...
```

**Explanation:**  
The `price_buffer` table keeps a list of recent prices per symbol. When a new tick arrives, we append the price and trim to the last 5. This state is fault‑tolerant because Faust persists every change to a Kafka changelog topic. If the instance fails, another can replay the changelog and reconstruct the buffer.

### State in Flink

Flink offers keyed state (ValueState, ListState, MapState) that is automatically checkpointed. For a rolling window, you can use a `ListState` or implement a custom `AggregateFunction`.

---

## 42.6 Backpressure Handling

Backpressure occurs when the stream processor cannot keep up with the incoming data rate. Without handling, the system may crash or lose data. Modern stream processors provide mechanisms to deal with backpressure:

- **Kafka Consumer Lag**: If the consumer falls behind, Kafka keeps messages; the lag (difference between latest offset and committed offset) grows. Monitoring lag is essential.
- **Flow Control**: In Faust, the underlying `asyncio` queue can be bounded; if the agent cannot process fast enough, the producer will eventually block (depending on configuration).
- **Dynamic Scaling**: Adding more consumers (partitions) can increase parallelism.
- **Drop or Sample**: For non‑critical applications, you may sample events (e.g., keep only 10% of ticks) or drop late events.

In the NEPSE scenario, tick rates are moderate (a few per second per symbol), so backpressure is unlikely. However, if we scale to many symbols or use complex models, we must plan.

**Example of monitoring consumer lag with `kafka-consumer-groups` CLI:**

```bash
kafka-consumer-groups --bootstrap-server localhost:9092 \
  --group nepse-predictor --describe
```

This shows per‑partition lag.

---

## 42.7 Exactly‑Once Processing

Exactly‑once semantics guarantee that each message is processed exactly one time, even in the face of failures. This is critical for financial applications where duplicate predictions or missed ticks could lead to incorrect trading decisions.

Kafka introduced exactly‑once semantics via **transactions** and idempotent producers. A stream processor can participate in a transaction, committing offsets and output writes atomically.

### Kafka Exactly‑Once in Python

The `confluent_kafka` library supports idempotent producers and transactions, but using them correctly is complex. For simplicity, many applications settle for **at‑least‑once** and deduplicate downstream (idempotent writes to the sink).

Flink and Spark Structured Streaming provide exactly‑once end‑to‑end when using appropriate sinks (e.g., Kafka, HDFS) and enabling checkpointing.

**Example of enabling exactly‑once in Flink Kafka sink:**

```java
kafkaProducer.setWriteSemantic(WriteSemantic.EXACTLY_ONCE);
```

---

## 42.8 Monitoring Real‑Time Systems

A real‑time prediction system must be continuously monitored to detect anomalies, performance degradation, or model drift. Key metrics include:

- **Throughput**: Messages per second processed.
- **Latency**: Time from ingestion to prediction output (p99).
- **Error Rate**: Percentage of failed messages or predictions.
- **Consumer Lag**: How far behind the consumer is from the latest message.
- **Model Performance**: Drift in prediction distribution compared to training.

### Tools

- **Prometheus + Grafana**: Collect metrics from the application (using a client library like `prometheus_client`) and visualize dashboards.
- **Kafka Monitoring**: Burrow, Kafka Lag Exporter.
- **Logging**: Structured logs (JSON) to Elasticsearch, viewed in Kibana.

**Example of exposing Prometheus metrics in a Faust app:**

```python
from prometheus_client import Counter, Histogram, start_http_server

# Start Prometheus HTTP server on port 8000
start_http_server(8000)

PREDICTIONS = Counter('predictions_total', 'Total predictions', ['symbol'])
LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency')

@app.agent(topic)
async def process(ticks):
    async for tick in ticks:
        with LATENCY.time():
            # ... predict ...
            PREDICTIONS.labels(symbol=tick.symbol).inc()
```

---

## 42.9 Scaling Strategies

As the number of symbols or the tick rate grows, the system must scale horizontally. Scaling a streaming application involves:

- **Increasing Partitions**: More partitions in Kafka allow more consumers in a group. However, rebalancing can be heavy; plan for it.
- **Parallelising the Stream Processor**: Faust and Flink automatically distribute work by key. Each instance handles a subset of keys.
- **Separating Model Serving**: Offload inference to a dedicated service (e.g., using TensorFlow Serving or a custom gRPC server) that can scale independently.
- **Auto‑Scaling**: Use Kubernetes Horizontal Pod Autoscaler based on CPU/memory or custom metrics (e.g., consumer lag).

**Example of scaling a Faust application with multiple workers:**

```bash
faust -A myapp worker --web-port=6066 -l info &
faust -A myapp worker --web-port=6067 -l info &
```

Faust uses Kafka’s consumer groups to assign partitions among workers automatically.

---

## Chapter Summary

In this chapter, we explored the architecture and implementation of real‑time prediction systems using the NEPSE stock market as a motivating example. We covered:

- The high‑level components of a real‑time pipeline: data source, ingestion, stream processing, model serving, and output sink.
- Apache Kafka as a distributed streaming platform, with practical code for producing and consuming NEPSE tick data.
- Stream processing with Apache Flink, Spark Structured Streaming, and custom Python frameworks like Faust.
- Techniques for low‑latency inference, including embedding models and optimising feature computation.
- State management to maintain rolling windows and other temporal aggregates.
- Backpressure handling, exactly‑once processing, monitoring, and scaling strategies.

Real‑time prediction systems are complex but essential for applications that require immediate responses. By combining the right tools and architectural patterns, we can build robust systems that deliver timely predictions for financial markets, IoT, or any domain where data flows continuously.

In the next chapter, we will dive into **Batch Prediction Systems**, contrasting them with real‑time approaches and showing how to schedule large‑scale offline predictions for historical analysis and model retraining.

---

**End of Chapter 42**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='41. batch_prediction_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='43. model_deployment_strategies.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
