# Chapter 48: Scalability and Performance Optimization

## Learning Objectives

By the end of this chapter, you will be able to:

- Identify performance bottlenecks in a time‑series prediction system
- Profile Python code to locate CPU and memory hot spots
- Apply algorithmic optimisations to reduce computational complexity
- Implement parallel processing techniques (multiprocessing, multithreading, async I/O) to speed up data processing and inference
- Leverage distributed computing frameworks (Dask, Ray, Spark) for large‑scale feature engineering and model training
- Design effective caching strategies to avoid redundant computations
- Optimise memory usage when working with large time‑series datasets
- Utilise GPU acceleration for deep learning models and large matrix operations
- Scale prediction services horizontally and vertically in cloud environments
- Balance cost and performance through auto‑scaling and resource right‑sizing

---

## Introduction

As your NEPSE prediction system grows—more stocks, higher frequency data, more complex models, and more users—the initial prototype that worked beautifully on your laptop will eventually buckle under the load. Prediction latency increases, batch processing jobs take too long to finish, and costs skyrocket. **Scalability and performance optimisation** are the disciplines that ensure your system can handle growth gracefully, whether it's a 10x increase in data volume or a 100x increase in prediction requests.

Scalability is not an afterthought; it must be designed into the system from the beginning. However, even if you started with a simple script, there are many techniques to retrofit performance improvements. In this chapter, we will explore the entire stack: from optimising Python code and algorithms, to parallel and distributed computing, to caching and memory management, and finally to cloud‑scale architectures. Using the NEPSE system as our guide, we will identify typical bottlenecks and apply practical solutions.

---

## 48.1 Identifying Performance Bottlenecks

Before optimising anything, you must know where the time is spent. Guessing leads to wasted effort. Use **profiling** tools to measure exactly which functions and lines of code are the slowest.

### 48.1.1 CPU Profiling with cProfile

Python's built‑in `cProfile` module records how many times each function is called and how long it takes.

```python
import cProfile
import pstats
from nepse_pipeline import run_feature_engineering

# Profile the feature engineering function
profiler = cProfile.Profile()
profiler.enable()
run_feature_engineering('nepse_data.csv')
profiler.disable()

# Save stats to a file
with open('profile_results.txt', 'w') as f:
    stats = pstats.Stats(profiler, stream=f)
    stats.sort_stats('cumulative')  # Sort by cumulative time
    stats.print_stats(20)            # Print top 20
```

**Explanation:**  
`cProfile` runs the target function and records every function call. After sorting by cumulative time, you see which functions consume the most time. For example, you might discover that a pandas `rolling` operation is the bottleneck.

### 48.1.2 Line‑by‑Line Profiling with line_profiler

For more granular insight, `line_profiler` shows time per line of code.

```python
# First, decorate the function you want to profile
@profile
def compute_rsi(df, period=14):
    delta = df['Close'].diff()
    gain = delta.where(delta > 0, 0.0)
    loss = -delta.where(delta < 0, 0.0)
    avg_gain = gain.rolling(window=period).mean()
    avg_loss = loss.rolling(window=period).mean()
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

# Then run: kernprof -l -v my_script.py
```

**Explanation:**  
The `@profile` decorator is used by `kernprof` to record time per line. The output shows, for each line, how many times it was executed and the total time spent. This can reveal that a seemingly innocent line is actually slow.

### 48.1.3 Memory Profiling

Memory leaks or excessive memory usage can also slow down a system. Use `memory_profiler` to track memory consumption.

```python
from memory_profiler import profile

@profile
def load_and_process():
    df = pd.read_csv('nepse_all.csv')
    df['SMA_20'] = df['Close'].rolling(20).mean()
    return df

load_and_process()
```

The output shows memory usage per line, helping you identify where data copies or large intermediate objects are created.

---

## 48.2 Code Optimisation

Once you know where the bottlenecks are, you can optimise the code itself. Often, simple changes yield significant speedups.

### 48.2.1 Use Vectorised Operations

Pandas and NumPy are built on vectorised operations that run in C, which is much faster than Python loops.

**Inefficient (Python loop):**

```python
def compute_sma_loop(df, window):
    sma = []
    for i in range(len(df)):
        if i < window:
            sma.append(np.nan)
        else:
            sma.append(df['Close'].iloc[i-window:i].mean())
    return sma
```

**Efficient (vectorised):**

```python
def compute_sma_vectorised(df, window):
    return df['Close'].rolling(window).mean()
```

The vectorised version is often 100x faster.

### 48.2.2 Avoid Common Pandas Pitfalls

- **Use `loc` and `iloc` appropriately**: Chained indexing like `df[df['a'] > 0]['b']` can create copies. Use `df.loc[df['a'] > 0, 'b']`.
- **Minise `apply` with custom functions**: `apply` is flexible but slow. If possible, use built‑in vectorised functions. If you must use `apply`, try to use it on a NumPy array rather than a DataFrame.
- **Set data types**: Loading a CSV with default types may use more memory than necessary. Specify `dtype` to use smaller types (e.g., `float32` instead of `float64`).

```python
dtypes = {
    'Open': 'float32',
    'High': 'float32',
    'Low': 'float32',
    'Close': 'float32',
    'Volume': 'int32'
}
df = pd.read_csv('nepse.csv', dtype=dtypes)
```

### 48.2.3 Use Efficient Data Structures

For some operations, converting a DataFrame to a NumPy array can be faster because NumPy has less overhead.

```python
# Instead of df['Close'].values * 2
arr = df['Close'].to_numpy()
result = arr * 2
```

---

## 48.3 Algorithmic Optimisation

Sometimes the algorithm itself is the problem. A quadratic algorithm on a large dataset will never be fast enough.

### 48.3.1 Reduce Complexity

For example, computing rolling statistics from scratch for each window is O(n·k) if done naively. Pandas' rolling implementation is optimised, but if you implement your own, use an incremental update.

**Naive rolling sum:**

```python
def rolling_sum_naive(arr, window):
    result = []
    for i in range(len(arr)):
        if i < window:
            result.append(np.nan)
        else:
            result.append(np.sum(arr[i-window:i]))
    return result
```

**Incremental rolling sum:**

```python
def rolling_sum_fast(arr, window):
    result = [np.nan] * window
    cumsum = np.cumsum(arr)
    for i in range(window, len(arr)):
        result.append(cumsum[i] - cumsum[i-window])
    return result
```

The incremental version is O(n) instead of O(n·k).

### 48.3.2 Use Appropriate Libraries

For specialised tasks like technical indicators, use libraries that are already optimised.

- **TA‑Lib** (Technical Analysis Library) provides C‑optimised functions for RSI, MACD, etc.
- **NumPy** and **SciPy** for linear algebra and statistical operations.
- **Numba** for just‑in‑time compilation of Python loops.

**Example with Numba:**

```python
from numba import jit
import numpy as np

@jit(nopython=True)
def fast_rsi(prices, period=14):
    deltas = np.diff(prices)
    gain = np.where(deltas > 0, deltas, 0.0)
    loss = np.where(deltas < 0, -deltas, 0.0)
    avg_gain = np.zeros_like(prices)
    avg_loss = np.zeros_like(prices)
    avg_gain[period] = np.mean(gain[:period])
    avg_loss[period] = np.mean(loss[:period])
    for i in range(period+1, len(prices)):
        avg_gain[i] = (avg_gain[i-1] * (period-1) + gain[i-1]) / period
        avg_loss[i] = (avg_loss[i-1] * (period-1) + loss[i-1]) / period
    rs = avg_gain / avg_loss
    rsi = 100 - 100 / (1 + rs)
    return rsi
```

Numba compiles this loop to machine code, often achieving speeds comparable to C.

---

## 48.4 Parallel Processing

Modern CPUs have multiple cores. Parallel processing allows you to use them all.

### 48.4.1 Multiprocessing

The `multiprocessing` module is ideal for CPU‑bound tasks that can run independently, such as computing features for many stocks in parallel.

**Example: Process multiple symbols in parallel**

```python
import multiprocessing as mp
import pandas as pd

def process_symbol(symbol):
    """Load data for a single symbol and compute features."""
    df = pd.read_csv(f'data/{symbol}.csv')
    df['SMA_20'] = df['Close'].rolling(20).mean()
    df['RSI'] = compute_rsi(df['Close'])
    return df

if __name__ == '__main__':
    symbols = ['NABIL', 'NTC', 'SBI', 'HRL', 'NICA']
    with mp.Pool(processes=4) as pool:
        results = pool.map(process_symbol, symbols)
    # results is a list of DataFrames, one per symbol
```

**Explanation:**  
`Pool.map` distributes the list of symbols across worker processes. Each worker runs `process_symbol` independently. This can speed up batch feature engineering significantly.

### 48.4.2 Multithreading

Python's Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously. Therefore, multithreading is only beneficial for I/O‑bound tasks (e.g., waiting for network responses, reading files).

**Example: Fetching data from an API concurrently**

```python
import concurrent.futures
import requests

def fetch_symbol_data(symbol):
    url = f"https://api.nepse.com/stock/{symbol}"
    response = requests.get(url)
    return response.json()

symbols = ['NABIL', 'NTC', 'SBI']
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(fetch_symbol_data, symbols))
```

While one thread waits for the network, others can run.

### 48.4.3 Asynchronous I/O

For even higher concurrency, use `asyncio` with an HTTP client like `aiohttp`.

```python
import aiohttp
import asyncio

async def fetch_symbol(session, symbol):
    url = f"https://api.nepse.com/stock/{symbol}"
    async with session.get(url) as response:
        return await response.json()

async def fetch_all(symbols):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_symbol(session, sym) for sym in symbols]
        return await asyncio.gather(*tasks)

symbols = ['NABIL', 'NTC', 'SBI']
loop = asyncio.get_event_loop()
results = loop.run_until_complete(fetch_all(symbols))
```

Asyncio can handle thousands of concurrent connections efficiently.

---

## 48.5 Distributed Computing

When a single machine is not enough—either because the data is too large or the computation too heavy—you need distributed computing.

### 48.5.1 Dask

Dask provides parallel and distributed computing with a familiar pandas/NumPy interface. It can scale from a single machine to a cluster.

**Example: Parallel rolling window computation with Dask**

```python
import dask.dataframe as dd

# Read multiple CSV files into a Dask DataFrame
df = dd.read_csv('data/nepse_*.csv', parse_dates=['Date'])

# Compute rolling mean (Dask handles partitioning)
df['SMA_20'] = df.groupby('Symbol')['Close'].rolling(20).mean().reset_index(drop=True)

# Trigger computation
result = df.compute()
```

**Explanation:**  
Dask splits the data into partitions and processes them in parallel. Operations are lazy; you build a computation graph, then call `.compute()` to execute. Dask can also scale to a cluster of machines.

### 48.5.2 Ray

Ray is a general‑purpose distributed execution framework. It's great for parallelising Python functions across a cluster.

**Example: Distributed feature engineering with Ray**

```python
import ray
import pandas as pd

ray.init()

@ray.remote
def process_symbol(symbol):
    df = pd.read_csv(f'data/{symbol}.csv')
    df['SMA_20'] = df['Close'].rolling(20).mean()
    df['RSI'] = compute_rsi(df['Close'])
    return df

symbols = ['NABIL', 'NTC', 'SBI', 'HRL', 'NICA']
futures = [process_symbol.remote(sym) for sym in symbols]
results = ray.get(futures)
```

Ray handles task scheduling, data passing, and fault tolerance.

### 48.5.3 Apache Spark

Spark is the industry standard for large‑scale data processing. It's particularly suited for batch pipelines.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, window

spark = SparkSession.builder.appName("NEPSE").getOrCreate()

df = spark.read.option("header", "true").csv("data/nepse_all.csv")

# Compute daily average price per symbol
result = df.groupBy("Symbol", window("Date", "1 day")).agg(avg("Close").alias("avg_close"))
result.show()
```

Spark's Catalyst optimiser and Tungsten execution engine make it very efficient for large datasets.

---

## 48.6 Caching Strategies

Caching avoids recomputing expensive results. In a prediction system, many intermediate values can be cached.

### 48.6.1 In‑Memory Caching with Redis

Redis is an in‑memory key‑value store that can cache feature vectors, model predictions, or even pre‑computed technical indicators.

**Example: Cache rolling averages per symbol**

```python
import redis
import pickle
import pandas as pd

r = redis.Redis(host='localhost', port=6379, db=0)

def get_sma(symbol, window, force_recompute=False):
    cache_key = f"sma:{symbol}:{window}"
    if not force_recompute:
        cached = r.get(cache_key)
        if cached:
            return pickle.loads(cached)
    # Compute if not cached
    df = pd.read_csv(f'data/{symbol}.csv')
    sma = df['Close'].rolling(window).mean().tolist()
    r.setex(cache_key, 3600, pickle.dumps(sma))  # Cache for 1 hour
    return sma
```

**Explanation:**  
Before computing the SMA, we check Redis. If present, we return the cached result. If not, we compute and store it with an expiration time. This is especially useful for features that are expensive to compute and change infrequently.

### 48.6.2 Application‑Level Caching with `functools.lru_cache`

For pure functions, Python's built‑in `lru_cache` can memoise results.

```python
from functools import lru_cache

@lru_cache(maxsize=128)
def compute_rsi(series_tuple, period=14):
    # series_tuple is a tuple of prices (since lists are not hashable)
    prices = np.array(series_tuple)
    # ... RSI computation ...
    return rsi

# Usage
prices_tuple = tuple(df['Close'].values)
rsi = compute_rsi(prices_tuple)
```

**Explanation:**  
`lru_cache` stores the results of function calls. If the same arguments are passed again, the cached result is returned. This is useful for functions called repeatedly with the same inputs.

### 48.6.3 Database Caching with Materialised Views

If you use a database like PostgreSQL, you can create materialised views that store pre‑computed aggregates and refresh them periodically.

```sql
CREATE MATERIALIZED VIEW sma_20 AS
SELECT symbol, date, AVG(close) OVER (PARTITION BY symbol ORDER BY date ROWS 19 PRECEDING) AS sma_20
FROM prices;

REFRESH MATERIALIZED VIEW sma_20;  -- Run daily
```

---

## 48.7 Memory Optimisation

Large time‑series datasets can easily exceed available RAM. Optimising memory usage allows you to work with more data on the same hardware.

### 48.7.1 Downcast Numeric Types

Use the smallest possible data type for each column.

```python
def optimise_floats(df):
    floats = df.select_dtypes(include=['float64']).columns
    for col in floats:
        df[col] = pd.to_numeric(df[col], downcast='float')
    return df

def optimise_ints(df):
    ints = df.select_dtypes(include=['int64']).columns
    for col in ints:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    return df
```

This can reduce memory usage by 50% or more.

### 48.7.2 Use Categoricals for Low‑Cardinality Columns

Columns like `Symbol` have many repeated values. Convert them to categorical.

```python
df['Symbol'] = df['Symbol'].astype('category')
```

This stores the unique symbols once and uses integer indices, saving memory and speeding up groupby operations.

### 48.7.3 Chunking

When you cannot fit the entire dataset in memory, process it in chunks.

```python
chunk_size = 10000
reader = pd.read_csv('nepse_all.csv', chunksize=chunk_size)
for chunk in reader:
    process_chunk(chunk)
```

Each chunk is processed independently, and results can be aggregated.

### 48.7.4 Out‑of‑Core Computation with Dask or Vaex

Libraries like Dask and Vaex are designed for datasets larger than memory. They operate lazily and only load data as needed.

---

## 48.8 GPU Acceleration

For deep learning models and large matrix operations, GPUs can provide massive speedups. Frameworks like TensorFlow, PyTorch, and RAPIDS (cuDF, cuML) leverage GPUs.

### 48.8.1 Using cuDF for GPU‑Accelerated DataFrames

RAPIDS cuDF provides a pandas‑like interface that runs on NVIDIA GPUs.

```python
import cudf

# Read CSV directly into GPU memory
df = cudf.read_csv('nepse_all.csv')

# Rolling operations run on GPU
df['SMA_20'] = df['Close'].rolling(20).mean()

# Convert back to pandas if needed
pandas_df = df.to_pandas()
```

**Explanation:**  
cuDF uses the GPU's massive parallelism to speed up operations. For large datasets, this can be 10‑100x faster than pandas.

### 48.8.2 GPU‑Accelerated Machine Learning

Use cuML for GPU‑accelerated scikit‑learn style models.

```python
from cuml import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)  # X_train and y_train can be cuDF DataFrames
```

### 48.8.3 Deep Learning with PyTorch/TensorFlow

For neural networks, GPUs are essential. Ensure your data loading pipeline does not become a bottleneck.

```python
import torch
from torch.utils.data import DataLoader, TensorDataset

# Move data to GPU
X_tensor = torch.tensor(X.values, device='cuda')
y_tensor = torch.tensor(y.values, device='cuda')
dataset = TensorDataset(X_tensor, y_tensor)
dataloader = DataLoader(dataset, batch_size=1024, shuffle=True)
```

---

## 48.9 Scaling Prediction Services

The prediction API itself must scale to handle increasing request rates.

### 48.9.1 Horizontal Scaling

Run multiple copies of your prediction service behind a load balancer.

**With Kubernetes:**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nepse-predictor
spec:
  replicas: 5  # Run 5 pods
  template:
    spec:
      containers:
      - name: predictor
        image: nepse-predictor:latest
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: nepse-predictor
spec:
  selector:
    app: nepse-predictor
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
```

The load balancer distributes requests among the pods.

### 48.9.2 Vertical Scaling

Increase the resources (CPU, memory) of the machine running the service. This is simpler but has limits.

### 48.9.3 Auto‑scaling

In Kubernetes, you can use the Horizontal Pod Autoscaler to automatically adjust the number of replicas based on CPU utilisation or custom metrics.

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nepse-predictor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nepse-predictor
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
```

When CPU usage exceeds 70%, Kubernetes will spin up more pods, up to a maximum of 10.

### 48.9.4 Optimising Model Serving

- **Batch inference**: If your model supports it, process multiple requests together to amortise overhead.
- **Model quantisation**: Use reduced precision (e.g., float16) to speed up inference and reduce memory.
- **Hardware acceleration**: Use specialised hardware like GPUs or TPUs for deep learning, or Intel's oneDNN for CPU optimisation.

---

## 48.10 Cloud Scaling

Cloud providers offer virtually unlimited scalability, but it comes at a cost. You must balance performance with expenditure.

### 48.10.1 Choosing Instance Types

- **Compute‑optimised** (e.g., AWS C5 family) for CPU‑bound prediction services.
- **Memory‑optimised** (e.g., AWS R5 family) for large feature stores.
- **GPU instances** (e.g., AWS P3, P4) for deep learning training.

### 48.10.2 Spot/Preemptible Instances

For non‑critical batch jobs, use spot instances (AWS) or preemptible VMs (GCP) at a fraction of the cost. They can be terminated at any time, so design your pipeline to be resilient.

### 48.10.3 Serverless Options

For sporadic workloads, serverless (AWS Lambda, Google Cloud Functions) can be cost‑effective. However, cold starts and time limits may be problematic.

### 48.10.4 Managed Services

Consider managed services for parts of your stack:

- **AWS SageMaker** for training and deployment.
- **Google Vertex AI** for end‑to‑end ML.
- **Azure Machine Learning** for similar capabilities.

These services handle scaling, monitoring, and maintenance for you, but they can be more expensive than self‑managed solutions.

---

## Chapter Summary

In this chapter, we explored the vast landscape of scalability and performance optimisation for time‑series prediction systems, using the NEPSE stock predictor as a concrete example. We covered:

- Profiling to identify bottlenecks with `cProfile`, `line_profiler`, and memory profilers.
- Code optimisation techniques such as vectorisation, efficient pandas usage, and data type tuning.
- Algorithmic improvements to reduce complexity and leverage specialised libraries like Numba and TA‑Lib.
- Parallel processing with `multiprocessing` for CPU‑bound tasks and multithreading/async for I/O‑bound tasks.
- Distributed computing frameworks (Dask, Ray, Spark) for scaling beyond a single machine.
- Caching strategies at multiple levels (Redis, `lru_cache`, materialised views) to avoid redundant work.
- Memory optimisation through downcasting, categoricals, and chunking.
- GPU acceleration with RAPIDS, cuML, and deep learning frameworks.
- Scaling prediction services horizontally, vertically, and with auto‑scaling in Kubernetes.
- Cloud considerations, including instance selection, spot instances, and managed services.

By applying these techniques, you can ensure that your NEPSE prediction system remains responsive and cost‑effective as it grows. In the next chapter, we will discuss **Security and Compliance**, ensuring that your system protects sensitive financial data and meets regulatory requirements.

---

**End of Chapter 48**