# Chapter 91: Performance Optimization

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Identify performance bottlenecks in time‑series prediction systems using profiling tools.
- Apply code‑level optimizations (vectorization, data structures, algorithmic improvements) to speed up computation.
- Leverage parallel processing (multiprocessing, multithreading, async I/O) for CPU‑bound and I/O‑bound tasks.
- Use caching strategies to reduce redundant computations.
- Optimize memory usage, especially when working with large time‑series datasets.
- Scale computations using distributed frameworks like Dask and Ray.
- Accelerate model training and inference with GPUs.
- Design for low‑latency predictions in real‑time systems.
- Implement performance monitoring and continuous optimization.

---

## **91.1 Introduction to Performance Optimization**

Performance optimization is the process of making a system faster, more efficient, and more scalable. In the context of the NEPSE stock prediction system, performance considerations arise at multiple levels:

- **Data ingestion**: Loading and validating large CSV files daily.
- **Feature engineering**: Computing rolling statistics and technical indicators for thousands of stocks.
- **Model training**: Training complex models (e.g., XGBoost, LSTM) on years of historical data.
- **Prediction serving**: Responding to API requests with low latency.
- **Batch predictions**: Generating predictions for many symbols efficiently.

Optimization is a trade‑off: it often increases code complexity or reduces readability. Therefore, it should be guided by measurement: **profile first, optimize second**. This chapter will equip you with the tools and techniques to identify and address performance bottlenecks in your time‑series prediction system.

---

## **91.2 Identifying Performance Bottlenecks**

Before optimising, you must know where the time is spent. Profiling tools help measure execution time and resource usage.

### **91.2.1 Simple Timing with `time`**

For quick checks, you can wrap code blocks with `time.time()` or `time.perf_counter()`.

```python
import time

start = time.perf_counter()
result = expensive_function()
elapsed = time.perf_counter() - start
print(f"Function took {elapsed:.2f} seconds")
```

### **91.2.2 Profiling with `cProfile`**

Python's built‑in `cProfile` provides a detailed breakdown of function calls.

```python
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

# Run your code
run_feature_engineering()

profiler.disable()
stats = pstats.Stats(profiler).sort_stats('cumulative')
stats.print_stats(20)  # print top 20 by cumulative time
```

For a more visual output, use `snakeviz`:

```bash
python -m cProfile -o output.prof my_script.py
snakeviz output.prof
```

### **91.2.3 Line Profiler**

`line_profiler` gives per‑line timing, useful for understanding bottlenecks within a function.

```python
# pip install line_profiler
@profile
def slow_function():
    # ...
```

Run with `kernprof -l -v my_script.py`.

### **91.2.4 Memory Profiling**

Use `memory_profiler` to track memory usage.

```python
from memory_profiler import profile

@profile
def memory_intensive_function():
    # ...
```

Or monitor a running process with `mprof`.

### **91.2.5 Application Performance Monitoring (APM)**

For production systems, use APM tools like **New Relic**, **Datadog**, or **Prometheus** with **Grafana** to monitor latency, throughput, and resource usage over time.

---

## **91.3 Code‑Level Optimizations**

Once you've identified bottlenecks, apply these common optimizations.

### **91.3.1 Use Built‑in Functions and Libraries**

Python's built‑in functions are implemented in C and are very fast. Use them instead of manual loops.

```python
# Slow
total = 0
for x in data:
    total += x

# Fast
total = sum(data)
```

For numerical computations, use **NumPy** and **pandas**, which are vectorized.

```python
# Slow (Python loop)
returns = [ (prices[i] - prices[i-1]) / prices[i-1] for i in range(1, len(prices)) ]

# Fast (vectorized)
returns = (prices[1:] - prices[:-1]) / prices[:-1]
```

### **91.3.2 Avoid Loops in pandas**

Iterating over rows in pandas is slow. Use vectorized operations, `apply` with care, or `agg`.

```python
# Slow
df['new_col'] = 0
for idx, row in df.iterrows():
    df.loc[idx, 'new_col'] = row['A'] + row['B']

# Fast
df['new_col'] = df['A'] + df['B']
```

If you must loop, consider `itertuples()` which is faster than `iterrows()`.

### **91.3.3 Use Appropriate Data Structures**

- Use `set` for membership tests (`O(1)` vs `O(n)` for lists).
- Use `deque` for fast appends/pops from both ends.
- Use `heapq` for priority queues.

For time‑series, consider storing data in **numpy arrays** or **pandas Series** rather than lists of dictionaries.

### **91.3.4 Lazy Evaluation**

Avoid computing things until needed. For example, use generator expressions instead of lists when you don't need all values at once.

```python
# Eager (creates full list)
results = [expensive(x) for x in huge_data]

# Lazy (computes on demand)
results = (expensive(x) for x in huge_data)
```

### **91.3.5 Caching Expensive Computations**

If a function is called repeatedly with the same arguments, cache its results. Use `functools.lru_cache`.

```python
from functools import lru_cache

@lru_cache(maxsize=128)
def compute_rolling_features(symbol, date):
    # expensive computation
    return result
```

For larger datasets, consider a persistent cache like **Redis** (Chapter 81).

---

## **91.4 Parallel Processing**

Python's Global Interpreter Lock (GIL) limits true parallelism for CPU‑bound tasks. Use `multiprocessing` to bypass the GIL.

### **91.4.1 Multiprocessing**

```python
import multiprocessing as mp

def process_symbol(symbol):
    # process one symbol's data
    return result

symbols = ['NABIL', 'NEPSE', 'HRL', ...]
with mp.Pool(processes=4) as pool:
    results = pool.map(process_symbol, symbols)
```

This is ideal for embarrassingly parallel tasks like per‑symbol feature engineering.

### **91.4.2 Multithreading for I/O‑Bound Tasks**

For I/O‑bound tasks (e.g., network requests, file reads), threading can improve performance despite the GIL.

```python
import concurrent.futures

def fetch_data(url):
    return requests.get(url).json()

urls = [...]
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(fetch_data, urls))
```

### **91.4.3 Asynchronous I/O**

For high‑concurrency I/O (e.g., a prediction API), use `asyncio` with an async web framework like FastAPI.

```python
import asyncio
import aiohttp

async def fetch_data(session, url):
    async with session.get(url) as response:
        return await response.json()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_data(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
```

---

## **91.5 Optimizing Feature Engineering**

Feature engineering is often the most time‑consuming part of the pipeline.

### **91.5.1 Efficient Rolling Windows**

pandas rolling operations are optimized, but you can speed them up by using `numba` for custom functions.

```python
import numba

@numba.jit(nopython=True)
def rolling_mean_numba(data, window):
    result = np.empty(len(data))
    for i in range(len(data)):
        start = max(0, i - window + 1)
        result[i] = np.mean(data[start:i+1])
    return result
```

### **91.5.2 Pre‑compute and Store Features**

Instead of recomputing features every time, compute them once and store in a feature store (Chapter 74). This is especially important for online predictions.

### **91.5.3 Use Efficient Data Types**

Downcast numeric columns to save memory and speed up computations.

```python
df['Close'] = pd.to_numeric(df['Close'], downcast='float')
df['Volume'] = pd.to_numeric(df['Volume'], downcast='integer')
```

---

## **91.6 Optimizing Model Training**

### **91.6.1 Hardware Acceleration (GPUs)**

For deep learning models (LSTM, Transformers), use GPUs. Libraries like TensorFlow and PyTorch automatically leverage GPUs if available.

```python
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
```

For XGBoost, you can use the GPU‑accelerated version:

```python
model = xgb.XGBRegressor(tree_method='gpu_hist', gpu_id=0)
```

### **91.6.2 Distributed Training**

For very large datasets or models, use distributed training with frameworks like Ray or Horovod (Chapter 85).

### **91.6.3 Hyperparameter Tuning Efficiency**

Use **early stopping** to avoid overfitting and reduce training time. Use libraries like **Optuna** or **Hyperopt** that can prune unpromising trials.

```python
import optuna

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
    }
    model = xgb.XGBRegressor(**params, n_estimators=1000, early_stopping_rounds=10)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
    return mean_absolute_error(y_val, model.predict(X_val))

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)
```

---

## **91.7 Optimizing Prediction Serving**

Low‑latency predictions are critical for real‑time systems.

### **91.7.1 Model Serialization and Loading**

Use fast serialization formats. For XGBoost, the native `.json` format is faster than pickle.

```python
model.save_model('model.json')
model = xgb.XGBRegressor()
model.load_model('model.json')
```

### **91.7.2 Model Caching**

If you have multiple model versions, load them once and cache in memory. In a web service, load the model at startup.

### **91.7.3 Batching Predictions**

If you receive many prediction requests, batch them to amortize overhead.

```python
@app.post("/predict_batch")
def predict_batch(requests: List[PredictionRequest]):
    features = [get_features(req) for req in requests]
    X = pd.DataFrame(features)
    predictions = model.predict(X)
    return [{"prediction": p} for p in predictions]
```

### **91.7.4 Lightweight Alternatives**

If latency is critical, consider:

- Using a simpler model (e.g., linear regression instead of XGBoost).
- Quantizing neural networks (TensorFlow Lite, ONNX).
- Moving inference to the edge (Chapter 78).

### **91.7.5 Asynchronous Processing**

For non‑critical predictions, use a message queue (Kafka) and process asynchronously.

---

## **91.8 Memory Optimization**

Large time‑series datasets can consume significant memory.

### **91.8.1 Chunking**

Process data in chunks instead of loading everything at once.

```python
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    process_chunk(chunk)
```

### **91.8.2 Use Efficient File Formats**

Parquet is much more efficient than CSV in both storage and memory.

```python
df = pd.read_parquet('data.parquet')
```

### **91.8.3 Downcasting**

As mentioned, downcast numeric columns.

### **91.8.4 Garbage Collection**

For long‑running processes, manually trigger garbage collection if needed, but let Python manage it normally.

---

## **91.9 Scaling with Dask and Ray**

When a single machine is insufficient, use distributed frameworks.

### **91.9.1 Dask for DataFrame Operations**

Dask can handle larger‑than‑memory datasets by partitioning.

```python
import dask.dataframe as dd

df = dd.read_parquet('data/*.parquet')
df['daily_return'] = df.groupby('symbol')['close'].pct_change()
df = df.dropna()
result = df.compute()  # triggers computation
```

### **91.9.2 Ray for Distributed Execution**

Ray is great for parallelizing custom functions across a cluster.

```python
import ray

ray.init()

@ray.remote
def process_symbol(symbol):
    # ... processing
    return result

futures = [process_symbol.remote(sym) for sym in symbols]
results = ray.get(futures)
```

---

## **91.10 Monitoring Performance in Production**

Once optimizations are in place, continuously monitor to ensure they remain effective.

- Track **request latency** (p50, p95, p99) over time.
- Monitor **throughput** (requests per second).
- Watch **CPU, memory, and GPU utilisation**.
- Set up alerts for performance degradation (Chapter 73).

Use tools like Prometheus, Grafana, and custom dashboards.

---

## **Chapter Summary**

In this chapter, we explored performance optimization for time‑series prediction systems, using the NEPSE example as a guide. We covered:

- Identifying bottlenecks with profiling tools.
- Code‑level optimizations (vectorization, data structures, caching).
- Parallel processing with multiprocessing, threading, and async.
- Optimizing feature engineering and model training.
- Reducing prediction latency through caching, batching, and efficient serialization.
- Managing memory for large datasets.
- Scaling with Dask and Ray.
- Monitoring performance in production.

Remember: measure first, optimise second. Premature optimisation can lead to complex, hard‑to‑maintain code. Focus on the bottlenecks that actually matter for your system's performance goals.

In the next chapter, we will discuss **Security and Compliance**, ensuring that our prediction system is protected against threats and meets regulatory requirements.

---

**End of Chapter 91**