# Chapter 92: Troubleshooting and Debugging

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand a systematic methodology for debugging issues in time‑series prediction systems.
- Identify common failure modes in data pipelines, feature engineering, model training, and prediction services.
- Use logging effectively to capture diagnostic information without overwhelming storage.
- Leverage debugging tools (pdb, logging, tracebacks) to pinpoint errors in code.
- Perform root cause analysis for data‑related issues, including missing values, outliers, and concept drift.
- Debug distributed systems using distributed tracing and log aggregation.
- Implement strategies for reproducing and fixing bugs in production.
- Document troubleshooting steps and build a knowledge base for recurring issues.

---

## **92.1 Introduction to Troubleshooting and Debugging**

No matter how well‑designed a system is, things will go wrong. The data feed may fail, a feature calculation may contain a bug, a model may produce nonsensical predictions, or an API may become unresponsive. The ability to systematically diagnose and resolve these issues is a critical skill for any engineer or data scientist working on a time‑series prediction system like the NEPSE stock predictor.

Troubleshooting is the process of identifying the root cause of a problem. Debugging is the act of fixing it. This chapter will provide you with a structured approach to both, covering common pitfalls, tools, and techniques.

We'll follow the lifecycle of a typical issue: from detection (often via monitoring alerts, as in Chapter 73), through investigation, to resolution and documentation.

---

## **92.2 A Systematic Debugging Methodology**

When faced with a problem, it's tempting to jump to conclusions and try random fixes. A systematic methodology saves time and ensures you actually fix the root cause.

### **92.2.1 Step 1: Reproduce the Problem**
Before you can fix a bug, you need to be able to reproduce it consistently. If the issue is intermittent, note the conditions under which it occurs (time, data, load, etc.). Create a minimal test case that isolates the problem.

### **92.2.2 Step 2: Gather Information**
Collect logs, metrics, and traces. Check monitoring dashboards. Talk to users or team members who observed the issue.

### **92.2.3 Step 3: Formulate Hypotheses**
Based on the information, hypothesise what might be wrong. For example: “The data ingestion service failed because the source CSV was in a different format.”

### **92.2.4 Step 4: Test Hypotheses**
Design experiments to test each hypothesis. This could be as simple as checking the source file, or as complex as running a modified version of the code in a staging environment.

### **92.2.5 Step 5: Identify Root Cause**
Once a hypothesis is confirmed, you have found the root cause. Understand why it happened.

### **92.2.6 Step 6: Implement a Fix**
Apply a fix, ensuring it doesn't introduce new issues. Write tests to prevent regression.

### **92.2.7 Step 7: Document**
Record the issue, its root cause, and the solution. This helps others and prevents future occurrences.

---

## **92.3 Common Issues in Time‑Series Prediction Systems**

Let's review common failure modes, using the NEPSE system as examples.

### **92.3.1 Data Pipeline Issues**
- **Missing data**: The daily CSV file is not available. Symptoms: stale predictions, alerts from monitoring.
- **Schema changes**: The CSV format changes (e.g., new columns, different order). Symptoms: ingestion fails with column errors.
- **Data quality**: Unexpected values (e.g., negative prices, zero volume). Symptoms: feature engineering produces NaNs, models output nonsense.

### **92.3.2 Feature Engineering Issues**
- **Look‑ahead bias**: Accidentally using future data in features (e.g., using today's high to predict today's close). Symptoms: model performs unrealistically well in backtesting but poorly in production.
- **Incorrect calculations**: Bugs in technical indicator formulas (e.g., RSI, MACD). Symptoms: predictions deviate from expected patterns.
- **Memory/performance**: Rolling window calculations become too slow or consume too much memory as data grows.

### **92.3.3 Model Training Issues**
- **Data leakage**: Training data includes information from the test period (e.g., scaling on full dataset). Symptoms: high validation accuracy but low real‑world performance.
- **Overfitting**: Model performs well on training data but poorly on unseen data.
- **Concept drift**: Model performance degrades over time due to changing market conditions.

### **92.3.4 Prediction Service Issues**
- **Latency spikes**: API responses become slow. Possible causes: model loading overhead, feature computation, database contention.
- **Errors**: 500 Internal Server Error. Possible causes: unhandled exceptions, dependency failures.
- **Incorrect predictions**: Model returns a value that is obviously wrong (e.g., negative price). Could be due to input feature issues or model corruption.

### **92.3.5 Infrastructure Issues**
- **Resource exhaustion**: CPU, memory, or disk full.
- **Network problems**: Service cannot reach database or other services.
- **Deployment failures**: Wrong model version deployed, configuration error.

---

## **92.4 Logging for Debugging**

Logs are your primary window into the system's behaviour. Effective logging is essential for troubleshooting.

### **92.4.1 What to Log**
- **Errors and exceptions**: Always log the full stack trace.
- **Warnings**: Unexpected but non‑fatal conditions (e.g., missing data, falling back to default).
- **Key events**: Start/end of major processes (ingestion, training, prediction).
- **Performance data**: Time taken for critical operations (can be sampled).
- **Input/output summaries**: For debugging, log the first few rows of data or feature values (but beware of logging sensitive information).

### **92.4.2 Log Levels**
Use standard log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL. In production, usually set to INFO or WARNING to avoid noise.

### **92.4.3 Structured Logging**
Log in JSON format to make it easy to parse and query. Use libraries like `python-json-logger`.

```python
import logging
from pythonjsonlogger import jsonlogger

logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter('%(asctime)s %(levelname)s %(name)s %(message)s')
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)

logger.info("Data ingestion started", extra={'symbol': 'NABIL', 'file': 'nepse_20230601.csv'})
```

This produces logs like:
```json
{"asctime": "2024-06-01 10:00:00", "levelname": "INFO", "name": "__main__", "message": "Data ingestion started", "symbol": "NABIL", "file": "nepse_20230601.csv"}
```

### **92.4.4 Centralised Logging**
In a distributed system, logs from different services must be aggregated. Use tools like **ELK Stack** (Elasticsearch, Logstash, Kibana), **Loki**, or cloud services (AWS CloudWatch, GCP Logging). This allows you to search across all logs.

### **92.4.5 Log Rotation**
To prevent disks from filling up, configure log rotation (e.g., using `logrotate` or the logging module's `RotatingFileHandler`).

---

## **92.5 Debugging Tools**

### **92.5.1 Python Debugger (pdb)**
`pdb` allows you to step through code interactively. Insert `breakpoint()` (Python 3.7+) at the point you want to inspect.

```python
def compute_rsi(prices):
    breakpoint()  # start debugger here
    # ...
```

When the breakpoint is hit, you can inspect variables, step through lines, and evaluate expressions.

Common pdb commands:
- `n` (next): execute next line
- `s` (step): step into function call
- `c` (continue): continue until next breakpoint
- `p variable` (print): print variable value
- `q` (quit): exit debugger

### **92.5.2 Logging as a Debugging Tool**
Sometimes you can't attach a debugger (e.g., in production). Use logging strategically to output variable values at key points.

```python
logger.debug(f"Processing symbol {symbol}, close price: {close_price}")
```

### **92.5.3 Assertions**
Use `assert` to catch invalid states early. Assertions can be disabled in production with the `-O` flag, but they are valuable during development.

```python
assert len(prices) > 0, "prices must not be empty"
assert close_price > 0, f"Close price {close_price} must be positive"
```

### **92.5.4 Post‑mortem Debugging**
If a program crashes, you can launch a debugger after the fact using `pdb.pm()`.

```python
import pdb

try:
    run()
except Exception:
    pdb.pm()  # start debugger at point of exception
```

### **92.5.5 Interactive Debugging with IDEs**
Most IDEs (VS Code, PyCharm) have excellent built‑in debuggers with graphical interfaces. Learn to use breakpoints, watches, and variable inspection.

---

## **92.6 Debugging Data Issues**

Data problems are common in time‑series systems. Here are techniques to diagnose them.

### **92.6.1 Visual Inspection**
Plot the data. A simple line chart can reveal missing periods, outliers, or unexpected patterns.

```python
import matplotlib.pyplot as plt

df['close'].plot()
plt.show()
```

### **92.6.2 Summary Statistics**
Compute basic statistics and compare with expectations.

```python
print(df['close'].describe())
```

If the minimum is negative or the maximum is implausible, you've found an issue.

### **92.6.3 Check for Missing Values**
```python
print(df.isnull().sum())
```

If critical columns have missing values, investigate why.

### **92.6.4 Validate Against a Known Source**
If you suspect the data is incorrect, compare a few samples with a trusted source (e.g., another data provider).

### **92.6.5 Data Drift Detection**
Use statistical tests (Kolmogorov‑Smirnov, chi‑square) to compare current data distribution with a reference period. This can reveal changes in data generation.

```python
from scipy.stats import ks_2samp

ks_stat, p_value = ks_2samp(reference_data['close'], current_data['close'])
if p_value < 0.05:
    print("Significant drift detected")
```

---

## **92.7 Debugging Model Issues**

### **92.7.1 Performance Degradation**
If model performance drops, first check if the data has drifted (as above). Then look at:

- **Feature importance**: Has the importance of key features changed? Use SHAP or permutation importance on recent data.
- **Residual analysis**: Plot residuals over time. Are errors increasing? Is there a pattern (e.g., bias on high‑volatility days)?

### **92.7.2 Overfitting**
- Compare training and validation errors. If training error is much lower, overfitting is likely.
- Check feature importance: are there features that should not be predictive (e.g., row number)?

### **92.7.3 Data Leakage**
- Review feature engineering code: ensure all transformations use only past information (`shift()` correctly).
- Check that train/test split respects time order.

### **92.7.4 Model Serving Issues**
If the model returns bad predictions in production but worked in testing:

- Verify that the input features are being computed identically. Compare a sample input from production with a sample from training.
- Check that the correct model version is loaded.
- Look for feature scaling mismatches (e.g., scaler fitted on training data but not used in production).

---

## **92.8 Debugging Distributed Systems**

When issues span multiple services, traditional debugging becomes harder.

### **92.8.1 Distributed Tracing**
Tools like **Jaeger** or **Zipkin** trace a request as it travels through services. They show the path, timing, and any errors. In FastAPI, you can instrument with OpenTelemetry as shown in Chapter 81.

### **92.8.2 Centralised Log Aggregation**
Use the ELK stack or Loki to search logs from all services in one place. For example, to find all logs related to a specific prediction request, include a request ID in every log entry.

```python
import uuid

request_id = str(uuid.uuid4())
logger.info("Prediction request", extra={'request_id': request_id, 'symbol': symbol})
```

Then you can search for that request_id across all services.

### **92.8.3 Health Checks and Readiness Probes**
In a containerised environment, implement health check endpoints. Kubernetes uses these to restart unhealthy containers.

```python
@app.get("/health")
def health():
    return {"status": "healthy"}
```

### **92.8.4 Chaos Engineering**
Intentionally inject failures (e.g., kill a service, add latency) to see how the system behaves. This builds confidence in your debugging tools and resilience.

---

## **92.9 Reproducing and Fixing Bugs**

Once you've identified the root cause, you need to fix it.

### **92.9.1 Write a Regression Test**
Before fixing, write a test that reproduces the bug. This ensures you understand it and prevents it from recurring.

```python
def test_rsi_calculation_edge_case():
    prices = pd.Series([100, 100, 100])  # constant prices
    expected = pd.Series([np.nan, np.nan, 50.0])
    result = calculate_rsi(prices, period=2)
    pd.testing.assert_series_equal(result, expected)
```

### **92.9.2 Apply the Fix**
Make the minimal change necessary to pass the test.

### **92.9.3 Run All Tests**
Ensure your fix doesn't break anything else.

### **92.9.4 Deploy Carefully**
Deploy to staging first, verify, then deploy to production with a rollout strategy (e.g., canary).

---

## **92.10 Building a Troubleshooting Knowledge Base**

Over time, you will encounter recurring issues. Document them in a knowledge base (e.g., a wiki). For each issue, note:

- **Symptoms**: What alerts or user reports indicate this issue?
- **Possible causes**: List common reasons.
- **Diagnostic steps**: Commands to run, logs to check.
- **Resolution**: How to fix it.
- **Prevention**: How to avoid it in the future.

This becomes an invaluable resource for on‑call engineers and new team members.

**Example entry**:

```
# Issue: Data ingestion fails with "Column mismatch"

## Symptoms
- Alert: "Data ingestion failed"
- Log shows: "Expected columns ['Symbol','Date','Open',...] but got ..."

## Possible Causes
1. Source CSV format changed (new columns, different order).
2. Delimiter changed (e.g., from comma to semicolon).
3. File is corrupted or empty.

## Diagnostic Steps
1. Check the source file: `aws s3 cp s3://nepse-raw/latest.csv - | head`
2. Compare columns with expected schema in `config/schema.yaml`.
3. Check file size: `aws s3 ls s3://nepse-raw/latest.csv`

## Resolution
- If format changed, update the ingestion code or schema.
- If file corrupted, contact data provider for a replacement.

## Prevention
- Implement schema validation at ingestion time.
- Alert on schema changes (e.g., using Great Expectations).
```

---

## **Chapter Summary**

In this chapter, we explored the art and science of troubleshooting and debugging in a time‑series prediction system like NEPSE. We covered:

- A systematic methodology for diagnosing issues.
- Common failure modes in data pipelines, feature engineering, models, and infrastructure.
- Effective logging practices, including structured logging and centralisation.
- Debugging tools: pdb, logging, assertions, and IDE debuggers.
- Techniques for debugging data‑related issues (visualisation, statistics, drift detection).
- Approaches for model performance degradation and data leakage.
- Debugging distributed systems with tracing and log aggregation.
- The importance of regression tests and careful deployment.
- Building a knowledge base to capture recurring issues.

Troubleshooting is a skill that improves with experience. By applying these techniques and documenting your findings, you'll become more effective at keeping your prediction system reliable and trustworthy.

In the next chapter, we will explore **Emerging Technologies and Future Trends** in time‑series prediction, including foundation models and large language models for time series.

---

**End of Chapter 92**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='91. performance_optimization.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../13. emerging_technologies_and_future_trends/93. foundation_models_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
