# Chapter 90: Quality Assurance

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Define a comprehensive quality assurance (QA) strategy for a time‑series prediction system.
- Understand the role of different types of testing: unit, integration, system, and acceptance.
- Implement automated tests for data pipelines, feature engineering, model training, and prediction services.
- Conduct manual testing to catch issues that automated tests might miss.
- Measure and optimise system performance (latency, throughput, resource usage).
- Identify and mitigate security vulnerabilities in ML systems.
- Validate model behaviour using techniques like cross‑validation, backtesting, and shadow deployment.
- Ensure data quality through schema validation, anomaly detection, and drift monitoring.
- Integrate QA into the CI/CD pipeline to catch issues early.
- Apply best practices for maintaining high quality in production ML systems.

---

## **90.1 Introduction to Quality Assurance in ML Systems**

Quality assurance (QA) in machine learning systems extends beyond traditional software testing. While we still need to verify that code behaves as expected, we must also validate data, models, and their interactions. A bug in a traditional software system might cause a crash or incorrect output; in an ML system, it might silently degrade predictions, leading to poor business decisions.

For the NEPSE stock prediction system, quality issues could manifest as:

- Data pipeline failures (missing or incorrect data) causing stale predictions.
- Feature engineering bugs (e.g., wrong lag calculation) leading to systematically biased predictions.
- Model training issues (data leakage, overfitting) resulting in poor generalisation.
- Deployment problems (wrong model version, resource exhaustion) causing service outages.
- Security vulnerabilities (unauthorised access, model theft) compromising the system.

A robust QA strategy addresses all these aspects. This chapter will guide you through the key components, from automated testing to manual validation, performance testing, and security audits.

---

## **90.2 QA Strategy**

A QA strategy defines the overall approach to ensuring quality. It should be tailored to the system's risk profile, complexity, and regulatory requirements. For the NEPSE system, a typical strategy might include:

- **Pre‑commit checks**: Linting, type checking, and unit tests run locally or in CI before code is merged.
- **Continuous integration**: On every push, run a suite of automated tests: unit, integration, and data validation.
- **Staging environment**: Deploy to a staging environment that mirrors production for manual testing and integration tests.
- **Pre‑production validation**: Run model performance tests on hold‑out data, shadow deployments, and canary releases.
- **Production monitoring**: Continuously monitor data drift, model performance, and system health (as in Chapter 73).
- **Regular audits**: Periodically review code, data, and models for quality and security.

This multi‑layered approach catches issues at the earliest possible stage and provides confidence when deploying changes.

---

## **90.3 Testing Frameworks**

Choosing the right tools is essential. For Python‑based systems like NEPSE, common frameworks include:

- **pytest**: The most popular testing framework. Supports fixtures, parameterised tests, and plugins.
- **unittest**: Built‑in, but less flexible than pytest.
- **nose2**: Another alternative, but pytest is recommended.

For data validation, we can use:

- **Great Expectations**: Declarative validation of data quality.
- **Pandera**: Schema validation for pandas DataFrames.
- **Deequ** (if using Spark): Unit tests for data.

For model testing, we might use:

- **MLflow** for tracking experiments and comparing model performance.
- **Custom scripts** to compute metrics on hold‑out data.
- **SHAP** or **LIME** for interpretability checks.

For performance testing:

- **locust** or **k6** for load testing APIs.
- **pytest‑benchmark** for micro‑benchmarks.

---

## **90.4 Automated Testing**

Automated tests are the foundation of QA. They should run frequently and provide fast feedback.

### **90.4.1 Unit Tests**

Unit tests verify individual functions or classes in isolation. For the NEPSE system, examples include:

- Test that `compute_daily_return` correctly calculates percentage change.
- Test that `calculate_rsi` returns expected values for a known input.
- Test that data validation functions catch missing columns.

**Example using pytest:**

```python
# test_features.py
import pandas as pd
import numpy as np
from feature_engineering import compute_daily_return, calculate_rsi

def test_compute_daily_return():
    df = pd.DataFrame({'Close': [100, 105, 103]})
    result = compute_daily_return(df)
    expected = [np.nan, 5.0, -1.9047619]
    pd.testing.assert_series_equal(
        result['daily_return'],
        pd.Series(expected, name='daily_return'),
        check_less_precise=True
    )

def test_rsi_edge_cases():
    # Test with constant prices
    prices = pd.Series([100]*20)
    rsi = calculate_rsi(prices)
    assert rsi.iloc[-1] == 50.0  # RSI should be 50 for constant prices
    
    # Test with insufficient data
    with pytest.raises(ValueError):
        calculate_rsi(pd.Series([100, 101]), period=14)
```

### **90.4.2 Integration Tests**

Integration tests verify that components work together correctly. For example, test that the feature engineering pipeline produces the expected output when given a sample raw DataFrame, or that the prediction service returns a valid response when called with a known input.

**Example:**

```python
# test_integration.py
import pytest
from fastapi.testclient import TestClient
from prediction_service import app
from feature_store import FeatureStore

client = TestClient(app)

def test_prediction_endpoint(monkeypatch):
    # Mock the feature store to return known features
    def mock_get_features(symbol, date):
        return pd.DataFrame([{'feature1': 1.0, 'feature2': 2.0}])
    
    monkeypatch.setattr(FeatureStore, 'get_features', mock_get_features)
    
    response = client.post("/predict", json={"symbol": "NABIL", "date": "2023-06-01"})
    assert response.status_code == 200
    data = response.json()
    assert "predicted_close" in data
    assert isinstance(data["predicted_close"], float)
```

### **90.4.3 Data Validation Tests**

Automated data validation ensures that incoming data meets expectations. This can be part of the ingestion pipeline.

**Example with Great Expectations:**

```python
import great_expectations as ge

def test_raw_data_quality(df):
    ge_df = ge.from_pandas(df)
    # Expect no nulls in critical columns
    ge_df.expect_column_values_to_not_be_null('Close')
    ge_df.expect_column_values_to_not_be_null('Volume')
    # Expect Close prices to be positive
    ge_df.expect_column_values_to_be_between('Close', 0, None)
    # Expect Volume to be non‑negative
    ge_df.expect_column_values_to_be_between('Volume', 0, None)
    # Expect no duplicate symbol‑date pairs
    ge_df.expect_compound_columns_to_be_unique(['Symbol', 'Date'])
    
    results = ge_df.validate()
    assert results['success'], results
```

### **90.4.4 Model Validation Tests**

Model validation tests ensure that a newly trained model meets minimum performance criteria before deployment.

```python
def test_new_model_performance():
    # Load hold‑out test data
    X_test, y_test = load_test_data()
    
    # Load candidate model
    model = load_model('candidate.pkl')
    
    # Compute metrics
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    
    # Compare to baseline (e.g., previous model)
    baseline_mae = 12.5
    assert mae <= baseline_mae * 1.05, f"MAE {mae} exceeds baseline by >5%"
    
    # Also check performance on key segments (e.g., high volatility)
    high_vol_idx = X_test['volatility'] > X_test['volatility'].quantile(0.9)
    if high_vol_idx.any():
        mae_high = mean_absolute_error(y_test[high_vol_idx], y_pred[high_vol_idx])
        assert mae_high <= baseline_mae * 1.2, "Performance on high vol degraded too much"
```

---

## **90.5 Manual Testing**

Automated tests cannot catch everything. Manual testing by a human (e.g., a data scientist or domain expert) is still valuable.

### **90.5.1 Exploratory Testing**
A tester interacts with the system without a script, trying to find unexpected behaviour. For the NEPSE system, they might:

- Request predictions for unusual symbols or dates.
- Examine feature distributions for anomalies.
- Plot predictions against actuals to spot systematic biases.

### **90.5.2 Usability Testing**
If the system has a user interface (e.g., a dashboard), test that it is intuitive and displays information correctly.

### **90.5.3 Acceptance Testing**
Before a release, stakeholders (e.g., a trader) may test the system against acceptance criteria defined in the user stories.

Manual testing should be guided by test plans and documented. However, it's not scalable for frequent releases; that's why we automate as much as possible.

---

## **90.6 Performance Testing**

Performance testing ensures the system meets its latency, throughput, and resource utilisation requirements.

### **90.6.1 Load Testing**
Simulate many concurrent users to see how the system behaves. For the prediction API, we can use **locust**.

```python
# locustfile.py
from locust import HttpUser, task, between

class PredictionUser(HttpUser):
    wait_time = between(1, 3)
    
    @task
    def predict(self):
        self.client.post("/predict", json={
            "symbol": "NABIL",
            "date": "2023-06-01"
        })
```

Run with `locust -f locustfile.py` and observe response times and error rates.

### **90.6.2 Endurance Testing**
Run a moderate load over a long period (e.g., hours) to detect memory leaks or gradual performance degradation.

### **90.6.3 Stress Testing**
Push the system beyond expected limits to see where it breaks. This helps identify scaling bottlenecks.

### **90.6.4 Benchmarking**
Measure the performance of critical functions (e.g., feature computation) to ensure they are efficient. Use `pytest-benchmark`.

```python
def test_feature_computation_speed(benchmark):
    df = load_large_dataframe()
    result = benchmark(compute_all_features, df)
    assert result is not None
```

---

## **90.7 Security Testing**

Security is often overlooked in ML systems, but it's critical, especially in finance.

### **90.7.1 Authentication and Authorization**
Test that API endpoints require valid authentication and that users cannot access data they shouldn't. Use tools like **OWASP ZAP** or **Burp Suite** to scan for common vulnerabilities.

### **90.7.2 Data Privacy**
Ensure that sensitive data (e.g., API keys, database credentials) is not exposed in logs or error messages. Test that the system respects data minimisation and anonymisation.

### **90.7.3 Model Theft Prevention**
If the model is a valuable asset, protect it. Test that the model cannot be easily extracted via repeated API calls (e.g., by limiting rate, adding noise). Use techniques like model watermarking.

### **90.7.4 Adversarial Attacks**
Test the model's robustness against adversarial inputs. For example, small perturbations to features should not cause large changes in predictions (if that is a requirement). Libraries like **ART** (Adversarial Robustness Toolbox) can help.

---

## **90.8 Model Testing**

Model testing goes beyond performance metrics to ensure the model behaves as expected in various scenarios.

### **90.8.1 Backtesting**
As we did in Chapter 74, simulate how the model would have performed historically. This is a form of system testing for the model.

### **90.8.2 Sensitivity Analysis**
Test how predictions change when input features are slightly varied. This can reveal if the model relies too heavily on a single feature.

```python
def test_sensitivity(model, X_sample):
    base_pred = model.predict(X_sample)
    for col in X_sample.columns:
        X_perturbed = X_sample.copy()
        X_perturbed[col] *= 1.01  # 1% increase
        new_pred = model.predict(X_perturbed)
        change = abs(new_pred - base_pred) / base_pred
        assert change < 0.1, f"Too sensitive to {col}"
```

### **90.8.3 Fairness Testing**
If the model could impact people (e.g., loan approval), test for bias across demographic groups. For NEPSE, this is less relevant, but still good practice.

### **90.8.4 Shadow Deployment**
Before replacing a production model, run the new model in parallel (shadow mode) and compare its predictions to the current model. This provides real‑world validation without risk.

---

## **90.9 Data Testing**

Data quality is the foundation of ML. Tests should be applied both to static datasets and streaming data.

### **90.9.1 Schema Validation**
Ensure incoming data matches the expected schema (column names, types). Use Pandera or Great Expectations.

```python
import pandera as pa

schema = pa.DataFrameSchema({
    "Symbol": pa.Column(str),
    "Date": pa.Column(pa.DateTime),
    "Open": pa.Column(float, pa.Check.greater_than(0)),
    "High": pa.Column(float, pa.Check.greater_than(0)),
    "Low": pa.Column(float, pa.Check.greater_than(0)),
    "Close": pa.Column(float, pa.Check.greater_than(0)),
    "Volume": pa.Column(int, pa.Check.greater_than_or_equal_to(0)),
})

def test_data_schema(df):
    validated_df = schema.validate(df)
    assert validated_df is not None
```

### **90.9.2 Statistical Tests**
Check that distributions of key features remain stable over time (drift detection). Use statistical tests (e.g., Kolmogorov‑Smirnov) as in Chapter 77.

### **90.9.3 Anomaly Detection**
Monitor for unexpected values (e.g., price = 0, volume negative). This can be integrated into the ingestion pipeline.

### **90.9.4 Freshness**
Ensure data is arriving on time. In the NEPSE system, if the daily CSV is not available by a certain time, an alert should fire.

---

## **90.10 Integration Testing**

Integration tests verify that all components work together in a realistic environment. This is especially important in a microservices architecture (Chapter 81).

### **90.10.1 End‑to‑End Tests**
Simulate a complete user scenario: from data ingestion to prediction. For example:

1. Trigger ingestion of a sample CSV.
2. Wait for feature computation.
3. Call prediction API and verify the result.
4. Check that logs and metrics are generated.

These tests can be run in a staging environment that mirrors production.

### **90.10.2 Contract Testing**
Ensure that services communicate according to their API contracts. Tools like **Pact** can be used to test consumer‑driven contracts between services.

### **90.10.3 Chaos Engineering**
Intentionally introduce failures (e.g., kill a service, delay network) to see if the system degrades gracefully. This builds confidence in resilience.

---

## **90.11 Best Practices**

1. **Shift left**: Test as early as possible in the development cycle.
2. **Automate ruthlessly**: Any test that can be automated should be.
3. **Maintain a healthy test suite**: Keep tests fast, reliable, and independent.
4. **Use realistic data in tests**: But anonymise it if needed.
5. **Monitor test coverage**: Aim for high coverage, but focus on critical paths.
6. **Treat data as code**: Version data, test its quality, and document it.
7. **Involve domain experts**: They can spot issues that automated tests miss.
8. **Document test plans and results**: Especially for manual and acceptance testing.
9. **Continuously improve**: Review test failures and update tests to catch similar issues in the future.
10. **Balance effort and risk**: Not everything needs 100% coverage; focus on areas where failures would have the highest impact.

---

## **Chapter Summary**

In this chapter, we explored the multifaceted world of quality assurance for a time‑series prediction system like NEPSE. We covered:

- A layered QA strategy from pre‑commit checks to production monitoring.
- Automated testing with unit, integration, data validation, and model tests.
- Manual testing for exploratory and acceptance purposes.
- Performance testing to ensure the system meets SLAs.
- Security testing to protect data and models.
- Model‑specific testing including backtesting and sensitivity analysis.
- Data testing for schema, drift, and freshness.
- Integration testing in a microservices environment.
- Best practices to embed quality into the development process.

Quality assurance is not a one‑time activity but a continuous discipline. By integrating these practices into your CI/CD pipeline and daily work, you can deliver a reliable, trustworthy prediction system that meets user needs and withstands the test of time.

In the next chapter, we will explore **Performance Optimization**, focusing on how to make your system faster and more efficient.

---

**End of Chapter 90**