# Chapter 47: A/B Testing and Model Comparison

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the fundamentals of A/B testing and why it is essential for validating model improvements in production
- Design an A/B test for a time‑series prediction system, including proper traffic splitting and metric selection
- Calculate sample sizes required to achieve statistically significant results
- Apply statistical significance tests (t‑test, chi‑square, Mann–Whitney U) to compare model performance
- Implement multi‑armed bandit algorithms to dynamically allocate traffic while testing
- Conduct multivariate tests when multiple factors change simultaneously
- Build a model comparison framework that tracks multiple models over time
- Measure business impact (e.g., trading returns) rather than just technical metrics
- Interpret test results and avoid common pitfalls like p‑hacking and peeking

---

## Introduction

Imagine you have trained a new model for predicting NEPSE stock movements—perhaps it uses a more advanced feature set or a deeper neural network. You believe it will outperform the current production model. But before you replace the old model, you must be certain that the new one is genuinely better, not just luck on a particular test set. **A/B testing** (also known as split testing) provides a rigorous framework for comparing two (or more) versions of a model by exposing them to real user traffic and measuring their performance.

In the context of the NEPSE prediction system, an A/B test might compare two models that serve predictions to a downstream trading algorithm. The key challenge is that the true labels (whether the price actually went up the next day) are only available after the fact. Moreover, financial data is noisy, and even a genuinely superior model may appear worse over short periods due to random fluctuations.

This chapter will guide you through the design, execution, and analysis of A/B tests for machine learning models. We will cover sample size calculation, metric selection, statistical significance, and more advanced techniques like multi‑armed bandits that can accelerate learning. Using the NEPSE example, we will also discuss how to measure business impact—such as simulated trading returns—rather than just accuracy metrics.

---

## 47.1 A/B Testing Fundamentals

A/B testing is a controlled experiment with two variants: **A** (the control, usually the current production model) and **B** (the treatment, the candidate model). Users or prediction requests are randomly assigned to one of the variants, and their outcomes are tracked. At the end of the experiment, we compare the performance metrics between the two groups to decide if the difference is statistically significant.

### 47.1.1 Key Concepts

- **Randomisation**: Essential to ensure that any differences are due to the model, not confounding factors.
- **Sample size**: The number of observations needed to detect a meaningful effect with sufficient power.
- **Metric**: The quantitative measure we use to compare models (e.g., accuracy, F1 score, profit).
- **Statistical significance**: The probability that the observed difference is not due to random chance (usually p‑value < 0.05).
- **Practical significance**: The effect size is large enough to matter in the business context.

For the NEPSE system, the unit of randomisation could be each prediction request, or it could be each stock symbol if we want to avoid mixing predictions for the same symbol across models (which could confuse downstream logic). We’ll discuss this trade‑off.

---

## 47.2 Experimental Design for Time‑Series Models

Time‑series data introduces unique challenges for A/B testing because observations are not independent. If you assign each request randomly, a single symbol might receive predictions from both models on different days, which is acceptable as long as the assignment is independent. However, if the model is used for trading decisions, consistency may matter. Often, it’s simpler to randomise by time periods (e.g., odd days use model A, even days use model B), but this can be confounded by day‑of‑week effects. A better approach is to randomise by symbol: assign each stock symbol to either control or treatment for the entire duration of the test. This ensures that each symbol's predictions are always from the same model, making the comparison cleaner.

### 47.2.1 Randomisation by Symbol

```python
import hashlib
import pandas as pd

def assign_symbol_to_group(symbol, salt="nepse_test_v1", split=0.5):
    """Deterministic assignment based on symbol hash."""
    hash_val = int(hashlib.md5((symbol + salt).encode()).hexdigest(), 16)
    return 'treatment' if (hash_val % 100) < split * 100 else 'control'

# Example: NEPSE stocks
symbols = ['NABIL', 'NTC', 'SBI', 'HRL', 'NICA']
for sym in symbols:
    group = assign_symbol_to_group(sym, split=0.5)
    print(f"{sym}: {group}")
```

**Explanation:**  
Using a hash of the symbol plus a random salt ensures that the assignment is stable across runs. The salt can be changed for a new test. This method guarantees that the same symbol always receives the same model during the experiment.

### 47.2.2 Metric Selection

What should we measure? Common choices:

- **Accuracy / F1 score** – If the task is classification (up/down).
- **Mean Absolute Error (MAE)** – If predicting exact price.
- **Profit from a trading strategy** – The ultimate business metric. For example, simulate a simple strategy: buy if predicted probability > 0.6, sell otherwise, and compute total return.

Using a trading simulation is more realistic because it accounts for the magnitude of moves, not just direction.

**Example of computing daily return from predictions:**

```python
def simulate_trading(predictions_df):
    """
    predictions_df has columns: date, symbol, predicted_prob, actual_return
    Strategy: if predicted_prob > 0.6, take long position (expect up),
              if predicted_prob < 0.4, take short position (expect down),
              else stay out.
    """
    positions = 0
    returns = []
    for _, row in predictions_df.iterrows():
        if row['predicted_prob'] > 0.6:
            returns.append(row['actual_return'])   # long
        elif row['predicted_prob'] < 0.4:
            returns.append(-row['actual_return'])  # short
        else:
            returns.append(0)                       # out
    return sum(returns)
```

---

## 47.3 Sample Size Calculation

Before running an A/B test, you must determine how many observations you need to detect a meaningful difference. This depends on:

- **Baseline metric** (e.g., current model's accuracy)
- **Minimum detectable effect** (MDE) – the smallest improvement you care about
- **Significance level** (α) – usually 0.05
- **Statistical power** (1‑β) – usually 0.8

For a two‑sample test of proportions (e.g., accuracy), you can use a formula or a library like `statsmodels`.

```python
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Current accuracy
p1 = 0.55
# Desired improvement (MDE)
p2 = 0.58
effect_size = proportion_effectsize(p1, p2)

power_analysis = NormalIndPower()
sample_size = power_analysis.solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.8,
    alternative='larger'
)
print(f"Required sample size per group: {int(sample_size)}")
```

**Explanation:**  
`proportion_effectsize` computes Cohen's h for proportions. `solve_power` returns the number of observations needed in each group. For NEPSE, if you have ~250 trading days per year and many symbols, you might need several months of data to reach the required sample size.

---

## 47.4 Statistical Significance Testing

After collecting data, you need to test whether the observed difference is statistically significant. The choice of test depends on the metric:

- **Binary outcomes** (e.g., correct/incorrect prediction): use a chi‑square test or a two‑proportion z‑test.
- **Continuous outcomes** (e.g., trading profit, absolute error): use a t‑test (if normally distributed) or Mann‑Whitney U test (non‑parametric).

### 47.4.1 Example: Comparing Accuracies with Chi‑Square

```python
import numpy as np
from scipy.stats import chi2_contingency

# Suppose we have:
# Control group: 1000 predictions, 550 correct (55%)
# Treatment group: 1000 predictions, 580 correct (58%)
control_correct = 550
control_total = 1000
treatment_correct = 580
treatment_total = 1000

# Create contingency table
table = [[control_correct, control_total - control_correct],
         [treatment_correct, treatment_total - treatment_correct]]

chi2, p_value, dof, expected = chi2_contingency(table)
print(f"Chi-square = {chi2:.4f}, p-value = {p_value:.4f}")
if p_value < 0.05:
    print("Statistically significant difference.")
else:
    print("Not significant.")
```

### 47.4.2 Example: Comparing Trading Returns with t‑test

```python
from scipy.stats import ttest_ind

# daily_returns_control: list of daily returns for control group
# daily_returns_treatment: list for treatment group
t_stat, p_value = ttest_ind(daily_returns_control, daily_returns_treatment, equal_var=False)
print(f"t-statistic = {t_stat:.4f}, p-value = {p_value:.4f}")
```

### 47.4.3 Multiple Testing and Peeking

If you check the p‑value every day and stop as soon as it becomes significant, you inflate the false positive rate. This is called **peeking**. To avoid this, you should decide the sample size in advance and not look at the results until the experiment is complete. Alternatively, use sequential testing methods that adjust for multiple looks.

---

## 47.5 Multi‑Armed Bandit

Traditional A/B testing is static: you split traffic equally and wait until the end. A **multi‑armed bandit** dynamically allocates more traffic to the better‑performing variant as data accumulates, reducing the opportunity cost of a bad model. It also can produce a conclusion faster, though with more complexity.

### 47.5.1 Epsilon‑Greedy Algorithm

A simple bandit: with probability ε, explore (choose a random model); with probability 1‑ε, exploit (choose the model with the highest average reward so far).

```python
import random
import numpy as np

class EpsilonGreedy:
    def __init__(self, n_models, epsilon=0.1):
        self.epsilon = epsilon
        self.counts = [0] * n_models   # number of times each model was used
        self.values = [0.0] * n_models # average reward for each model

    def select_model(self):
        if random.random() < self.epsilon:
            return random.randint(0, len(self.values)-1)  # explore
        else:
            return np.argmax(self.values)                 # exploit

    def update(self, chosen_model, reward):
        self.counts[chosen_model] += 1
        n = self.counts[chosen_model]
        value = self.values[chosen_model]
        # Incremental average
        self.values[chosen_model] = ((n-1) * value + reward) / n

# Simulate
bandit = EpsilonGreedy(2, epsilon=0.1)
for _ in range(1000):
    model = bandit.select_model()
    # Simulate reward (1 for correct prediction, 0 otherwise)
    # Assume model 0 has true reward 0.55, model 1 has 0.58
    reward = 1 if (model == 0 and random.random() < 0.55) or (model == 1 and random.random() < 0.58) else 0
    bandit.update(model, reward)

print(f"Model 0 average: {bandit.values[0]:.4f}, pulls: {bandit.counts[0]}")
print(f"Model 1 average: {bandit.values[1]:.4f}, pulls: {bandit.counts[1]}")
```

**Explanation:**  
The bandit will eventually allocate most pulls to the better model (model 1) while still occasionally exploring. This can be used in production to serve predictions while learning.

### 47.5.2 Thompson Sampling

Thompson Sampling is a Bayesian approach that maintains a belief distribution for each model's reward and samples from that distribution to choose the model. It often converges faster than epsilon‑greedy.

```python
import numpy as np

class ThompsonSampling:
    def __init__(self, n_models):
        # Beta distribution parameters (successes, failures)
        self.alphas = [1] * n_models
        self.betas = [1] * n_models

    def select_model(self):
        samples = [np.random.beta(a, b) for a, b in zip(self.alphas, self.betas)]
        return np.argmax(samples)

    def update(self, chosen_model, reward):
        if reward == 1:
            self.alphas[chosen_model] += 1
        else:
            self.betas[chosen_model] += 1
```

---

## 47.6 Multivariate Testing

Sometimes you want to test multiple changes simultaneously, e.g., different feature sets and different algorithms. A multivariate test (also called factorial experiment) can evaluate several factors at once. For example:

- Factor 1: Model architecture (LSTM vs. XGBoost)
- Factor 2: Feature set (with vs. without sentiment features)

This gives 2 × 2 = 4 combinations. You can then analyse main effects and interactions.

### 47.6.1 Analysing Multivariate Results

Use ANOVA (Analysis of Variance) to test whether factors have significant effects. However, multivariate tests require larger sample sizes because you are testing multiple hypotheses.

---

## 47.7 Model Comparison Framework

Beyond one‑off A/B tests, you may want to continuously compare multiple models in production. This can be done with a **model registry** that tracks performance metrics over time, and a **champion/challenger** system where one model serves most traffic and a few challengers get a small slice.

### 47.7.1 Building a Champion/Challenger System

```python
class ChampionChallenger:
    def __init__(self, champion_model, challenger_models, traffic_split):
        self.champion = champion_model
        self.challengers = challenger_models
        self.split = traffic_split  # e.g., [0.8, 0.1, 0.1] for champion + 2 challengers

    def predict(self, features):
        # Choose model based on probabilities
        choice = np.random.choice(len(self.split), p=self.split)
        if choice == 0:
            return self.champion.predict(features)
        else:
            return self.challengers[choice-1].predict(features)
```

Periodically, you evaluate the performance of each challenger against the champion using a statistical test. If a challenger is significantly better, it becomes the new champion.

---

## 47.8 Business Impact Analysis

Technical metrics (accuracy, AUC) are useful but do not always translate to business value. For the NEPSE system, the ultimate measure might be **profit** from a trading strategy. When analysing an A/B test, you should compute the business impact directly.

### 47.8.1 Example: Profit per Trade

Suppose you have a trading algorithm that uses the model's predictions. You can simulate its performance on the historical data for both control and treatment groups. Then compare the total profit or Sharpe ratio.

**Caveat:** The trading strategy itself may introduce variance. You might need to run multiple simulations or use bootstrap to get confidence intervals.

```python
def bootstrap_profit(predictions_df, n_bootstrap=1000):
    profits = []
    for _ in range(n_bootstrap):
        sample = predictions_df.sample(frac=1.0, replace=True)
        profit = simulate_trading(sample)
        profits.append(profit)
    return np.percentile(profits, [2.5, 97.5])  # 95% CI
```

---

## 47.9 Best Practices

1. **Pre‑register the experiment** – Define your hypothesis, metrics, and sample size before starting.
2. **Randomise properly** – Avoid systematic biases.
3. **Monitor for side effects** – The new model might be better on your primary metric but worse on others (e.g., latency). Track multiple metrics.
4. **Use both statistical and practical significance** – A tiny improvement may not be worth the deployment cost.
5. **Segment analysis** – The new model might work better for some symbols or market conditions. Check for heterogeneity.
6. **Avoid peeking** – Stick to the planned duration, or use sequential testing.
7. **Document results** – Save all experiment data for future reference.

---

## Chapter Summary

In this chapter, we explored how to rigorously compare models in production using A/B testing and related techniques, all within the context of the NEPSE prediction system. We covered:

- The fundamentals of A/B testing and its importance for validating model improvements.
- Experimental design considerations for time‑series data, including randomisation by symbol and metric selection.
- Sample size calculation to ensure experiments are adequately powered.
- Statistical significance tests for different metric types (chi‑square, t‑test, Mann‑Whitney).
- The dangers of peeking and how to avoid inflated false positives.
- Multi‑armed bandit algorithms (epsilon‑greedy, Thompson sampling) that dynamically allocate traffic while testing.
- Multivariate testing for evaluating multiple factors simultaneously.
- Building a champion/challenger framework for continuous model comparison.
- Translating technical metrics into business impact, such as trading profits.

By applying these principles, you can make data‑driven decisions about model updates, ensuring that only genuinely better models reach production. In the next chapter, we will discuss **Scalability and Performance Optimization**, focusing on how to handle increasing data volumes and prediction loads in your NEPSE system.

---

**End of Chapter 47**