# ML Testing Strategies

This notebook covers comprehensive testing strategies for ML systems - critical for FAANG-level ML engineering.

## Topics Covered
1. **Unit Testing ML Components** - Testing transformers, models, and utilities
2. **Integration Testing** - End-to-end pipeline testing
3. **Model Validation** - Performance and regression testing
4. **Data Quality Testing** - Schema validation and data contracts
5. **A/B Testing** - Statistical significance and experiment design
6. **Load & Chaos Testing** - Performance and resilience testing

In [None]:
import torch
import torch.nn as nn
import numpy as np
from typing import Dict, List, Any, Optional, Callable, Tuple
from dataclasses import dataclass, field
from enum import Enum
from abc import ABC, abstractmethod
import json
import time
from datetime import datetime
import hashlib
import logging
from scipy import stats
import unittest
from unittest.mock import Mock, patch, MagicMock

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## 1. Unit Testing ML Components

Testing individual components: transformers, feature engineering, and model layers.

In [None]:
# Example ML components to test

class FeatureTransformer:
    """Feature transformation pipeline"""
    
    def __init__(self, normalize: bool = True, fill_na: float = 0.0):
        self.normalize = normalize
        self.fill_na = fill_na
        self.mean_ = None
        self.std_ = None
        self.is_fitted = False
    
    def fit(self, X: np.ndarray) -> 'FeatureTransformer':
        """Fit the transformer on training data"""
        # Handle NaN values
        X_clean = np.nan_to_num(X, nan=self.fill_na)
        
        if self.normalize:
            self.mean_ = np.mean(X_clean, axis=0)
            self.std_ = np.std(X_clean, axis=0)
            # Prevent division by zero
            self.std_[self.std_ == 0] = 1.0
        
        self.is_fitted = True
        return self
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """Transform data using fitted parameters"""
        if not self.is_fitted:
            raise ValueError("Transformer must be fitted before transform")
        
        X_clean = np.nan_to_num(X, nan=self.fill_na)
        
        if self.normalize:
            return (X_clean - self.mean_) / self.std_
        return X_clean
    
    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        """Fit and transform in one step"""
        return self.fit(X).transform(X)


class SimpleClassifier(nn.Module):
    """Simple neural network classifier for testing"""
    
    def __init__(self, input_dim: int, hidden_dim: int, num_classes: int):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In [None]:
class TestFeatureTransformer(unittest.TestCase):
    """Unit tests for FeatureTransformer"""
    
    def setUp(self):
        """Set up test fixtures"""
        self.transformer = FeatureTransformer(normalize=True)
        self.X_train = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
        self.X_test = np.array([[2.0, 3.0], [4.0, 5.0]])
    
    def test_fit_sets_parameters(self):
        """Test that fit() computes mean and std correctly"""
        self.transformer.fit(self.X_train)
        
        expected_mean = np.mean(self.X_train, axis=0)
        expected_std = np.std(self.X_train, axis=0)
        
        np.testing.assert_array_almost_equal(self.transformer.mean_, expected_mean)
        np.testing.assert_array_almost_equal(self.transformer.std_, expected_std)
        self.assertTrue(self.transformer.is_fitted)
    
    def test_transform_normalizes_data(self):
        """Test that transform() normalizes data correctly"""
        self.transformer.fit(self.X_train)
        X_transformed = self.transformer.transform(self.X_train)
        
        # After normalization, mean should be ~0, std should be ~1
        np.testing.assert_array_almost_equal(
            np.mean(X_transformed, axis=0), 
            np.zeros(2), 
            decimal=5
        )
    
    def test_transform_before_fit_raises_error(self):
        """Test that transform without fit raises ValueError"""
        with self.assertRaises(ValueError):
            self.transformer.transform(self.X_train)
    
    def test_handles_nan_values(self):
        """Test NaN handling"""
        X_with_nan = np.array([[1.0, np.nan], [3.0, 4.0], [5.0, 6.0]])
        transformer = FeatureTransformer(normalize=True, fill_na=-999)
        
        # Should not raise
        X_transformed = transformer.fit_transform(X_with_nan)
        
        # No NaN in output
        self.assertFalse(np.any(np.isnan(X_transformed)))
    
    def test_zero_std_handling(self):
        """Test that zero std columns are handled"""
        X_constant = np.array([[1.0, 5.0], [1.0, 5.0], [1.0, 5.0]])
        self.transformer.fit(X_constant)
        
        # Should not raise division by zero
        X_transformed = self.transformer.transform(X_constant)
        self.assertFalse(np.any(np.isinf(X_transformed)))


class TestSimpleClassifier(unittest.TestCase):
    """Unit tests for neural network model"""
    
    def setUp(self):
        self.model = SimpleClassifier(input_dim=10, hidden_dim=32, num_classes=3)
        self.batch_size = 4
    
    def test_output_shape(self):
        """Test that output has correct shape"""
        x = torch.randn(self.batch_size, 10)
        output = self.model(x)
        
        self.assertEqual(output.shape, (self.batch_size, 3))
    
    def test_forward_with_single_sample(self):
        """Test forward pass with single sample"""
        x = torch.randn(1, 10)
        output = self.model(x)
        
        self.assertEqual(output.shape, (1, 3))
    
    def test_gradient_flow(self):
        """Test that gradients flow through all layers"""
        x = torch.randn(self.batch_size, 10, requires_grad=True)
        output = self.model(x)
        loss = output.sum()
        loss.backward()
        
        # Check all parameters have gradients
        for name, param in self.model.named_parameters():
            self.assertIsNotNone(param.grad, f"No gradient for {name}")
            self.assertFalse(
                torch.all(param.grad == 0), 
                f"Zero gradient for {name}"
            )
    
    def test_deterministic_output(self):
        """Test that model produces deterministic output in mode"""
        self.model.train(False)
        x = torch.randn(self.batch_size, 10)
        
        with torch.no_grad():
            output1 = self.model(x)
            output2 = self.model(x)
        
        torch.testing.assert_close(output1, output2)


# Run tests
suite = unittest.TestLoader().loadTestsFromTestCase(TestFeatureTransformer)
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSimpleClassifier))

runner = unittest.TextTestRunner(verbosity=2)
result = runner.run(suite)

print(f"\nTests run: {result.testsRun}")
print(f"Failures: {len(result.failures)}")
print(f"Errors: {len(result.errors)}")

## 2. Integration Testing

Testing complete ML pipelines end-to-end.

In [None]:
@dataclass
class PipelineConfig:
    """Configuration for ML pipeline"""
    input_dim: int
    hidden_dim: int
    num_classes: int
    normalize: bool = True
    batch_size: int = 32


class MLPipeline:
    """
    Complete ML pipeline for testing.
    """
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.transformer = FeatureTransformer(normalize=config.normalize)
        self.model = SimpleClassifier(
            config.input_dim, 
            config.hidden_dim, 
            config.num_classes
        )
        self.is_trained = False
    
    def preprocess(self, X: np.ndarray, fit: bool = False) -> torch.Tensor:
        """Preprocess input data"""
        if fit:
            X_transformed = self.transformer.fit_transform(X)
        else:
            X_transformed = self.transformer.transform(X)
        return torch.tensor(X_transformed, dtype=torch.float32)
    
    def train(self, X: np.ndarray, y: np.ndarray, epochs: int = 10) -> Dict[str, List[float]]:
        """Train the model"""
        X_tensor = self.preprocess(X, fit=True)
        y_tensor = torch.tensor(y, dtype=torch.long)
        
        criterion = nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.model.parameters(), lr=0.01)
        
        history = {'loss': [], 'accuracy': []}
        
        self.model.train()
        for epoch in range(epochs):
            optimizer.zero_grad()
            outputs = self.model(X_tensor)
            loss = criterion(outputs, y_tensor)
            loss.backward()
            optimizer.step()
            
            # Calculate accuracy
            predictions = torch.argmax(outputs, dim=1)
            accuracy = (predictions == y_tensor).float().mean().item()
            
            history['loss'].append(loss.item())
            history['accuracy'].append(accuracy)
        
        self.is_trained = True
        return history
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make predictions"""
        if not self.is_trained:
            raise ValueError("Model must be trained before prediction")
        
        self.model.train(False)
        X_tensor = self.preprocess(X, fit=False)
        
        with torch.no_grad():
            outputs = self.model(X_tensor)
            predictions = torch.argmax(outputs, dim=1)
        
        return predictions.numpy()
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Get prediction probabilities"""
        if not self.is_trained:
            raise ValueError("Model must be trained before prediction")
        
        self.model.train(False)
        X_tensor = self.preprocess(X, fit=False)
        
        with torch.no_grad():
            outputs = self.model(X_tensor)
            probabilities = torch.softmax(outputs, dim=1)
        
        return probabilities.numpy()

In [None]:
class TestMLPipelineIntegration(unittest.TestCase):
    """Integration tests for ML pipeline"""
    
    @classmethod
    def setUpClass(cls):
        """Set up test data once for all tests"""
        np.random.seed(42)
        cls.n_samples = 100
        cls.n_features = 10
        cls.n_classes = 3
        
        # Generate synthetic data
        cls.X = np.random.randn(cls.n_samples, cls.n_features)
        cls.y = np.random.randint(0, cls.n_classes, cls.n_samples)
        
        cls.config = PipelineConfig(
            input_dim=cls.n_features,
            hidden_dim=32,
            num_classes=cls.n_classes
        )
    
    def setUp(self):
        """Set up fresh pipeline for each test"""
        self.pipeline = MLPipeline(self.config)
    
    def test_end_to_end_training(self):
        """Test complete training workflow"""
        history = self.pipeline.train(self.X, self.y, epochs=5)
        
        # Check training completed
        self.assertTrue(self.pipeline.is_trained)
        self.assertEqual(len(history['loss']), 5)
        self.assertEqual(len(history['accuracy']), 5)
        
        # Loss should generally decrease
        self.assertLess(history['loss'][-1], history['loss'][0] * 2)
    
    def test_prediction_after_training(self):
        """Test that predictions work after training"""
        self.pipeline.train(self.X, self.y, epochs=5)
        
        predictions = self.pipeline.predict(self.X)
        
        self.assertEqual(len(predictions), self.n_samples)
        self.assertTrue(all(0 <= p < self.n_classes for p in predictions))
    
    def test_probability_predictions_sum_to_one(self):
        """Test that probabilities sum to 1"""
        self.pipeline.train(self.X, self.y, epochs=5)
        
        probabilities = self.pipeline.predict_proba(self.X)
        
        self.assertEqual(probabilities.shape, (self.n_samples, self.n_classes))
        np.testing.assert_array_almost_equal(
            probabilities.sum(axis=1), 
            np.ones(self.n_samples)
        )
    
    def test_prediction_before_training_fails(self):
        """Test that prediction before training raises error"""
        with self.assertRaises(ValueError):
            self.pipeline.predict(self.X)
    
    def test_new_data_prediction(self):
        """Test prediction on new unseen data"""
        self.pipeline.train(self.X, self.y, epochs=5)
        
        # New data with same features
        X_new = np.random.randn(10, self.n_features)
        predictions = self.pipeline.predict(X_new)
        
        self.assertEqual(len(predictions), 10)


# Run integration tests
suite = unittest.TestLoader().loadTestsFromTestCase(TestMLPipelineIntegration)
runner = unittest.TextTestRunner(verbosity=2)
result = runner.run(suite)

print(f"\nIntegration Tests - Run: {result.testsRun}, Failures: {len(result.failures)}")

## 3. Model Validation & Regression Testing

Ensuring model quality and detecting performance regressions.

In [None]:
@dataclass
class ModelMetrics:
    """Metrics for model validation"""
    accuracy: float
    precision: float
    recall: float
    f1_score: float
    latency_p50_ms: float
    latency_p99_ms: float
    memory_mb: float
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())


class ModelValidator:
    """
    Validates ML models against quality thresholds.
    """
    
    def __init__(
        self,
        min_accuracy: float = 0.8,
        min_precision: float = 0.75,
        min_recall: float = 0.75,
        max_latency_p99_ms: float = 100,
        max_memory_mb: float = 1000
    ):
        self.thresholds = {
            'accuracy': min_accuracy,
            'precision': min_precision,
            'recall': min_recall,
            'latency_p99_ms': max_latency_p99_ms,
            'memory_mb': max_memory_mb
        }
    
    def compute_metrics(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        latencies: List[float],
        memory_mb: float
    ) -> ModelMetrics:
        """Compute all model metrics"""
        # Basic metrics
        accuracy = np.mean(y_true == y_pred)
        
        # Per-class metrics (simplified for multi-class)
        unique_classes = np.unique(y_true)
        precisions = []
        recalls = []
        
        for cls in unique_classes:
            tp = np.sum((y_pred == cls) & (y_true == cls))
            fp = np.sum((y_pred == cls) & (y_true != cls))
            fn = np.sum((y_pred != cls) & (y_true == cls))
            
            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0
            
            precisions.append(precision)
            recalls.append(recall)
        
        avg_precision = np.mean(precisions)
        avg_recall = np.mean(recalls)
        f1 = 2 * avg_precision * avg_recall / (avg_precision + avg_recall + 1e-8)
        
        return ModelMetrics(
            accuracy=accuracy,
            precision=avg_precision,
            recall=avg_recall,
            f1_score=f1,
            latency_p50_ms=np.percentile(latencies, 50),
            latency_p99_ms=np.percentile(latencies, 99),
            memory_mb=memory_mb
        )
    
    def validate(self, metrics: ModelMetrics) -> Tuple[bool, List[str]]:
        """Validate metrics against thresholds"""
        violations = []
        
        if metrics.accuracy < self.thresholds['accuracy']:
            violations.append(
                f"Accuracy {metrics.accuracy:.3f} < {self.thresholds['accuracy']}"
            )
        
        if metrics.precision < self.thresholds['precision']:
            violations.append(
                f"Precision {metrics.precision:.3f} < {self.thresholds['precision']}"
            )
        
        if metrics.recall < self.thresholds['recall']:
            violations.append(
                f"Recall {metrics.recall:.3f} < {self.thresholds['recall']}"
            )
        
        if metrics.latency_p99_ms > self.thresholds['latency_p99_ms']:
            violations.append(
                f"Latency P99 {metrics.latency_p99_ms:.1f}ms > {self.thresholds['latency_p99_ms']}ms"
            )
        
        if metrics.memory_mb > self.thresholds['memory_mb']:
            violations.append(
                f"Memory {metrics.memory_mb:.1f}MB > {self.thresholds['memory_mb']}MB"
            )
        
        return len(violations) == 0, violations


class RegressionTester:
    """
    Tests for model regressions compared to baseline.
    """
    
    def __init__(self, allowed_regression_pct: float = 5.0):
        self.allowed_regression_pct = allowed_regression_pct
        self.baseline_metrics: Optional[ModelMetrics] = None
    
    def set_baseline(self, metrics: ModelMetrics) -> None:
        """Set baseline metrics for comparison"""
        self.baseline_metrics = metrics
    
    def check_regression(
        self,
        current_metrics: ModelMetrics
    ) -> Tuple[bool, Dict[str, Any]]:
        """Check for regressions against baseline"""
        if self.baseline_metrics is None:
            return True, {"message": "No baseline set"}
        
        results = {
            "regressions": [],
            "improvements": [],
            "details": {}
        }
        
        # Check accuracy (higher is better)
        acc_change_pct = (
            (current_metrics.accuracy - self.baseline_metrics.accuracy) / 
            self.baseline_metrics.accuracy * 100
        )
        results["details"]["accuracy_change_pct"] = acc_change_pct
        
        if acc_change_pct < -self.allowed_regression_pct:
            results["regressions"].append(
                f"Accuracy regressed by {abs(acc_change_pct):.1f}%"
            )
        elif acc_change_pct > self.allowed_regression_pct:
            results["improvements"].append(
                f"Accuracy improved by {acc_change_pct:.1f}%"
            )
        
        # Check latency (lower is better)
        latency_change_pct = (
            (current_metrics.latency_p99_ms - self.baseline_metrics.latency_p99_ms) /
            self.baseline_metrics.latency_p99_ms * 100
        )
        results["details"]["latency_change_pct"] = latency_change_pct
        
        if latency_change_pct > self.allowed_regression_pct:
            results["regressions"].append(
                f"Latency regressed by {latency_change_pct:.1f}%"
            )
        elif latency_change_pct < -self.allowed_regression_pct:
            results["improvements"].append(
                f"Latency improved by {abs(latency_change_pct):.1f}%"
            )
        
        has_regression = len(results["regressions"]) > 0
        return not has_regression, results


# Example: Model Validation
validator = ModelValidator(min_accuracy=0.85, min_precision=0.80)

# Simulate metrics
y_true = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
y_pred = np.array([0, 1, 2, 0, 1, 1, 0, 1, 2, 0])  # One error
latencies = np.random.exponential(10, 100).tolist()  # ms

metrics = validator.compute_metrics(y_true, y_pred, latencies, memory_mb=256)
passed, violations = validator.validate(metrics)

print(f"Model Metrics:")
print(f"  Accuracy: {metrics.accuracy:.3f}")
print(f"  Precision: {metrics.precision:.3f}")
print(f"  Recall: {metrics.recall:.3f}")
print(f"  F1 Score: {metrics.f1_score:.3f}")
print(f"  Latency P99: {metrics.latency_p99_ms:.1f}ms")
print(f"\nValidation: {'PASSED' if passed else 'FAILED'}")
if violations:
    print(f"Violations: {violations}")

## 4. Data Quality Testing

Testing data quality and schema validation (inspired by Great Expectations).

In [None]:
@dataclass
class ColumnExpectation:
    """Expectation for a single column"""
    column: str
    expectation_type: str
    kwargs: Dict[str, Any] = field(default_factory=dict)


class DataExpectationSuite:
    """
    Suite of data quality expectations (Great Expectations style).
    """
    
    def __init__(self, name: str):
        self.name = name
        self.expectations: List[ColumnExpectation] = []
    
    def expect_column_to_exist(self, column: str) -> 'DataExpectationSuite':
        """Expect column to exist in dataset"""
        self.expectations.append(ColumnExpectation(
            column=column,
            expectation_type="column_exists"
        ))
        return self
    
    def expect_column_values_to_not_be_null(self, column: str) -> 'DataExpectationSuite':
        """Expect no null values in column"""
        self.expectations.append(ColumnExpectation(
            column=column,
            expectation_type="no_nulls"
        ))
        return self
    
    def expect_column_values_in_range(
        self,
        column: str,
        min_value: float,
        max_value: float
    ) -> 'DataExpectationSuite':
        """Expect values to be within range"""
        self.expectations.append(ColumnExpectation(
            column=column,
            expectation_type="in_range",
            kwargs={"min": min_value, "max": max_value}
        ))
        return self
    
    def expect_column_values_in_set(
        self,
        column: str,
        value_set: List[Any]
    ) -> 'DataExpectationSuite':
        """Expect values to be in predefined set"""
        self.expectations.append(ColumnExpectation(
            column=column,
            expectation_type="in_set",
            kwargs={"values": value_set}
        ))
        return self
    
    def expect_column_mean_in_range(
        self,
        column: str,
        min_value: float,
        max_value: float
    ) -> 'DataExpectationSuite':
        """Expect column mean to be within range"""
        self.expectations.append(ColumnExpectation(
            column=column,
            expectation_type="mean_in_range",
            kwargs={"min": min_value, "max": max_value}
        ))
        return self


class DataValidator:
    """
    Validates data against expectation suite.
    """
    
    def __init__(self, suite: DataExpectationSuite):
        self.suite = suite
    
    def validate(self, data: Dict[str, np.ndarray]) -> Dict[str, Any]:
        """Validate data against all expectations"""
        results = {
            "success": True,
            "expectations": [],
            "failed_count": 0,
            "passed_count": 0
        }
        
        for expectation in self.suite.expectations:
            result = self._check_expectation(expectation, data)
            results["expectations"].append(result)
            
            if result["success"]:
                results["passed_count"] += 1
            else:
                results["failed_count"] += 1
                results["success"] = False
        
        return results
    
    def _check_expectation(
        self,
        exp: ColumnExpectation,
        data: Dict[str, np.ndarray]
    ) -> Dict[str, Any]:
        """Check a single expectation"""
        result = {
            "expectation_type": exp.expectation_type,
            "column": exp.column,
            "success": False,
            "details": {}
        }
        
        if exp.expectation_type == "column_exists":
            result["success"] = exp.column in data
            result["details"]["found"] = exp.column in data
        
        elif exp.expectation_type == "no_nulls":
            if exp.column in data:
                null_count = np.sum(np.isnan(data[exp.column]))
                result["success"] = null_count == 0
                result["details"]["null_count"] = int(null_count)
        
        elif exp.expectation_type == "in_range":
            if exp.column in data:
                values = data[exp.column]
                min_val, max_val = exp.kwargs["min"], exp.kwargs["max"]
                in_range = np.all((values >= min_val) & (values <= max_val))
                result["success"] = bool(in_range)
                result["details"]["min_observed"] = float(np.min(values))
                result["details"]["max_observed"] = float(np.max(values))
        
        elif exp.expectation_type == "in_set":
            if exp.column in data:
                values = data[exp.column]
                valid_set = set(exp.kwargs["values"])
                in_set = all(v in valid_set for v in values)
                result["success"] = in_set
                result["details"]["unexpected_values"] = list(
                    set(values) - valid_set
                )[:5]
        
        elif exp.expectation_type == "mean_in_range":
            if exp.column in data:
                mean = np.mean(data[exp.column])
                min_val, max_val = exp.kwargs["min"], exp.kwargs["max"]
                result["success"] = min_val <= mean <= max_val
                result["details"]["observed_mean"] = float(mean)
        
        return result


# Example: Data Quality Testing
suite = DataExpectationSuite("training_data")
suite.expect_column_to_exist("feature_1")
suite.expect_column_to_exist("feature_2")
suite.expect_column_values_to_not_be_null("feature_1")
suite.expect_column_values_in_range("feature_1", -10, 10)
suite.expect_column_mean_in_range("feature_1", -1, 1)
suite.expect_column_values_in_set("label", [0, 1, 2])

# Test data
test_data = {
    "feature_1": np.random.randn(100),
    "feature_2": np.random.randn(100),
    "label": np.random.randint(0, 3, 100)
}

validator = DataValidator(suite)
results = validator.validate(test_data)

print(f"Data Validation Results:")
print(f"  Overall Success: {results['success']}")
print(f"  Passed: {results['passed_count']}, Failed: {results['failed_count']}")
print("\nExpectation Details:")
for exp_result in results['expectations']:
    status = "PASS" if exp_result['success'] else "FAIL"
    print(f"  [{status}] {exp_result['column']}: {exp_result['expectation_type']}")

## 5. A/B Testing

Statistical testing for ML experiments.

In [None]:
@dataclass
class ExperimentVariant:
    """A/B test variant"""
    name: str
    conversions: int
    total: int
    
    @property
    def conversion_rate(self) -> float:
        return self.conversions / self.total if self.total > 0 else 0


class ABTestAnalyzer:
    """
    Analyzes A/B test results with statistical rigor.
    """
    
    def __init__(
        self,
        alpha: float = 0.05,  # Significance level
        power: float = 0.80,  # Statistical power
        mde: float = 0.02    # Minimum detectable effect
    ):
        self.alpha = alpha
        self.power = power
        self.mde = mde
    
    def calculate_sample_size(
        self,
        baseline_rate: float
    ) -> int:
        """Calculate required sample size per variant"""
        # Using normal approximation
        z_alpha = stats.norm.ppf(1 - self.alpha / 2)
        z_beta = stats.norm.ppf(self.power)
        
        p1 = baseline_rate
        p2 = baseline_rate + self.mde
        
        pooled_p = (p1 + p2) / 2
        
        numerator = (z_alpha * np.sqrt(2 * pooled_p * (1 - pooled_p)) +
                    z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
        denominator = (p2 - p1) ** 2
        
        return int(np.ceil(numerator / denominator))
    
    def z_test_proportions(
        self,
        control: ExperimentVariant,
        treatment: ExperimentVariant
    ) -> Dict[str, Any]:
        """Perform z-test for difference in proportions"""
        p1 = control.conversion_rate
        p2 = treatment.conversion_rate
        n1 = control.total
        n2 = treatment.total
        
        # Pooled proportion
        p_pooled = (control.conversions + treatment.conversions) / (n1 + n2)
        
        # Standard error
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
        
        # Z-statistic
        z = (p2 - p1) / se if se > 0 else 0
        
        # Two-tailed p-value
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        # Confidence interval for difference
        se_diff = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
        ci_lower = (p2 - p1) - 1.96 * se_diff
        ci_upper = (p2 - p1) + 1.96 * se_diff
        
        return {
            "test": "z_test",
            "z_statistic": z,
            "p_value": p_value,
            "significant": p_value < self.alpha,
            "control_rate": p1,
            "treatment_rate": p2,
            "absolute_lift": p2 - p1,
            "relative_lift": (p2 - p1) / p1 if p1 > 0 else 0,
            "confidence_interval": (ci_lower, ci_upper)
        }


class BayesianABTest:
    """
    Bayesian A/B testing with Beta-Binomial model.
    """
    
    def __init__(self, prior_alpha: float = 1, prior_beta: float = 1):
        self.prior_alpha = prior_alpha
        self.prior_beta = prior_beta
    
    def analyze(
        self,
        control: ExperimentVariant,
        treatment: ExperimentVariant,
        n_samples: int = 10000
    ) -> Dict[str, Any]:
        """Analyze using Bayesian inference"""
        # Posterior distributions (Beta)
        control_posterior = stats.beta(
            self.prior_alpha + control.conversions,
            self.prior_beta + control.total - control.conversions
        )
        
        treatment_posterior = stats.beta(
            self.prior_alpha + treatment.conversions,
            self.prior_beta + treatment.total - treatment.conversions
        )
        
        # Sample from posteriors
        control_samples = control_posterior.rvs(n_samples)
        treatment_samples = treatment_posterior.rvs(n_samples)
        
        # Probability that treatment > control
        prob_treatment_better = np.mean(treatment_samples > control_samples)
        
        # Credible intervals (95%)
        control_ci = control_posterior.ppf([0.025, 0.975])
        treatment_ci = treatment_posterior.ppf([0.025, 0.975])
        
        # Expected lift
        lift_samples = (treatment_samples - control_samples) / control_samples
        expected_lift = np.mean(lift_samples)
        lift_ci = np.percentile(lift_samples, [2.5, 97.5])
        
        return {
            "method": "bayesian",
            "prob_treatment_better": prob_treatment_better,
            "control_mean": control_posterior.mean(),
            "control_ci": tuple(control_ci),
            "treatment_mean": treatment_posterior.mean(),
            "treatment_ci": tuple(treatment_ci),
            "expected_lift": expected_lift,
            "lift_ci": tuple(lift_ci),
            "decision": "Treatment wins" if prob_treatment_better > 0.95 else "No clear winner"
        }


# Example: A/B Test Analysis
control = ExperimentVariant(name="Model_v1", conversions=500, total=5000)
treatment = ExperimentVariant(name="Model_v2", conversions=550, total=5000)

# Frequentist analysis
analyzer = ABTestAnalyzer(alpha=0.05)
z_result = analyzer.z_test_proportions(control, treatment)

print("Frequentist Analysis (Z-test):")
print(f"  Control Rate: {z_result['control_rate']:.3f}")
print(f"  Treatment Rate: {z_result['treatment_rate']:.3f}")
print(f"  Relative Lift: {z_result['relative_lift']:.1%}")
print(f"  P-value: {z_result['p_value']:.4f}")
print(f"  Significant: {z_result['significant']}")
print(f"  95% CI: ({z_result['confidence_interval'][0]:.4f}, {z_result['confidence_interval'][1]:.4f})")

# Bayesian analysis
bayesian = BayesianABTest()
bayes_result = bayesian.analyze(control, treatment)

print(f"\nBayesian Analysis:")
print(f"  P(Treatment > Control): {bayes_result['prob_treatment_better']:.1%}")
print(f"  Expected Lift: {bayes_result['expected_lift']:.1%}")
print(f"  Decision: {bayes_result['decision']}")

## 6. Load & Chaos Testing

Testing ML service performance and resilience.

In [None]:
@dataclass
class LoadTestResult:
    """Results from load test"""
    total_requests: int
    successful_requests: int
    failed_requests: int
    latency_p50_ms: float
    latency_p95_ms: float
    latency_p99_ms: float
    throughput_rps: float
    error_rate: float
    duration_seconds: float


class MLServiceLoadTester:
    """
    Load testing for ML inference services.
    """
    
    def __init__(
        self,
        model: nn.Module,
        input_shape: Tuple[int, ...]
    ):
        self.model = model
        self.input_shape = input_shape
        self.model.train(False)
    
    def _simulate_request(self) -> Tuple[bool, float]:
        """Simulate a single inference request"""
        start_time = time.time()
        
        try:
            with torch.no_grad():
                x = torch.randn(1, *self.input_shape)
                _ = self.model(x)
            
            latency = (time.time() - start_time) * 1000  # ms
            return True, latency
        except Exception:
            latency = (time.time() - start_time) * 1000
            return False, latency
    
    def run_load_test(
        self,
        num_requests: int = 1000,
        target_rps: float = None
    ) -> LoadTestResult:
        """Run load test"""
        latencies = []
        successes = 0
        failures = 0
        
        start_time = time.time()
        
        for i in range(num_requests):
            success, latency = self._simulate_request()
            latencies.append(latency)
            
            if success:
                successes += 1
            else:
                failures += 1
            
            # Rate limiting
            if target_rps:
                expected_time = (i + 1) / target_rps
                elapsed = time.time() - start_time
                if elapsed < expected_time:
                    time.sleep(expected_time - elapsed)
        
        duration = time.time() - start_time
        
        return LoadTestResult(
            total_requests=num_requests,
            successful_requests=successes,
            failed_requests=failures,
            latency_p50_ms=np.percentile(latencies, 50),
            latency_p95_ms=np.percentile(latencies, 95),
            latency_p99_ms=np.percentile(latencies, 99),
            throughput_rps=num_requests / duration,
            error_rate=failures / num_requests,
            duration_seconds=duration
        )


class ChaosEngineer:
    """
    Chaos engineering for ML systems.
    """
    
    def __init__(self):
        self.experiments: List[Dict] = []
    
    def inject_latency(
        self,
        func: Callable,
        latency_ms: float,
        probability: float = 1.0
    ) -> Callable:
        """Inject latency into function calls"""
        def wrapper(*args, **kwargs):
            if np.random.random() < probability:
                time.sleep(latency_ms / 1000)
            return func(*args, **kwargs)
        return wrapper
    
    def inject_failure(
        self,
        func: Callable,
        failure_rate: float = 0.1,
        exception_type: type = RuntimeError
    ) -> Callable:
        """Inject random failures"""
        def wrapper(*args, **kwargs):
            if np.random.random() < failure_rate:
                raise exception_type("Injected chaos failure")
            return func(*args, **kwargs)
        return wrapper
    
    def run_experiment(
        self,
        name: str,
        target_func: Callable,
        chaos_func: Callable,
        iterations: int = 100
    ) -> Dict[str, Any]:
        """Run chaos experiment"""
        results = {
            "name": name,
            "iterations": iterations,
            "successes": 0,
            "failures": 0,
            "errors": []
        }
        
        # Wrap target with chaos
        chaotic_func = chaos_func(target_func)
        
        for _ in range(iterations):
            try:
                chaotic_func()
                results["successes"] += 1
            except Exception as e:
                results["failures"] += 1
                results["errors"].append(str(e))
        
        results["resilience_score"] = results["successes"] / iterations
        self.experiments.append(results)
        
        return results


# Example: Load Testing
model = SimpleClassifier(input_dim=10, hidden_dim=32, num_classes=3)
load_tester = MLServiceLoadTester(model, input_shape=(10,))

print("Running load test...")
result = load_tester.run_load_test(num_requests=500)

print(f"\nLoad Test Results:")
print(f"  Total Requests: {result.total_requests}")
print(f"  Success Rate: {(1 - result.error_rate):.1%}")
print(f"  Throughput: {result.throughput_rps:.1f} RPS")
print(f"  Latency P50: {result.latency_p50_ms:.2f}ms")
print(f"  Latency P95: {result.latency_p95_ms:.2f}ms")
print(f"  Latency P99: {result.latency_p99_ms:.2f}ms")

# Example: Chaos Engineering
chaos = ChaosEngineer()

def inference_function():
    with torch.no_grad():
        x = torch.randn(1, 10)
        return model(x)

# Test with 10% failure rate
chaos_result = chaos.run_experiment(
    name="failure_injection",
    target_func=inference_function,
    chaos_func=lambda f: chaos.inject_failure(f, failure_rate=0.1),
    iterations=100
)

print(f"\nChaos Experiment Results:")
print(f"  Experiment: {chaos_result['name']}")
print(f"  Resilience Score: {chaos_result['resilience_score']:.1%}")
print(f"  Failures: {chaos_result['failures']}")

## FAANG Interview Questions

### Q1: How do you design a comprehensive test suite for an ML model before production?

**Answer:**
I design a multi-layered testing pyramid:

1. **Unit Tests** (fastest, most numerous):
   - Feature transformers: normalization, encoding, handling edge cases
   - Model forward pass: output shapes, gradient flow, determinism
   - Utility functions: data loading, preprocessing

2. **Integration Tests**:
   - End-to-end pipeline: training -> scoring -> prediction
   - API contract testing: input/output schema validation
   - Database/storage integration: model serialization/loading

3. **Model Validation**:
   - Performance thresholds: accuracy, precision, recall
   - Regression tests against baseline
   - Slice-based testing: performance on subgroups

4. **Data Quality Tests**:
   - Schema validation
   - Distribution drift detection
   - Feature completeness

### Q2: How would you ensure statistical validity in A/B tests for ML models?

**Answer:**
Key practices:

1. **Pre-registration**: Document hypothesis and analysis plan before experiment
2. **Sample Size Calculation**: Compute required n based on MDE, power, and baseline rate
3. **Sequential Testing**: Use alpha spending functions for early stopping without inflating false positives
4. **Multiple Comparison Correction**: Apply Bonferroni or FDR when testing multiple metrics
5. **Novelty Effects**: Wait sufficient time for effects to stabilize
6. **Guardrail Metrics**: Monitor for negative side effects
7. **Stratified Randomization**: Ensure balanced groups

### Q3: What's your strategy for testing ML systems in production?

**Answer:**
Production testing strategy:

1. **Shadow Mode**: Run new model alongside production without serving
2. **Canary Deployment**: Gradual traffic shift (1% -> 5% -> 25% -> 100%)
3. **Online Assessment**: Real-time metrics (latency, error rate, prediction distribution)
4. **A/B Testing**: Statistically rigorous comparison of business metrics
5. **Chaos Engineering**: Inject failures to test resilience
6. **Load Testing**: Verify performance under peak traffic
7. **Automated Rollback**: Trigger on SLO violations

## Summary

This notebook covered:

1. **Unit Testing**: Testing individual ML components with proper fixtures
2. **Integration Testing**: End-to-end pipeline validation
3. **Model Validation**: Performance thresholds and regression testing
4. **Data Quality**: Schema validation and expectation suites
5. **A/B Testing**: Frequentist and Bayesian statistical analysis
6. **Load & Chaos Testing**: Performance and resilience testing

### Key Takeaways for FAANG Interviews:
- ML systems require comprehensive testing beyond model accuracy
- Data quality testing is as important as code testing
- A/B tests need statistical rigor (power analysis, multiple comparisons)
- Production testing includes shadow mode, canary, and chaos engineering
- Automated testing enables continuous deployment with confidence