# Probability

> Core probability utilities for RBE - normalization, sampling, entropy, and divergence measures

In [None]:
#| default_exp rbe.probability

In [None]:
#| hide
from nbdev.showdoc import *

## Imports and utils

In [None]:
#| export
import numpy as np
from typing import Optional, Union, List
from fastcore.all import *
from scipy.special import entr

In [None]:
from fastcore.test import test_eq, test_close

## Basic Operations

Core probability operations following fast.ai style - short names, clear purpose.

We write source code first, and then tests come after. The tests serve as both a means to confirm that the code works and also serves as working examples. 

The `normalize` function takes a list or array of numbers and converts them into proper probabilities that sum to 1.

For example, if you have raw scores like `[1, 2, 3]`, it converts them to `[1/6, 2/6, 3/6]` = `[0.167, 0.333, 0.5]`.

This is essential for probability calculations because:
- Probabilities must sum to 1 by definition
- Many algorithms (like sampling) require normalized distributions
- Raw scores from sensors or models often aren't normalized

The function also includes robust error handling for edge cases common in security applications - rejecting negative values, empty arrays, and all-zero inputs that could indicate data corruption or sensor failures.

In [None]:
#| export
def normalize(probs):
    """Normalize probabilities to sum to 1."""
    probs = np.asarray(probs, dtype=np.float64)  # Ensure float64 for precision
    if probs.size == 0: raise ValueError("Cannot normalize empty array")
    if np.any(probs < 0): raise ValueError("Probabilities must be non-negative") 
    s = np.sum(probs)
    if s == 0: raise ValueError("Cannot normalize zero probabilities")
    return probs / s

In [None]:
normalize([1, 2, 3])

array([0.16666667, 0.33333333, 0.5       ])

In [None]:
# Test normalize function with comprehensive edge cases
# Basic normalization
probs = [1, 2, 3]
normed = normalize(probs)
test_close(np.sum(normed), 1.0)
test_close(normed, [1/6, 2/6, 3/6])

# Already normalized - should remain unchanged
test_close(normalize([0.2, 0.3, 0.5]), [0.2, 0.3, 0.5])

# Single element - critical for RBE edge cases
test_close(normalize([5]), [1.0])

# Uniform distribution
test_close(normalize([1, 1, 1, 1]), [0.25, 0.25, 0.25, 0.25])

# Very small numbers (numerical stability for anomaly scores)
tiny = [1e-10, 2e-10, 3e-10]
normed_tiny = normalize(tiny)
test_close(np.sum(normed_tiny), 1.0)
assert normed_tiny.dtype == np.float64, "Should maintain float64 precision"

# Large numbers (overflow protection)
large = [1e100, 2e100, 3e100]
normed_large = normalize(large) 
test_close(np.sum(normed_large), 1.0)
test_close(normed_large, [1/6, 2/6, 3/6])

# Mixed scales (common in cyber security scores)
mixed = [0.001, 1000, 0.1]
normed_mixed = normalize(mixed)
test_close(np.sum(normed_mixed), 1.0)


In [None]:
# Test error conditions
# Empty array
try:
    normalize([])
    assert False, "Should raise ValueError for empty array"
except ValueError as e:
    assert "empty array" in str(e)

# All zeros
try:
    normalize([0, 0, 0])
    assert False, "Should raise ValueError for zero probabilities"
except ValueError as e:
    assert "zero probabilities" in str(e)

# Negative values (data corruption detection)
try:
    normalize([1, -2, 3])
    assert False, "Should raise ValueError for negative probabilities"
except ValueError as e:
    assert "non-negative" in str(e)

# NaN values (sensor failure detection)
try:
    normalize([1, np.nan, 3])
    assert False, "Should handle NaN gracefully"
except:
    pass  # Expected to fail somehow

The `sample` function randomly selects indices from a probability distribution. 

Given a list of probabilities (like `[0.1, 0.7, 0.2]`), it returns random indices (0, 1, or 2) where higher probability values are more likely to be chosen. For example, index 1 would be selected about 70% of the time.

Key features:
- Takes any probabilities (automatically normalizes them to sum to 1)
- Returns a single index when `n=1`, or an array of indices when `n>1`
- Uses a controllable random number generator for reproducible results
- Essential for Monte Carlo methods in Recursive Bayesian Estimators

In your cyber security context, this would be useful for simulating network events based on their estimated probabilities or sampling from threat likelihood distributions.

In [None]:
#| export
def sample(probs, # probability distribution
           n=1, # number of samples
           rng=None # random number generator
           ):
    """Sample indices from probability distribution."""
    if rng is None: rng = np.random.default_rng()
    probs = normalize(probs)  # This handles all validation
    if n == 1:
        return rng.choice(len(probs), p=probs)  # Return scalar
    else:
        return rng.choice(len(probs), size=n, p=probs)  # Return array

In [None]:
sample([0.1,0.7,0.2], n=10, rng=np.random.default_rng(42))

array([1, 1, 2, 1, 0, 2, 1, 1, 1, 1])

In [None]:
sample([0.1,0.7,0.2], n=1, rng=np.random.default_rng(42))

1

In [None]:
# Test sample function - critical for RBE Monte Carlo methods

# Basic sampling with fixed seed for reproducibility
rng = np.random.default_rng(42)
samples = sample([0.1, 0.7, 0.2], n=1000, rng=rng)
assert len(samples) == 1000
assert np.all((samples >= 0) & (samples <= 2))

# Check distribution approximates expected probabilities
counts = np.bincount(samples, minlength=3)
freqs = counts / 1000
test_close(freqs, [0.1, 0.7, 0.2], eps=0.05)  # Allow 5% tolerance

# Single sample returns scalar (not array)
rng = np.random.default_rng(123)
single = sample([0.3, 0.7], n=1, rng=rng)
assert isinstance(single, (int, np.integer)), f"Expected scalar, got {type(single)}"
assert 0 <= single <= 1

# Multiple samples return array
multiple = sample([0.3, 0.7], n=5, rng=rng)
assert isinstance(multiple, np.ndarray), "Expected array for n>1"
assert len(multiple) == 5



In [None]:
# Test with unnormalized probabilities (common in cyber security)
unnorm = [10, 70, 20]  # Sums to 100, not 1
rng = np.random.default_rng(456)
samples = sample(unnorm, n=1000, rng=rng)
counts = np.bincount(samples, minlength=3)
freqs = counts / 1000
test_close(freqs, [0.1, 0.7, 0.2], eps=0.05)

# Edge case: single option (deterministic)
certain = sample([1], n=10, rng=rng)
assert np.all(certain == 0), "Single option should always return index 0"

# Extreme probabilities (rare events in anomaly detection)
rare = [0.999, 0.001]  # Very rare anomaly
samples = sample(rare, n=10000, rng=np.random.default_rng(789))
anomaly_count = np.sum(samples == 1)
# Should be around 10 anomalies, allow wide tolerance for randomness
assert 0 <= anomaly_count <= 50, f"Got {anomaly_count} anomalies"



In [None]:
# Test error conditions for robust cyber security applications

# Negative probabilities (corrupted threat scores)
try:
    sample([0.5, -0.3, 0.8], n=1)
    assert False, "Should reject negative probabilities"
except ValueError as e:
    assert "non-negative" in str(e)

# Empty probabilities
try:
    sample([], n=1)
    assert False, "Should reject empty probability array"
except ValueError as e:
    assert "empty array" in str(e)

# Zero sample count
zero_samples = sample([0.5, 0.5], n=0)
assert len(zero_samples) == 0, "n=0 should return empty array"

# Test reproducibility (critical for security audits)
rng1 = np.random.default_rng(999)
rng2 = np.random.default_rng(999)
s1 = sample([0.4, 0.6], n=100, rng=rng1)
s2 = sample([0.4, 0.6], n=100, rng=rng2)
assert np.array_equal(s1, s2), "Same seed should produce identical results"


## Information Measures

**Entropy** and **divergence** measures for quantifying uncertainty and comparing distributions.

### Entropy
Entropy is a fundamental measure of uncertainty or randomness in information theory, our implementation uses it to quantify how "spread out" or unpredictable a probability distribution is.

Our `entropy` function calculates Shannon entropy using the formula:

$$H(X) = -∑ p(x) * log(p(x))$$


Where:
- $p(x)$ is the probability of each outcome
- The sum is over all possible outcomes
- The logarithm base determines the units (bits for base 2, nats for base e)

Our implementation sneakily uses `scipy.special.entr(probs)`, which computes `x * log(x)` with proper handling of the edge case where `x = 0` (since `0 * log(0)` is mathematically defined as 0, not undefined).

## What Entropy Measures

Entropy quantifies uncertainty:

- **High entropy** = high uncertainty = uniform distribution
  - Example: `[0.25, 0.25, 0.25, 0.25]` has entropy = 2 bits
  - All outcomes equally likely, maximum unpredictability

- **Low entropy** = low uncertainty = skewed distribution  
  - Example: `[0.8, 0.1, 0.1]` has lower entropy
  - One outcome much more likely than others

- **Zero entropy** = no uncertainty = deterministic
  - Example: `[1.0, 0.0, 0.0]` has entropy = 0 bits
  - Outcome is certain

## Applications in Cyber Security

For potential network anomaly detection, entropy is particularly valuable:

1. **Baseline Characterization**: Measure the "normal" randomness of network traffic patterns
2. **Anomaly Detection**: Sudden changes in entropy can indicate attacks or unusual behavior
3. **Feature Engineering**: Use entropy of packet sizes, timing intervals, or connection patterns as input features
4. **Model Uncertainty**: In Recursive Bayesian Estimators, entropy helps quantify how confident your predictions are

For example, if network traffic normally has high entropy (many different packet sizes, destinations, etc.), but suddenly becomes very low entropy (repetitive patterns), this could signal a DDoS attack or malware communication.



In [None]:
#| export
def entropy(probs, # probability distribution
            base=2 # base of the logarithm
            ):
    """Calculate entropy using scipy's numerically stable implementation."""
    probs = normalize(probs)
    h = np.sum(entr(probs))  # Uses x*log(x) with proper handling of x=0
    
    if base == 2:
        return h / np.log(2)
    elif base == 'e':
        return h
    else:
        return h / np.log(base)


In [None]:
entropy([0.5, 0.5])

np.float64(1.0)

In [None]:
# Test known entropy values
# Binary uniform distribution has entropy = 1 bit
test_close(entropy([0.5, 0.5]), 1.0)

# Certain outcome has entropy = 0
test_close(entropy([1.0, 0.0]), 0.0)
test_close(entropy([0.0, 1.0]), 0.0)

# 4-way uniform distribution has entropy = 2 bits
test_close(entropy([0.25, 0.25, 0.25, 0.25]), 2.0)

# Test different bases - CORRECTED
uniform_binary = [0.5, 0.5]
test_close(entropy(uniform_binary, base=2), 1.0)
test_close(entropy(uniform_binary, base='e'), np.log(2))
test_close(entropy(uniform_binary, base=10), np.log(2) / np.log(10))


# Test with unnormalized probabilities (common in cyber security)
test_close(entropy([1, 1]), 1.0)  # Should normalize to [0.5, 0.5]
test_close(entropy([2, 2, 2, 2]), 2.0)  # Should normalize to uniform

# Test numerical stability with tiny probabilities
tiny_probs = [1e-15, 0.5, 0.5 - 1e-15]
h_tiny = entropy(tiny_probs)
test_close(h_tiny, 1.0, eps=1e-10)  # Should be close to uniform entropy

# Test with zeros (scipy.special.entr handles this gracefully)
with_zeros = [0.0, 0.3, 0.7]
h_zeros = entropy(with_zeros)
expected = entropy([0.3, 0.7])  # Should equal entropy without the zero
test_close(h_zeros, expected)

# Test entropy ordering (more uniform = higher entropy)
certain = [1.0, 0.0, 0.0]
skewed = [0.8, 0.1, 0.1] 
uniform = [1/3, 1/3, 1/3]

h_certain = entropy(certain)
h_skewed = entropy(skewed)
h_uniform = entropy(uniform)

assert h_certain < h_skewed < h_uniform, "Entropy should increase with uniformity"

# Test large number of outcomes (network anomaly detection scenarios)
many_outcomes = np.ones(100) / 100  # 100 equally likely events
test_close(entropy(many_outcomes), np.log2(100))  # Should be log2(n) for uniform

# Test edge cases for cyber security robustness
single_outcome = [1.0]
test_close(entropy(single_outcome), 0.0)

# Test reproducibility
h1 = entropy([0.4, 0.6])
h2 = entropy([0.4, 0.6])
assert h1 == h2, "Entropy should be deterministic"

## KL Divergence

KL divergence (Kullback-Leibler divergence) measures how different two probability distributions are. Think of it as asking: "If I have distribution P, how surprised would I be if the world actually followed distribution Q instead?"

The mathematical formula is:
$$D(P||Q) = \sum p(x) \cdot \log\left(\frac{p(x)}{q(x)}\right)$$

## Intuitive Understanding

Imagine you're a weather forecaster:
- **P** is your forecast: "80% chance of rain, 20% chance of sun"  
- **Q** is what actually happens over many days: "60% rain, 40% sun"

KL divergence tells you how "wrong" your forecast was on average. If your forecast matches reality perfectly, KL divergence = 0. The more different they are, the higher the value.

## Key Properties

1. **Always non-negative**: KL divergence ≥ 0, with equality only when P = Q
2. **Asymmetric**: D(P||Q) ≠ D(Q||P) in general - the direction matters!
3. **Can be infinite**: When P assigns probability to something Q says is impossible

## Why It's Important for Cyber Security Applications

### 1. **Anomaly Detection**
```python
# Normal network traffic pattern
normal_traffic = [0.7, 0.2, 0.1]  # [web, email, other]

# Current traffic pattern  
current_traffic = [0.3, 0.1, 0.6]  # Lots of "other" traffic - suspicious!

# High KL divergence indicates anomaly
anomaly_score = kl_div(current_traffic, normal_traffic)
```

### 2. **Model Drift Detection**
Your RBE model learns what "normal" looks like. Over time, you can check if new data still matches your model's expectations:

```python
# Your model's learned distribution
model_belief = [0.9, 0.08, 0.02]  # [normal, suspicious, attack]

# New incoming data distribution
recent_data = [0.7, 0.25, 0.05]   # More suspicious activity

# KL divergence tells you if your model needs updating
drift_score = kl_div(recent_data, model_belief)
```

### 3. **Information-Theoretic Security**
KL divergence quantifies information leakage. If an attacker's observations P differ significantly from the expected distribution Q, it indicates potential information disclosure. (more on this at the end*)

## Algorithm Walkthrough

Our implementation handles several critical cases:

1. **Normal case**: Computes the sum where both distributions have probability
2. **Impossible events**: Returns infinity when P expects something Q says can't happen
3. **Numerical stability**: Uses epsilon only when absolutely needed to avoid division by zero

## Why Asymmetry Matters

D(P||Q) asks: "How surprised is P by Q?"  
D(Q||P) asks: "How surprised is Q by P?"

In cyber security:
- D(current||baseline) = "How anomalous is current traffic compared to normal?"
- D(baseline||current) = "How much would we need to update our normal baseline?"

These are fundamentally different questions! The first is better for anomaly detection, the second for model adaptation.

The infinity case is particularly important - it means your current observations include events that your baseline model considers impossible, which is a strong anomaly signal in security contexts.



In [None]:
#| export
def kl_div(p, # probability distribution
           q, # probability distribution
           eps=1e-15 # epsilon,small number to avoid log(0)
           ):
    """KL divergence D(P||Q) from distribution P to distribution Q."""
    p, q = normalize(p), normalize(q)
    # Check for undefined case: P has probability where Q doesn't
    if np.any((p > 0) & (q == 0)): return np.inf
    # Only use epsilon where both are zero (to handle 0*log(0) = 0)
    mask = (p > 0)  # Only compute where p > 0
    result = np.sum(p[mask] * np.log(p[mask] / np.maximum(q[mask], eps)))
    return result



In [None]:
#Anomaly Detection Example
# Normal network traffic pattern
normal_traffic = [0.7, 0.2, 0.1]  # [web, email, other]

# Current traffic pattern  
current_traffic = [0.3, 0.1, 0.6]  # Lots of "other" traffic - suspicious!

# High KL divergence indicates anomaly
anomaly_score = kl_div(current_traffic, normal_traffic)
anomaly_score

np.float64(0.7515516053646771)

In [None]:
#Model Drift Detection Example
# Your model's learned distribution
model_belief = [0.9, 0.08, 0.02]  # [normal, suspicious, attack]

# New incoming data distribution
recent_data = [0.7, 0.25, 0.05]   # More suspicious activity

# KL divergence tells you if your model needs updating
drift_score = kl_div(recent_data, model_belief)
drift_score

np.float64(0.15475300759416463)

In [None]:
# Test cases for KL divergence

# Identical distributions
test_close(kl_div([0.5, 0.5], [0.5, 0.5]), 0.0, eps=1e-12)

# Different distributions  
p1, q1 = [0.8, 0.2], [0.6, 0.4]
kl1 = kl_div(p1, q1)
assert kl1 > 0, "KL divergence should be positive for different distributions"

# Asymmetry: D(P||Q) ≠ D(Q||P)
kl2 = kl_div(q1, p1)
assert kl1 != kl2, "KL divergence should be asymmetric"

# Undefined case: P has support where Q doesn't
p_undefined = [0.5, 0.5, 0.0]
q_undefined = [0.5, 0.0, 0.5]  # q[1]=0 but p[1]>0
assert kl_div(p_undefined, q_undefined) == np.inf

# Edge case: both zero at same positions (should work fine)
p_zeros = [0.6, 0.4, 0.0]
q_zeros = [0.7, 0.3, 0.0]
finite_kl = kl_div(p_zeros, q_zeros)
assert np.isfinite(finite_kl)


### JS Divergence

Jensen-Shannon (JS) divergence is a powerful tool for measuring how different two probability distributions are, and it fixes several problems that KL divergence has. Let me break down how it works and why it's particularly valuable for your cyber security applications.

#### How JS Divergence Works

JS divergence uses a clever mathematical trick. Instead of directly comparing distributions P and Q, it:

1. **Creates a mixture**: M = ½(P + Q) - the average of both distributions
2. **Measures how each original distribution differs from this mixture**:
   - KL(P || M) = how much P diverges from the average
   - KL(Q || M) = how much Q diverges from the average  
3. **Takes the average**: $JS(P,Q) = ½[KL(P||M) + KL(Q||M)]$

Think of it like this: if two people disagree about something, JS divergence asks "how much do they each disagree with a compromise position?" rather than "how much does person A disagree with person B?"

#### Key Advantages Over KL Divergence

**1. Always Finite**: Unlike KL divergence, JS never returns infinity. This is crucial for robust cyber security systems where you can't have your anomaly detection crash on edge cases.

**2. Symmetric**: JS(P,Q) = JS(Q,P). This means you get the same "distance" regardless of which distribution you consider the reference. Perfect for comparing network traffic patterns where neither is inherently the "baseline."

**3. Bounded**: JS divergence is always between 0 and log(2) ≈ 0.693. This gives you a natural scale - you know that 0.693 represents maximum possible difference, making it easier to set thresholds.

**4. Smooth**: Small changes in probabilities lead to small changes in JS divergence, making it more stable for real-world noisy data.

#### Why It's Useful for Cyber Security Applications

**Robust Anomaly Detection**: Your network traffic analysis won't crash when encountering new, previously unseen traffic types (which would make KL divergence infinite).

**Symmetric Threat Assessment**: When comparing current traffic to historical patterns, you get the same anomaly score regardless of which you treat as the "reference" - important for consistent alerting.

**Natural Thresholds**: Since JS is bounded, you can establish meaningful thresholds like "anything above 0.3 is suspicious, above 0.5 is likely an attack."

**Comparative Analysis**: We can meaningfully compare how different various traffic patterns are from each other, not just from a baseline.

For our Recursive Bayesian Estimator application, JS divergence gives us a stable, interpretable measure of how much our model's beliefs are changing over time - essential for detecting both gradual drift and sudden anomalies in network behavior.

The mathematical elegance is that by using the mixture distribution M as an intermediate step, JS divergence captures the intuitive notion of "distance between distributions" while avoiding the mathematical pitfalls that make KL divergence problematic for practical applications.

In [None]:
#| export
def js_div(p, # probability distribution
           q # probability distribution
           ):
    """Jensen-Shannon divergence between distributions p and q."""
    p, q = normalize(p), normalize(q)
    # Check dimensions match
    if len(p) != len(q): raise ValueError("Distributions must have same length")
    
    m = 0.5 * (p + q)
    # JS divergence is always finite (unlike KL), but check for safety
    kl_pm = kl_div(p, m)
    kl_qm = kl_div(q, m)
    
    if not (np.isfinite(kl_pm) and np.isfinite(kl_qm)):
        # This should never happen with proper implementation
        raise RuntimeError("Unexpected infinite KL divergence in JS calculation")
    
    return 0.5 * (kl_pm + kl_qm)

In [None]:
# Test the JS divergence function

# Basic functionality tests
# Identical distributions should give JS = 0
test_close(js_div([0.5, 0.5], [0.5, 0.5]), 0.0, eps=1e-12)
test_close(js_div([0.3, 0.4, 0.3], [0.3, 0.4, 0.3]), 0.0, eps=1e-12)

# Different distributions should give JS > 0
p1, q1 = [0.8, 0.2], [0.6, 0.4]
js1 = js_div(p1, q1)
assert js1 > 0, "JS divergence should be positive for different distributions"

# Test symmetry: JS(P,Q) = JS(Q,P) - key advantage over KL
js2 = js_div(q1, p1)
test_close(js1, js2, eps=1e-12)

# Test boundedness: JS divergence should be ≤ log(2) ≈ 0.693
# Maximum occurs when distributions have disjoint support
max_divergent = [1.0, 0.0], [0.0, 1.0]
js_max = js_div(*max_divergent)
test_close(js_max, np.log(2), eps=1e-10)
assert js_max <= np.log(2) + 1e-10, "JS divergence should be bounded by log(2)"

# Test with unnormalized inputs (common in cyber security)
unnorm_p = [10, 20, 30]  # Will normalize to [1/6, 2/6, 3/6]
unnorm_q = [15, 15, 30]  # Will normalize to [1/4, 1/4, 1/2]
js_unnorm = js_div(unnorm_p, unnorm_q)
# Should equal normalized version
norm_p = [1/6, 2/6, 3/6]
norm_q = [1/4, 1/4, 1/2]
test_close(js_unnorm, js_div(norm_p, norm_q))

# Test dimension mismatch error handling
try:
    js_div([0.5, 0.5], [0.3, 0.3, 0.4])
    assert False, "Should raise ValueError for mismatched dimensions"
except ValueError as e:
    assert "same length" in str(e)

# Test with zeros (should handle gracefully unlike KL)
with_zeros_p = [0.6, 0.4, 0.0]
with_zeros_q = [0.5, 0.0, 0.5]
js_zeros = js_div(with_zeros_p, with_zeros_q)
assert np.isfinite(js_zeros), "JS divergence should be finite even with zeros"
assert js_zeros > 0, "Different distributions with zeros should have positive JS"

# Test numerical stability with very small probabilities
tiny_p = [1e-10, 0.5, 0.5 - 1e-10]
tiny_q = [1e-10, 0.4, 0.6 - 1e-10]
js_tiny = js_div(tiny_p, tiny_q)
assert np.isfinite(js_tiny), "Should handle tiny probabilities"

# Test single element distributions
single_p = [1.0]
single_q = [1.0]
test_close(js_div(single_p, single_q), 0.0)

# Test reproducibility
js_rep1 = js_div([0.4, 0.6], [0.3, 0.7])
js_rep2 = js_div([0.4, 0.6], [0.3, 0.7])
assert js_rep1 == js_rep2, "JS divergence should be deterministic"

# Cyber security scenario: compare traffic patterns
normal_traffic = [0.7, 0.2, 0.1]      # [web, email, other]
suspicious_traffic = [0.4, 0.1, 0.5]   # More "other" traffic
attack_traffic = [0.1, 0.05, 0.85]     # Mostly "other" - likely attack

js_suspicious = js_div(normal_traffic, suspicious_traffic)
js_attack = js_div(normal_traffic, attack_traffic)

# Attack should be more divergent than suspicious
assert js_attack > js_suspicious, "Attack pattern should be more divergent"
assert js_attack <= np.log(2), "Even extreme patterns should be bounded"

## Effective Sample Size

The `eff_size` function calculates the **effective sample size** of a set of weights, which is a crucial diagnostic tool in particle filtering and Monte Carlo methods. Let me break down what it does and why it may be important for our applications.

### What It Measures

The effective sample size tells you **how many particles are meaningfully contributing** to your estimate. It's calculated using the formula:

$$N_{eff} = \frac{1}{\sum_{i=1}^N w_i^2}$$

where $w_i$ are the normalized weights.

### Intuitive Understanding

Think of it this way: if you have 1000 particles but 999 of them have tiny weights and only 1 has a large weight, you're really only getting information from that 1 particle - your effective sample size is close to 1, not 1000.

**Perfect case**: All particles have equal weight (1/N each)
- Each weight = 1/N, so $w_i^2 = 1/N^2$
- Sum of squares = $N \times (1/N^2) = 1/N$
- Effective size = $1/(1/N) = N$ ✓

**Worst case**: One particle has all the weight
- One weight = 1, others = 0
- Sum of squares = $1^2 + 0 + 0 + ... = 1$
- Effective size = $1/1 = 1$ ✓

### Example Application for Cyber Security RBE

#### 1. **Particle Filter Health Monitoring**
```python
# Your RBE is tracking network anomaly probabilities
weights = [0.95, 0.02, 0.02, 0.01]  # Most belief in one hypothesis
eff_size(weights)  # Returns ~1.1 - very low!
```

When effective sample size drops too low (common threshold: < N/2), it means your particle filter has **degeneracy** - most particles are irrelevant.

#### 2. **Resampling Trigger**
```python
if eff_size(particle_weights) < len(particles) / 2:
    # Time to resample! Most particles are useless
    resample_particles()
```

This prevents your RBE from wasting computation on particles that don't contribute meaningful information about network threats.

#### 3. **Quality Control**
A consistently low effective sample size indicates:
- Your model might be too confident (overconfident predictions)
- You need more diverse particles
- The observation model might be poorly calibrated

#### 4. **Computational Efficiency**
Instead of blindly using all N particles, you know you're really only getting information equivalent to `eff_size(weights)` particles. This helps you:
- Decide when to add more particles
- Understand the true precision of your estimates
- Optimize computational resources

### Example in Network Anomaly Detection

```python
# Scenario: RBE tracking different threat types
threat_weights = [0.7, 0.15, 0.1, 0.05]  # [normal, suspicious, malware, APT]
effective_particles = eff_size(threat_weights)  # ≈ 2.3

# This tells you that despite having 4 categories, you're really only 
# getting information equivalent to ~2.3 independent observations
# The "normal" category dominates, reducing diversity
```

### The Math Behind the Magic

The formula $1/\sum w_i^2$ is actually the **harmonic mean** of the reciprocals of the weights, which naturally penalizes uneven distributions:

- When weights are uniform: harmonic mean ≈ arithmetic mean ≈ N
- When weights are skewed: harmonic mean << arithmetic mean

This makes it a sensitive detector of particle filter degeneracy, which is exactly what you want for maintaining a healthy RBE system in your cyber security application.

The key insight is that effective sample size gives you a single number that summarizes the "health" of your particle distribution - essential for automated monitoring of your anomaly detection system!



In [None]:
#| export
def eff_size(weights):
    "Calculate effective sample size of normalized `weights`"
    weights = normalize(weights)
    return 1.0 / np.sum(weights**2)

In [None]:
# Test effective sample size
uniform_weights = np.ones(100) / 100
skewed_weights = np.zeros(100)
skewed_weights[0] = 1.0

test_close(eff_size(uniform_weights), 100.0)  # All particles contribute
test_close(eff_size(skewed_weights), 1.0)     # Only one particle

# Test edge cases for particle filter robustness
test_close(eff_size([1]), 1.0)  # Single particle
test_close(eff_size([0.9, 0.1]), 1/(0.9**2 + 0.1**2))  # Known value

# Test with unnormalized weights (common in practice)
test_close(eff_size([10, 90]), eff_size([0.1, 0.9]))

# Test numerical precision
many_equal = np.ones(1000)
test_close(eff_size(many_equal), 1000.0)

## Categorical Distribution Utilities

In [None]:
#| export
def categorical(probs, # probability distribution
                labels=None # optional labels for the distribution
                ):
    "Create categorical distribution from `probs` with optional `labels`"
    probs = normalize(probs)
    if labels is None:
        labels = list(range(len(probs)))
    return dict(zip(labels, probs))

def uniform(n):
    "Create uniform distribution over `n` outcomes"
    return np.ones(n) / n

def from_counts(counts):
    "Create probability distribution from `counts`"
    counts = np.asarray(counts)
    if np.any(counts < 0):
        raise ValueError("Counts must be non-negative")
    return normalize(counts)

In [None]:
# Test categorical utilities
cat_dist = categorical([1, 2, 3], ['A', 'B', 'C'])
test_eq(cat_dist['A'], 1/6)
test_eq(cat_dist['B'], 2/6)
test_eq(cat_dist['C'], 3/6)

# Test uniform
u = uniform(4)
test_close(u, [0.25, 0.25, 0.25, 0.25])

# Test from_counts
probs = from_counts([10, 20, 30])
test_close(probs, [1/6, 2/6, 3/6])

## Export

In [None]:
#| export
__all__ = [
    # Basic operations
    'normalize', 'sample',
    
    # Information measures
    'entropy', 'kl_div', 'js_div',
    
    # Effective sample size
    'eff_size',
    
    # Categorical utilities
    'categorical', 'uniform', 'from_counts'
]

Didn't have time to look into this further but this sounds like an interesting application

## Information-Theoretic Security with KL Divergence

Information-theoretic security uses mathematical measures of information to detect and prevent security breaches. KL divergence is particularly powerful here because it quantifies how much information an adversary might be gaining.

## Core Concept: Information Leakage

In a secure system, an attacker's observations should look random or match expected patterns. When they deviate significantly, it suggests information is leaking.

### Example: Timing Attack Detection

```python
# Normal response times (in milliseconds) for different operations
normal_timing = [0.6, 0.3, 0.1]  # [fast, medium, slow operations]

# Observed timing pattern during potential attack
observed_timing = [0.2, 0.1, 0.7]  # Unusually many slow operations

# High KL divergence suggests timing-based information leakage
leakage_score = kl_div(observed_timing, normal_timing)
```

If an attacker is probing your system and causing unusual timing patterns (like forcing expensive cryptographic operations), the KL divergence will spike.

## Practical Security Applications

### 1. **Side-Channel Attack Detection**
Monitor patterns in:
- CPU usage during cryptographic operations
- Memory access patterns 
- Network packet timing
- Power consumption (in embedded systems)

```python
# Normal CPU usage distribution during encryption
normal_cpu = [0.4, 0.4, 0.2]  # [low, medium, high usage]

# During suspected side-channel attack
attack_cpu = [0.1, 0.2, 0.7]   # Lots of high CPU usage

# Detect the attack
if kl_div(attack_cpu, normal_cpu) > threshold:
    alert("Possible side-channel attack detected")
```

### 2. **Covert Channel Detection**
Attackers might hide communication in seemingly normal traffic patterns:

```python
# Normal distribution of packet sizes
normal_packets = normalize([100, 200, 150, 50])  # Various normal sizes

# Suspicious pattern - too regular, might encode data
suspicious_packets = normalize([128, 128, 64, 128])  # Suspiciously regular

# KL divergence reveals the anomaly
covert_score = kl_div(suspicious_packets, normal_packets)
```

### 3. **Privacy-Preserving Systems**
In differential privacy, you want to ensure that adding or removing one person's data doesn't significantly change query results:

```python
# Query results with person A's data
with_person_a = [0.3, 0.4, 0.3]

# Query results without person A's data  
without_person_a = [0.32, 0.38, 0.3]

# Low KL divergence means good privacy protection
privacy_leakage = kl_div(with_person_a, without_person_a)
```

## Why KL Divergence is Perfect for This

1. **Sensitive to small changes**: Even subtle information leakage creates measurable divergence
2. **Asymmetric nature**: You can measure leakage in specific directions
3. **Principled threshold setting**: Based on information theory rather than ad-hoc rules
4. **Handles rare events**: The infinity case catches when attackers force impossible states

## Advanced: Mutual Information Estimation

KL divergence is also used to estimate mutual information between variables, which directly measures how much information one variable reveals about another:

```python
def mutual_info_estimate(joint_dist, marginal_x, marginal_y):
    """Estimate mutual information using KL divergence"""
    # Product of marginals (independence assumption)
    independent = np.outer(marginal_x, marginal_y).flatten()
    
    # KL divergence from independence to actual joint distribution
    return kl_div(joint_dist.flatten(), independent)
```

This helps quantify exactly how much information an attacker gains from their observations.

## Real-World Impact

Information-theoretic security moves beyond "did an attack happen?" to "how much information did the attacker gain?" This quantitative approach enables:

- **Risk quantification**: Measure actual information loss, not just binary breach/no-breach
- **Proactive defense**: Detect information leakage before full compromise
- **Privacy engineering**: Design systems with measurable privacy guarantees
- **Forensic analysis**: Quantify the scope of information disclosure after incidents

The beauty is that it's mathematically principled - you're not just looking for "suspicious" patterns, but measuring actual information flow using fundamental laws of information theory.



In [None]:
#|hide
import nbdev; nbdev.nbdev_export()