# Real World Use Case: A/B Testing (Bandits)

**Scenario**: You have 2 versions of a website (A and B).
**Goal**: Which one gets more clicks?
**Standard A/B Test**: Show A to 1000 people, Show B to 1000 people. Compare.
**RL (Q-Learning / Bandits)**: Epsilon-Greedy.
*   If A seems better early on, verify it (Exploit), but occasionally check B (Explore).
*   Result: You stop wasting traffic on B much faster.

In [None]:
import numpy as np

# True Click Rates (Unknown to agent)
rates = [0.05, 0.08] # Site B (index 1) is better
q_values = [0.0, 0.0]
counts = [0, 0]

epsilon = 0.1

for user in range(10000):
    # 1. Choose
    if np.random.rand() < epsilon:
        choice = np.random.randint(2) # Explore
    else:
        choice = np.argmax(q_values) # Exploit
        
    # 2. Reward (Simulated User Click)
    reward = 1 if np.random.rand() < rates[choice] else 0
    
    # 3. Update Q-Value (Average)
    counts[choice] += 1
    # Incremental Mean Formula
    q_values[choice] += (1/counts[choice]) * (reward - q_values[choice])

print(f"Estimated Rates: {q_values}")
print(f"Traffic Allocation: {counts}")
print("Notice Site B got way more traffic automatically.")

## Conclusion
This is called "Multi-Armed Bandit" optimization. It maximizes revenue *during* the test.