# Online statistics

## Mean

$$
\left\{ 
    \begin{array}{ll}
m_0 = 0  \\
m_{t+1} = m_t + \frac{x - m_t}{t+1}
\end{array}
\right.
$$

## Variance

## Pearson correlation

## Rolling mean

## Parallelism

## Aggregations

## A/B testing application

In [3]:
import pandas as pd

url = 'https://raw.githubusercontent.com/alenyeh1014/DataAnalytics-AB_Testing/master/DataFiles/ab_data.csv'
events = pd.read_csv(url)
events = events.sort_values(by='timestamp')
events.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
131228,922696,2017-01-02 13:42:05.378582,treatment,new_page,0
184884,781507,2017-01-02 13:42:15.234051,control,old_page,0
83878,737319,2017-01-02 13:42:21.786186,control,old_page,0
102717,818377,2017-01-02 13:42:26.640581,treatment,new_page,0
158789,725857,2017-01-02 13:42:27.851110,treatment,new_page,0


In [28]:
from collections import defaultdict
from river import stats
from river import stream

conversion_rates = defaultdict(stats.Mean)

for event, _ in stream.iter_pandas(events):
    group = event['group']
    converted = event['converted']
    conversion_rates[group].update(converted)

conversion_rates

defaultdict(river.stats.mean.Mean,
            {'treatment': Mean: 0.11892, 'control': Mean: 0.120399})

In [29]:
import statsmodels.api as sm

z_score, p_value = sm.stats.proportions_ztest(
    [conversion_rates['treatment'].get(), conversion_rates['control'].get()],
    [conversion_rates['treatment'].n, conversion_rates['control'].n],
    alternative='larger'
)
z_score, p_value

(-0.003147450060250258, 0.5012556488313168)

**Question 🤔: what's the conclusion of this test?**

We can also run the test at any point in time.

In [33]:
conversion_rates = defaultdict(stats.Mean)
p_values = []

for t, (event, _) in enumerate(stream.iter_pandas(events)):
    group = event['group']
    converted = event['converted']
    conversion_rates[group].update(converted)
    if t and t % 10_000 == 0:
        _, p_value = sm.stats.proportions_ztest(
            [conversion_rates['treatment'].get(), conversion_rates['control'].get()],
            [conversion_rates['treatment'].n, conversion_rates['control'].n],
            alternative='larger'
        )
        print(f'At time {t}, p-value is {p_value}')

At time 10000, p-value is 0.5017003797428943
At time 20000, p-value is 0.5016591526314012
At time 30000, p-value is 0.5036727980355485
At time 40000, p-value is 0.502629111005679
At time 50000, p-value is 0.50306104724114
At time 60000, p-value is 0.5018889329631347
At time 70000, p-value is 0.5020726098111026
At time 80000, p-value is 0.5026496191551849
At time 90000, p-value is 0.5017898097357284
At time 100000, p-value is 0.5018808441436245
At time 110000, p-value is 0.5007958908340566
At time 120000, p-value is 0.5001146001610731
At time 130000, p-value is 0.5004514720905143
At time 140000, p-value is 0.5005564623588821
At time 150000, p-value is 0.5009082052779854
At time 160000, p-value is 0.501131267314793
At time 170000, p-value is 0.5013340807415309
At time 180000, p-value is 0.5014484061459883
At time 190000, p-value is 0.5015544671976391
At time 200000, p-value is 0.5014694245011267
At time 210000, p-value is 0.5009732901851823
At time 220000, p-value is 0.5010833884409136
A

**Question 🤔: what are the pros/cons of checking A/B test results before the test has ended?**

**Question 🤔: how would we have done this with batch tools?**