# Online statistics

Basic statistics are the building blocks of statistical models. For instance, being able to compute an average and a variance is fundamental. In this tutorial, we will take a look at how to compute basic statistics in an online fashion.

We will use stock market data to feed our examples. This requires installing the Yahoo finance SDK for Python:

```py
pip install yfinance
```

In [41]:
import yfinance as yf

history = yf.download(
    tickers=['GOOGL', 'MSFT', 'AAPL', 'AMZN'],
    start='2020-01-01',
    end='2022-01-01'
)
history.tail()

[*********************100%***********************]  4 of 4 completed


Unnamed: 0_level_0,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,High,High,High,High,Low,Low,Low,Low,Open,Open,Open,Open,Volume,Volume,Volume,Volume
Unnamed: 0_level_1,AAPL,AMZN,GOOGL,MSFT,AAPL,AMZN,GOOGL,MSFT,AAPL,AMZN,GOOGL,MSFT,AAPL,AMZN,GOOGL,MSFT,AAPL,AMZN,GOOGL,MSFT,AAPL,AMZN,GOOGL,MSFT
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
2021-12-27,179.016113,169.669495,147.906494,338.42334,180.330002,169.669495,147.906494,342.450012,180.419998,172.942993,148.343994,342.480011,177.070007,169.2155,147.169495,335.429993,177.089996,171.037003,147.255997,335.459991,74919600,58688000,15976000,19947000
2021-12-28,177.983704,170.660995,146.686996,337.237457,179.289993,170.660995,146.686996,341.25,181.330002,172.175995,148.298996,343.809998,178.529999,169.135498,146.054504,340.320007,180.160004,170.182495,148.235992,343.149994,79144300,54638000,18200000,15661500
2021-12-29,178.073044,169.201004,146.654999,337.929199,179.380005,169.201004,146.654999,341.950012,180.630005,171.212006,147.417007,344.299988,178.139999,168.600494,145.647507,339.679993,179.330002,170.839996,146.644501,341.299988,62348900,35754000,17788000,15042000
2021-12-30,176.901611,168.644501,146.2005,335.330139,178.199997,168.644501,146.2005,339.320007,180.570007,170.888,147.300003,343.130005,178.089996,168.524002,145.994507,338.820007,179.470001,169.699997,146.694,341.910004,59773000,37584000,15688000,15994500
2021-12-31,176.27623,166.716995,144.852005,332.365417,177.570007,166.716995,144.852005,336.320007,179.229996,169.350006,146.698502,339.359985,177.259995,166.558502,144.852005,335.850006,178.089996,168.955994,146.050003,338.51001,64062300,47830000,18136000,18000800


In [134]:
ticks = history['Adj Close'].melt(
    var_name='ticker',
    value_name='price',
    ignore_index=False
)
ticks.sample(5)

Unnamed: 0_level_0,ticker,price
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-10-20,AMZN,170.753006
2020-07-28,MSFT,196.991806
2020-03-18,AAPL,60.465328
2021-01-28,MSFT,234.157791
2021-02-26,GOOGL,101.095497


## Mean

$$
\left\{ 
    \begin{array}{ll}
m_0 = 0  \\
m_{t+1} = m_t + \frac{x - m_t}{t+1}
\end{array}
\right.
$$

We will use Python [dataclasses](https://docs.python.org/3/library/dataclasses.html) for the implementation. These are lightweight containers, they take away a lot of the boilerplate required with regular classes.

In [53]:
from dataclasses import dataclass

@dataclass
class Mean:
    n: float = 0
    value: float = 0

    def update(self, x):
        self.n += 1
        self.value += (x - self.value) / self.n

Mean()

Mean(n=0, value=0)

In [54]:
mean = Mean()

for x in ticks.query('ticker == "MSFT"').price:
    mean.update(x)
    
mean

Mean(n=505, value=229.89625256226788)

**Question 🤔: how can we test this implementation is correct?**

## Variance

The online variance can be implemented with [Welford's algorithm](https://jonisalonen.com/2013/deriving-welfords-method-for-computing-variance/).

$$
\left\{ 
    \begin{array}{ll}
m_0 = 0  \\
v_0 = 0  \\
s_0 = 0  \\
m_{t+1} = m_t + \frac{x - m_t}{t+1} \\
s_{t+1} = s_t + (x - m_t)(x - m_{t+1}) \\
v_{t+1} = \frac{s_{t+1}}{t+1}
\end{array}
\right.
$$

In [63]:
from dataclasses import field

@dataclass
class Variance:
    mean: Mean = field(default_factory=Mean)
    sos: float = 0
    value: float = 0

    def update(self, x):
        mean_old = self.mean.value
        self.mean.update(x)
        mean_new = self.mean.value
        self.sos += (x - mean_old) * (x - mean_new)
        self.value = self.sos / self.mean.n

var = Variance()

for x in ticks.query('ticker == "MSFT"').price:
    var.update(x)

var

Variance(mean=Mean(n=505, value=229.89625256226788), sos=1355596.386266115, value=2684.349279734881)

**Question 🤔: although the output seems correct, what is the issue with the implementation?**

In [88]:
@dataclass
class Variance(Mean):
    mean: Mean = field(default_factory=Mean)

    def update(self, x):
        mean_old = self.mean.value
        self.mean.update(x)
        mean_new = self.mean.value
        super().update((x - mean_old) * (x - mean_new))

var = Variance()

for x in ticks.query('ticker == "MSFT"').price:
    var.update(x)

var

Variance(n=505, value=2684.3492797348804, mean=Mean(n=505, value=229.89625256226788))

**Question 🤔: why is this implementation preferable?**

## Covariance

In [96]:
@dataclass
class Covariance(Mean):
    mean_x: Mean = field(default_factory=Mean)
    mean_y: Mean = field(default_factory=Mean)

    def update(self, x, y):
        dx = x - self.mean_x.value
        self.mean_x.update(x)
        self.mean_y.update(y)
        super().update(dx * (y - self.mean_y.value))

cov = Covariance()

for _, tick in (
    ticks
    .query('ticker in ("GOOGL", "MSFT")')
    .pivot(values='price', columns='ticker')
    .iterrows()
):
    cov.update(tick.GOOGL, tick.MSFT)

cov

Covariance(n=505, value=1455.8528752664254, mean_x=Mean(n=505, value=99.0374940702231), mean_y=Mean(n=505, value=229.89625256226788))

## Pearson correlation

In [98]:
@dataclass
class PearsonCorrelation:
    cov: Covariance = field(default_factory=Covariance)
    var_x: Variance = field(default_factory=Variance)
    var_y: Variance = field(default_factory=Variance)

    def update(self, x, y):
        self.cov.update(x, y)
        self.var_x.update(x)
        self.var_y.update(y)

    @property
    def value(self):
        return self.cov.value / (self.var_x.value * self.var_y.value) ** 0.5

corr = PearsonCorrelation()

for _, tick in (
    ticks
    .query('ticker in ("GOOGL", "MSFT")')
    .pivot(values='price', columns='ticker')
    .iterrows()
):
    corr.update(tick.GOOGL, tick.MSFT)

corr 

PearsonCorrelation(cov=Covariance(n=505, value=1455.8528752664254, mean_x=Mean(n=505, value=99.0374940702231), mean_y=Mean(n=505, value=229.89625256226788)), var_x=Variance(n=505, value=837.2064192106675, mean=Mean(n=505, value=99.0374940702231)), var_y=Variance(n=505, value=2684.3492797348804, mean=Mean(n=505, value=229.89625256226788)))

In [106]:
from dataclasses import asdict

asdict(corr)

{'cov': {'n': 505,
  'value': 1455.8528752664254,
  'mean_x': {'n': 505, 'value': 99.0374940702231},
  'mean_y': {'n': 505, 'value': 229.89625256226788}},
 'var_x': {'n': 505,
  'value': 837.2064192106675,
  'mean': {'n': 505, 'value': 99.0374940702231}},
 'var_y': {'n': 505,
  'value': 2684.3492797348804,
  'mean': {'n': 505, 'value': 229.89625256226788}}}

## Rolling mean

Computing a statistic over a window of time is very useful. For the sake of simplicity, we will only implement rolling windows.

In [110]:
from collections import deque
from typing import Protocol

class Rollable(Protocol):

    def update(self, x):
        ...

    def revert(self, x):
        ...

@dataclass
class Rolling:
    statistic: Rollable
    window_size: int

    def __post_init__(self):
        self.window = deque(maxlen=self.window_size)

    def update(self, x):
        self.statistic.update(x)
        if self.window.maxlen == len(self.window):
            self.statistic.revert(self.window[0])
        self.window.append(x)

    @property
    def value(self):
        return self.statistic.value

In [112]:
@dataclass
class RollableMean(Mean):

    def revert(self, x):
        self.n -= 1
        self.value -= (x - self.value) / self.n

In [120]:
rmean = Rolling(RollableMean(), 10)

for x in ticks.query('ticker == "MSFT"').price:
    rmean.update(x)
    
rmean

Rolling(statistic=RollableMean(n=10, value=330.0904785156252), window_size=10)

In [121]:
ticks.query('ticker == "MSFT"').tail(10).price.mean()

330.090478515625

## Using River

It's nice to know how these online algorithms work. However, in practice you'll want to use an already existing library. In Python, there is [River](https://github.com/online-ml/river/):

```sh
pip install river
```

In [123]:
from river import stats
from river import utils

rmean = utils.Rolling(stats.Mean(), 10)

for x in ticks.query('ticker == "MSFT"').price:
    rmean.update(x)
    
rmean

Mean: 330.090479

## Divide and conquer

In [132]:
A = stats.Mean()
year = 2020
for x in ticks.query('ticker == "MSFT" and Date.dt.year == @year').price:
    A.update(x)
    
B = stats.Mean()
year = 2021
for x in ticks.query('ticker == "MSFT" and Date.dt.year == @year').price:
    B.update(x)

C = A + B
C

Mean: 229.896253

**Question 🤔: what does this ability of merging statistics enable?**

## Slice and dice

In [138]:
from river import feature_extraction

agg = feature_extraction.Agg(
    on='price',
    by='ticker',
    how=stats.Mean()
)

for tick in ticks.to_dict('records'):
    agg.learn_one(tick)

agg.state

AAPL     116.635108
AMZN     150.585230
GOOGL     99.037494
MSFT     229.896253
Name: price_mean_by_ticker, dtype: float64

In [145]:
from river import feature_extraction

agg = (
    feature_extraction.Agg(on='price', by='ticker', how=stats.Mean()) +
    feature_extraction.Agg(on='price', by='ticker', how=stats.Var())
)

for tick in ticks.to_dict('records'):
    agg.learn_one(tick)

In [146]:
agg[0].state

AAPL     116.635108
AMZN     150.585230
GOOGL     99.037494
MSFT     229.896253
Name: price_mean_by_ticker, dtype: float64

In [147]:
agg[1].state

AAPL      868.475062
AMZN      679.493151
GOOGL     838.867543
MSFT     2689.675370
Name: price_var_by_ticker, dtype: float64

## A/B testing application

In [3]:
import pandas as pd

url = 'https://raw.githubusercontent.com/alenyeh1014/DataAnalytics-AB_Testing/master/DataFiles/ab_data.csv'
events = pd.read_csv(url)
events = events.sort_values(by='timestamp')
events.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
131228,922696,2017-01-02 13:42:05.378582,treatment,new_page,0
184884,781507,2017-01-02 13:42:15.234051,control,old_page,0
83878,737319,2017-01-02 13:42:21.786186,control,old_page,0
102717,818377,2017-01-02 13:42:26.640581,treatment,new_page,0
158789,725857,2017-01-02 13:42:27.851110,treatment,new_page,0


In [28]:
from collections import defaultdict
from river import stats
from river import stream

conversion_rates = defaultdict(stats.Mean)

for event, _ in stream.iter_pandas(events):
    group = event['group']
    converted = event['converted']
    conversion_rates[group].update(converted)

conversion_rates

defaultdict(river.stats.mean.Mean,
            {'treatment': Mean: 0.11892, 'control': Mean: 0.120399})

In [29]:
import statsmodels.api as sm

z_score, p_value = sm.stats.proportions_ztest(
    [conversion_rates['treatment'].get(), conversion_rates['control'].get()],
    [conversion_rates['treatment'].n, conversion_rates['control'].n],
    alternative='larger'
)
z_score, p_value

(-0.003147450060250258, 0.5012556488313168)

**Question 🤔: what's the conclusion of this test?**

We can also run the test at any point in time.

In [33]:
conversion_rates = defaultdict(stats.Mean)
p_values = []

for t, (event, _) in enumerate(stream.iter_pandas(events)):
    group = event['group']
    converted = event['converted']
    conversion_rates[group].update(converted)
    if t and t % 10_000 == 0:
        _, p_value = sm.stats.proportions_ztest(
            [conversion_rates['treatment'].get(), conversion_rates['control'].get()],
            [conversion_rates['treatment'].n, conversion_rates['control'].n],
            alternative='larger'
        )
        print(f'At time {t}, p-value is {p_value}')

At time 10000, p-value is 0.5017003797428943
At time 20000, p-value is 0.5016591526314012
At time 30000, p-value is 0.5036727980355485
At time 40000, p-value is 0.502629111005679
At time 50000, p-value is 0.50306104724114
At time 60000, p-value is 0.5018889329631347
At time 70000, p-value is 0.5020726098111026
At time 80000, p-value is 0.5026496191551849
At time 90000, p-value is 0.5017898097357284
At time 100000, p-value is 0.5018808441436245
At time 110000, p-value is 0.5007958908340566
At time 120000, p-value is 0.5001146001610731
At time 130000, p-value is 0.5004514720905143
At time 140000, p-value is 0.5005564623588821
At time 150000, p-value is 0.5009082052779854
At time 160000, p-value is 0.501131267314793
At time 170000, p-value is 0.5013340807415309
At time 180000, p-value is 0.5014484061459883
At time 190000, p-value is 0.5015544671976391
At time 200000, p-value is 0.5014694245011267
At time 210000, p-value is 0.5009732901851823
At time 220000, p-value is 0.5010833884409136
A

**Question 🤔: what are the pros/cons of checking A/B test results before the test has ended?**

**Question 🤔: how would we have done this with batch tools?**