# HW ‚Äî Hypothesis Testing with ARPU / Revenue per User

## Context
You are a data analyst on a mobile game team. The team launched a change in onboarding / paywall / checkout and asks you to evaluate impact using **ARPU (Average Revenue Per User)**.

For this homework, ARPU is computed at the **user level over the whole period**:

1) First aggregate revenue per user over the period:
$Revenue_u = \sum_{d \in period} revenue_{u,d}$

2) Then compute ARPU as the mean across users:
$ARPU = \dfrac{1}{N}\sum_{u=1}^{N} Revenue_u$

Why ARPU is commonly used:
- In A/B tests with equal-sized groups, $\Delta ARPU$ is directly proportional to the difference in **total revenue** between groups:
  $TotalRevenue = N \cdot ARPU$
  so comparing ARPU is equivalent to comparing total revenue (up to a constant factor).

Revenue data is usually tricky:
- many users have **zero revenue**
- heavy tails (rare users with very large payments)

Because of that, we want to be careful with ‚Äústandard‚Äù asymptotic tests.

You have two files:
- `history_activity.csv` ‚Äî pre-period (1 month)
- `current_activity.csv` ‚Äî current period (1 month) + `bucket` (0/1)


# Task 1 ‚Äî Pre-period visualization (history_activity.csv)

Work only with `history_activity.csv`.

Build daily time series:

1) **DAU**: number of unique users with `revenue > 0` on that day  
2) **Daily ARPU**: mean revenue per user-day:
   $ARPU(d) = mean_u(revenue_{u,d})$
3) **Share of payers** per day:
   $\pi(d) = \dfrac{\#\{u: revenue_{u,d} > 0\}}{\#\{u\}}$

Make 3 plots (one per metric).

Write 3‚Äì5 sentences:
- is there seasonality / trend?
- is the payer share stable?
- do you see signs of heavy tails (spikes)?


In [None]:
## code

In [None]:
## code

# Task 2 ‚Äî Is the asymptotic z/t-test calibrated for ARPU?

We want to check if the standard asymptotic test for the difference in means behaves correctly under **AA** (no real effect).

## Setup
1) Aggregate the pre-period into **one row per user**:
   $revenue_u = \sum_d revenue_{u,d}$

2) Repeat many times:
- randomly split users into two equal groups A and B
- run an asymptotic test for difference in mean revenue (asymptotic z-test)
- store the p-value

## What to check
- **Alpha control**: $P(p < 0.05)$ should be close to 0.05
- **Uniformity**: p-values should look approximately Uniform(0,1)
  (histogram + QQ plot)

If calibration is poor, it means naive asymptotic inference may be misleading for heavy-tailed revenue.


In [None]:
## code

In [None]:
## code

If you did everything correctly, you should see that under AA:
- the rejection rate at $\alpha=0.05$ is close to 0.05
- p-values look approximately Uniform(0,1)

This means the asymptotic test for the difference in mean ARPA is reasonably well-calibrated for our data.

Now we can move on to the practical question: **what experiment design can we afford and what is optimal for us?**
Next, we will think about sample size, duration, and variance reduction options (e.g., stratification / CUPED), and choose a design that achieves the required power with minimal cost.


# Task 3 ‚Äî Experiment design: how long do we need to run?

We are ready to run an A/B test, but first we must answer a basic design question:

**Given our traffic and the variability of ARPU, how long should the experiment run to detect meaningful effects?**

To answer this, we will compute the **minimum detectable effect (MDE)** for different experiment durations and test settings.

## Step 1 ‚Äî Estimate weekly traffic
Using `history_activity.csv`:
- compute the average number of weekly unique users

Call this value $T$ = **unique users per week**.

## Step 2 ‚Äî Convert weeks to total sample size
For a duration of $w$ weeks:
$N(w) = T \cdot w$

We will consider:
$w \in \{1,2,3,4,5,6,7,8\}$  
(so $N(w)$ is up to ~800k)

## Step 3 ‚Äî Compute relative MDE for ARPU
Using `history_daily.csv`, estimate baseline ARPU variability on the same window length:
- $Revenue_u = \sum_{d \in window} revenue_{u,d}$
- $ARPU = mean_u(Revenue_u)$

Then compute **relative MDE (%)** for:
- $\alpha \in \{0.01, 0.05, 0.10\}$ (two-sided)
- power $\in \{0.75, 0.80\}$


In [None]:
## code

### Why with n grows MDE decreases slower? 

In [None]:
## code

It looks like with a standard choice of $\alpha = 0.05$ and power $= 0.80$ we would need to run the experiment for about **4 weeks**, and even then the **relative MDE for ARPU is quite high**.

We discussed this with the product manager. They say we **cannot wait longer than 4 weeks**, so we commit to:
- **two-sided $\alpha = 0.05$**
- **power $= 0.80$ (i.e., $\beta = 0.20$)**
- **duration = 4 weeks**

The product manager has high confidence in the change and believes the expected effect is large, so this MDE is acceptable for a go/no-go decision.

At the same time, we still want to understand *what drives the revenue change*.  
ARPU can increase because:
- more users convert to payment (**payer rate** goes up), and/or
- paying users spend more (**ARPPU** goes up)

So we will **decompose ARPU** and compute MDE not only for ARPU, but also for:
1) **Payer Rate**: $P(Revenue_u > 0)$ over the experiment window  
2) **ARPPU**: $E[Revenue_u \mid Revenue_u > 0]$ over the experiment window

These are not decision metrics, but they help interpret the mechanism behind any ARPU movement.

## Questions
1) What are the MDE values for these three metrics?
2) How do you interpret them?  
   Which component (conversion vs spend among payers) is easier to detect and why?


In [None]:
## code

In [None]:
## code

Typically, the MDE for **payer rate** and **ARPPU** is lower than for ARPU.  
That means these metrics are often more sensitive and can help us understand *what exactly changed*:
- did we improve **conversion to payment**?
- or did we increase **revenue per payer** (order size)?

However, our original business goal is **ARPU**.  
So we will treat payer rate and ARPPU as **proxy / diagnostic metrics**: they help interpret the outcome, but they do not replace the primary objective.


# Task 4 ‚Äî Define the OEC (template)

Read about OEC here:
https://www.analytics-toolkit.com/glossary/overall-evaluation-criterion/

Fill in the fields below.

## 1) Product hypothesis
Write one clear hypothesis (generate any idea which potentially can increase ARPU in our mobile game):

> ‚ÄúIf we ______, then ______ will change because ______.‚Äù

(1‚Äì2 sentences total.)

---

## 2) OEC (Overall Evaluation Criterion)

### Primary metric (ARPU)
For each user:
$Revenue_u = \sum_{d \in experiment} revenue_{u,d}$

Then:
$ARPU = \dfrac{1}{N}\sum_u Revenue_u$

### Success criterion (fixed for this homework)
We declare success if:
- **ARPU uplift ‚â• {insert your relative MDE}**:
  $\dfrac{ARPU_B - ARPU_A}{ARPU_A} \ge  MDE$
- and the result is statistically significant with **two-sided $\alpha = 0.05$**

---

## 3) Proxy / diagnostic metrics (interpretation only)

**Payer rate**:
$payer_u = 1[Revenue_u > 0]$  
$PayerRate = mean_u(payer_u)$

**ARPPU**:
$ARPPU = E[Revenue_u \mid Revenue_u > 0]$

Fill in:
- If ARPU increases mainly via **payer rate**, it means ____________________________.
- If ARPU increases mainly via **ARPPU**, it means ______________________________.


Great, we are ready to launch!

The product manager thanks you for the preparation:
- you validated that the chosen statistical test is well-calibrated under AA (alpha is controlled),
- the experiment design matches the business goal (primary decision is based on ARPU),
- and you defined helpful diagnostic metrics (payer rate and ARPPU) that will let us understand *why* ARPU changes (conversion vs revenue per payer).

Now we can run the experiment and analyze the results using the agreed OEC.


# Experiment analysis ‚Äî Step 1: Sanity check for the splitter (SRM)

Before looking at business impact, we should verify that randomization worked correctly.

A common issue is **Sample Ratio Mismatch (SRM)**: the observed number of users in A and B differs from the expected split (e.g., 50/50). This can indicate problems in the splitter or data collection.

We test SRM using a **chi-square goodness-of-fit** test:
- $H_0$: group assignment follows the expected split (50/50)
- $H_1$: observed split differs from expected

If the p-value is small (e.g., < 0.05), we suspect SRM and should not trust the experiment results until investigated.


In [None]:
## code

So far so good, nothing suspicious, let's move forward

# Task 5 ‚Äî Build an A/B report table (experiment results)

The experiment is finished and all data is available.

Your goal is to produce a clean A/B report table with the main results for **three metrics**:
1) **ARPU** (28-day revenue per user)
2) **Payer rate**: $P(Revenue_u > 0)$
3) **ARPPU**: $E[Revenue_u \mid Revenue_u > 0]$

## Step 1 ‚Äî Prepare user-level dataset
Work with `current_daily.csv`.

For the first 28 days of the experiment:
- aggregate to **one row per user**:
  $Revenue_u = \sum_d revenue_{u,d}$
- keep the user‚Äôs bucket (0/1)

## Step 2 ‚Äî Compute per-metric stats
For each metric, report:

- **Control mean**
- **Treatment mean**
- **Absolute lift**: $\Delta = \bar{X}_B - \bar{X}_A$
- **Relative lift (%)**: $\Delta / \bar{X}_A$
- **Standard error** of $\Delta$
- **95% CI** for $\Delta$
- **p-value**

Use asymptotic tests:
- ARPU: Asymptotic z-test on user-level $Revenue_u$
- Payer rate: Asymptotic z-test on user-level
- ARPPU: Asymptotic z-test on payer-only $Revenue_u$ values

## Step 3 ‚Äî Format the report
Create a table (DataFrame) with one row per metric and apply styling:
- highlight rows with **p-value < 0.05** (e.g., green)
- display readable numeric formatting (rounding)

The output should look like a standard experiment report table.


In [None]:
## code

## Interpretation questions (write-up)

Look at the A/B report table for:
- ARPU
- payer rate
- ARPPU

Answer in a few sentences:

1) **What happened in the experiment?**  
   Describe the direction of changes for all three metrics (up / down / no clear change).

2) **What could have happened product-wise?**  
   Give 1‚Äì2 plausible product explanations based on the pattern you see.

3) **Do we consider the experiment successful? Why?**  
   Use the OEC:
   Clearly state whether it is a win / loss / inconclusive.


In [None]:
## code

## Note on confidence intervals for relative lift

Be careful when reporting confidence intervals for **relative lift**:
$L = \dfrac{\bar{X}_B - \bar{X}_A}{\bar{X}_A}$

Even if the underlying metric is a simple mean (like ARPU), **relative lift is a ratio** because the denominator $\bar{X}_A$ is random.  
This can cause subtle issues:
- naive transformations of an absolute CI into a relative CI (e.g., ‚Äúdivide CI endpoints by $\bar{X}_A$‚Äù) may be inaccurate
- coverage can be distorted, especially with heavy-tailed data or small samples
- for ratio metrics (or near-zero denominators) the problem becomes even more pronounced

If you want to go deeper, see:
- https://alexdeng.github.io/public/files/kdd2018-dm.pdf
- https://www.landonlehman.com/post/confidence-intervals-for-relative-lift/


# CUPED (variance reduction): can we decide faster using historical data?

If we have a good pre-period signal for the same users, we can often reduce variance and detect effects faster.

Before applying CUPED, we need to check a basic feasibility question:

## Task 6
Using `current_daily.csv` (experiment users) and `history_daily.csv` (pre-period):

1) Define the experiment user set:
- all unique `user_id` present in the first 28 days of `current_daily.csv`

2) For these users, check how many have historical data:
- a user ‚Äúhas history‚Äù if they appear at least once in `history_daily.csv`

Compute:
- share of experiment users with history:
  $\text{coverage} = \dfrac{|U_{exp} \cap U_{hist}|}{|U_{exp}|}$

Report this share as a percentage.


In [None]:
## code

## CUPED: how much variance reduction do we get, and how much time can we save?

From the lecture, CUPED reduces variance approximately by a factor:
$\mathrm{Var}_{new} \approx (1-\rho^2)\,\mathrm{Var}_{old}$

where $\rho$ is the correlation between:
- the **pre-period** user metric (covariate)
- the **experiment-period** user metric (outcome)

## Task
1) For each user in the experiment, compute:
- $X_u$ = pre-period revenue (sum over the pre-period window)
- $Y_u$ = experiment-period revenue (sum over the experiment window)

2) Compute the correlation $\rho = corr(X, Y)$ using only users with both $X_u$ and $Y_u$.

3) Compute the theoretical variance reduction:
- variance multiplier: $(1-\rho^2)$
- variance reduction (%): $100 \cdot \rho^2$

4) Translate variance reduction into time reduction:
For a fixed MDE, required sample size scales linearly with variance, so required duration scales similarly:
$\text{weeks}_{new} \approx (1-\rho^2)\cdot \text{weeks}_{old}$

Assume $\text{weeks}_{old} = 4$ and estimate $\text{weeks}_{new}$.


In [None]:
## code

If you did everything correctly, you should see that **CUPED barely helps in this dataset**.

Even though there is a large overlap between experiment users and pre-period users, the correlation $\rho$ between
pre-period revenue $X_u$ and experiment-period revenue $Y_u$ is very small. As a result:
- $\rho^2$ is close to 0
- the variance multiplier $(1-\rho^2)$ is close to 1
- the estimated time reduction is almost zero

**Explain why this happens in our setting even if overlap is high? (both product-wise and from statistical perspective)**

In [None]:
## code

# Task 7. What happens if we ‚Äúpeek‚Äù every day?

So far we assumed a fixed horizon: we run the experiment for a planned duration and decide once at the end.

Now let‚Äôs simulate a very common anti-pattern:
> ‚ÄúWe stop the experiment on the first day when p-value < Œ±‚Äù.

Even if there is **no real effect** (AA / null), repeated daily looks inflate the false positive rate:
- nominal $\alpha=0.05$ no longer means ‚Äú5% false positives‚Äù
- the real probability of a false win becomes much higher

## Goal
Simulate many AA experiments and estimate the **real FPR**:
- **FPR** = share of experiments where p-value becomes < Œ± **at least once** during the timeline

Also plot random p-value trajectories and color **red** those that cross the threshold at least once.


## Important: this simulation must be AA (null)

To estimate the **false positive rate** of ‚Äústop when p-value < Œ±‚Äù, we must simulate **AA experiments** (no real effect).
That means we should use **pre-period data** (`history_activity.csv`) and randomly split users into A/B.

We should NOT use the finished experiment data (`current_activity.csv`) for this step, because it may contain a real treatment effect (or other changes), and then the result would no longer measure FPR under the null.


In [None]:
## code

As you can see, making a decision by ‚Äúpeeking‚Äù every day and stopping at the first time $p < \alpha$ is very dangerous.

Even under AA (no true effect), repeated looks greatly increase the chance of crossing the threshold at least once.  
So the nominal $\alpha = 0.05$ no longer means ‚Äú5% false positives‚Äù ‚Äî the real false positive rate becomes much higher.

Conclusion: **without special corrections, you must not make decisions this way**.  
If early stopping is required, you need a sequential design (e.g., alpha-spending / group sequential boundaries) that preserves the overall type I er


# Task 8 ‚Äî Multiple metrics: ‚Äúwin if at least one is significant‚Äù

So far our decision rule used a single primary metric (ARPU).  
Now consider a risky alternative rule:

> ‚ÄúWe declare a win if **at least one** of the three metrics is statistically significant at $\alpha=0.05$.‚Äù

Metrics:
1) ARPU  
2) payer rate  
3) ARPPU  

Even under AA (no real effect), this rule inflates the false positive rate because we run multiple tests.

## Goal
Using AA simulations on `history_daily.csv`:
- repeatedly split users into two equal groups A/B
- compute p-values for the 3 metrics
- declare a ‚Äúwin‚Äù if $\min(p_1,p_2,p_3) < 0.05$

Estimate the **real alpha**:
$\alpha_{real} = P(\text{at least one significant})$
and compare it to 0.05.


In [None]:
## code

In [None]:
## code

## As you can see, even though these three metrics are strongly dependent (they all reflect the same payment behavior), using the rule

> ‚Äúdeclare a win if at least one metric is significant‚Äù

still inflates the false positive rate.  
This is dangerous: you will report ‚Äúwins‚Äù under AA more often than the nominal $\alpha=0.05$, even without any true effect.

---

## Congrats üéâ

You have finished this homework.

I hope it was interesting and useful. Please leave any feedback below (anything is welcome: clarity, difficulty, what to add/remove).

Thank you!
