# Task 1. Latency Police (20 points)

Story. A company (QuickAPI) advertises:

“During peak hours, the average API latency does not exceed 200 ms.”

You collected n = 25 peak-hour latencies (ms). Assume i.i.d. Exponential waiting times with mean $\mu$


Your job: decide whether the ad claim looks consistent with the data or it's just a clickbait

In [5]:
import numpy as np

x_obs = np.array([
    93.85, 260.75, 172.69, 137.36, 33.92,
    33.92, 11.97, 201.12, 137.44, 163.07,
     4.15, 350.36, 228.45, 47.78, 40.13,
    40.62, 72.53, 117.11, 109.62, 74.02,
    61.28, 134.49, 96.49, 89.88, 108.20
])
n = len(x_obs)
mu0 = 200.0


## Translate the ad claim into hypotheses (2 points)


QuickAPI claims: “During peak hours, the average API latency does not exceed 200 ms.”

Write statistical hypotheses. Decide one-sided vs two-sided. Argue your answer.

What alpha level you will use?

In [6]:
## your code here

## Statistic (1 point)

We will use one statistic for the whole task:

$S=\sum_{i=1}^{n} X_i$

Compute $S_{obs}$.

In [11]:
S_obs = np.nan

## Statistic distribution (2 points)

Under the boundary of the null ($\mu=\mu_0$), find the exact distribution of $S$ and its parameters.

Write 2–5 lines explaining where it comes from and include a non-AI citation (textbook / lecture notes).

Then define the parameters in code.

In [13]:
## your code here

## Plot distribution of your statistic under null hypothesis. (2 points)


In [15]:
## your code here

## Add rejection region and observed value (3 points)


mark $S_{obs}$ as a vertical line

mark $c$ as a vertical line

shade the rejection region ($S \ge c$)

Do your reject $H_0$ or not?

In [17]:
## your code here

## Simulation check: does the exact test control $\alpha$? (4 points)

We now verify the test by simulation under the boundary $\mu=\mu_0$.

Simulate many datasets from Exp(mean=$\mu_0$), each of size $n$:

1. compute $S$

2. compute the exact p-value for each run

3. estimate the rejection rate at your chosen $\alpha$

Report the simulated rejection rate and compare it to $\alpha$.

In [18]:
## your code here

## Apply asymptotic Z-test to the same problem: is False Positive Rate as expected? (4 points)


In [19]:
## your code here

## Plot p-value distribution for the exact test and for the asymptotic one. (2 points)

What do you see? Are both tests correct? Explain why we see such results 

In [20]:
## your code here

# Task 2. The T-test myth (20 points)

Story. Your teammate says:

> “In industry everyone uses the classic two-sample $t$-test or Welch's $t$-test. It’s robust and basically assumption-free: you only need i.i.d. samples. Normality is not a real requirement.”

You will stress-test this claim on a **small sample** from a **clearly non-normal** distribution.

**Fixed setup (do not change):**
We run an A/A test: both groups are sampled from the same skewed distribution, so there is **no real effect**.

$X_1,\dots,X_{n_1} \overset{iid}{\sim} \mathrm{LogNormal}(0,1^2), \qquad
Y_1,\dots,Y_{n_2} \overset{iid}{\sim} \mathrm{LogNormal}(0,1^2), \qquad
X \perp Y.$

For simplicity we use the **pooled two-sample $t$-test** because in this simulation the true variances are equal; this avoids varying degrees of freedom.

Fixed parameters:
- $n_1=n_2=20$
- seed = `7`
- simulations $R=50{,}000$
- $\alpha=0.05$


In [45]:
seed = 7
rng = np.random.default_rng(seed)

n1 = 20
n2 = 20
R = 50_000

# A/A simulations: shape (R, n)
x = rng.lognormal(mean=0.0, sigma=1.0, size=(R, n1))
y = rng.lognormal(mean=0.0, sigma=1.0, size=(R, n2))

In [46]:
x.shape

(50000, 20)

## 1) Do pooled t-statistics actually look like Student-$t$ here? (4 points)

- Simulate $R$ experiments under $H_0$ and compute the pooled two-sample $t$ statistic each time.
- Plot the histogram (density) of simulated $t$.
- Overlay a Student-$t$ density with $\nu=n_1+n_2-2$.
- Do the shapes match? If not, describe the mismatch (center vs tails vs asymmetry).


In [47]:
# your code here

## 2) Does the pooled t-test control $\alpha=0.05$ in this setting? (4 points)

- Under the same $H_0$, compute pooled **two-sided** $t$-test p-values for all $R$ experiments.
- Estimate empirical false positive rate:
  $\widehat{\mathrm{FPR}}=\Pr(p \le 0.05).$
- Plot the p-value histogram.
- Is it close to  $U(0,1)$? If not, in what direction is it distorted?


In [48]:
# your code here

## 3) PIT check: is the Student-$t$ CDF the right CDF for these pooled $t$’s? (4 points)

Probability integral transform (PIT) fact:

https://en.wikipedia.org/wiki/Probability_integral_transform

If a random variable $T$ truly has CDF $F$, then
$U = F(T)$
must be $U(0,1)$.

- For each run compute
  $U = F_{t,\nu}(t),$
  where $F_{t,\nu}$ is the Student-$t$ CDF and $\nu = n_1+n_2-2$.
- Plot the histogram of $U$.
- Is it flat? If not, interpret the shape (what kind of mismatch does it suggest?).


In [49]:
# your code here

## 4) Check the chi-square piece (4 points)

Textbook (normal-theory) identity for the **pooled two-sample $t$-test**:

$
t = \frac{Z}{\sqrt{\chi^2_\nu/\nu}},
\qquad
Z\sim \mathcal N(0,1),\;\; \chi^2_\nu \text{ independent},\;\;
\nu = n_1+n_2-2.
$

In our A/A simulation the data are **not normal**, so this identity is not guaranteed.
We will **test whether the chi-square part still looks right**.

For each simulated run compute the pooled variance $s_p^2$ and build:

$
Q = \frac{\nu\, s_p^2}{\sigma^2_{\text{true}}}.
$

If the textbook story held, then we would expect:

$
Q \sim \chi^2_\nu.
$

Make:
- a QQ plot of empirical $Q$ vs $\chi^2_\nu$ quantiles
- a PIT check: $V = F_{\chi^2_\nu}(Q)$ should be close to $U(0,1)$ if $Q$ is really chi-square

Describe what breaks and why non-normality can cause it.



In [50]:
# your code here

## 5) One-paragraph conclusion (4 points)

In 5–8 lines, answer:

- In this fixed setup, is the “t-test is assumption-free” claim supported or not?
- If it fails, describe *how* it fails
- What practical rule of thumb would you tell a teammate about using the pooled t-test?


\# your conclusion here

# Task 3. The MW test myth (20 points)

Mann–Whitney U (MWU) sometimes is used in A/B testing as a “robust non-parametric alternative” to the $t$-test when data are non-normal or heavy-tailed.
Your goal is to verify this belief and deliver a verdict:

Can MWU replace a mean test (e.g., $t$-test / asymptotic $z$-test for $\mu_B-\mu_A$) without changing the hypothesis?

Does it produce “more robust” decisions for mean-based business metrics (e.g., ARPU)?

https://stripe.com/en-es/resources/more/what-is-average-revenue-per-user-why-it-matters-and-how-to-calculate-it

## 1) Simulate many datasets under $H_0$ with $n=2000$ per group. (7 points)

For each simulation compute the asymptotic Z statistic

$z = \dfrac{\bar{Y}-\bar{X}}{\sqrt{s_X^2/n + s_Y^2/n}}$

two-sided p-value from $z$

Plot:

- histogram of $z$

- histogram of p-values

Check:

- empirical FPR at $\alpha=0.05$ is close to $0.05$

- $z$ looks close to $\mathcal{N}(0,1)$

- p-values look close to $\text{Uniform}(0,1)$

Question: are the means equal by construction?

In [19]:
n = 2000
rng = np.random.default_rng(2025)

# N(0,1)
A = rng.normal(0.0, 1.0, n)

# mixture with rare outliers
p, mu1 = 0.98, 8.0
mu0 = -((1 - p) / p) * mu1
B = np.where(rng.random(n) < p,
             rng.normal(mu0, 1.0, n),
             rng.normal(mu1, 1.0, n))

In [None]:
# your code here

## 2) Mann–Whitney U on the same data (7 points)

What to do

1. Using the same $X$ and $Y$ as in Task 1 (means equal), simulate 2000 samples.

2. Run Mann–Whitney U (two-sided, asymptotic) at $\alpha=0.05$.

3. Report the empirical rejection rate.

Questions:

1. If MWU is used to test equality of means, what is the conceptual mistake?

2. What risks does this create in A/B decisions?

In [75]:
# yur code here

## 3) ARPU: construct a bad example (6 points)

Build a concrete ARPU setup where the distribution changes is in opposite with MWU test result.

Simulate many experiments with $n=2000$ per group.

Run right-sided MWU for “B greater than A”.

Questions:

1. Can MWU reject $H_0$ in favor of “B is better” while the ARPU mean in B is lower?

2. If yes, explain why this can happen for ARPU-like metrics.

Reference (for inspiration): https://blog.analytics-toolkit.com/2024/stop-abusing-the-mann-whitney-u-test-mwu/

In [76]:
# yur code here

# Task 4. Confidence Intervals comparison (20 points)

You are estimating a Bernoulli probability p with small samples. Different confidence intervals behave very differently, especially near p≈0 or p≈1. Your job is to build and compare them via simulation.

We will use:

- n = 10 and n = 30 (and later vary n)

- confidence level 1 - α = 0.95

Intervals to implement by hand:

1. Wald

2. Wilson

3. Agresti–Coull

4. Clopper–Pearson (exact)

https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

## Implement Wald CI (3 points)


In [31]:
# your code here

## Implement Wilson + Agresti–Coull (3 points)

Write two functions: ci_wilson, ci_agresti_coull.

No helper packages.

In [32]:
# your code here

## Implement Clopper–Pearson (exact) (3 points)

Use Beta distribution quantiles (you may use scipy.stats.beta.ppf)

In [33]:
# your code here

## Coverage simulation over p-grid (n=10 and n=30) (6 points)

For each method, estimate coverage:

grid: $p \in \{0.01, 0.02, …, 0.99\}$

for each p: simulate $B$ binomials $k \sim Bin(n, p)$

build CI

- heck whether true p lies inside

- compute coverage rate

- Plot coverage(p) for each method for:

n=10

n=30

In [28]:
## your code here

## Quick take (write 5–8 lines)

1) Which method under-covers (coverage < 0.95)? Where exactly: near $p\approx 0$, $p\approx 1$, or everywhere?
2) Which method is conservative (coverage > 0.95)? What do you pay for that?
3) Does the picture change a lot from $n=10$ to $n=30$? Describe the main differences.


## Average width vs p (3 points)

For n=10 and n=30, plot average interval width vs p for each method.

In [29]:
# your code here

## Answer following questions:

1) Which method is usually the narrowest? Is it also well-calibrated by coverage?
2) For small $p$ (e.g. $p<0.05$), which methods get much wider and why might that be reasonable?


## When do they start working? Vary different n (2 points)


In [30]:
# your code here

## When does asymptotics become OK? (write 2–4 lines)

1) For each method, roughly at what $n$ does the *minimum* coverage over $p\in\{0.01,\dots,0.99\}$ stop being embarrassing?
2) Is there a clear “threshold n” after which methods behave similarly? Give a number range.
3) Explain in plain words why the edge cases ($p$ near 0 or 1) are hardest for small $n$.


# Task 5. Are you an ML Engineer??? (20 points)

Story. You inherit a production ML model. It predicts well, but has too many features: slow, fragile, hard to explain.  
You are not asked to code. You are asked to **design a feature selection procedure** based on permutation importance + hypothesis testing.

Setup: you have a fixed dataset (train/val/test), a fixed trained model, and a fixed evaluation metric (e.g., RMSE).


## 1) What exactly are we measuring? (4 points)

In 3–5 sentences define permutation importance for feature $j$ as a random variable $\Delta_j$ (what is permuted, what score is computed, what sign means "useful").

What dataset part you will use for feature selection?

In [78]:
# your code here

## 2) Statistical question per feature (4 points)

Write hypotheses for one feature $j$ using $\Delta_j$:

$
H_0: \#\#\# \\
H_1: \#\#\#
$

Explain why it is one/two-sided.

In [79]:
# your code here

## 4) Multiple testing (4 points)

You test $d$ features. Explain why "select all with $p\le 0.05$" is wrong.

What error rate are you controlling and why is it appropriate for feature selection?

In [80]:
# your code here

## 5) Practical significance vs statistical significance (4 points)

How would you define practical significance here? How it might me incorporated into the framework?

In [81]:
# your code here

## 6) Make it work at scale (2 points)

Permutation is expensive. Propose **two** concrete tricks to reduce compute while keeping decisions reliable 

In [82]:
# your code here

# Congrats! You've finished this assignment. How was it? Any feedback is very welcome!


In [77]:
# your thoughts here :) e.g. "useless, I hate it"