# Probability Essentials — FAANG-Level Lab

**Goal:** ML-relevant probability: expectation, variance, Bayes, and simulation checks.

**Outcome:** You can reason about uncertainty, distributions, and Bayes updates (interview-ready).


In [1]:
import numpy as np

def check(name: str, cond: bool):
    if not cond:
        raise AssertionError(f'Failed: {name}')
    print(f'OK: {name}')

rng = np.random.default_rng(0)

## Section 1 — Discrete Random Variables

### Task 1.1: Expectation & variance from a PMF
Given values x and probabilities p (sum to 1):
- implement E[X] and Var(X)

# HINT:
- E[X] = sum p_i x_i
- Var(X) = E[X^2] - (E[X])^2

**Explain:** Why is variance not linear, but expectation is?

**Answer:** Expectation is linear because it is just a weighted sum: adding random variables or scaling them simply adds or scales their averages, regardless of whether the variables are independent. Variance, however, depends on squared deviations from the mean, which introduces interaction terms when random variables are added. As a result, variance includes cross terms (covariances) and squares, so it does not distribute over addition in the same way expectation does, making variance non-linear.

In [2]:
def expectation(x, p):
    x = np.asarray(x, dtype=float)
    p = np.asarray(p, dtype=float)
    return float(np.sum(p * x))

def variance(x, p):
    x = np.asarray(x, dtype=float)
    p = np.asarray(p, dtype=float)
    ex = np.sum(p * x)
    ex2 = np.sum(p * x * x)
    return float(ex2 - ex*ex)

x = np.array([0, 1, 2])
p = np.array([0.2, 0.5, 0.3])
mu = expectation(x, p)
var = variance(x, p)
print('E[X]=', mu, 'Var=', var)
check('mu', abs(mu - 1.1) < 1e-9)
check('var', abs(var - (0.2*0 + 0.5*1 + 0.3*4 - 1.1**2)) < 1e-9)

E[X]= 1.1 Var= 0.48999999999999977
OK: mu
OK: var


## Section 2 — Conditional Probability & Bayes

### Task 2.1: Bayes theorem (classic interview)
Disease test example:
- prevalence P(D)=0.01
- sensitivity P(+|D)=0.99
- false positive rate P(+|~D)=0.05
Compute P(D|+)

# HINT:
P(D|+) = P(+|D)P(D) / (P(+|D)P(D) + P(+|~D)P(~D))

**FAANG gotcha:** base-rate fallacy.

In [3]:
P_D = 0.01
P_pos_given_D = 0.99
P_pos_given_notD = 0.05

# TODO
P_notD = 1 - P_D
P_pos = P_pos_given_D*P_D + P_pos_given_notD*P_notD
P_D_given_pos = (P_pos_given_D*P_D) / P_pos
print('P(D|+)=', P_D_given_pos)
check('range', 0 <= P_D_given_pos <= 1)
# Should be around 0.166...
check('approx', abs(P_D_given_pos - (0.99*0.01)/(0.99*0.01 + 0.05*0.99)) < 1e-12)

P(D|+)= 0.16666666666666669
OK: range
OK: approx


### Task 2.2: Simulation check (sanity)
Simulate N people and estimate P(D|+) empirically.

# HINT:
- sample disease ~ Bernoulli(P_D)
- sample test result conditional on disease

**Explain:** Why does simulation converge to the analytic value?

**Answer:** Simulation converges to the analytic value because each simulated person is an independent draw from the same underlying probability model, so the empirical frequencies you compute (like the fraction with disease among those who test positive) are sample averages of random outcomes. By the Law of Large Numbers, as (N) grows, these sample averages concentrate around their true expected probabilities, meaning the observed conditional proportion P(D|+) approaches the exact value implied by the model (i.e., Bayes’ rule).

In [4]:
N = 200000
# TODO
disease = rng.random(N) < P_D
test_pos = np.empty(N, dtype=bool)
test_pos[disease] = rng.random(disease.sum()) < P_pos_given_D
test_pos[~disease] = rng.random((~disease).sum()) < P_pos_given_notD

# estimate P(D|+)
est = disease[test_pos].mean()
print('estimate', est)
check('close', abs(est - P_D_given_pos) < 0.01)

estimate 0.17002444575571105
OK: close


## Section 3 — Continuous Distributions (Normal)

### Task 3.1: Standardization (z-score)
Given X ~ Normal(mu, sigma^2). Compute standardized Z=(X-mu)/sigma.

# HINT:
- simulate X and check Z mean ~0, std ~1

**ML link:** standardization shows up in preprocessing and SGD stability.

In [5]:
mu, sigma = 5.0, 2.0
X = rng.normal(mu, sigma, size=200000)

Z = Z = (X - mu) / sigma
print('Z mean', Z.mean(), 'Z std', Z.std())
check('mean0', abs(Z.mean()) < 0.02)
check('std1', abs(Z.std() - 1.0) < 0.02)

Z mean 0.00263367611867299 Z std 0.9997503191012033
OK: mean0
OK: std1


## Section 4 — Naive Bayes Thinking (Optional Mini)

### Task 4.1: Compute log-odds for a toy Naive Bayes
Given word likelihoods for spam vs ham, compute posterior odds for a message.

# HINT:
- work in log space (sum logs)

**FAANG gotcha:** multiplying small probabilities underflows; use logs.

In [6]:
# Toy params
prior_spam = 0.2
prior_ham = 0.8
P_word_given_spam = {'free': 0.08, 'win': 0.05, 'meeting': 0.001}
P_word_given_ham  = {'free': 0.002, 'win': 0.001, 'meeting': 0.03}
message = ['free', 'win']

# TODO: compute log posterior ratio log P(spam|msg) - log P(ham|msg) up to constant
log_ratio = np.log(prior_spam) - np.log(prior_ham)
for w in message:
    log_ratio += np.log(P_word_given_spam[w]) - np.log(P_word_given_ham[w])
print('log_ratio', log_ratio)
check('finite', np.isfinite(log_ratio))

log_ratio 6.214608098422191
OK: finite


---
## Submission Checklist
- All TODOs completed
- Checks pass
- Explain prompts answered
