# Chapter 8: Statistical Methods for Data Analytics

This chapter gives you the **statistics toolkit** you’ll use constantly as a data analyst: probability, sampling, hypothesis testing, confidence intervals, correlation vs causation, regression, and A/B testing.

**Goal:** understand what statistical results mean, when to use each method, and how to avoid common mistakes.

## Introduction

Statistics is the backbone of data analytics. While you can collect and visualize data without statistics, you cannot make **reliable conclusions** or **informed decisions** without it.

Think of statistics as your toolkit for answering questions like:
- "Is this difference real, or just random noise?"
- "How confident can I be in this result?"
- "What's likely to happen next based on past data?"

### What You'll Learn in This Chapter

| Section | Topic | Why It Matters |
|---------|-------|----------------|
| 8.1 | Role of Statistics | Understand why statistics is essential |
| 8.2 | Probability & Distributions | Model uncertainty and randomness |
| 8.3 | Sampling Techniques | Collect data that represents reality |
| 8.4 | Hypothesis Testing | Make data-driven decisions |
| 8.5 | Parametric vs Non-parametric | Choose the right test for your data |
| 8.6 | Confidence Intervals | Quantify uncertainty in estimates |
| 8.7 | Correlation vs Causation | Avoid misleading conclusions |
| 8.8 | Regression Analysis | Model relationships between variables |
| 8.9 | A/B Testing | Run controlled experiments |
| 8.10 | Assumptions & Limitations | Know when statistics can fail |

### Prerequisites
- Basic Python (Chapter 2)
- NumPy and Pandas basics (Chapters 3-4)
- Data visualization (Chapter 5)

Let's begin!

## 8.0 Setup (Imports + Reproducibility)

We’ll use standard data analytics libraries. If a library is missing, install it (in a terminal) with: `pip install scipy statsmodels seaborn`.

We also set a random seed so examples are reproducible.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

np.random.seed(42)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)

def t_confidence_interval(sample, confidence=0.95):
    """Return a t-based confidence interval for the mean.

    Uses SciPy's newer API when available, otherwise falls back to the older one.
    """
    sample = np.asarray(sample)
    mean = sample.mean()
    sem = stats.sem(sample)
    df = len(sample) - 1
    try:
        # SciPy >= 1.11 uses 'confidence='
        return stats.t.interval(confidence=confidence, df=df, loc=mean, scale=sem)
    except TypeError:
        # Older SciPy uses 'alpha='
        return stats.t.interval(alpha=confidence, df=df, loc=mean, scale=sem)

## 8.1 Role of Statistics in Analytics

Statistics helps you move from **observations** (what happened in your data) to **inference** (what is likely true in the real world).

### Two big ideas
- **Descriptive statistics:** summarize what you observed (mean, median, charts).
- **Inferential statistics:** make a careful guess about a larger population using a sample (confidence intervals, hypothesis tests).

> **Common mistake:** treating a sample result as a guaranteed truth about the population. Statistics always includes uncertainty.

## 8.2 Probability Concepts and Distributions

### Probability (the intuition)
Probability is a number from 0 to 1 that represents how likely an event is.

Key terms:
- **Experiment:** a process that produces an outcome (e.g., a customer visit).
- **Outcome:** one result (e.g., purchase vs no purchase).
- **Event:** a set of outcomes (e.g., purchase).

### Random variables
A **random variable** turns outcomes into numbers.
- **Discrete:** counts (number of purchases)
- **Continuous:** measurements (time on page)

### Probability distributions
A distribution describes how likely different values are.
- **PMF** (discrete) or **PDF** (continuous)
- **CDF:** probability a value is *≤ x*

> **Tip:** In analytics, you rarely need to memorize formulas. Focus on what the distribution *models* and when it’s a reasonable approximation.

In [None]:
# Discrete distribution example: Binomial (purchases out of n visitors)
n = 50      # trials (visitors)
p = 0.10    # probability of purchase

k = np.arange(0, n + 1)
pmf = stats.binom.pmf(k, n=n, p=p)

plt.bar(k, pmf)
plt.title('Binomial: Purchases out of 50 visitors (p=0.10)')
plt.xlabel('Number of purchases')
plt.ylabel('Probability')
plt.show()

In [None]:
# Continuous distribution example: Standard normal
x = np.linspace(-4, 4, 400)
pdf = stats.norm.pdf(x, loc=0, scale=1)

plt.plot(x, pdf)
plt.title('Standard Normal Distribution (mean=0, std=1)')
plt.xlabel('x')
plt.ylabel('Density')
plt.show()

### Common distributions in analytics
- **Bernoulli:** one yes/no trial (purchase or not).
- **Binomial:** number of successes in *n* Bernoulli trials.
- **Poisson:** counts in a fixed period/space (e.g., tickets per hour).
- **Normal:** many natural measurements; often appears via the Central Limit Theorem.
- **Exponential:** time between events (e.g., time between arrivals).

> **Warning:** Not everything is normal. Always look at the data (histogram/box plot) before assuming a distribution.

In [None]:
# Poisson example: number of support tickets per hour
lam = 4  # average tickets per hour
k = np.arange(0, 16)
pmf = stats.poisson.pmf(k, mu=lam)

plt.stem(k, pmf)
plt.title('Poisson: Tickets per hour (lambda=4)')
plt.xlabel('Tickets per hour')
plt.ylabel('Probability')
plt.show()

## 8.3 Sampling Techniques

In real analytics, you rarely measure the whole population. You sample.

### Population vs sample
- **Population:** everyone/everything you care about (all customers).
- **Sample:** the subset you observed.

### Common sampling methods
- **Simple random sampling:** everyone has equal chance.
- **Stratified sampling:** sample within groups (e.g., regions) to ensure representation.
- **Cluster sampling:** sample groups (clusters) then include all/part within clusters.
- **Systematic sampling:** take every k-th item (be careful: can introduce patterns).

> **Common mistake:** sampling only what’s convenient (convenience sampling). This can produce biased conclusions.

In [None]:
# Create a synthetic population dataset
N = 5000
population = pd.DataFrame({
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=N, p=[0.25, 0.35, 0.20, 0.20]),
    'spend': np.random.gamma(shape=2.0, scale=30.0, size=N)  # positive, skewed
})
population.head()

In [None]:
# Simple random sample
sample_srs = population.sample(n=300, random_state=42)

# Stratified sample: equal n from each region
sample_strat = (
    population.groupby('region', group_keys=False)
    .apply(lambda g: g.sample(n=75, random_state=42))
)

pd.DataFrame({
    'population_mean_spend': [population['spend'].mean()],
    'srs_mean_spend': [sample_srs['spend'].mean()],
    'strat_mean_spend': [sample_strat['spend'].mean()]
})

In [None]:
# Compare region proportions in population vs samples
compare = pd.DataFrame({
    'population': population['region'].value_counts(normalize=True),
    'srs': sample_srs['region'].value_counts(normalize=True),
    'strat': sample_strat['region'].value_counts(normalize=True),
}).fillna(0)

compare

### Why sampling method matters
If some groups are under-represented, your estimates (like mean spend) can shift.

> **Tip:** Use stratified sampling when you *know* some groups matter and you want stable comparisons across them.

## Exercise 8.2 (Sampling)

Using the `population` DataFrame created above:

1. Take a **systematic sample** of 300 records (every k-th record)
2. Calculate the mean spend from your systematic sample
3. Compare it to the population mean
4. Discuss: What could go wrong with systematic sampling if the data has a pattern?

In [None]:
# Your code here
# Step 1: Systematic sampling - take every k-th record
k = len(population) // 300  # Calculate step size
systematic_sample = population.iloc[::k].head(300)  # Take every k-th row

# Step 2: Calculate mean spend
systematic_mean = systematic_sample['spend'].mean()

# Step 3: Compare to population mean
population_mean = population['spend'].mean()

print(f"Population mean spend: ${population_mean:.2f}")
print(f"Systematic sample mean spend: ${systematic_mean:.2f}")
print(f"Difference: ${abs(systematic_mean - population_mean):.2f}")

# Discussion: Systematic sampling works well when data is randomly ordered.
# However, if there's a pattern (e.g., data sorted by region or time), 
# the sample might over-represent or under-represent certain groups.

## 8.4 Hypothesis Testing Framework

A hypothesis test is a structured way to decide whether the data is inconsistent with a default assumption.

### The core steps
1. State hypotheses
   - **Null ($H_0$):** ‘no effect’ / ‘no difference’
   - **Alternative ($H_1$):** the effect/difference you suspect
2. Choose a test statistic
3. Compute a p-value: how surprising your data would be if $H_0$ were true
4. Compare p-value to alpha (common: 0.05)

### Important interpretation
- Small p-value ≠ proof that $H_1$ is true.
- Large p-value ≠ proof that $H_0$ is true.

> **Warning:** Statistical significance is not the same as practical significance. Always look at effect size.

In [None]:
# Example: does a new subject line increase click-through rate (CTR)?
n_control = 1000
n_variant = 1000

p_control = 0.08
p_variant = 0.095

control_clicks = np.random.binomial(1, p_control, size=n_control)
variant_clicks = np.random.binomial(1, p_variant, size=n_variant)

ctr_control = control_clicks.mean()
ctr_variant = variant_clicks.mean()
ctr_control, ctr_variant

In [None]:
from statsmodels.stats.proportion import proportions_ztest

count = np.array([control_clicks.sum(), variant_clicks.sum()])
nobs = np.array([n_control, n_variant])

z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')

summary = pd.DataFrame({
    'group': ['control', 'variant'],
    'n': [n_control, n_variant],
    'clicks': [int(count[0]), int(count[1])],
    'ctr': [ctr_control, ctr_variant],
})

summary, z_stat, p_value

In [None]:
abs_diff = ctr_variant - ctr_control
rel_lift = abs_diff / ctr_control if ctr_control > 0 else np.nan
abs_diff, rel_lift

## Exercise 8.1 (Hypothesis testing)

Simulate another experiment where:
- control CTR = 0.12
- variant CTR = 0.125
- sample size = 500 per group

Tasks:
1. Compute both CTRs
2. Run `proportions_ztest`
3. Report p-value and absolute difference
4. Write one sentence: significant at 0.05? practically meaningful?

In [None]:
# Your code here
n = 500
p_c = 0.12
p_v = 0.125

control = np.random.binomial(1, p_c, size=n)
variant = np.random.binomial(1, p_v, size=n)

ctr_c = control.mean()
ctr_v = variant.mean()

count = np.array([control.sum(), variant.sum()])
nobs = np.array([n, n])
z, p = proportions_ztest(count, nobs, alternative='two-sided')

abs_diff = ctr_v - ctr_c
ctr_c, ctr_v, z, p, abs_diff

## 8.5 Parametric and Non-parametric Tests

### Parametric tests
Parametric tests assume a particular distribution (often normal) and/or assumptions about variance.
- Example: **t-test** compares means

### Non-parametric tests
Non-parametric tests make fewer distribution assumptions (often use ranks).
- Example: **Mann–Whitney U** compares central tendency without assuming normality

When to consider non-parametric tests:
- data is heavily skewed
- many outliers
- small sample sizes

> **Tip:** Non-parametric does not mean ‘assumption-free’. It means ‘fewer/simpler assumptions’.

In [None]:
# Simulate skewed spending data for two customer segments
n = 200
spend_A = np.random.gamma(shape=2.0, scale=25.0, size=n)
spend_B = np.random.gamma(shape=2.2, scale=25.0, size=n)

df_spend = pd.DataFrame({
    'segment': ['A'] * n + ['B'] * n,
    'spend': np.concatenate([spend_A, spend_B]),
})

sns.boxplot(data=df_spend, x='segment', y='spend')
plt.title('Spending by Segment (skewed data)')
plt.show()

In [None]:
# Parametric test: Welch's t-test (does not assume equal variances)
t_stat, p_t = stats.ttest_ind(spend_A, spend_B, equal_var=False)

# Non-parametric test: Mann-Whitney U (two-sided)
u_stat, p_u = stats.mannwhitneyu(spend_A, spend_B, alternative='two-sided', method='auto')

t_stat, p_t, u_stat, p_u

## 8.6 Confidence Intervals

A **confidence interval (CI)** gives a range of plausible values for an unknown population parameter (like the true mean).

### How to interpret a 95% CI
If we repeated the same sampling method many times, **about 95% of the constructed intervals would contain the true value**.

> **Warning:** It does *not* mean ‘there is a 95% probability the true value is in this specific interval’.

In [None]:
sample = np.random.normal(loc=50, scale=10, size=30)
mean = sample.mean()
ci_low, ci_high = t_confidence_interval(sample, confidence=0.95)
mean, (ci_low, ci_high)

In [None]:
sns.histplot(sample, bins=12, kde=True)
plt.axvline(mean, color='black', linestyle='--', label=f'Mean = {mean:.2f}')
plt.axvspan(ci_low, ci_high, alpha=0.2, label='95% CI for mean')
plt.title('Sample Distribution + 95% CI for the Mean')
plt.legend()
plt.show()

### Bootstrap confidence interval (practical and flexible)
Bootstrapping uses repeated resampling (with replacement) to approximate the sampling distribution. This is useful when the data is skewed or the statistic is complex.

We’ll bootstrap the mean spend from a skewed sample.

In [None]:
skewed_sample = np.random.gamma(shape=2.0, scale=30.0, size=200)

B = 3000
boot_means = np.array([np.random.choice(skewed_sample, size=len(skewed_sample), replace=True).mean() for _ in range(B)])

ci_low, ci_high = np.percentile(boot_means, [2.5, 97.5])
skewed_sample.mean(), (ci_low, ci_high)

In [None]:
sns.histplot(boot_means, bins=40, kde=True)
plt.axvline(ci_low, color='red', linestyle='--', label='2.5%')
plt.axvline(ci_high, color='red', linestyle='--', label='97.5%')
plt.title('Bootstrap Sampling Distribution of the Mean')
plt.legend()
plt.show()

## Exercise 8.3 (Confidence Intervals)

A company wants to estimate the average time customers spend on their website.

Tasks:
1. Generate a random sample of 50 session durations (use `np.random.exponential(scale=5, size=50)` for realistic right-skewed data)
2. Calculate a 95% t-based confidence interval using the helper function
3. Calculate a 95% bootstrap confidence interval (use 2000 iterations)
4. Compare the two intervals - which one is wider? Why might that be?

In [None]:
# Your code here
# Step 1: Generate session duration data (exponential is right-skewed)
session_times = np.random.exponential(scale=5, size=50)

print(f"Sample mean: {session_times.mean():.2f} minutes")
print(f"Sample median: {np.median(session_times):.2f} minutes")

# Step 2: T-based confidence interval
t_ci = t_confidence_interval(session_times, confidence=0.95)
print(f"\n95% T-based CI: ({t_ci[0]:.2f}, {t_ci[1]:.2f})")
print(f"T-based CI width: {t_ci[1] - t_ci[0]:.2f}")

# Step 3: Bootstrap confidence interval
B = 2000
boot_means = np.array([
    np.random.choice(session_times, size=len(session_times), replace=True).mean() 
    for _ in range(B)
])
boot_ci = np.percentile(boot_means, [2.5, 97.5])
print(f"\n95% Bootstrap CI: ({boot_ci[0]:.2f}, {boot_ci[1]:.2f})")
print(f"Bootstrap CI width: {boot_ci[1] - boot_ci[0]:.2f}")

# Step 4: Comparison
print("\n--- Comparison ---")
print("For skewed data, bootstrap CI may be more accurate because it doesn't")
print("assume normality. The t-based CI assumes the sampling distribution of")
print("the mean is approximately normal, which may not hold for small, skewed samples.")

## 8.7 Correlation and Causation

### Correlation
Correlation measures how strongly two variables move together.
- **Pearson correlation:** linear relationship
- **Spearman correlation:** rank-based, more robust to outliers

### Causation
Causation means changing X *causes* a change in Y. Correlation alone does not prove causation.

Why correlation can be misleading:
- **Confounders:** a third variable affects both X and Y
- **Reverse causality:** Y causes X
- **Coincidence:** especially with small samples

> **Tip:** Strong evidence for causation often comes from randomized experiments (like A/B tests), not observational data.

In [None]:
# Confounder example: temperature influences two variables
n = 800
temperature = np.random.normal(loc=25, scale=5, size=n)
ice_cream_sales = 50 + 5 * temperature + np.random.normal(0, 20, size=n)
drownings = 2 + 0.3 * temperature + np.random.normal(0, 2, size=n)

df = pd.DataFrame({
    'temperature': temperature,
    'ice_cream_sales': ice_cream_sales,
    'drownings': drownings,
})

df.corr(numeric_only=True)

In [None]:
sns.scatterplot(data=df, x='ice_cream_sales', y='drownings', alpha=0.6)
plt.title('Correlation example: Ice cream sales vs drownings')
plt.show()

## Exercise 8.4 (Correlation vs Causation)

Analyze the relationship between study hours and exam scores:

1. Generate synthetic data where study hours genuinely affects exam scores
2. Calculate both Pearson and Spearman correlations
3. Create a scatter plot with correlation values displayed
4. Think: Can you conclude that studying *causes* better scores from this data alone? What other factors might be involved?

In [None]:
# Your code here
# Step 1: Generate data where study hours affects scores
n = 100
study_hours = np.random.uniform(1, 10, size=n)
# Scores increase with study hours, but with some noise
exam_scores = 40 + 5 * study_hours + np.random.normal(0, 8, size=n)
exam_scores = np.clip(exam_scores, 0, 100)  # Keep scores in valid range

df_study = pd.DataFrame({'study_hours': study_hours, 'exam_scores': exam_scores})

# Step 2: Calculate correlations
pearson_r = df_study['study_hours'].corr(df_study['exam_scores'], method='pearson')
spearman_r = df_study['study_hours'].corr(df_study['exam_scores'], method='spearman')

print(f"Pearson correlation: {pearson_r:.3f}")
print(f"Spearman correlation: {spearman_r:.3f}")

# Step 3: Scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df_study, x='study_hours', y='exam_scores', alpha=0.7)
plt.title(f'Study Hours vs Exam Scores\nPearson r = {pearson_r:.3f}, Spearman ρ = {spearman_r:.3f}')
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.show()

# Step 4: Discussion
print("\n--- Causation Discussion ---")
print("Even with strong correlation, we cannot definitively conclude causation because:")
print("• Confounders: Students who study more might also sleep better, attend class more, etc.")
print("• Self-selection: Motivated students both study more AND perform better")
print("• To establish causation, we'd need a randomized experiment")

## 8.8 Simple and Multiple Regression

Regression models how an outcome changes with one or more predictors.

### Simple linear regression
One predictor: $y = b_0 + b_1 x + "noise"$

### Multiple regression
Multiple predictors can ‘control for’ confounders and improve predictions.

What to look at:
- **Coefficients:** direction and size (within the model)
- **p-values:** evidence against coefficient being 0 (interpret carefully)
- **$R^2$:** variance explained (higher is not always better)

> **Warning:** Regression often shows association, not automatic causation.

In [None]:
# Simulate: sales depends on ad_spend and price
n = 300
ad_spend = np.random.uniform(0, 100, size=n)
price = np.random.uniform(10, 30, size=n)
sales = 200 + 3.5 * ad_spend - 5.0 * price + np.random.normal(0, 30, size=n)

df_reg = pd.DataFrame({'sales': sales, 'ad_spend': ad_spend, 'price': price})
df_reg.head()

In [None]:
m_simple = smf.ols('sales ~ ad_spend', data=df_reg).fit()
m_simple.summary().tables[1]

In [None]:
m_multi = smf.ols('sales ~ ad_spend + price', data=df_reg).fit()
m_multi.summary().tables[1]

In [None]:
sns.regplot(data=df_reg, x='ad_spend', y='sales', scatter_kws={'alpha': 0.5})
plt.title('Sales vs Ad Spend (simple regression line)')
plt.show()

In [None]:
df_reg = df_reg.copy()
df_reg['residuals'] = m_multi.resid
df_reg['fitted'] = m_multi.fittedvalues

sns.scatterplot(data=df_reg, x='fitted', y='residuals', alpha=0.6)
plt.axhline(0, color='black', linestyle='--')
plt.title('Residuals vs Fitted (look for random scatter around 0)')
plt.show()

## Exercise 8.5 (Regression)

A retail company wants to predict monthly sales based on marketing spend and store size.

Tasks:
1. Create a synthetic dataset with 200 stores including: `marketing_spend` (0-50), `store_size` (500-5000 sq ft), and `monthly_sales`
2. Build a simple regression model with only `marketing_spend`
3. Build a multiple regression model with both predictors
4. Compare the R² values - how much does adding `store_size` improve the model?
5. Interpret the coefficients: what does each one mean in plain English?

In [None]:
# Your code here
# Step 1: Create synthetic retail data
n = 200
marketing_spend = np.random.uniform(0, 50, size=n)
store_size = np.random.uniform(500, 5000, size=n)

# Sales depends on both, with some noise
monthly_sales = (10000 + 
                 200 * marketing_spend + 
                 5 * store_size + 
                 np.random.normal(0, 3000, size=n))

df_retail = pd.DataFrame({
    'marketing_spend': marketing_spend,
    'store_size': store_size,
    'monthly_sales': monthly_sales
})

print("Data preview:")
print(df_retail.head())

# Step 2: Simple regression (marketing only)
model_simple = smf.ols('monthly_sales ~ marketing_spend', data=df_retail).fit()
print(f"\n--- Simple Model (marketing only) ---")
print(f"R² = {model_simple.rsquared:.4f}")

# Step 3: Multiple regression (both predictors)
model_multi = smf.ols('monthly_sales ~ marketing_spend + store_size', data=df_retail).fit()
print(f"\n--- Multiple Model (marketing + store size) ---")
print(f"R² = {model_multi.rsquared:.4f}")

# Step 4: Compare R²
r2_improvement = model_multi.rsquared - model_simple.rsquared
print(f"\nR² improvement by adding store_size: {r2_improvement:.4f} ({r2_improvement*100:.1f}% more variance explained)")

# Step 5: Interpret coefficients
print("\n--- Coefficient Interpretation ---")
print(model_multi.summary().tables[1])
print("\nIn plain English:")
print(f"• For every $1 increase in marketing spend, sales increase by ~${model_multi.params['marketing_spend']:.0f}")
print(f"• For every 1 sq ft increase in store size, sales increase by ~${model_multi.params['store_size']:.0f}")
print(f"• Base sales (no marketing, zero size) would be ~${model_multi.params['Intercept']:.0f}")

## 8.9 A/B Testing Methodology

A/B testing is a controlled experiment where users are randomly assigned to:
- **A (control):** current version
- **B (variant):** new change

### Typical workflow
1. Define the goal metric (CTR, conversion, revenue per user)
2. Define hypotheses and alpha
3. Randomize and run the experiment
4. Analyze results (p-value + effect size + confidence interval)
5. Decide considering business impact and risks

### Common pitfalls
- Stopping early when p-value looks good (peeking)
- Running too many metrics and picking the best one
- Not checking sample ratio mismatch
- Ignoring practical impact

In [None]:
# Mini-project: analyze an A/B test end-to-end (simulated)
nA = 3000
nB = 3000
pA = 0.060
pB = 0.066  # small lift

A = np.random.binomial(1, pA, size=nA)
B = np.random.binomial(1, pB, size=nB)

ctrA = A.mean()
ctrB = B.mean()
diff = ctrB - ctrA

count = np.array([A.sum(), B.sum()])
nobs = np.array([nA, nB])
z, p = proportions_ztest(count, nobs, alternative='two-sided')

results = pd.DataFrame({
    'group': ['A', 'B'],
    'n': [nA, nB],
    'clicks': [int(count[0]), int(count[1])],
    'ctr': [ctrA, ctrB],
})

results, diff, p

In [None]:
# Bootstrap CI for the CTR difference (B - A)
B_iter = 2000
boot_diffs = np.array([
    np.random.choice(B, size=len(B), replace=True).mean()
    - np.random.choice(A, size=len(A), replace=True).mean()
    for _ in range(B_iter)
])

ci_low, ci_high = np.percentile(boot_diffs, [2.5, 97.5])
diff, (ci_low, ci_high)

In [None]:
sns.histplot(boot_diffs, bins=40, kde=True)
plt.axvline(ci_low, color='red', linestyle='--', label='2.5%')
plt.axvline(ci_high, color='red', linestyle='--', label='97.5%')
plt.axvline(diff, color='black', linestyle='--', label='Observed diff')
plt.title('Bootstrap CI for CTR Difference (B - A)')
plt.legend()
plt.show()

## Exercise 8.6 (A/B Testing - Mini Project)

You're a data analyst at an e-commerce company. The product team wants to test whether a new checkout button color (green instead of blue) increases the conversion rate.

**Scenario:**
- Control (blue button): 5,000 visitors, 3.2% conversion rate
- Variant (green button): 5,000 visitors, 3.6% conversion rate

**Tasks:**
1. Simulate the experiment data
2. Perform a two-proportion z-test
3. Calculate the absolute and relative lift
4. Compute a 95% bootstrap confidence interval for the difference
5. Write a short recommendation: Should the company adopt the green button? Consider both statistical significance and practical impact.

In [None]:
# Your code here - A/B Testing Mini Project

# Step 1: Simulate experiment data
np.random.seed(123)  # For reproducibility
n_control = 5000
n_variant = 5000
p_control = 0.032  # 3.2% conversion
p_variant = 0.036  # 3.6% conversion

control_conversions = np.random.binomial(1, p_control, size=n_control)
variant_conversions = np.random.binomial(1, p_variant, size=n_variant)

observed_control = control_conversions.mean()
observed_variant = variant_conversions.mean()

print("=== A/B Test Results ===")
print(f"\nControl (Blue): {observed_control*100:.2f}% conversion ({control_conversions.sum()} / {n_control})")
print(f"Variant (Green): {observed_variant*100:.2f}% conversion ({variant_conversions.sum()} / {n_variant})")

# Step 2: Two-proportion z-test
count = np.array([control_conversions.sum(), variant_conversions.sum()])
nobs = np.array([n_control, n_variant])
z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')

print(f"\n--- Statistical Test ---")
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at α=0.05? {'Yes ✓' if p_value < 0.05 else 'No ✗'}")

# Step 3: Calculate lift
abs_lift = observed_variant - observed_control
rel_lift = abs_lift / observed_control if observed_control > 0 else 0

print(f"\n--- Effect Size ---")
print(f"Absolute lift: {abs_lift*100:.2f} percentage points")
print(f"Relative lift: {rel_lift*100:.1f}%")

# Step 4: Bootstrap CI for difference
B = 2000
boot_diffs = np.array([
    np.random.choice(variant_conversions, size=n_variant, replace=True).mean() -
    np.random.choice(control_conversions, size=n_control, replace=True).mean()
    for _ in range(B)
])
ci_low, ci_high = np.percentile(boot_diffs, [2.5, 97.5])

print(f"\n--- 95% Bootstrap CI for Difference ---")
print(f"CI: ({ci_low*100:.2f}%, {ci_high*100:.2f}%)")
print(f"Does CI include 0? {'Yes (uncertain)' if ci_low <= 0 <= ci_high else 'No (significant)'}")

# Step 5: Recommendation
print("\n" + "="*50)
print("RECOMMENDATION")
print("="*50)
if p_value < 0.05 and ci_low > 0:
    print("✓ The green button shows a statistically significant improvement.")
    print(f"  Expected additional conversions per 10,000 visitors: ~{abs_lift * 10000:.0f}")
    print("  Recommendation: ADOPT the green button.")
else:
    print("✗ Results are not statistically significant at α=0.05.")
    print("  Recommendation: Consider running the test longer or with more traffic.")
    print("  The observed difference could be due to random chance.")

## 8.10 Statistical Assumptions and Limitations

### Common assumptions to check
- Random sampling / random assignment
- Independence
- Distribution assumptions (for parametric tests)
- Sufficient sample size

### Common limitations
- Multiple comparisons (false positives)
- Selection bias
- Measurement error
- Simpson’s paradox

> **Tip:** Always pair statistics with domain knowledge, good data collection, and clear metric definitions.

## 8.10 Statistical Assumptions and Limitations

Understanding when statistical methods work (and when they fail) is as important as knowing how to use them.

### Common assumptions to check

| Test/Method | Key Assumptions |
|-------------|-----------------|
| T-test | Data is roughly normal (or large sample), independent observations |
| Chi-square test | Expected frequencies ≥ 5, independent observations |
| Linear regression | Linear relationship, normal residuals, homoscedasticity, independent errors |
| Correlation | Linear relationship (for Pearson), no extreme outliers |
| A/B test | Random assignment, independent users, no interference between groups |

### How to check assumptions

1. **Normality:** Histogram, Q-Q plot, Shapiro-Wilk test
2. **Independence:** Study design (was randomization proper?)
3. **Homoscedasticity:** Residual plot (look for constant spread)
4. **Linearity:** Scatter plot, residual plot

### Common limitations and pitfalls

| Problem | What Goes Wrong | How to Avoid |
|---------|-----------------|--------------|
| Multiple comparisons | Running many tests inflates false positives | Use Bonferroni correction or control FDR |
| Selection bias | Sample doesn't represent population | Use proper sampling methods |
| Survivorship bias | Only seeing "survivors" | Think about what's missing from data |
| Simpson's paradox | Aggregate trend reverses in subgroups | Always check grouped data |
| P-hacking | Trying analyses until p < 0.05 | Pre-register hypotheses, be honest |
| Overfitting | Model fits noise, not signal | Use holdout validation |

> **Tip:** Always pair statistics with domain knowledge, good data collection, and clear metric definitions. Statistics is a tool, not a replacement for thinking.

> **Warning:** A statistically significant result is not automatically a correct or important result. Context matters!

In [None]:
# Example: Checking normality assumption
from scipy.stats import shapiro

# Generate two datasets: one normal, one skewed
normal_data = np.random.normal(loc=50, scale=10, size=100)
skewed_data = np.random.exponential(scale=10, size=100)

# Shapiro-Wilk test (null hypothesis: data is normally distributed)
stat_normal, p_normal = shapiro(normal_data)
stat_skewed, p_skewed = shapiro(skewed_data)

print("Shapiro-Wilk Normality Test")
print("="*40)
print(f"Normal data: W={stat_normal:.4f}, p={p_normal:.4f}")
print(f"  → {'Likely normal' if p_normal > 0.05 else 'Not normal'}")
print(f"\nSkewed data: W={stat_skewed:.4f}, p={p_skewed:.4f}")
print(f"  → {'Likely normal' if p_skewed > 0.05 else 'Not normal'}")

# Visual check with histograms
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(normal_data, bins=15, edgecolor='black', alpha=0.7)
axes[0].set_title(f'Normal Data\n(Shapiro p={p_normal:.3f})')
axes[0].set_xlabel('Value')

axes[1].hist(skewed_data, bins=15, edgecolor='black', alpha=0.7, color='orange')
axes[1].set_title(f'Skewed Data\n(Shapiro p={p_skewed:.3f})')
axes[1].set_xlabel('Value')

plt.tight_layout()
plt.show()

## Exercise 8.7 (Chapter Review - Comprehensive Mini Project)

**Scenario:** You're analyzing customer data for a subscription service.

Given the following synthetic dataset, complete the analysis:

1. **Descriptive Stats:** Calculate mean, median, and std of `monthly_spend`
2. **Confidence Interval:** Calculate a 95% CI for mean monthly spend
3. **Hypothesis Test:** Test if premium users spend more than basic users (use t-test)
4. **Correlation:** Calculate correlation between `tenure_months` and `monthly_spend`
5. **Regression:** Build a model predicting spend from tenure and user type
6. **Interpret:** Write 2-3 sentences summarizing your findings for a non-technical stakeholder

In [None]:
# Generate the customer dataset
np.random.seed(42)
n_customers = 400

user_type = np.random.choice(['basic', 'premium'], size=n_customers, p=[0.7, 0.3])
tenure_months = np.random.exponential(scale=12, size=n_customers) + 1

# Premium users spend more on average
base_spend = np.where(user_type == 'premium', 45, 25)
monthly_spend = base_spend + 0.5 * tenure_months + np.random.normal(0, 8, size=n_customers)
monthly_spend = np.maximum(monthly_spend, 5)  # Minimum spend

customers = pd.DataFrame({
    'user_type': user_type,
    'tenure_months': tenure_months,
    'monthly_spend': monthly_spend
})

print("Customer Data Sample:")
print(customers.head(10))
print(f"\nDataset shape: {customers.shape}")

In [None]:
# Your code here - Complete the analysis

# 1. Descriptive Statistics
print("=== 1. Descriptive Statistics ===")
print(f"Mean monthly spend: ${customers['monthly_spend'].mean():.2f}")
print(f"Median monthly spend: ${customers['monthly_spend'].median():.2f}")
print(f"Std Dev: ${customers['monthly_spend'].std():.2f}")

# 2. Confidence Interval
print("\n=== 2. 95% Confidence Interval ===")
ci = t_confidence_interval(customers['monthly_spend'], confidence=0.95)
print(f"95% CI for mean spend: (${ci[0]:.2f}, ${ci[1]:.2f})")

# 3. Hypothesis Test (Premium vs Basic)
print("\n=== 3. Hypothesis Test ===")
premium_spend = customers[customers['user_type'] == 'premium']['monthly_spend']
basic_spend = customers[customers['user_type'] == 'basic']['monthly_spend']

t_stat, p_val = stats.ttest_ind(premium_spend, basic_spend, alternative='greater')
print(f"Premium mean: ${premium_spend.mean():.2f}")
print(f"Basic mean: ${basic_spend.mean():.2f}")
print(f"T-statistic: {t_stat:.3f}, P-value: {p_val:.6f}")
print(f"Conclusion: Premium users {'DO' if p_val < 0.05 else 'DO NOT'} spend significantly more (α=0.05)")

# 4. Correlation
print("\n=== 4. Correlation Analysis ===")
pearson_r = customers['tenure_months'].corr(customers['monthly_spend'])
print(f"Pearson correlation (tenure vs spend): {pearson_r:.3f}")

# 5. Regression
print("\n=== 5. Regression Model ===")
# Create dummy variable for user type
customers['is_premium'] = (customers['user_type'] == 'premium').astype(int)
model = smf.ols('monthly_spend ~ tenure_months + is_premium', data=customers).fit()
print(model.summary().tables[1])

# 6. Interpretation
print("\n" + "="*50)
print("STAKEHOLDER SUMMARY")
print("="*50)
print("""
Our analysis of 400 customers reveals:

1. The average customer spends about ${:.0f}/month (95% confident it's between 
   ${:.0f} and ${:.0f}).

2. Premium users spend significantly more than basic users - about ${:.0f} more 
   per month on average.

3. Longer-tenured customers tend to spend more. For each additional month of 
   tenure, customers spend about ${:.2f} more.

Recommendation: Focus retention efforts on both converting basic users to premium 
and keeping customers engaged long-term.
""".format(
    customers['monthly_spend'].mean(),
    ci[0], ci[1],
    premium_spend.mean() - basic_spend.mean(),
    model.params['tenure_months']
))

## Additional Resources (Optional)

### Documentation
- **SciPy stats reference:** https://docs.scipy.org/doc/scipy/reference/stats.html
- **Statsmodels documentation:** https://www.statsmodels.org/

### Free Learning Resources
- **Khan Academy (Statistics & Probability):** https://www.khanacademy.org/math/statistics-probability
- **OpenIntro Statistics (free textbook):** https://www.openintro.org/book/os/
- **Seeing Theory (visual probability):** https://seeing-theory.brown.edu/

### Recommended Reading
- *Naked Statistics* by Charles Wheelan - Excellent for building intuition
- *Statistics Done Wrong* by Alex Reinhart - Learn from common mistakes
- *The Art of Statistics* by David Spiegelhalter - Modern statistical thinking

### Tools & Calculators
- **Sample size calculators:** https://www.evanmiller.org/ab-testing/sample-size.html
- **Power analysis:** Use `statsmodels.stats.power` module in Python

## Summary / Key Takeaways

### Core Concepts
- **Statistics** helps you quantify uncertainty when learning from samples.
- **Descriptive statistics** summarize what you observed; **inferential statistics** make careful guesses about populations.
- **Distributions** (binomial, Poisson, normal, etc.) model how data can behave.

### Methods Learned
| Method | Use Case |
|--------|----------|
| Sampling techniques | Collect representative data |
| Hypothesis testing | Make yes/no decisions with controlled error |
| Confidence intervals | Quantify uncertainty in estimates |
| Correlation analysis | Measure relationships between variables |
| Linear regression | Model and predict outcomes |
| A/B testing | Run controlled experiments |

### Critical Thinking Points
1. **Correlation ≠ Causation:** Always consider confounders and alternative explanations
2. **Statistical significance ≠ Practical significance:** A tiny effect can be "significant" with large samples
3. **P-values are not probabilities of truth:** They measure surprise, not certainty
4. **Check assumptions:** Every statistical method has assumptions; violating them can invalidate results
5. **Context matters:** Domain knowledge should guide statistical analysis, not the other way around

### Common Mistakes to Avoid
- ❌ Stopping an experiment early because p < 0.05 (peeking problem)
- ❌ Running many tests and picking the best result (multiple comparisons)
- ❌ Confusing correlation with causation
- ❌ Ignoring effect size and only reporting p-values
- ❌ Using parametric tests on heavily skewed data without checking assumptions

### What's Next?
In Chapter 9, you'll learn about **Database Systems and SQL**, which will help you extract and query data that you'll analyze using these statistical methods.