# Week 2: Statistics and Probability for Finance

---

## Table of Contents
1. Probability Distributions
2. The Normal Distribution
3. Hypothesis Testing
4. Linear Regression
5. Covariance and Correlation Matrices

---

## 1. Probability Distributions

### What is a Probability Distribution?
A probability distribution describes how likely different outcomes are. In finance, we use distributions to model:
- Future stock returns
- Default probabilities
- Extreme market events

### Key Concepts

**Probability Density Function (PDF)**: For continuous variables, gives the relative likelihood of different values.

**Cumulative Distribution Function (CDF)**: Probability that a random variable is less than or equal to a value.

$$F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) dt$$

**Expected Value (Mean)**: The average outcome if we repeated the experiment infinitely.

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx$$


In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Example: What's the probability that tomorrow's return is less than -2%?
# Assume returns are normally distributed with mean=0.05%, std=1.5%

mean_return = 0.0005  # 0.05% daily
std_return = 0.015    # 1.5% daily

# Create normal distribution
return_dist = stats.norm(loc=mean_return, scale=std_return)

# Probability of return < -2%
prob_below_minus2 = return_dist.cdf(-0.02)

print(f"Probability of return < -2%: {prob_below_minus2:.4f} = {prob_below_minus2*100:.2f}%")
print(f"\nThis means: On about {prob_below_minus2*252:.1f} days per year, we expect losses > 2%")

Probability of return < -2%: 0.0859 = 8.59%

This means: On about 21.6 days per year, we expect losses > 2%


---

## 2. The Normal Distribution

### Why Normal?
The normal (Gaussian) distribution is fundamental because:
1. **Central Limit Theorem**: Sum of many random variables → normal
2. **Mathematical tractability**: Easy to work with analytically
3. **Two-parameter simplicity**: Fully described by mean and variance

### Formula

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Where:
- $\mu$ = mean (center of distribution)
- $\sigma$ = standard deviation (spread)

### Standard Normal Distribution

When $\mu = 0$ and $\sigma = 1$, we call it the **standard normal**.

Any normal variable can be standardized:
$$Z = \frac{X - \mu}{\sigma}$$

### The 68-95-99.7 Rule

For a normal distribution:
- 68% of values fall within 1σ of the mean
- 95% of values fall within 2σ of the mean
- 99.7% of values fall within 3σ of the mean

In [2]:
# Demonstrate the 68-95-99.7 rule with stock returns
daily_mean = 0.0005  # 0.05%
daily_std = 0.015    # 1.5%

# Standard normal for calculations
standard_normal = stats.norm(0, 1)

# Calculate probabilities within 1, 2, 3 standard deviations
prob_1std = standard_normal.cdf(1) - standard_normal.cdf(-1)
prob_2std = standard_normal.cdf(2) - standard_normal.cdf(-2)
prob_3std = standard_normal.cdf(3) - standard_normal.cdf(-3)

print("The 68-95-99.7 Rule:")
print(f"Within 1σ: {prob_1std:.4f} = {prob_1std*100:.2f}%")
print(f"Within 2σ: {prob_2std:.4f} = {prob_2std*100:.2f}%")
print(f"Within 3σ: {prob_3std:.4f} = {prob_3std*100:.2f}%")

print(f"\nFor our stock (μ={daily_mean:.2%}, σ={daily_std:.2%}):")
print(f"68% of days: returns between {daily_mean - daily_std:.2%} and {daily_mean + daily_std:.2%}")
print(f"95% of days: returns between {daily_mean - 2*daily_std:.2%} and {daily_mean + 2*daily_std:.2%}")
print(f"99.7% of days: returns between {daily_mean - 3*daily_std:.2%} and {daily_mean + 3*daily_std:.2%}")

The 68-95-99.7 Rule:
Within 1σ: 0.6827 = 68.27%
Within 2σ: 0.9545 = 95.45%
Within 3σ: 0.9973 = 99.73%

For our stock (μ=0.05%, σ=1.50%):
68% of days: returns between -1.45% and 1.55%
95% of days: returns between -2.95% and 3.05%
99.7% of days: returns between -4.45% and 4.55%


### Reality Check: Fat Tails

**Important**: Real financial returns are NOT perfectly normal!

They exhibit:
- **Fat tails** (leptokurtosis): Extreme events more common than normal predicts
- **Negative skewness**: Large negative returns more common than large positive
- **Volatility clustering**: High volatility tends to follow high volatility

The normal distribution **underestimates** tail risk!

In [3]:
# Compare theoretical normal with real market behavior
# Under normal distribution, 3-sigma events happen 0.3% of the time (about 0.75 days/year)

# Simulate "real" returns with fat tails using Student's t-distribution
np.random.seed(42)
normal_returns = np.random.normal(0, 0.015, 10000)
fat_tail_returns = stats.t.rvs(df=4, loc=0, scale=0.012, size=10000)  # t-distribution with 4 df

# Count extreme events (beyond 3 sigma)
threshold = 3 * 0.015  # 3 sigma
normal_extremes = np.sum(np.abs(normal_returns) > threshold)
fat_tail_extremes = np.sum(np.abs(fat_tail_returns) > threshold)

print("Extreme Events (|return| > 4.5%):")
print(f"Normal distribution: {normal_extremes} events out of 10,000 ({normal_extremes/100:.2f}%)")
print(f"Fat-tail distribution: {fat_tail_extremes} events out of 10,000 ({fat_tail_extremes/100:.2f}%)")
print(f"\nFat tails produce {fat_tail_extremes/max(normal_extremes,1):.1f}x more extreme events!")
print("\n⚠️ This is why VaR and risk models failed in 2008!")

Extreme Events (|return| > 4.5%):
Normal distribution: 28 events out of 10,000 (0.28%)
Fat-tail distribution: 169 events out of 10,000 (1.69%)

Fat tails produce 6.0x more extreme events!

⚠️ This is why VaR and risk models failed in 2008!


---

## 3. Hypothesis Testing

### Why Hypothesis Testing in Finance?
- Is this trading strategy's return statistically significant?
- Does adding a factor improve the model?
- Is this stock's beta different from 1?

### The Framework

1. **Null Hypothesis (H₀)**: The default assumption (e.g., "strategy has no alpha")
2. **Alternative Hypothesis (H₁)**: What we're testing for (e.g., "strategy has positive alpha")
3. **Test Statistic**: A number calculated from data
4. **P-value**: Probability of observing our result if H₀ is true
5. **Decision**: Reject H₀ if p-value < significance level (typically 0.05)

### t-Test for Strategy Returns

**Question**: Is the mean return significantly different from zero?

$$t = \frac{\bar{r} - 0}{s / \sqrt{n}}$$

Where:
- $\bar{r}$ = sample mean return
- $s$ = sample standard deviation
- $n$ = number of observations

In [4]:
# Example: Testing if a trading strategy has positive returns
np.random.seed(123)

# Simulate 2 years of daily strategy returns
n_days = 504  # 2 years
true_alpha = 0.0003  # Strategy actually has 0.03% daily alpha
strategy_returns = np.random.normal(true_alpha, 0.01, n_days)

# Calculate t-statistic manually
mean_ret = np.mean(strategy_returns)
std_ret = np.std(strategy_returns, ddof=1)
n = len(strategy_returns)

t_stat = mean_ret / (std_ret / np.sqrt(n))

# Get p-value (two-tailed test)
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-1))

print("Testing H₀: Mean return = 0 (no alpha)")
print("="*50)
print(f"Sample mean return: {mean_ret:.4%} daily")
print(f"Annualized return: {mean_ret * 252:.2%}")
print(f"Sample std dev: {std_ret:.4%}")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\n✓ Result: REJECT H₀ at 5% significance level")
    print("  The strategy return is statistically significant!")
else:
    print("\n✗ Result: Cannot reject H₀")
    print("  Insufficient evidence that strategy has real alpha")

Testing H₀: Mean return = 0 (no alpha)
Sample mean return: -0.0054% daily
Annualized return: -1.36%
Sample std dev: 1.0018%

t-statistic: -0.121
p-value: 0.9039

✗ Result: Cannot reject H₀
  Insufficient evidence that strategy has real alpha


### Information Ratio and Significance

The **Information Ratio (IR)** is related to the t-statistic:

$$IR = \frac{\text{Excess Return}}{\text{Tracking Error}} = \frac{\bar{r}}{\sigma}$$

$$t = IR \times \sqrt{n}$$

This tells us: A small but consistent alpha can be significant with enough observations!

In [5]:
# How long to detect alpha?
daily_alpha = 0.0002  # 0.02% daily (about 5% annual)
daily_vol = 0.01      # 1% daily tracking error
IR = daily_alpha / daily_vol

print(f"Information Ratio: {IR:.4f} daily = {IR * np.sqrt(252):.2f} annualized")
print("\nHow many observations needed for significance?")
print("(Need t-stat > 1.96 for 5% significance)\n")

for years in [0.5, 1, 2, 3, 5]:
    n = int(years * 252)
    t_stat = IR * np.sqrt(n)
    significant = "✓" if t_stat > 1.96 else "✗"
    print(f"{years} years ({n} days): t = {t_stat:.2f} {significant}")

Information Ratio: 0.0200 daily = 0.32 annualized

How many observations needed for significance?
(Need t-stat > 1.96 for 5% significance)

0.5 years (126 days): t = 0.22 ✗
1 years (252 days): t = 0.32 ✗
2 years (504 days): t = 0.45 ✗
3 years (756 days): t = 0.55 ✗
5 years (1260 days): t = 0.71 ✗


---

## 4. Linear Regression

### Ordinary Least Squares (OLS)

Linear regression finds the best-fit line:

$$Y = \alpha + \beta X + \epsilon$$

Where:
- $Y$ = dependent variable (e.g., stock return)
- $X$ = independent variable (e.g., market return)
- $\alpha$ = intercept (alpha in CAPM)
- $\beta$ = slope coefficient (beta in CAPM)
- $\epsilon$ = error term

### OLS Formulas

**Slope (Beta)**:
$$\beta = \frac{Cov(X, Y)}{Var(X)} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}$$

**Intercept (Alpha)**:
$$\alpha = \bar{Y} - \beta \bar{X}$$

### R-squared

Measures how much of Y's variance is explained by X:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}$$

- $R^2 = 1$: Perfect fit
- $R^2 = 0$: Model explains nothing

In [6]:
# Example: Calculate stock's beta vs market
np.random.seed(42)

# Generate market returns
n_days = 252
market_returns = np.random.normal(0.0004, 0.012, n_days)

# Stock with beta = 1.3 and alpha = 0.02% daily
true_beta = 1.3
true_alpha = 0.0002
noise = np.random.normal(0, 0.008, n_days)
stock_returns = true_alpha + true_beta * market_returns + noise

# Calculate beta manually
covariance = np.cov(market_returns, stock_returns)[0, 1]
market_variance = np.var(market_returns, ddof=1)

beta_calc = covariance / market_variance
alpha_calc = np.mean(stock_returns) - beta_calc * np.mean(market_returns)

# Calculate R-squared
predicted = alpha_calc + beta_calc * market_returns
ss_res = np.sum((stock_returns - predicted)**2)
ss_tot = np.sum((stock_returns - np.mean(stock_returns))**2)
r_squared = 1 - ss_res / ss_tot

print("OLS Regression: Stock Returns vs Market Returns")
print("="*50)
print(f"True parameters: α = {true_alpha:.4%}, β = {true_beta:.2f}")
print(f"Estimated:       α = {alpha_calc:.4%}, β = {beta_calc:.2f}")
print(f"\nR-squared: {r_squared:.4f}")
print(f"→ {r_squared*100:.1f}% of stock's variance explained by market")

OLS Regression: Stock Returns vs Market Returns
True parameters: α = 0.0200%, β = 1.30
Estimated:       α = 0.0397%, β = 1.32

R-squared: 0.7838
→ 78.4% of stock's variance explained by market


---

## 5. Covariance and Correlation Matrices

### Why Matrices?
With multiple assets, we need to track all pairwise relationships. This is essential for:
- Portfolio optimization
- Risk management
- Factor models

### Covariance Matrix

For assets A, B, C:

$$\Sigma = \begin{bmatrix} \sigma_A^2 & Cov(A,B) & Cov(A,C) \\ Cov(B,A) & \sigma_B^2 & Cov(B,C) \\ Cov(C,A) & Cov(C,B) & \sigma_C^2 \end{bmatrix}$$

**Properties**:
- Symmetric: $Cov(A,B) = Cov(B,A)$
- Diagonal = variances
- Off-diagonal = covariances
- Must be positive semi-definite

### Correlation Matrix

Normalized covariance:

$$\rho_{ij} = \frac{Cov(i,j)}{\sigma_i \sigma_j}$$

$$P = \begin{bmatrix} 1 & \rho_{AB} & \rho_{AC} \\ \rho_{BA} & 1 & \rho_{BC} \\ \rho_{CA} & \rho_{CB} & 1 \end{bmatrix}$$

**Property**: Diagonal is always 1 (asset perfectly correlated with itself)

In [7]:
# Create covariance and correlation matrices
np.random.seed(42)

# Simulate 3 correlated assets
n_days = 252
market = np.random.normal(0, 0.012, n_days)

# Different exposures to market
tech_stock = 1.5 * market + np.random.normal(0, 0.008, n_days)  # High beta tech
utility = 0.5 * market + np.random.normal(0, 0.005, n_days)     # Low beta utility
gold = np.random.normal(0, 0.01, n_days)                        # Uncorrelated

# Create DataFrame
returns_df = pd.DataFrame({
    'Tech': tech_stock,
    'Utility': utility,
    'Gold': gold
})

# Covariance matrix
cov_matrix = returns_df.cov()
print("Covariance Matrix (×10,000 for readability):")
print((cov_matrix * 10000).round(4))

# Correlation matrix
corr_matrix = returns_df.corr()
print("\nCorrelation Matrix:")
print(corr_matrix.round(3))

print("\nInterpretation:")
print(f"• Tech & Utility: ρ = {corr_matrix.loc['Tech', 'Utility']:.2f} (both exposed to market)")
print(f"• Tech & Gold: ρ = {corr_matrix.loc['Tech', 'Gold']:.2f} (diversification benefit!)")

Covariance Matrix (×10,000 for readability):
           Tech  Utility    Gold
Tech     3.7362   1.0619  0.0283
Utility  1.0619   0.6393  0.0158
Gold     0.0283   0.0158  0.8757

Correlation Matrix:
          Tech  Utility   Gold
Tech     1.000    0.687  0.016
Utility  0.687    1.000  0.021
Gold     0.016    0.021  1.000

Interpretation:
• Tech & Utility: ρ = 0.69 (both exposed to market)
• Tech & Gold: ρ = 0.02 (diversification benefit!)


### Portfolio Variance Using Covariance Matrix

For a portfolio with weights $w = [w_1, w_2, ..., w_n]$:

$$\sigma_p^2 = w^T \Sigma w = \sum_{i}\sum_{j} w_i w_j \sigma_{ij}$$

This is the fundamental formula for portfolio risk!

In [8]:
# Calculate portfolio risk using covariance matrix
weights = np.array([0.5, 0.3, 0.2])  # 50% Tech, 30% Utility, 20% Gold

# Portfolio variance: w^T * Σ * w
port_variance = weights @ cov_matrix.values @ weights
port_std = np.sqrt(port_variance)

# Compare to weighted average of individual volatilities (if perfectly correlated)
individual_stds = np.sqrt(np.diag(cov_matrix.values))
max_possible_std = weights @ individual_stds

print(f"Portfolio weights: Tech={weights[0]:.0%}, Utility={weights[1]:.0%}, Gold={weights[2]:.0%}")
print(f"\nIndividual asset volatilities (daily):")
for name, std in zip(returns_df.columns, individual_stds):
    print(f"  {name}: {std:.4%}")

print(f"\nPortfolio volatility: {port_std:.4%}")
print(f"If perfectly correlated: {max_possible_std:.4%}")
print(f"\nDiversification benefit: {(max_possible_std - port_std)/max_possible_std:.1%} risk reduction!")

Portfolio weights: Tech=50%, Utility=30%, Gold=20%

Individual asset volatilities (daily):
  Tech: 1.9329%
  Utility: 0.7995%
  Gold: 0.9358%

Portfolio volatility: 1.1631%
If perfectly correlated: 1.3935%

Diversification benefit: 16.5% risk reduction!


---

## Summary: Week 2 Key Formulas

| Concept | Formula |
|---------|--------|
| Normal PDF | $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ |
| Standardization | $Z = \frac{X - \mu}{\sigma}$ |
| t-statistic | $t = \frac{\bar{r}}{s / \sqrt{n}}$ |
| OLS Beta | $\beta = \frac{Cov(X,Y)}{Var(X)}$ |
| R-squared | $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ |
| Portfolio Variance | $\sigma_p^2 = w^T \Sigma w$ |

---

*Next Week: Time Series Analysis*