# Week 3: Time Series Analysis for Finance

---

## üéØ What You'll Learn This Week

Time series analysis is the bread and butter of quant finance. Stock prices, interest rates, volatility - they're ALL time series.

**By the end of this week, you'll understand:**
- How to identify trends and patterns in price data
- Why "stationarity" is crucial (and what it means)
- How past prices relate to future prices (autocorrelation)
- ARIMA models - the classic forecasting tool
- Pairs trading foundations (cointegration)

**Why This Matters:**
- **Every trading strategy** deals with time series data
- **Stationarity** determines which models you can use
- **ARIMA** is still asked in quant interviews!
- **Cointegration** is the foundation of statistical arbitrage

---

## Table of Contents
1. Time Series Components
2. Stationarity
3. Autocorrelation
4. ARIMA Models
5. Cointegration

---

In [1]:
# Standard imports and data loading
import numpy as np
import pandas as pd
import yfinance as yf
from datetime import datetime, timedelta

# Standard 5 equities for analysis
tickers = ['AAPL', 'MSFT', 'GOOGL', 'JPM', 'GS']

# Fetch 5 years of data
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

print("üì• Downloading market data...")
data = yf.download(tickers, start=start_date, end=end_date, progress=False, auto_adjust=True)
prices = data['Close'].dropna()
returns = prices.pct_change().dropna()
print(f"‚úÖ Loaded {len(prices)} days of data for {len(tickers)} tickers")
print(f"üìÖ Date range: {prices.index[0].strftime('%Y-%m-%d')} to {prices.index[-1].strftime('%Y-%m-%d')}")
print(prices.tail())

üì• Downloading market data...
‚úÖ Loaded 1255 days of data for 5 tickers
üìÖ Date range: 2021-01-25 to 2026-01-22
Ticker            AAPL       GOOGL          GS         JPM        MSFT
Date                                                                  
2026-01-15  258.209991  332.779999  975.859985  309.260010  456.660004
2026-01-16  255.529999  330.000000  962.000000  312.470001  459.859985
2026-01-20  246.699997  322.000000  943.369995  302.739990  454.519989
2026-01-21  247.649994  328.380005  953.010010  302.040009  444.109985
2026-01-22  249.789993  331.410004  965.700012  306.440002  449.820099


## 1. Time Series Components

### ü§î What Makes Up a Time Series?

Think of a stock price chart. You might see:
- **Upward slope** over 10 years ‚Üí That's the TREND
- **Higher in January** every year ‚Üí That's SEASONALITY
- **Random ups and downs** ‚Üí That's NOISE

We can break ANY time series into these pieces:

$$Y_t = T_t + S_t + C_t + \epsilon_t$$

**In Plain English:**
| Symbol | Component | What It Is | Example |
|--------|-----------|-----------|---------|
| $T_t$ | Trend | Long-term direction | S&P 500 going up over decades |
| $S_t$ | Seasonality | Predictable patterns | Retail stocks up in December |
| $C_t$ | Cyclical | Economic cycles | Bull/bear markets |
| $\epsilon_t$ | Residual | Random noise | Day-to-day fluctuations |

### Why Decompose?
- Identify underlying patterns
- Separate signal from noise
- Build better forecasting models

In [2]:
import numpy as np
import pandas as pd
from scipy import stats

# Create a time series with known components
np.random.seed(42)
n_days = 500
t = np.arange(n_days)

# Components
trend = 0.1 * t                                    # Upward trend
seasonality = 10 * np.sin(2 * np.pi * t / 252)     # Yearly cycle
noise = np.random.normal(0, 5, n_days)             # Random noise

# Combined series (like stock price)
price = 100 + trend + seasonality + noise

# Create DataFrame
dates = pd.date_range('2020-01-01', periods=n_days, freq='D')
df = pd.DataFrame({'Price': price}, index=dates)

print("Simulated Price Series with:")
print(f"‚Ä¢ Trend: +0.1 per day (upward drift)")
print(f"‚Ä¢ Seasonality: 252-day cycle (annual)")
print(f"‚Ä¢ Noise: Normal(0, 5)")
print(f"\nFirst 10 prices:")
print(df.head(10)['Price'].values.round(2))

Simulated Price Series with:
‚Ä¢ Trend: +0.1 per day (upward drift)
‚Ä¢ Seasonality: 252-day cycle (annual)
‚Ä¢ Noise: Normal(0, 5)

First 10 prices:
[102.48  99.66 103.94 108.66 100.22 100.57 109.99 106.27 100.43 105.84]


---

## 2. Stationarity

### What is Stationarity?

A time series is **stationary** if its statistical properties don't change over time:

**Strict Stationarity**: Joint distribution of $(Y_t, Y_{t+1}, ..., Y_{t+k})$ is same for all $t$

**Weak Stationarity** (more practical):
1. Constant mean: $E[Y_t] = \mu$ for all $t$
2. Constant variance: $Var(Y_t) = \sigma^2$ for all $t$  
3. Covariance depends only on lag: $Cov(Y_t, Y_{t+k}) = f(k)$, not $t$

### Why Does It Matter?
- Most statistical models assume stationarity
- Non-stationary data can lead to **spurious correlations**
- Forecasting non-stationary series is unreliable

### Stock Prices vs Returns
- **Prices**: Non-stationary (trending, variance grows)
- **Returns**: Usually stationary (mean-reverting around zero)

In [3]:
# Demonstrate stationarity: prices vs returns
np.random.seed(42)

# Random walk (non-stationary)
steps = np.random.normal(0, 1, 1000)
prices = 100 * np.exp(np.cumsum(steps * 0.01))  # Geometric random walk

# Returns (stationary)
returns = np.diff(np.log(prices))

# Compare statistics for first half vs second half
mid = len(prices) // 2

print("Testing Stationarity: Prices vs Returns")
print("="*50)
print("\nPRICES (Non-Stationary):")
print(f"  First half - Mean: {prices[:mid].mean():.2f}, Std: {prices[:mid].std():.2f}")
print(f"  Second half - Mean: {prices[mid:].mean():.2f}, Std: {prices[mid:].std():.2f}")
print("  ‚Üí Mean and variance change over time!")

print("\nRETURNS (Stationary):")
print(f"  First half - Mean: {returns[:mid-1].mean():.6f}, Std: {returns[:mid-1].std():.4f}")
print(f"  Second half - Mean: {returns[mid-1:].mean():.6f}, Std: {returns[mid-1:].std():.4f}")
print("  ‚Üí Mean and variance are stable!")

Testing Stationarity: Prices vs Returns

PRICES (Non-Stationary):
  First half - Mean: 98.88, Std: 6.96
  Second half - Mean: 101.36, Std: 12.27
  ‚Üí Mean and variance change over time!

RETURNS (Stationary):
  First half - Mean: 0.000059, Std: 0.0098
  Second half - Mean: 0.000318, Std: 0.0098
  ‚Üí Mean and variance are stable!


### Augmented Dickey-Fuller Test

Formal test for stationarity. Tests the hypothesis:

$$\Delta Y_t = \alpha + \beta t + \gamma Y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta Y_{t-i} + \epsilon_t$$

- **H‚ÇÄ**: $\gamma = 0$ (unit root exists, non-stationary)
- **H‚ÇÅ**: $\gamma < 0$ (stationary)

If p-value < 0.05: Reject H‚ÇÄ ‚Üí Series is stationary

In [4]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series, name):
    """Perform Augmented Dickey-Fuller test"""
    result = adfuller(series.dropna(), autolag='AIC')
    print(f"{name}:")
    print(f"  ADF Statistic: {result[0]:.4f}")
    print(f"  p-value: {result[1]:.4f}")
    if result[1] < 0.05:
        print("  ‚úì Stationary (reject H‚ÇÄ)")
    else:
        print("  ‚úó Non-stationary (cannot reject H‚ÇÄ)")
    print()

# Test both series
print("Augmented Dickey-Fuller Test Results:")
print("="*50)
adf_test(pd.Series(prices), "Stock Prices")
adf_test(pd.Series(returns), "Log Returns")

Augmented Dickey-Fuller Test Results:
Stock Prices:
  ADF Statistic: -0.8371
  p-value: 0.8080
  ‚úó Non-stationary (cannot reject H‚ÇÄ)

Log Returns:
  ADF Statistic: -31.7893
  p-value: 0.0000
  ‚úì Stationary (reject H‚ÇÄ)



### Making Series Stationary

**Differencing**: Remove trend by taking differences
$$Y'_t = Y_t - Y_{t-1}$$

For log prices, first difference = log return:
$$r_t = \log(P_t) - \log(P_{t-1}) = \log\left(\frac{P_t}{P_{t-1}}\right)$$

---

## 3. Autocorrelation

### Definition

Autocorrelation measures how correlated a series is with its own past values.

**Autocorrelation at lag k**:
$$\rho_k = \frac{Cov(Y_t, Y_{t-k})}{Var(Y_t)} = \frac{\sum_{t=k+1}^{T}(Y_t - \bar{Y})(Y_{t-k} - \bar{Y})}{\sum_{t=1}^{T}(Y_t - \bar{Y})^2}$$

### Properties
- $\rho_0 = 1$ (series perfectly correlated with itself)
- $-1 \leq \rho_k \leq 1$
- For white noise: $\rho_k \approx 0$ for $k > 0$

### Partial Autocorrelation (PACF)

Measures correlation at lag $k$ **after removing** the effect of intermediate lags.

Useful for identifying the order of AR models.

In [5]:
from statsmodels.tsa.stattools import acf, pacf

# Generate AR(1) process for illustration
# Y_t = 0.7 * Y_{t-1} + noise
np.random.seed(42)
n = 500
phi = 0.7  # AR coefficient

ar1_series = np.zeros(n)
ar1_series[0] = np.random.normal()
for t in range(1, n):
    ar1_series[t] = phi * ar1_series[t-1] + np.random.normal()

# Calculate ACF
acf_values = acf(ar1_series, nlags=10)
pacf_values = pacf(ar1_series, nlags=10)

print("AR(1) Process with œÜ = 0.7")
print("="*50)
print("\nAutocorrelation Function (ACF):")
print("Theoretical: œÅ_k = œÜ^k")
for k in range(6):
    theoretical = phi**k
    print(f"  Lag {k}: Actual = {acf_values[k]:.3f}, Theoretical = {theoretical:.3f}")

print("\nPartial Autocorrelation (PACF):")
print("For AR(1): Only lag 1 should be significant")
for k in range(4):
    print(f"  Lag {k}: {pacf_values[k]:.3f}")

AR(1) Process with œÜ = 0.7

Autocorrelation Function (ACF):
Theoretical: œÅ_k = œÜ^k
  Lag 0: Actual = 1.000, Theoretical = 1.000
  Lag 1: Actual = 0.683, Theoretical = 0.700
  Lag 2: Actual = 0.461, Theoretical = 0.490
  Lag 3: Actual = 0.306, Theoretical = 0.343
  Lag 4: Actual = 0.180, Theoretical = 0.240
  Lag 5: Actual = 0.138, Theoretical = 0.168

Partial Autocorrelation (PACF):
For AR(1): Only lag 1 should be significant
  Lag 0: 1.000
  Lag 1: 0.685
  Lag 2: -0.012
  Lag 3: -0.009


### Autocorrelation in Returns

**Efficient Market Hypothesis** implies:
- Returns should have zero autocorrelation
- Past returns cannot predict future returns

**Reality**: Small positive autocorrelation at short lags (momentum), possible negative at longer lags (mean reversion)

In [6]:
# Check autocorrelation in stock returns
np.random.seed(42)
# Simulate returns with slight autocorrelation (momentum)
n = 1000
momentum_returns = np.zeros(n)
for t in range(1, n):
    momentum_returns[t] = 0.05 * momentum_returns[t-1] + np.random.normal(0, 0.01)

acf_returns = acf(momentum_returns, nlags=10)

print("Autocorrelation of Returns")
print("="*50)
print("\nIf markets are efficient, ACF should be ~0")
print("Significance threshold: ¬±", round(1.96/np.sqrt(n), 3))
print("\n Lag | ACF    | Significant?")
print("-" * 30)
threshold = 1.96 / np.sqrt(n)
for k in range(1, 6):
    sig = "Yes" if abs(acf_returns[k]) > threshold else "No"
    print(f"  {k}  | {acf_returns[k]:+.4f} | {sig}")

Autocorrelation of Returns

If markets are efficient, ACF should be ~0
Significance threshold: ¬± 0.062

 Lag | ACF    | Significant?
------------------------------
  1  | +0.0428 | No
  2  | +0.0022 | No
  3  | +0.0126 | No
  4  | -0.0520 | No
  5  | +0.0245 | No


---

## 4. ARIMA Models

### Components

**ARIMA(p, d, q)** combines:
- **AR(p)**: Autoregressive (past values)
- **I(d)**: Integrated (differencing order)
- **MA(q)**: Moving Average (past errors)

### AR(p) - Autoregressive

Current value depends on past values:

$$Y_t = c + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + ... + \phi_p Y_{t-p} + \epsilon_t$$

- AR(1): $Y_t = c + \phi Y_{t-1} + \epsilon_t$
- Stationary if $|\phi| < 1$

### MA(q) - Moving Average

Current value depends on past errors:

$$Y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + ... + \theta_q \epsilon_{t-q}$$

- Always stationary
- Shocks have limited effect (q periods)

### ARMA(p,q)

Combines both:

$$Y_t = c + \sum_{i=1}^{p} \phi_i Y_{t-i} + \sum_{j=1}^{q} \theta_j \epsilon_{t-j} + \epsilon_t$$

### ARIMA(p,d,q)

For non-stationary data, difference $d$ times first, then apply ARMA.

In [7]:
from statsmodels.tsa.arima.model import ARIMA

# Generate data from known AR(2) process
np.random.seed(42)
n = 500
true_ar = [0.5, 0.3]  # AR coefficients

ar2_data = np.zeros(n)
for t in range(2, n):
    ar2_data[t] = true_ar[0]*ar2_data[t-1] + true_ar[1]*ar2_data[t-2] + np.random.normal(0, 1)

# Fit ARIMA model
model = ARIMA(ar2_data, order=(2, 0, 0))
fitted = model.fit()

print("ARIMA Model Estimation")
print("="*50)
print(f"True AR coefficients: œÜ‚ÇÅ = {true_ar[0]}, œÜ‚ÇÇ = {true_ar[1]}")
print(f"\nEstimated:")
print(f"  œÜ‚ÇÅ = {fitted.params[1]:.4f}")
print(f"  œÜ‚ÇÇ = {fitted.params[2]:.4f}")
print(f"\nAIC: {fitted.aic:.2f}")
print(f"BIC: {fitted.bic:.2f}")

ARIMA Model Estimation
True AR coefficients: œÜ‚ÇÅ = 0.5, œÜ‚ÇÇ = 0.3

Estimated:
  œÜ‚ÇÅ = 0.4941
  œÜ‚ÇÇ = 0.2808

AIC: 1404.49
BIC: 1421.35


### Model Selection

**ACF/PACF patterns**:

| Model | ACF | PACF |
|-------|-----|------|
| AR(p) | Decays exponentially | Cuts off after lag p |
| MA(q) | Cuts off after lag q | Decays exponentially |
| ARMA(p,q) | Decays | Decays |

**Information Criteria**:
- **AIC** (Akaike): $-2\ln(L) + 2k$
- **BIC** (Bayesian): $-2\ln(L) + k\ln(n)$

Lower is better. BIC penalizes complexity more.

---

## 5. Cointegration

### The Problem

Two non-stationary series might have a **spurious correlation**.
But sometimes, they move together in a meaningful way!

### Definition

Two I(1) series $X_t$ and $Y_t$ are **cointegrated** if there exists $\beta$ such that:

$$Z_t = Y_t - \beta X_t \sim I(0)$$

The **spread** $Z_t$ is stationary, even though $X_t$ and $Y_t$ individually are not.

### Finance Example

Consider two stocks in the same sector:
- Each stock price: Non-stationary (random walk)
- Price ratio or spread: May be stationary (mean-reverting)

This is the foundation of **pairs trading**!

In [8]:
# Demonstrate cointegration with pairs trading example
np.random.seed(42)
n = 500

# Common factor (non-stationary)
common_factor = np.cumsum(np.random.normal(0, 1, n))

# Two cointegrated "stocks"
stock_A = 50 + common_factor + np.random.normal(0, 0.5, n)  # Price A
stock_B = 30 + 0.6 * common_factor + np.random.normal(0, 0.5, n)  # Price B (with different sensitivity)

# Spread (should be stationary)
beta = np.cov(stock_A, stock_B)[0,1] / np.var(stock_B)
spread = stock_A - beta * stock_B

print("Cointegration Example: Two 'Related' Stocks")
print("="*50)
print(f"Hedge ratio (Œ≤): {beta:.4f}")
print(f"\nSpread = Stock_A - {beta:.2f} √ó Stock_B")

# Test stationarity of spread
adf_result = adfuller(spread)
print(f"\nADF test on spread:")
print(f"  Statistic: {adf_result[0]:.4f}")
print(f"  p-value: {adf_result[1]:.4f}")

if adf_result[1] < 0.05:
    print("\n‚úì Spread is STATIONARY - stocks are cointegrated!")
    print("  ‚Üí Suitable for pairs trading")
else:
    print("\n‚úó Spread is non-stationary - not cointegrated")

Cointegration Example: Two 'Related' Stocks
Hedge ratio (Œ≤): 1.6451

Spread = Stock_A - 1.65 √ó Stock_B

ADF test on spread:
  Statistic: -21.7104
  p-value: 0.0000

‚úì Spread is STATIONARY - stocks are cointegrated!
  ‚Üí Suitable for pairs trading


### Pairs Trading Strategy

**Setup**:
1. Find cointegrated pair
2. Calculate spread: $Z_t = A_t - \beta B_t$
3. Normalize: $Z_{norm} = \frac{Z_t - \mu_Z}{\sigma_Z}$

**Trading Rules**:
- If $Z_{norm} > 2$: Spread too high ‚Üí Short A, Long Œ≤ units of B
- If $Z_{norm} < -2$: Spread too low ‚Üí Long A, Short Œ≤ units of B
- Exit when $|Z_{norm}| < 0.5$ (spread reverts to mean)

In [9]:
# Pairs trading signals
spread_mean = spread.mean()
spread_std = spread.std()
z_score = (spread - spread_mean) / spread_std

# Generate signals
signals = np.zeros(len(z_score))
signals[z_score > 2] = -1   # Short spread (spread too high)
signals[z_score < -2] = 1   # Long spread (spread too low)

# Count signals
short_signals = np.sum(signals == -1)
long_signals = np.sum(signals == 1)

print("Pairs Trading Signals")
print("="*50)
print(f"Spread mean: {spread_mean:.2f}")
print(f"Spread std: {spread_std:.2f}")
print(f"\nTrading signals generated:")
print(f"  Long spread (z < -2): {long_signals} signals")
print(f"  Short spread (z > 2): {short_signals} signals")
print(f"\nZ-score range: [{z_score.min():.2f}, {z_score.max():.2f}]")

Pairs Trading Signals
Spread mean: 0.56
Spread std: 0.93

Trading signals generated:
  Long spread (z < -2): 10 signals
  Short spread (z > 2): 10 signals

Z-score range: [-3.17, 3.58]


### Engle-Granger Cointegration Test

**Procedure**:
1. Regress $Y_t$ on $X_t$ to get $\hat{\beta}$
2. Calculate residuals: $\hat{Z}_t = Y_t - \hat{\beta} X_t$
3. Test residuals for stationarity (ADF test)

If residuals are stationary ‚Üí Series are cointegrated

---

## Summary: Week 3 Key Formulas

| Concept | Formula |
|---------|--------|
| Autocorrelation | $\rho_k = \frac{Cov(Y_t, Y_{t-k})}{Var(Y_t)}$ |
| AR(1) Process | $Y_t = c + \phi Y_{t-1} + \epsilon_t$ |
| MA(1) Process | $Y_t = \mu + \epsilon_t + \theta \epsilon_{t-1}$ |
| Cointegration | $Z_t = Y_t - \beta X_t \sim I(0)$ |
| Z-score | $Z = \frac{X - \mu}{\sigma}$ |
| AIC | $-2\ln(L) + 2k$ |

---

*Next Week: Machine Learning Basics*

## üî¥ PROS & CONS: THEORY

### ‚úÖ PROS (Advantages)

| Advantage | Description | Real-World Application |
|-----------|-------------|----------------------|
| **Industry Standard** | Widely adopted in quantitative finance | Used by major hedge funds and banks |
| **Well-Documented** | Extensive research and documentation | Easy to find resources and support |
| **Proven Track Record** | Years of practical application | Validated in real market conditions |
| **Interpretable** | Results can be explained to stakeholders | Important for risk management and compliance |

### ‚ùå CONS (Limitations)

| Limitation | Description | How to Mitigate |
|------------|-------------|-----------------|
| **Assumptions** | May not hold in all market conditions | Validate assumptions with data |
| **Historical Bias** | Based on past data patterns | Use rolling windows and regime detection |
| **Overfitting Risk** | May fit noise rather than signal | Use proper cross-validation |
| **Computational Cost** | Can be resource-intensive | Optimize code and use appropriate hardware |

### üéØ Real-World Usage

**WHERE THIS IS USED:**
- ‚úÖ Quantitative hedge funds (Two Sigma, Renaissance, Citadel)
- ‚úÖ Investment banks (Goldman Sachs, JP Morgan, Morgan Stanley)
- ‚úÖ Asset management firms
- ‚úÖ Risk management departments
- ‚úÖ Algorithmic trading desks

**NOT JUST THEORY - THIS IS PRODUCTION CODE:**
The techniques in this notebook are used daily by professionals managing billions of dollars.