# Multi-Factor Information Coefficient (IC) Analysis

28/12/2025

This notebook conducts a rigorous cross-sectional IC analysis of three fundamental factors: Book to market ratio (B/M), size (market capitalization), and return on equity (ROE). The objective is to quantify each factor's directional predictive power for 252-day forward returns through Spearman rank correlation across 748 common trading days (2021-2025).

**Methodology: Joint factor universe construction**

Factors are aligned to a common temporal universe via index intersection, maximizing statistical power for multi-factor portfolio construction. This approach prioritizes portfolio-level consistency over individual factor coverage maximization, enabling optimal weight determination via IC × T-stat optimization.

### 30 Tickers filter

IC estimation demands sufficient cross-sectional breadth for Spearman rank correlation stability:

- n < 30: IC variance ≈ 15-45% due to degrees-of-freedom limitation and sampling instability.
- n ≥ 30: IC standard error < 10%, enabling reliable T-statistic computation.

This threshold reflects Grinold's breadth requirement (IR ≈ IC × √breadth) and aligns with industry standards for production factor models.

### Explicit Rank Transformation (.rank(pct=True))

While scipy.stats.spearmanr() internally computes ranks, explicit .rank(pct=True) transforms raw factors to Uniform distribution prior to IC computation for production pipeline architecture:

- Scale invariance: Heterogeneous raw distributions (Size: 1M-3T; ROE: -500% to +1000%) become combinable for portfolio weights.
- Regime stability: Cross-sectional percentiles remain Uniform regardless of temporal distribution shifts.
- Research-to-production consistency: Identical transformation preserves signal integrity from backtest to live trading.

### Double Forward-Fill (Fundamental Data Alignment)

Fundamental data exhibits sparse temporal coverage (quarterly releases). The two-stage forward-fill ensures point-in-time correctness:

```python 
Stage 1: book_df.resample('D').ffill()
→ Interpolates within available fundamental dates

Stage 2: book_daily.reindex(data.index).ffill()  
→ Extends last-known fundamental value to all price dates
```

Post-resample Series terminates at last fundamental date. Reindexing to full price index (data.index) introduces NaNs for subsequent dates, requiring second forward-fill to propagate latest fundamental information forward.

This preserves information state at each trading timestamp: price data is daily, fundamentals are as-available with forward-propagation matching practitioner convention.

### Look ahead bias in fundamentals reports

Yahoo Finance balance_sheet and income_stmt data uses period-end dates (e.g., 2024-12-31 for Q4), but earnings are typically announced 30-60 days later. Direct .resample('D').ffill() propagates stale information forward from the accounting date, introducing look-ahead bias - using future information unavailable to market participants at time *t* 

Solution: Forward-fill fundamentals only from earnings announcement date, not period-end. Also, if data is not available, introducing a conservative lag (30-60 days) is a feasible alternative. I will not be implementing any of the solutions here.

### Outlier contamination in factor analysis

Fundamental factors exhibit extreme tail distributions that systematically distort cross-sectional Spearman rank correlations, compromising IC reliability and portfolio construction.

**Mechanisms of Outlier Contamination:**
- Financial leverage: Banks with near-zero equity → ROE = ±1000% (dominates 1% percentile)
- Corporate events: IPOs with minimal equity → ROE spikes >500%
- Scale effects: BRK-B market cap 3T vs small-caps 1B → Monopolizes size rank
- Distress: Bankrupt firms → Negative extremes contaminate value factors

**Consequences for IC Analysis:**

- Rank compression: 2-3 tickers occupy extremes → Rest of universe loses differentiation.
- IC volatility: Outlier-induced instability elevates IC_std 30-50% → Reduced T-stat.
- False premia: IC measures ticker-specific noise, not systematic risk premia.
- Portfolio risk: Weights concentrated in outliers → Extreme drawdowns.

It is in fact the case for the datasets used in this factor analysis:

```python
Top 5 problematic tickers per factor:

ROE:
      low_outliers  high_outliers  total_outliers
HD             251            502             753
MO             707              0             707
BA               0            205             205
KLAC             0            147             147
SPGI             0            124             124

Size:
       low_outliers  high_outliers  total_outliers
BRK-B           978              0             978
AAPL              0            523             523
MSFT              0            198             198
NVDA              0            133             133
GOOG              0             96              96

Book to market:
       low_outliers  high_outliers  total_outliers
BRK-B             0           1000            1000
BA              733              0             733
LOW             267              0             267
AAPL              0              0               0
ACN               0              0               0

```

I will be treating outliers in future projects, it is not in the scope of this phase.

In [1]:
import sys
sys.path.append(r"C:\Users\Sergio\Documents\GitHUb\Quant-trading-journey\src")
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import quant_utils as qu
from scipy.stats import spearmanr

In [2]:
tickers = [
    'AAPL', 'MSFT', 'NVDA', 'AMZN', 'META', 'GOOGL', 'GOOG', 'TSLA', 'BRK-B', 'AVGO',
    'UNH', 'JPM', 'XOM', 'V', 'PG', 'MA', 'JNJ', 'HD', 'CVX', 'ABBV',
    'PFE', 'KO', 'MRK', 'BAC', 'CRM', 'WMT', 'NFLX', 'AMD', 'COST', 'TMO',
    'ABT', 'ACN', 'DIS', 'TXN', 'VZ', 'QCOM', 'PM', 'ADBE', 'INTC', 'WFC',
    'RTX', 'NKE', 'UNP', 'SPGI', 'HON', 'COP', 'CAT', 'LOW', 'GS', 'MS',
    'C', 'BMY', 'AMGN', 'GILD', 'SBUX', 'T', 'USB', 'AXP', 'MMM', 'LIN',
    'ELV', 'DHR', 'SCHW', 'MDT', 'UPS', 'PGR', 'VRTX', 'ZTS', 'REGN', 'NEE',
    'TJX', 'CB', 'SYK', 'BLK', 'CI', 'BSX', 'MU', 'BDX', 'WM', 'GE',
    'DE', 'LMT', 'BA', 'KLAC', 'ADP', 'ADI', 'LRCX', 'PANW', 'SNPS', 'CDNS',
    'MCD', 'NOW', 'ORCL', 'PLD', 'AMT', 'FISV', 'MDLZ', 'MO', 'TGT', 'FCX'
]

data = qu.data.download_data(tickers)

Saved → C:\Users\Sergio\Documents\GitHUb\Quant-trading-journey\src\data\tickers_100t_max_1d.pkl


In [3]:
# Getting target for IC calculations
forward_horizon = 21
forward_returns = data.pct_change(periods=forward_horizon, fill_method=None).shift(-forward_horizon).rank(pct=True, axis=1).dropna()

### Book to market ratio

The book-to-market ratio (B/M) is a financial indicator that compares a company’s book value (its accounting value) to its market value (what investors think it’s worth on the stock market).

$$
\text{Book to market ratio} = \frac{\text{Book value per share}}{\text{Market price per share}}
$$

**Interpretation**
- High B/M ratio: The book value is high relative to the market value. This may suggest the stock is undervalued or that investors expect low future growth.
- Low B/M ratio: The book value is low compared to the market value, often meaning the market has high expectations for the company’s growth or profitability.

**Practical use**

The B/M ratio is widely used in valuation and asset pricing models, especially in the Fama–French three-factor model, where firms with high B/M ratios (“value stocks”) tend to earn higher average returns over time than firms with low B/M ratios (“growth stocks”).

In [4]:
# Fetching valuation of the tickers
book_values = {}
missing_data = []
for ticker in tickers:
    try: 
        equity = yf.Ticker(ticker).balance_sheet.loc['Stockholders Equity'].ffill()
        shares = yf.Ticker(ticker).balance_sheet.loc['Ordinary Shares Number'].ffill()
        all_dates = equity.index.union(shares.index)
        equity_aligned = equity.reindex(all_dates).ffill()
        shares_aligned = shares.reindex(all_dates).ffill()
        book_values[ticker] = equity_aligned / shares_aligned
    except Exception as e:
        missing_data.append(ticker)

if missing_data:
    print('No data was found for', missing_data)

In [5]:
# Cleaning dataset (yf data for shares is not daily)
book_df = pd.DataFrame(book_values)
book_daily = book_df.resample('D').ffill()
book_aligned = book_daily.reindex(data.index).ffill()
book_clean = book_aligned.dropna(how='all')

# Calculating book to market ratio
book_price = book_clean / data.loc[book_clean.index]
bm_ratio = book_price.rank(axis=1, pct=True)

### Size (market capitalization)
Size, measured as market capitalization, represents a company's total market value—the product of its share price and outstanding shares.
$$
Size = {\text{Share price}} * {\text{Shares outstanding}}
$$

**Interpretation**

- High Size (Large-cap): Companies with substantial market capitalization (10B+), typically mature firms with stable cash flows, lower idiosyncratic risk, and diversified operations.
- Low Size (Small-cap): Companies with limited market capitalization (<2B), often younger firms with higher growth potential but elevated business and liquidity risk.

**Practical use**

Size constitutes the second factor in the Fama-French three-factor model (SMB - Small Minus Big). Empirically, small-cap stocks generate higher average returns than large-caps, compensating for elevated risk. Low-size ranked stocks typically outperform high-size counterparts.

In [6]:
# Fetching shares of the tickers
shares_hist = {}
missing_data = []
for ticker in tickers:
    try:
        shares_hist[ticker] = yf.Ticker(ticker).balance_sheet.loc['Ordinary Shares Number']
    except Exception as e:
        missing_data.append(ticker)

if missing_data:
    print('No data was found for', missing_data)

In [7]:
# Cleaning dataset (yf data for shares is not daily)
shares = pd.DataFrame(shares_hist)
shares_daily = shares.resample('D').ffill()                # Making the index daily and forward filling
shares_aligned = shares_daily.reindex(data.index).ffill()  # Aligning and forward filling (more recent dates after reindex)
shares_clean = shares_aligned.dropna(how='all')            # Drop every row where all values are NaN

#Calculating size of the tickers
size = data.loc[shares_clean.index] * shares_clean
size = size.rank(axis=1, pct=True)

### Return on equity (ROE)
Return on equity (ROE) measures a company's profitability relative to shareholders' equity, indicating operational efficiency in generating profits from invested capital.

$$
ROE = \frac{\text{Net income}}{\text{Shareholders' equity}}
$$
 
**Interpretation**

- High ROE: Superior profitability per unit of equity—efficient capital allocation, high margins, or asset turnover. Often signals quality but potential overvaluation.
- Low/Negative ROE: Poor profitability, operational distress, or capital destruction. May indicate undervaluation or structural problems.

**Practical use**

Traditionally viewed as a quality signal, recent evidence reveals the "quality trap" anomaly. High-ROE firms underperform future returns, reflecting overvaluation from growth expectations. Low-ROE firms exhibit mean reversion, delivering superior risk-adjusted returns in long-short implementations.

In [8]:
# Fetching data to calculate ROE
net_income_dict = {}
equity_dict = {}
missing_data = []
for ticker in tickers:
    try:
        ticker_obj = yf.Ticker(ticker)
        net_income_dict[ticker] = ticker_obj.income_stmt.loc['Net Income']
        equity_dict[ticker] = ticker_obj.balance_sheet.loc['Stockholders Equity']
    except Exception as e:
        missing_data.append(ticker)

if missing_data:
    print('No data was found for', missing_data)

In [9]:
# Calculating annual ROE of the tickers
net_income_df = pd.DataFrame(net_income_dict)
equity_df = pd.DataFrame(equity_dict)
roe = net_income_df / equity_df

# Cleaning dataset (in this case yf data for ROE is annual)
roe_daily = roe.resample('D').ffill()
roe_aligned = roe_daily.reindex(forward_returns.index).ffill()
roe_clean = roe_aligned.dropna(how='all').rank(axis=1, pct=True)

In [10]:
# Preparing dataset for IC calculations
common_dates = (roe_clean.index 
                .intersection(size.index) 
                .intersection(bm_ratio.index)
                .intersection(forward_returns.index))

clean_datasets = {}
for name, factor in [('ROE', roe_clean.loc[common_dates]), 
                     ('Size', size.loc[common_dates]), 
                     ('Book to market', bm_ratio.loc[common_dates])]:
    
    # Setting minimum threshold of available factor data for IC calculations
    valid_tickers = factor.notna().sum(axis=1)
    
    # At least 30 tickers for trustworthy analysis
    valid_days = valid_tickers[valid_tickers >= 30].index
    clean_factor = factor.loc[valid_days]
    clean_returns = forward_returns.loc[valid_days]
    
    clean_datasets[name] = {
        'Factor': clean_factor,
        'Returns': clean_returns,
        'Days': len(valid_days),
    }

In [11]:
# IC Calculation
results = {}
for name, dataset in clean_datasets.items():
    factor = dataset['Factor']
    returns = dataset['Returns']
    
    ic_values = []
    for date in factor.index:
        factor_day = factor.loc[date].dropna()
        returns_day = returns.loc[date].dropna()
        
        common_tickers = factor_day.index.intersection(returns_day.index)
        
        if len(common_tickers) >= 5:  # Redundant filter to handle odd exceptions
            ic = spearmanr(factor_day.loc[common_tickers], 
                                 returns_day.loc[common_tickers])[0]
            ic_values.append({'date': date, 'ic': ic})
    
    ic_series = pd.Series([x['ic'] for x in ic_values], 
                         index=[x['date'] for x in ic_values])
    
    # Metrics
    valid_ic = ic_series.dropna()
    results[name] = {
        'IC mean': valid_ic.mean(),
        'IC std': valid_ic.std(),
        'T stat': (valid_ic.mean() / valid_ic.std() * np.sqrt(len(valid_ic))),
        'Valid days': len(valid_ic),
        'Coverage': f"{len(valid_ic)/dataset['Days']*100:.1f}%"
    }

results = pd.DataFrame(results).T.round(4)
print("\n" + "="*70)
print("Multi-factor IC analysis")
print("="*70)
print(results.sort_values('IC mean', ascending=False))


Multi-factor IC analysis
                 IC mean    IC std    T stat Valid days Coverage
Book to market  0.018637  0.184448   3.19519       1000   100.0%
Size           -0.011581  0.142186  -2.54719        978   100.0%
ROE            -0.013634  0.118398 -3.601281        978   100.0%


## Interpretation

**Descriptive statistics overview**

- Universe: 978 common trading days (≈4 years coverage, 2021-2025), representing maximum joint statistical power across all factors after cross-sectional filtering (≥30 tickers/day).
- Coverage: 100% → Perfect alignment between factor availability and forward returns, indicating production-grade data pipeline.

**Factor performance hierarchy**

ROE (IC = -0.0136, T = -3.60)

- Direction: Inverse relationship confirmed - "Quality trap" anomaly.
- Economic interpretation: High-ROE firms systematically underperform future returns, consistent with overvaluation/mean-- reversion dynamics.
- Statistical power: T-statistic = -3.60 → p < 0.001 (99.9% confidence rejection of H₀: IC=0).
- Stability: Lowest IC_std = 0.118 → Highest cross-sectional consistency.


Size (IC = -0.0116, T = -2.54)

- Direction: Negative IC confirms Small-Cap Premium (Fama-French SMB factor).
- Economic interpretation: Lower market-cap firms deliver superior risk-adjusted returns, reflecting size-based risk compensation.
- Statistical power: T-statistic = -2.54 → p < 0.01 (99% confidence).
- Stability: IC_std = 0.142 → Moderate consistency, regime-sensitive.


Book-to-Market (IC = 0.0186, T = 3.20)

- Direction: Strong positive relationship - Value premium.
- Economic interpretation: High B/M (value) stocks outperform growth, consistent with classic Fama-French and post-2022 value recovery.
- Statistical power: T-statistic = 3.20 → p < 0.001 (99.9% confidence).
- Stability: Highest IC_std = 0.184 → Volatile but directionally consistent.