# 2. Data Selection

> “The quality of a composite indicator is directly linked to the quality of the variables selected.” - Gotten from the OECD Handbook on Composite Indicators

In this notebook, I will select the data that will be used to calculate the CSIAI and also justifications where needed.

## 2.1. Statistical Quality Principles

There are principles that an indicator must satisfy **all seven** or be rejected:

1. **Relevance** — captures something the theoretical framework says matters.  
2. **Accuracy** — sourced from audited statements and raw market data.  
3. **Timeliness** — updates at least quarterly; daily preferred in my case as discussed with **Dr. John Loane** and it is also the frequency of the CSIAI.  
4. **Accessibility** — free via `yfinance`.  
5. **Interpretability** — unit and direction are intuitive.  
6. **Comparability** — works across all sectors.  
7. **Coherence** — definitions do not conflict with other metrics.

If any of the indicators fail to meet these principles then I will exclude them or I can use a proxy with proper justification.

## 2.2. Defining the Dataset

### 2.2.1. Why Russell 3000?

* it covers 98 % of U.S. market capitalisation and is the most widely used benchmark for U.S. equities.
* It is a broad index that includes large, mid, and small-cap stocks.
* The list is public and can be found on [Wikipedia](https://en.wikipedia.org/wiki/Russell_3000_Index).
* The index is reconstituted annually, which means that it is updated regularly to reflect changes in the market.

### 2.2.2. Why require 24 months of price history (start 2023-01-01)?

The Risk metrics (beta, Sharpe ratio, maximum draw-down) need at least one full market cycle and in the Handbook in Section 3 it recommends “adequate observational base”.  Two years provides:
* ≈ 500 trading days: This is a reliable number of trading days to estimate volatility.
* IPOs - Initial Public Offerings - younger than 6 months are excluded to avoid data sparsity.

### 2.2.3. Why filter on **average 30-day volume ≥ 50 000 shares**?

* Ensures **Liquidity & Trading** sub-index is not dominated by stocks that are thinly traded.
* 50,000 shares/day is a common threshold for liquidity in the finance literature taking for example in this case the work of **Gao & Ongena (2021)** and this threshold keeps ≈ 80 % of the Russell 3000 index.
* This threshold also reduces the estimation error in bid-ask spread calculations, which is about how the market is functioning and how much it costs to trade a stock.

### 2.2.4  Reliable ticker source

The iShares Russell 3000 ETF (ticker **IWV**) publishes its full holdings every night as a CSV file.

Advantages:
* **Authoritative** — the fund must hold every Russell 3000 constituent.  
* **Timely** — file refreshes after each U.S. trading day.  
* **Stable URL** — the `.ajax` endpoint is version agnostic  which means it will not change.
* **Free** — no login or API key required.
* **Clean “Holdings” sheet** — tickers are listed in a single column.

We skip the first **seven** metadata rows, treat row 8 as the header, select the **“Ticker”** column, and drop empties, dashes, or placeholders - All these will be done dynamically in the code.

* **Local CSV fallback**  
If the network call fails or you simply prefer, drop the current file you downloaded manually at `data/input/IWV_holdings.csv`. The loader will auto-detect it and skip the web request.

In [3]:
import pandas as pd, requests, io, os, datetime as dt, yfinance as yf
from tqdm import tqdm
import warnings, pathlib
import csv

warnings.filterwarnings("ignore")
# Define the base data directory
DATA_DIR = pathlib.Path("..") / "data"

# Ensure the data directory exists
DATA_DIR.mkdir(parents=True, exist_ok=True)

INPUT_DIR = DATA_DIR / "input"
PROCESSED_DIR = DATA_DIR / "processed"

# Ensure folders exist
INPUT_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

LOCAL_FILE = INPUT_DIR / "IWV_holdings.csv"

REMOTE_URL = ("https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?fileType=csv&fileName=IWV_holdings&dataType=fund")

def load_iwv_csv() -> bytes:
    "Return CSV bytes local first, else remote download and cache."
    if LOCAL_FILE.exists() and LOCAL_FILE.stat().st_size > 0:
        print("Using local IWV CSV:", LOCAL_FILE)
        return LOCAL_FILE.read_bytes()
    print("Downloading IWV holdings CSV …")
    r = requests.get(REMOTE_URL, timeout=30)
    r.raise_for_status()
    LOCAL_FILE.write_bytes(r.content)
    print("Saved snapshot to:", LOCAL_FILE)
    return r.content

def extract_tickers(csv_bytes: bytes) -> list[str]:
    """
    Parses the CSV content to extract ticker symbols.
    Handles metadata at the beginning and footnotes at the end.
    """
    # Decode the bytes to a string
    csv_text = csv_bytes.decode('utf-8', errors='ignore')
    lines = csv_text.splitlines()

    # Identify the header row
    header_line_index = None
    for i, line in enumerate(lines):
        if 'Ticker' in line:
            header_line_index = i
            break

    if header_line_index is None:
        raise ValueError("Ticker header not found in the CSV file.")

    # Read the data starting from the header
    data_lines = lines[header_line_index:]

    # Stop reading when an empty line is encountered
    for j, line in enumerate(data_lines):
        if not line.strip():
            data_lines = data_lines[:j]
            break

    # Create a DataFrame from the data lines
    data_str = '\n'.join(data_lines)
    df = pd.read_csv(io.StringIO(data_str))

    # Clean and extract ticker symbols
    tickers = (df['Ticker'].astype(str).str.strip().replace({'': pd.NA, '-': pd.NA}).dropna().str.replace(r'\.', '-', regex=True).unique().tolist())

    return tickers

# to run the loader
csv_bytes = load_iwv_csv()
tickers = extract_tickers(csv_bytes)
print(f"Fetched {len(tickers)} tickers from IWV holdings")

# The parameters to filter the tickers in the yfinance
START, END = "2023-01-01", dt.date.today().isoformat()
# shares/day
VOL_THRESHOLD = 50_000

# Liquidity filter using a 40 day window
prices = yf.download(" ".join(tickers), start=dt.date.fromisoformat(END) - dt.timedelta(days=40), end=END, group_by="ticker", threads=True, progress=False)

liquid = []
for t in tickers:
    try:
        if prices[t]["Volume"].tail(30).mean() >= VOL_THRESHOLD:
            liquid.append(t)
    except KeyError:
        pass

print(f"Universe size after liquidity filter: {len(liquid)}")
universe = pd.DataFrame({"ticker": liquid})
universe.to_parquet(PROCESSED_DIR / "universe.parquet", index=False)

Using local IWV CSV: ../data/input/IWV_holdings.csv
Fetched 2656 tickers from IWV holdings



21 Failed downloads:
['P5N994', 'UHALB', 'ESM5', 'LGFA', 'BFA', 'GTXI', 'LENB', 'METCV', 'CWENA', 'GEFB', 'BFB', 'MSFUT', 'LGFB', 'MOGA', 'HEIA', 'RTYM5', 'BRKB', 'ADRO', 'XTSLA']: YFTzMissingError('possibly delisted; no timezone found')
['INH', 'CLSKW']: YFPricesMissingError('possibly delisted; no price data found  (1d 2025-03-27 -> 2025-05-06)')


Universe size after liquidity filter: 2563
