# üìä Market Data Download Pipeline

## Quantitative Finance Data Infrastructure

This notebook establishes the **foundational data infrastructure** for our quantitative finance learning system. We download real market data from Yahoo Finance to use throughout the 24-week program.

### Data Sources
- **Equities**: S&P 500 sample (45 major stocks across sectors)
- **ETFs**: Major index and sector ETFs (10 instruments)
- **FX**: Major currency pairs (4 pairs)
- **Fixed Income**: Treasury yields (10Y, 2Y, 5Y, 30Y)

### Key Principles
1. **Point-in-Time Data**: We use historical data as it was available at each point in time
2. **Survivorship Bias Awareness**: Current constituents may not reflect historical composition
3. **Data Quality**: All data undergoes quality checks before use

---
**Author**: ML Quant Finance Mastery  
**Last Updated**: 2026-01-20  
**Data Source**: Yahoo Finance via yfinance

## 1. Setup and Configuration

In [1]:
# Import required libraries
import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from pathlib import Path
import warnings
import time

warnings.filterwarnings('ignore')

# Configuration
DATA_DIR = Path("./raw_data")
START_DATE = "2019-01-01"
END_DATE = "2026-01-20"

# Create directory structure
for subdir in ["equities", "etfs", "fx", "fixed_income"]:
    (DATA_DIR / subdir).mkdir(parents=True, exist_ok=True)

print("‚úÖ Libraries imported successfully")
print(f"üìÅ Data directory: {DATA_DIR.absolute()}")
print(f"üìÖ Date range: {START_DATE} to {END_DATE}")

‚úÖ Libraries imported successfully
üìÅ Data directory: /Users/anto/Learning_Trading_ML/ML-Quant-Finance-Mastery/02_Daily_Coding/datasets/raw_data
üìÖ Date range: 2019-01-01 to 2026-01-20


## 2. Define Universe of Instruments

### Equity Selection Criteria
We select stocks that represent:
- **Large-cap diversification** across all 11 GICS sectors
- **High liquidity** for realistic backtesting
- **Representative of institutional portfolios**

> ‚ö†Ô∏è **Survivorship Bias Warning**: These are *current* constituents. Historical backtests should account for index changes.

In [2]:
# S&P 500 Sample - 45 stocks across all sectors
EQUITIES = {
    # Technology (8)
    "AAPL": "Apple Inc",
    "MSFT": "Microsoft Corp",
    "GOOGL": "Alphabet Inc",
    "NVDA": "NVIDIA Corp",
    "META": "Meta Platforms",
    "AVGO": "Broadcom Inc",
    "ORCL": "Oracle Corp",
    "CRM": "Salesforce Inc",
    
    # Financials (6)
    "JPM": "JPMorgan Chase",
    "BAC": "Bank of America",
    "GS": "Goldman Sachs",
    "MS": "Morgan Stanley",
    "BLK": "BlackRock Inc",
    "C": "Citigroup Inc",
    
    # Healthcare (5)
    "JNJ": "Johnson & Johnson",
    "UNH": "UnitedHealth Group",
    "PFE": "Pfizer Inc",
    "ABBV": "AbbVie Inc",
    "MRK": "Merck & Co",
    
    # Consumer Discretionary (4)
    "AMZN": "Amazon.com",
    "TSLA": "Tesla Inc",
    "HD": "Home Depot",
    "NKE": "Nike Inc",
    
    # Consumer Staples (4)
    "PG": "Procter & Gamble",
    "KO": "Coca-Cola Co",
    "PEP": "PepsiCo Inc",
    "WMT": "Walmart Inc",
    
    # Industrials (4)
    "UPS": "United Parcel Service",
    "CAT": "Caterpillar Inc",
    "BA": "Boeing Co",
    "HON": "Honeywell Intl",
    
    # Energy (4)
    "XOM": "Exxon Mobil",
    "CVX": "Chevron Corp",
    "COP": "ConocoPhillips",
    "SLB": "Schlumberger NV",
    
    # Communication Services (3)
    "NFLX": "Netflix Inc",
    "DIS": "Walt Disney Co",
    "CMCSA": "Comcast Corp",
    
    # Materials (3)
    "LIN": "Linde PLC",
    "APD": "Air Products",
    "FCX": "Freeport-McMoRan",
    
    # Utilities (2)
    "NEE": "NextEra Energy",
    "DUK": "Duke Energy",
    
    # Real Estate (2)
    "PLD": "Prologis Inc",
    "AMT": "American Tower",
}

# ETFs - Index and Sector
ETFS = {
    "SPY": "S&P 500 ETF",
    "QQQ": "Nasdaq 100 ETF",
    "IWM": "Russell 2000 ETF",
    "DIA": "Dow Jones ETF",
    "XLF": "Financial Select Sector",
    "XLK": "Technology Select Sector",
    "XLE": "Energy Select Sector",
    "XLV": "Healthcare Select Sector",
    "GLD": "Gold ETF",
    "TLT": "20+ Year Treasury ETF",
}

# FX Pairs (using Yahoo Finance tickers)
FX_PAIRS = {
    "EURUSD=X": "EUR/USD",
    "GBPUSD=X": "GBP/USD",
    "USDJPY=X": "USD/JPY",
    "AUDUSD=X": "AUD/USD",
}

# Treasury Yields
TREASURIES = {
    "^TNX": "10-Year Treasury Yield",
    "^IRX": "3-Month Treasury Yield",
    "^FVX": "5-Year Treasury Yield",
    "^TYX": "30-Year Treasury Yield",
}

print(f"üìà Equities: {len(EQUITIES)} stocks")
print(f"üìä ETFs: {len(ETFS)} funds")
print(f"üí± FX Pairs: {len(FX_PAIRS)} pairs")
print(f"üìâ Treasuries: {len(TREASURIES)} yields")
print(f"‚ïê" * 40)
print(f"üì¶ Total instruments: {len(EQUITIES) + len(ETFS) + len(FX_PAIRS) + len(TREASURIES)}")

üìà Equities: 45 stocks
üìä ETFs: 10 funds
üí± FX Pairs: 4 pairs
üìâ Treasuries: 4 yields
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
üì¶ Total instruments: 63


## 3. Data Download Functions

### Download Strategy
- **Batch processing** with rate limiting to avoid API blocks
- **Error handling** with retry logic for failed downloads
- **Progress tracking** for transparency

In [3]:
def download_instrument(ticker: str, name: str, start: str, end: str, 
                         max_retries: int = 3) -> pd.DataFrame | None:
    """
    Download OHLCV data for a single instrument with retry logic.
    
    Parameters:
    -----------
    ticker : str
        Yahoo Finance ticker symbol
    name : str
        Human-readable name for logging
    start, end : str
        Date range in YYYY-MM-DD format
    max_retries : int
        Number of retry attempts on failure
        
    Returns:
    --------
    pd.DataFrame or None
        OHLCV data with Date index, or None if download failed
    """
    for attempt in range(max_retries):
        try:
            data = yf.download(ticker, start=start, end=end, progress=False)
            if len(data) > 0:
                # Flatten MultiIndex columns if present
                if isinstance(data.columns, pd.MultiIndex):
                    data.columns = data.columns.get_level_values(0)
                return data
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(1)  # Wait before retry
            else:
                print(f"   ‚ùå Failed: {ticker} ({name}) - {str(e)[:50]}")
    return None


def download_batch(instruments: dict, category: str, output_dir: Path) -> dict:
    """
    Download a batch of instruments and save to CSV.
    
    Parameters:
    -----------
    instruments : dict
        Dictionary of {ticker: name} pairs
    category : str
        Category name for logging (e.g., "Equities")
    output_dir : Path
        Directory to save CSV files
        
    Returns:
    --------
    dict
        Summary statistics of the download
    """
    print(f"\n{'='*50}")
    print(f"üì• Downloading {category}: {len(instruments)} instruments")
    print(f"{'='*50}")
    
    success_count = 0
    failed = []
    all_data = {}
    
    for i, (ticker, name) in enumerate(instruments.items(), 1):
        print(f"   [{i:2d}/{len(instruments)}] {ticker:10s} - {name[:25]:25s}", end=" ")
        
        data = download_instrument(ticker, name, START_DATE, END_DATE)
        
        if data is not None and len(data) > 0:
            # Save individual file
            clean_ticker = ticker.replace("=X", "").replace("^", "")
            filepath = output_dir / f"{clean_ticker}.csv"
            data.to_csv(filepath)
            all_data[ticker] = data
            success_count += 1
            print(f"‚úÖ {len(data)} rows")
        else:
            failed.append(ticker)
            print("‚ùå Failed")
        
        time.sleep(0.2)  # Rate limiting
    
    print(f"\nüìä {category} Summary: {success_count}/{len(instruments)} successful")
    if failed:
        print(f"   Failed: {', '.join(failed)}")
    
    return {
        "category": category,
        "total": len(instruments),
        "success": success_count,
        "failed": failed,
        "data": all_data
    }

print("‚úÖ Download functions defined")

‚úÖ Download functions defined


## 4. Execute Downloads

### 4.1 Download Equities
Download 45 S&P 500 component stocks.

In [4]:
# Download Equities
equities_result = download_batch(EQUITIES, "Equities", DATA_DIR / "equities")


üì• Downloading Equities: 45 instruments
   [ 1/45] AAPL       - Apple Inc                 ‚úÖ 1771 rows
   [ 2/45] MSFT       - Microsoft Corp            ‚úÖ 1771 rows
   [ 3/45] GOOGL      - Alphabet Inc              ‚úÖ 1771 rows
   [ 4/45] NVDA       - NVIDIA Corp               ‚úÖ 1771 rows
   [ 5/45] META       - Meta Platforms            ‚úÖ 1771 rows
   [ 6/45] AVGO       - Broadcom Inc              ‚úÖ 1771 rows
   [ 7/45] ORCL       - Oracle Corp               ‚úÖ 1771 rows
   [ 8/45] CRM        - Salesforce Inc            ‚úÖ 1771 rows
   [ 9/45] JPM        - JPMorgan Chase            ‚úÖ 1771 rows
   [10/45] BAC        - Bank of America           ‚úÖ 1771 rows
   [11/45] GS         - Goldman Sachs             ‚úÖ 1771 rows
   [12/45] MS         - Morgan Stanley            ‚úÖ 1771 rows
   [13/45] BLK        - BlackRock Inc             ‚úÖ 1771 rows
   [14/45] C          - Citigroup Inc             ‚úÖ 1771 rows
   [15/45] JNJ        - Johnson & Johnson         ‚úÖ 1771 ro

### 4.2 Download ETFs
Download major index and sector ETFs.

In [5]:
# Download ETFs
etfs_result = download_batch(ETFS, "ETFs", DATA_DIR / "etfs")


üì• Downloading ETFs: 10 instruments
   [ 1/10] SPY        - S&P 500 ETF               ‚úÖ 1771 rows
   [ 2/10] QQQ        - Nasdaq 100 ETF            ‚úÖ 1771 rows
   [ 3/10] IWM        - Russell 2000 ETF          ‚úÖ 1771 rows
   [ 4/10] DIA        - Dow Jones ETF             ‚úÖ 1771 rows
   [ 5/10] XLF        - Financial Select Sector   ‚úÖ 1771 rows
   [ 6/10] XLK        - Technology Select Sector  ‚úÖ 1771 rows
   [ 7/10] XLE        - Energy Select Sector      ‚úÖ 1771 rows
   [ 8/10] XLV        - Healthcare Select Sector  ‚úÖ 1771 rows
   [ 9/10] GLD        - Gold ETF                  ‚úÖ 1771 rows
   [10/10] TLT        - 20+ Year Treasury ETF     ‚úÖ 1771 rows

üìä ETFs Summary: 10/10 successful


### 4.3 Download FX Pairs

In [6]:
# Download FX Pairs
fx_result = download_batch(FX_PAIRS, "FX Pairs", DATA_DIR / "fx")


üì• Downloading FX Pairs: 4 instruments
   [ 1/4] EURUSD=X   - EUR/USD                   ‚úÖ 1834 rows
   [ 2/4] GBPUSD=X   - GBP/USD                   ‚úÖ 1834 rows
   [ 3/4] USDJPY=X   - USD/JPY                   ‚úÖ 1834 rows
   [ 4/4] AUDUSD=X   - AUD/USD                   ‚úÖ 1834 rows

üìä FX Pairs Summary: 4/4 successful


### 4.4 Download Treasury Yields

In [7]:
# Download Treasury Yields
treasuries_result = download_batch(TREASURIES, "Treasuries", DATA_DIR / "fixed_income")


üì• Downloading Treasuries: 4 instruments
   [ 1/4] ^TNX       - 10-Year Treasury Yield    ‚úÖ 1771 rows
   [ 2/4] ^IRX       - 3-Month Treasury Yield    ‚úÖ 1771 rows
   [ 3/4] ^FVX       - 5-Year Treasury Yield     ‚úÖ 1771 rows
   [ 4/4] ^TYX       - 30-Year Treasury Yield    ‚úÖ 1771 rows

üìä Treasuries Summary: 4/4 successful


## 5. Create Combined Dataset

Create a consolidated panel dataset for easy cross-sectional analysis.

In [8]:
# Create combined price panel (Adjusted Close only)
all_adj_close = {}

# Combine all downloaded data
for result in [equities_result, etfs_result, fx_result, treasuries_result]:
    for ticker, data in result["data"].items():
        clean_ticker = ticker.replace("=X", "").replace("^", "")
        if "Adj Close" in data.columns:
            all_adj_close[clean_ticker] = data["Adj Close"]
        elif "Close" in data.columns:
            all_adj_close[clean_ticker] = data["Close"]

# Create DataFrame
combined_prices = pd.DataFrame(all_adj_close)
combined_prices.index = pd.to_datetime(combined_prices.index)
combined_prices = combined_prices.sort_index()

# Save combined dataset
combined_prices.to_csv(DATA_DIR / "combined_adjusted_close.csv")

print(f"‚úÖ Combined dataset created")
print(f"   Shape: {combined_prices.shape}")
print(f"   Date range: {combined_prices.index[0].strftime('%Y-%m-%d')} to {combined_prices.index[-1].strftime('%Y-%m-%d')}")
print(f"   Instruments: {combined_prices.columns.tolist()[:10]}...")

‚úÖ Combined dataset created
   Shape: (1836, 63)
   Date range: 2019-01-01 to 2026-01-19
   Instruments: ['AAPL', 'MSFT', 'GOOGL', 'NVDA', 'META', 'AVGO', 'ORCL', 'CRM', 'JPM', 'BAC']...


## 6. Download Summary

Final summary of all downloaded data.

In [9]:
# Final Summary
print("=" * 60)
print("üìä DATA DOWNLOAD COMPLETE")
print("=" * 60)

total_success = sum([r["success"] for r in [equities_result, etfs_result, fx_result, treasuries_result]])
total_attempted = sum([r["total"] for r in [equities_result, etfs_result, fx_result, treasuries_result]])

print(f"\n‚úÖ Successfully downloaded: {total_success}/{total_attempted} instruments")
print(f"\nüìÅ Data saved to: {DATA_DIR.absolute()}")
print(f"\nüìÇ Directory structure:")
print(f"   {DATA_DIR}/")
print(f"   ‚îú‚îÄ‚îÄ equities/       ({equities_result['success']} files)")
print(f"   ‚îú‚îÄ‚îÄ etfs/           ({etfs_result['success']} files)")
print(f"   ‚îú‚îÄ‚îÄ fx/             ({fx_result['success']} files)")
print(f"   ‚îú‚îÄ‚îÄ fixed_income/   ({treasuries_result['success']} files)")
print(f"   ‚îî‚îÄ‚îÄ combined_adjusted_close.csv")

print(f"\nüìà Combined dataset: {combined_prices.shape[0]:,} rows √ó {combined_prices.shape[1]} instruments")
print(f"   Date range: {combined_prices.index.min().strftime('%Y-%m-%d')} to {combined_prices.index.max().strftime('%Y-%m-%d')}")

# Quick data preview
print(f"\nüìã Sample data (last 5 days):")
combined_prices.tail().round(2)

üìä DATA DOWNLOAD COMPLETE

‚úÖ Successfully downloaded: 63/63 instruments

üìÅ Data saved to: /Users/anto/Learning_Trading_ML/ML-Quant-Finance-Mastery/02_Daily_Coding/datasets/raw_data

üìÇ Directory structure:
   raw_data/
   ‚îú‚îÄ‚îÄ equities/       (45 files)
   ‚îú‚îÄ‚îÄ etfs/           (10 files)
   ‚îú‚îÄ‚îÄ fx/             (4 files)
   ‚îú‚îÄ‚îÄ fixed_income/   (4 files)
   ‚îî‚îÄ‚îÄ combined_adjusted_close.csv

üìà Combined dataset: 1,836 rows √ó 63 instruments
   Date range: 2019-01-01 to 2026-01-19

üìã Sample data (last 5 days):


Unnamed: 0_level_0,AAPL,MSFT,GOOGL,NVDA,META,AVGO,ORCL,CRM,JPM,BAC,...,GLD,TLT,EURUSD,GBPUSD,USDJPY,AUDUSD,TNX,IRX,FVX,TYX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2026-01-13,261.05,470.67,335.97,185.81,631.09,354.61,202.29,241.06,310.9,54.54,...,421.63,87.82,1.17,1.35,157.99,0.67,4.17,3.56,3.74,4.83
2026-01-14,259.96,459.38,335.84,183.14,615.52,339.89,193.61,239.57,307.87,52.48,...,425.94,88.33,1.16,1.34,159.18,0.67,4.14,3.56,3.72,4.8
2026-01-15,258.21,456.66,332.78,187.05,620.8,343.02,189.85,233.53,309.26,52.59,...,423.33,88.31,1.16,1.34,158.4,0.67,4.16,3.57,3.76,4.79
2026-01-16,255.53,459.86,330.0,186.23,620.25,351.71,191.09,227.11,312.47,52.97,...,421.29,87.8,1.16,1.34,158.6,0.67,4.23,3.56,3.83,4.84
2026-01-19,,,,,,,,,,,...,,,1.16,1.34,157.54,0.67,,,,
