# Stage 04: Data Acquisition and Ingestion
**Project:** Turtle Trading Strategy Research  
**Author:** Panwei Hu  
**Date:** 2025-08-17

## Objectives
- Acquire multi-asset time series data for Turtle Trading backtests
- Focus on liquid, diversified universe: Equity ETFs, Bond ETFs, Commodity ETFs, Currency ETFs
- Implement robust API ingestion with fallbacks
- Validate data quality and save to project data directory
- Support both Alpha Vantage API and yfinance as data sources

## Data Universe for Turtle Trading
The Turtle Trading system requires a diversified set of liquid instruments across multiple asset classes to achieve proper risk diversification and capture trends across different markets.

In [1]:
import os, pathlib, datetime as dt
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import sys
sys.path.append('../src')

# Set up project paths
PROJECT_ROOT = pathlib.Path('..').resolve()
DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
RAW_DIR.mkdir(parents=True, exist_ok=True)

# Load environment variables from project root
load_dotenv(PROJECT_ROOT / '.env')

print(f"🐢 Turtle Trading Data Acquisition")
print(f"Project Root: {PROJECT_ROOT}")
print(f"Data Directory: {RAW_DIR}")
print(f"Alpha Vantage API Key: {'✅ Present' if os.getenv('ALPHAVANTAGE_API_KEY') else '❌ Missing'}")


🐢 Turtle Trading Data Acquisition
Project Root: /Users/panweihu/Desktop/Desktop_m1/NYU_mfe/bootcamp/camp4/bootcamp_bill_panwei_hu/turtle_project
Data Directory: /Users/panweihu/Desktop/Desktop_m1/NYU_mfe/bootcamp/camp4/bootcamp_bill_panwei_hu/turtle_project/data/raw
Alpha Vantage API Key: ✅ Present


In [2]:
# Utility functions for data acquisition
def ts():
    """Generate timestamp for unique filenames"""
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

def save_csv(df: pd.DataFrame, prefix: str, **meta):
    """Save DataFrame with metadata embedded in filename"""
    mid = '_'.join([f"{k}-{v}" for k,v in meta.items()])
    filename = f"{prefix}_{mid}_{ts()}.csv"
    path = RAW_DIR / filename
    df.to_csv(path, index=False)
    print(f"💾 Saved: {filename}")
    return path

def validate(df: pd.DataFrame, required_cols: list):
    """Validate DataFrame structure and content"""
    return {
        'shape': df.shape,
        'missing_cols': [c for c in required_cols if c not in df.columns],
        'total_nulls': df.isnull().sum().sum(),
        'dtypes': df.dtypes.to_dict()
    }

print("✅ Utility functions loaded")


✅ Utility functions loaded


## Multi-Asset Universe for Turtle Trading

The original Turtle traders focused on futures across multiple asset classes. For this research, we'll use liquid ETFs as proxies to capture the same diversification and trend-following opportunities across:

- **Equity Markets**: Broad market exposure across developed and emerging markets
- **Fixed Income**: Government and corporate bonds across duration spectrum  
- **Commodities**: Precious metals, energy, and agricultural exposure
- **Currencies**: Major currency pairs and USD strength/weakness

This universe provides the diversification needed for robust trend-following while maintaining high liquidity for realistic backtesting.

In [3]:
# Turtle Trading Asset Universe - Diversified ETF Portfolio
TURTLE_UNIVERSE = {
    'equity_us': ['SPY', 'QQQ', 'IWM'],  # Large, Tech, Small Cap
    'equity_intl': ['EFA', 'EEM', 'VEA'],  # Developed, Emerging, All-World
    'fixed_income': ['TLT', 'IEF', 'LQD', 'HYG'],  # Long Treasury, Intermediate, IG Corp, HY
    'commodities': ['GLD', 'SLV', 'USO', 'UNG', 'DBA'],  # Gold, Silver, Oil, Gas, Agriculture
    'currencies': ['FXE', 'FXY', 'UUP']  # Euro, Yen, USD Bull
}

# Flatten to single list for data acquisition
ALL_SYMBOLS = []
for category, symbols in TURTLE_UNIVERSE.items():
    ALL_SYMBOLS.extend(symbols)

print(f"🎯 Turtle Trading Universe: {len(ALL_SYMBOLS)} instruments")
for category, symbols in TURTLE_UNIVERSE.items():
    print(f"   {category}: {symbols}")
    
print(f"\n📊 Total symbols to acquire: {ALL_SYMBOLS}")


🎯 Turtle Trading Universe: 18 instruments
   equity_us: ['SPY', 'QQQ', 'IWM']
   equity_intl: ['EFA', 'EEM', 'VEA']
   fixed_income: ['TLT', 'IEF', 'LQD', 'HYG']
   commodities: ['GLD', 'SLV', 'USO', 'UNG', 'DBA']
   currencies: ['FXE', 'FXY', 'UUP']

📊 Total symbols to acquire: ['SPY', 'QQQ', 'IWM', 'EFA', 'EEM', 'VEA', 'TLT', 'IEF', 'LQD', 'HYG', 'GLD', 'SLV', 'USO', 'UNG', 'DBA', 'FXE', 'FXY', 'UUP']


In [6]:

# Robust Multi-Source Data Acquisition
USE_ALPHA = bool(os.getenv('ALPHAVANTAGE_API_KEY'))
USE_ALPHA = False
all_data = []

print(f"🚀 Starting data acquisition for {len(ALL_SYMBOLS)} symbols...")
print(f"Using {'Alpha Vantage API' if USE_ALPHA else 'yfinance (Yahoo Finance)'}")

for symbol in ALL_SYMBOLS:
    print(f"Fetching {symbol}...", end=' ')
    try:
        if USE_ALPHA:
            # Alpha Vantage API logic (fixed version from homework)
            API_KEY = os.getenv('ALPHAVANTAGE_API_KEY')
            
            url = f"https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol={symbol}&outputsize=compact&apikey={API_KEY}"
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            data = response.json()
            
            # Handle API responses
            if 'Error Message' in data:
                print(f"❌ API Error: {data['Error Message']}")
                continue
            if 'Note' in data:
                print(f"⚠️  Rate Limit: {data['Note']}")
                continue
            if 'Information' in data:
                print(f"ℹ️  Info: {data['Information']}")
                continue
                
            # Find time series data
            time_series_key = None
            for key in data.keys():
                if 'Time Series' in key:
                    time_series_key = key
                    break
                    
            if not time_series_key:
                print(f"❌ No time series data found")
                continue
                
            # Process data
            df = pd.DataFrame(data[time_series_key]).T.reset_index()
            
            # Handle different column names
            if '5. adjusted close' in df.columns:
                df = df.rename(columns={'index': 'date', '5. adjusted close': 'adj_close'})
            elif '4. close' in df.columns:
                df = df.rename(columns={'index': 'date', '4. close': 'adj_close'})
            else:
                print(f"❌ Unexpected columns: {list(df.columns)}")
                continue
                
            df['date'] = pd.to_datetime(df['date'])
            df['adj_close'] = pd.to_numeric(df['adj_close'])
            df = df[['date', 'adj_close']].copy()
            
        else:
            # yfinance logic (fixed version from homework)
            import yfinance as yf
            df = yf.download(symbol, period='2y', interval='1d', progress=False, auto_adjust=True)
            
            if df.empty:
                print("❌ No data")
                continue
                
            df = df.reset_index()
            
            # With auto_adjust=True, 'Close' column contains adjusted prices
            if 'Close' in df.columns and 'Date' in df.columns:
                df = df[['Date', 'Close']].copy()
                df.columns = ['date', 'adj_close']
            else:
                print(f"❌ Unexpected columns: {list(df.columns)}")
                continue
        
        if df.empty:
            print("❌ Empty data")
            continue
            
        df['symbol'] = symbol
        
        # Add asset category for analysis
        category = None
        for cat, syms in TURTLE_UNIVERSE.items():
            if symbol in syms:
                category = cat
                break
        df['asset_category'] = category
        
        all_data.append(df)
        print(f"✅ {len(df)} records")
        
    except Exception as e:
        print(f"❌ Error: {e}")
        continue

# Process and save results
if all_data:
    df_turtle_data = pd.concat(all_data, ignore_index=True)
    print(f"\n🎉 SUCCESS! Retrieved {len(df_turtle_data):,} total records for {df_turtle_data['symbol'].nunique()} symbols")
    
    # Validation
    v_turtle = validate(df_turtle_data, ['date','adj_close','symbol', 'asset_category'])
    print(f"📊 Validation: {v_turtle}")
    
    # Save data
    df_sorted = df_turtle_data.sort_values(['symbol', 'date']).reset_index(drop=True)
    saved_path = save_csv(df_sorted, 'turtle_universe', 
                         source='alpha' if USE_ALPHA else 'yfinance', 
                         assets='multi', count=len(ALL_SYMBOLS))
    
    # Show summary by asset category
    print(f"\n📈 Data Summary by Asset Category:")
    summary = df_sorted.groupby(['asset_category', 'symbol']).agg({
        'date': ['min', 'max'], 
        'adj_close': ['count', 'mean']
    }).round(2)
    print(summary)
    
else:
    print("❌ No data retrieved! Check your internet connection or API keys.")
    df_turtle_data = pd.DataFrame(columns=['date', 'adj_close', 'symbol', 'asset_category'])


🚀 Starting data acquisition for 18 symbols...
Using yfinance (Yahoo Finance)
Fetching SPY... ✅ 502 records
Fetching QQQ... ✅ 502 records
Fetching IWM... ✅ 502 records
Fetching EFA... ✅ 502 records
Fetching EEM... ✅ 502 records
Fetching VEA... ✅ 502 records
Fetching TLT... ✅ 502 records
Fetching IEF... ✅ 502 records
Fetching LQD... ✅ 502 records
Fetching HYG... ✅ 502 records
Fetching GLD... ✅ 502 records
Fetching SLV... ✅ 502 records
Fetching USO... ✅ 502 records
Fetching UNG... ✅ 502 records
Fetching DBA... ✅ 502 records
Fetching FXE... ✅ 502 records
Fetching FXY... ✅ 502 records
Fetching UUP... ✅ 502 records

🎉 SUCCESS! Retrieved 9,036 total records for 18 symbols
📊 Validation: {'shape': (9036, 4), 'missing_cols': [], 'total_nulls': np.int64(0), 'dtypes': {'date': dtype('<M8[ns]'), 'adj_close': dtype('float64'), 'symbol': dtype('O'), 'asset_category': dtype('O')}}
💾 Saved: turtle_universe_source-yfinance_assets-multi_count-18_20250820-102058.csv

📈 Data Summary by Asset Category:
    