# ETL Pipeline for Multi-Asset Portfolio Analysis

This project develops a simple ETL pipeline to transform raw financial market data into actionable investment insights. Using Yahoo Finance as the primary data source, the system extracts price and volume data for a diversified portfolio spanning multiple asset classes including equities, bonds, commodities, and volatility instruments.

## YFinance Setup

The yfinance library is not officially affiliated with Yahoo Finance. It uses web scraping techniques and can required to be tweaked or updated to work properly. Here we define some utility functions to avoid detection. The methods are very similar to those discussed in class while using BeautifulSoup or Selenium.

In [19]:
# General basic imports for the analysis
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

In [2]:
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Configure yfinance with custom session and headers
def setup_yfinance_session():
    """Set up a robust session for yfinance with headers and retry logic."""
    session = requests.Session()
    
    # Custom headers to avoid detection
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
    }
    session.headers.update(headers)
    
    # Retry strategy
    retry_strategy = Retry(
        total=3,
        status_forcelist=[429, 500, 502, 503, 504],
        backoff_factor=1
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    
    return session

# Apply the session globally to yfinance
custom_session = setup_yfinance_session()
yf._get_data_session = lambda: custom_session

print("Custom yfinance session configured!")

Custom yfinance session configured!


Let's test this solution by looking up a stock. The [Lookup API](https://yfinance-python.org/reference/api/yfinance.Lookup.html#yfinance.Lookup) queries Yahoo Finance for tickers:

In [3]:
lookup_res = yf.Lookup("AAPL")
lookup_res.get_stock().head()

Unnamed: 0_level_0,exchange,industryLink,industryName,quoteType,rank,regularMarketChange,regularMarketPercentChange,regularMarketPrice,shortName
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AAPL,NMS,https://finance.yahoo.com/sector/technology,Technology,equity,33036.0,7.360001,3.203901,237.080002,Apple Inc.
AAPL.NE,NEO,https://finance.yahoo.com/sector/technology,Technology,equity,20011.0,1.130001,3.414932,34.220001,APPLE CDR (CAD HEDGED)
AAPLUSTRAD.BO,BSE,https://finance.yahoo.com/sector/industrials,Industrials,equity,20002.0,0.0,0.0,0.84,AA Plus Tradelink Limited
AAPL34.SA,SAO,https://finance.yahoo.com/sector/technology,Technology,equity,20002.0,1.290005,2.030864,64.800003,APPLE DRN
AAPL.BA,BUE,https://finance.yahoo.com/sector/technology,Technology,equity,20002.0,600.0,3.821656,16300.0,APPLE INC CEDEAR(REPR 1/20 SHR)


## Financial Instruments Perimeter

In this section we are going to define a diversified set of financial instruments to capture performance across asset classes and market segments. These instruments were selected using my expertise and consulting with Claude Sonnet 4. The aim is simply to have a small but representative set, with equity indices for growth exposure, sector ETFs for tactical allocation, fixed income securities for stability, alternative assets for diversification, and volatility instruments for risk management. This approach should enable risk-return analysis and correlation studies across different market environments.

In [88]:
# Portfolio tickers for yfinance
tickers_list = [
    # Equity Indices
    "SPY",  # SPDR S&P 500 ETF Trust (US Large Cap)
    "QQQ",  # Invesco QQQ Trust (NASDAQ-100/Technology Heavy)
    "IWM",  # iShares Russell 2000 ETF (US Small Cap)
    "EFA",  # iShares MSCI EAFE ETF (International Developed Markets)
    "EEM",  # iShares MSCI Emerging Markets ETF (Emerging Markets)
    "FXI",  # iShares China Large-Cap ETF (China A-Shares)
    # Sector ETFs
    "XLF",  # Financial Select Sector SPDR Fund
    "XLK",  # Technology Select Sector SPDR Fund
    "XLE",  # Energy Select Sector SPDR Fund
    "XLV",  # Health Care Select Sector SPDR Fund
    "XLI",  # Industrial Select Sector SPDR Fund
    # Fixed Income
    "TLT",  # iShares 20+ Year Treasury Bond ETF (Long Duration)
    "SHY",  # iShares 1-3 Year Treasury Bond ETF (Short Duration)
    # Alternative Assets
    "GLD",  # SPDR Gold Trust (Precious Metals)
    "SLV",  # iShares Silver Trust (Industrial Precious Metals)
    "DBC",  # Invesco DB Commodity Index Tracking Fund (Broad Commodities)
    # Risk & Currency
    "VIX",  # CBOE Volatility Index (Market Fear Gauge)
    "UUP",  # Invesco DB US Dollar Index Bullish Fund (US Dollar Strength)
    "EURUSD=X",  # Euro/US Dollar
    "JPYUSD=X",  # Japanese Yen/US Dollar
    "^XDE", # Euro Currency Index
]
tickers = [yf.Ticker(ticker_str) for ticker_str in tickers_list]

### Financial Instruments Table

We can create a **relational table** describing our instruments by using the `ticker.get_info` method. The symbol (ticker) is our primary key.

In [89]:
# Create a DataFrame with ticker information
ticker_infos = {}

print("Fetching ticker information...")
for i, ticker_obj in enumerate(tickers):
    try:
        info = ticker_obj.get_info()
        ticker_infos[tickers_list[i]] = info
        # Small delay to avoid rate limiting
        time.sleep(0.1) 
    except Exception as e:
        print(f"Error fetching {tickers_list[i]}: {str(e)}")
        # Continue with next ticker even if one fails

# Convert to DataFrame
portfolio_info_df = pd.DataFrame.from_dict(ticker_infos, orient='index')

print(f"Successfully fetched information for {len(ticker_infos)} tickers")

Fetching ticker information...
Successfully fetched information for 21 tickers


Writing and displaying this table:

In [90]:
portfolio_info_df["symbol"] = portfolio_info_df.index
portfolio_info_df.reset_index(drop=True, inplace=True)

# Write portfolio to file
portfolio_info_df.to_csv("../data/portfolio_info.csv")

# Display basic information about our portfolio
key_columns = [
    "symbol",
    "shortName",
    "longName",
    "exchange",
    "quoteType",
    "currency",
    "marketCap",
]
available_columns = [col for col in key_columns if col in portfolio_info_df.columns]

portfolio_info_df[available_columns].head()

Unnamed: 0,symbol,shortName,longName,exchange,quoteType,currency,marketCap
0,SPY,SPDR S&P 500,SPDR S&P 500 ETF,PCX,ETF,USD,590813000000.0
1,QQQ,"Invesco QQQ Trust, Series 1",Invesco QQQ Trust,NGM,ETF,USD,224094500000.0
2,IWM,iShares Russell 2000 ETF,iShares Russell 2000 ETF,PCX,ETF,USD,65672950000.0
3,EFA,iShares MSCI EAFE ETF,iShares MSCI EAFE ETF,PCX,ETF,USD,85131650000.0
4,EEM,iShares MSCI Emerging Index Fun,iShares MSCI Emerging Markets ETF,PCX,ETF,USD,37627520000.0


## Downloading Historical Financial Data

We can now download financial data. We initially select a very wide time window to **extract and store all possible raw data**. Potential issues will be identified and handled at a later stage.

In [91]:
# Download historical data for all portfolio tickers (1990-today)
start_date = "1990-01-01"
end_date = datetime.now().strftime("%Y-%m-%d")

print(f"Date range: {start_date} to {end_date}")

Date range: 1990-01-01 to 2025-09-03


In [92]:
print(f"Downloading historical data from {start_date} to {end_date}...")

try:
    # Use space-separated string of tickers for bulk download
    tickers_string = " ".join(tickers_list)
    
    # Download with multi-level columns
    portfolio_data = yf.download(
        tickers_string,
        start=start_date,
        end=end_date,
        auto_adjust=True,  # Adjust for stock splits
        prepost=False,     # Only regular trading hours
        threads=True       # Use threading for faster downloads
    )
    
    print(f"Successfully downloaded data for {len(tickers_list)} tickers")
    print(f"Date range: {portfolio_data.index.min()} to {portfolio_data.index.max()}")
    
except Exception as e:
    print(f"Error downloading bulk data: {str(e)}")

Downloading historical data from 1990-01-01 to 2025-09-03...


[*********************100%***********************]  21 of 21 completed

Successfully downloaded data for 21 tickers
Date range: 1993-01-29 00:00:00 to 2025-09-03 00:00:00





In [94]:
# Display basic information about the downloaded data
print(f"Portfolio data overview:")
print(f"Shape: {portfolio_data.shape}")
print(f"Financial data (first level): {portfolio_data.columns.get_level_values(0).unique().tolist()}")
print(f"Tickers (second level): {portfolio_data.columns.get_level_values(1).unique().tolist()}")

portfolio_data.tail()

Portfolio data overview:
Shape: (8472, 105)
Financial data (first level): ['Close', 'High', 'Low', 'Open', 'Volume']
Tickers (second level): ['DBC', 'EEM', 'EFA', 'EURUSD=X', 'FXI', 'GLD', 'IWM', 'JPYUSD=X', 'QQQ', 'SHY', 'SLV', 'SPY', 'TLT', 'UUP', 'VIX', 'XLE', 'XLF', 'XLI', 'XLK', 'XLV', '^XDE']


Price,Close,Close,Close,Close,Close,Close,Close,Close,Close,Close,...,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume
Ticker,DBC,EEM,EFA,EURUSD=X,FXI,GLD,IWM,JPYUSD=X,QQQ,SHY,...,SPY,TLT,UUP,VIX,XLE,XLF,XLI,XLK,XLV,^XDE
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2025-08-28,22.18,50.099998,92.019997,1.164795,38.560001,315.029999,236.220001,0.006788,577.080017,82.652222,...,61519500.0,32205900.0,819400.0,,12043100.0,28837300.0,7166700.0,6190300.0,9165700.0,0.0
2025-08-29,22.209999,49.860001,91.480003,1.168156,38.91,318.070007,235.169998,0.006812,570.400024,82.722008,...,74522200.0,41686400.0,661500.0,,11872500.0,36062400.0,8806100.0,8740600.0,9655700.0,0.0
2025-09-01,,,,1.16918,,,,0.006798,,,...,,,,,,,,,,
2025-09-02,22.57,49.82,90.580002,1.171591,39.259998,325.589996,233.899994,0.006799,565.619995,82.68,...,81848800.0,48736200.0,930500.0,,12981700.0,38725500.0,9798600.0,9111100.0,13323800.0,0.0
2025-09-03,,,,1.163603,,,,0.006732,,,...,,,,,,,,,,


### Historical Data Table

Similarly to the financial instruments table, we can create a simple relational table with the obtained historical series. 
Notice that this is a multi-level-index data frame.

The ideal tool for historical series (columnar data) would be e.g. a Cassandra database, but in the interest of conciseness we are going to use another CSV file:

In [95]:
portfolio_data.to_csv("../data/portfolio_data.csv")

## Data Exploration

In this section we are going to perform a preliminary data exploration:

0. **Data Selection**
- Discuss what financial data falls in the perimeter of our analysis

1. **Data Quality Assessment**
- **Missing Data Analysis**: Identify patterns of missing values across tickers and time periods
- **Data Completeness**: Evaluate coverage for each instrument (some ETFs may have shorter histories)
- **Outlier Detection**: Spot anomalous price movements, volume spikes, or data errors
- **Temporal Consistency**: Verify trading day alignment and handle market holidays

2. **Data Profiling**
- **Statistical Summaries**: Generate descriptive statistics for OHLCV data across all tickers
- **Data Types & Formats**: Validate numeric precision and date formatting
- **Cross-Asset Validation**: Compare data ranges and patterns across asset classes
- **Volume Analysis**: Assess liquidity patterns and trading activity

3. **Preliminary Financial Analysis**
- **Price Evolution**: Visualize historical performance across the 1990-2024 period
- **Volatility Patterns**: Identify periods of high market stress (2008, 2020, etc.)
- **Correlation Structure**: Examine relationships between different asset classes
- **Market Regime Analysis**: Detect structural breaks and regime changes

4. **ETL Pipeline Readiness**
- **Data Standardization Needs**: Identify required transformations and normalization
- **Performance Optimization**: Assess data loading and processing efficiency
- **Star Schema Design**: Plan dimensional modeling for the data warehouse
- **Business Logic Validation**: Ensure data integrity for downstream analytics

This exploration will inform our transformation logic and help design robust data quality checks for the production ETL pipeline.

## 0. Data Selection

Using my knowledge of the domain, observing the available data columns (Open, Close, High, Low, Volume), I choose to simplify the following analysis by focusing on **[Closing Prices](https://www.investopedia.com/terms/c/closingprice.asp)**:

- they are the industry standard when evaluating stock performances;
- in historical series, most providers show adjusted closing prices that already account for dividends and other corporate actions;
- they were the standard reference long before telematic trading was introduced, and therefore longer historical series are available for closing price than e.g. High and Low prices.

In [96]:
close_df = portfolio_data['Close'].copy()

# Reset column names to ensure they're simple strings
close_df.columns.name = None

display(close_df.sample(5))
close_df.info()

Unnamed: 0_level_0,DBC,EEM,EFA,EURUSD=X,FXI,GLD,IWM,JPYUSD=X,QQQ,SHY,...,SPY,TLT,UUP,VIX,XLE,XLF,XLI,XLK,XLV,^XDE
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-11,,14.149918,28.837301,1.311699,11.004767,42.209999,46.210575,0.009669,32.717533,55.477074,...,80.627678,45.955395,,,19.770422,16.093983,20.331671,15.580914,21.067389,
2000-06-07,,,,,,,37.138123,0.009471,79.324593,,...,93.676834,,,,16.070974,12.865366,18.746672,40.532639,20.236576,
2004-02-10,,12.529618,25.800739,1.268504,,,44.37685,0.009471,31.492369,55.299229,...,76.883476,42.978859,,,15.819621,15.487915,17.978966,16.024618,22.081377,
1997-07-14,,,,,,,,0.008798,,,...,56.533661,,,,,,,,,
2024-05-17,22.508348,42.357365,77.8134,1.086779,28.404928,223.660004,205.154053,0.006436,448.511719,77.257469,...,521.255188,86.402618,27.326769,,91.195694,41.695553,123.119324,210.007233,143.301788,108.696999


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8472 entries, 1993-01-29 to 2025-09-03
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   DBC       4924 non-null   float64
 1   EEM       5633 non-null   float64
 2   EFA       6039 non-null   float64
 3   EURUSD=X  5646 non-null   float64
 4   FXI       5258 non-null   float64
 5   GLD       5229 non-null   float64
 6   IWM       6354 non-null   float64
 7   JPYUSD=X  7480 non-null   float64
 8   QQQ       6662 non-null   float64
 9   SHY       5811 non-null   float64
 10  SLV       4867 non-null   float64
 11  SPY       8204 non-null   float64
 12  TLT       5811 non-null   float64
 13  UUP       4657 non-null   float64
 14  VIX       787 non-null    float64
 15  XLE       6714 non-null   float64
 16  XLF       6714 non-null   float64
 17  XLI       6714 non-null   float64
 18  XLK       6714 non-null   float64
 19  XLV       6714 non-null   float64
 20  ^XDE      46

In [97]:
close_df.to_csv("../data/portfolio_data_close.csv")

### 1. Data Quality Assessment

In [98]:
# Missing Data Analysis

# Missing data by ticker (for Close prices)
missing_by_ticker = close_df.isnull().sum().sort_values(ascending=False)

print("Missing data points by ticker (Close prices):")
for ticker, missing_count in missing_by_ticker.items():
    if missing_count > 0:
        total_possible = len(close_df)
        pct_missing = (missing_count / total_possible) * 100
        print(f"  {ticker}: {missing_count:,} missing ({pct_missing:.1f}%)")


# Check for missing values across all tickers
missing_data = close_df.isnull().sum().sum()
print(f"\nTotal missing data points: {missing_data:,}")

total_data_points = close_df.size
print(f"Total data points: {total_data_points:,}")

missing_percentage = (missing_data / total_data_points) * 100
print(f"Overall missing data percentage: {missing_percentage:.2f}%\n")

Missing data points by ticker (Close prices):
  VIX: 7,685 missing (90.7%)
  UUP: 3,815 missing (45.0%)
  ^XDE: 3,783 missing (44.7%)
  SLV: 3,605 missing (42.6%)
  DBC: 3,548 missing (41.9%)
  GLD: 3,243 missing (38.3%)
  FXI: 3,214 missing (37.9%)
  EEM: 2,839 missing (33.5%)
  EURUSD=X: 2,826 missing (33.4%)
  TLT: 2,661 missing (31.4%)
  SHY: 2,661 missing (31.4%)
  EFA: 2,433 missing (28.7%)
  IWM: 2,118 missing (25.0%)
  QQQ: 1,810 missing (21.4%)
  XLI: 1,758 missing (20.8%)
  XLF: 1,758 missing (20.8%)
  XLE: 1,758 missing (20.8%)
  XLK: 1,758 missing (20.8%)
  XLV: 1,758 missing (20.8%)
  JPYUSD=X: 992 missing (11.7%)
  SPY: 268 missing (3.2%)

Total missing data points: 56,291
Total data points: 177,912
Overall missing data percentage: 31.64%



In [99]:
# Data completeness timeline
print(f"Data availability timeline:")
ranges = []
for ticker in tickers_list:
    ticker_data = close_df[ticker].dropna()
    if len(ticker_data) > 0:
        first_date = ticker_data.index.min()
        last_date = ticker_data.index.max()
        total_days = len(ticker_data)
        ranges.append((ticker, first_date, last_date, total_days))
    else:
        print(f"  {ticker}: No valid data found")

ranges.sort(key=lambda x: x[1])  # Sort by first_date
for ticker, first_date, last_date, total_days in ranges:
    print(f"  {ticker}: {first_date.strftime('%Y-%m-%d')} to {last_date.strftime('%Y-%m-%d')} ({total_days:,} days)")

Data availability timeline:
  SPY: 1993-01-29 to 2025-09-02 (8,204 days)
  JPYUSD=X: 1996-10-30 to 2025-09-03 (7,480 days)
  XLF: 1998-12-22 to 2025-09-02 (6,714 days)
  XLK: 1998-12-22 to 2025-09-02 (6,714 days)
  XLE: 1998-12-22 to 2025-09-02 (6,714 days)
  XLV: 1998-12-22 to 2025-09-02 (6,714 days)
  XLI: 1998-12-22 to 2025-09-02 (6,714 days)
  QQQ: 1999-03-10 to 2025-09-02 (6,662 days)
  IWM: 2000-05-26 to 2025-09-02 (6,354 days)
  EFA: 2001-08-27 to 2025-09-02 (6,039 days)
  TLT: 2002-07-30 to 2025-09-02 (5,811 days)
  SHY: 2002-07-30 to 2025-09-02 (5,811 days)
  EEM: 2003-04-14 to 2025-09-02 (5,633 days)
  EURUSD=X: 2003-12-01 to 2025-09-03 (5,646 days)
  FXI: 2004-10-08 to 2025-09-02 (5,258 days)
  GLD: 2004-11-18 to 2025-09-02 (5,229 days)
  DBC: 2006-02-06 to 2025-09-02 (4,924 days)
  SLV: 2006-04-28 to 2025-09-02 (4,867 days)
  ^XDE: 2007-01-11 to 2025-09-02 (4,689 days)
  UUP: 2007-03-01 to 2025-09-02 (4,657 days)
  VIX: 2014-12-04 to 2018-01-31 (787 days)


We immediately observe that VIX is not currently traded and should therefore be excluded. We can then rescrict the time series to a minimum common denominator:

In [101]:
# Filter out VIX from analysis as noted
tickers_cleaned = [ticker for ticker in tickers_list if ticker != "VIX"]
print(f"Analyzing {len(tickers_cleaned)} tickers: {tickers_cleaned}")

# Extract close prices for non-VIX tickers
close_df_cleaned = close_df[tickers_cleaned].copy()

Analyzing 20 tickers: ['SPY', 'QQQ', 'IWM', 'EFA', 'EEM', 'FXI', 'XLF', 'XLK', 'XLE', 'XLV', 'XLI', 'TLT', 'SHY', 'GLD', 'SLV', 'DBC', 'UUP', 'EURUSD=X', 'JPYUSD=X', '^XDE']


Let us check time series alignment:

In [118]:
# Define common period (intersection of all data ranges)
common_start = ranges[-2][1]
common_end = datetime.strptime(end_date, "%Y-%m-%d")

print("COMMON DATA PERIOD ANALYSIS")
print(f"Common period: {common_start.strftime('%Y-%m-%d')} -> {common_end.strftime('%Y-%m-%d')}")
print(f"Duration: {(common_end - common_start).days} days")

# Filter to common period
common_period_data = close_df_cleaned.loc[common_start:common_end]

# Check for missing values in common period
print("TRADE DAY ALIGNMENT CHECK")
total_days = len(common_period_data)
missing_counts = common_period_data.isnull().sum()
total_missing = missing_counts.sum()

print(f"Total trading days in common period: {total_days}")

# Check if perfectly aligned (no missing values)

if total_missing == 0:
    print("PERFECT ALIGNMENT: All time series are perfectly aligned!")
else:
    print(f"MISALIGNMENT DETECTED: {total_missing:,} total missing values")
    print("Missing values by ticker:")
    for ticker, missing_count in missing_counts.items():
        if missing_count > 0:
            pct_missing = (missing_count / total_days) * 100
            print(f"  {ticker}: {missing_count} missing ({pct_missing:.2f}%)")

COMMON DATA PERIOD ANALYSIS
Common period: 2007-03-01 -> 2025-09-03
Duration: 6761 days
TRADE DAY ALIGNMENT CHECK
Total trading days in common period: 4828
MISALIGNMENT DETECTED: 3,135 total missing values
Missing values by ticker:
  SPY: 171 missing (3.54%)
  QQQ: 171 missing (3.54%)
  IWM: 171 missing (3.54%)
  EFA: 171 missing (3.54%)
  EEM: 171 missing (3.54%)
  FXI: 171 missing (3.54%)
  XLF: 171 missing (3.54%)
  XLK: 171 missing (3.54%)
  XLE: 171 missing (3.54%)
  XLV: 171 missing (3.54%)
  XLI: 171 missing (3.54%)
  TLT: 171 missing (3.54%)
  SHY: 171 missing (3.54%)
  GLD: 171 missing (3.54%)
  SLV: 171 missing (3.54%)
  DBC: 171 missing (3.54%)
  UUP: 171 missing (3.54%)
  EURUSD=X: 28 missing (0.58%)
  JPYUSD=X: 28 missing (0.58%)
  ^XDE: 172 missing (3.56%)


In [119]:
# Summary statistics
complete_data_mask = common_period_data.notnull().all(axis=1)
complete_days = complete_data_mask.sum()
alignment_pct = (complete_days / total_days) * 100

print("ALIGNMENT SUMMARY")
print(f"Days with complete data: {complete_days}")
print(f"Days with missing data: {total_days - complete_days}")
print(f"Alignment percentage: {alignment_pct:.2f}%")

if alignment_pct == 100:
    print("CONCLUSION: Time series are perfectly aligned")
elif alignment_pct >= 95:
    print("CONCLUSION: Time series are well-aligned with minimal gaps")
else:
    print("CONCLUSION: Significant alignment issues detected")

ALIGNMENT SUMMARY
Days with complete data: 4627
Days with missing data: 201
Alignment percentage: 95.84%
CONCLUSION: Time series are well-aligned with minimal gaps


#### Data Quality Results:

**Missing Data Analysis:**
- **Total missing data**: Significant gaps identified across multiple instruments
- **VIX availability**: Discontinued after 2018, requiring exclusion from ongoing analysis
- **Currency exposure**: EURUSD=X and JPYUSD=X show natural weekend gaps and international holiday patterns
- **Overall missing data percentage**: Varies by instrument inception dates and trading calendars

**Data Completeness Timeline:**
- **Portfolio span**: 22 instruments covering 1990-2025 period
- **Varying inception dates**: ETFs have different launch dates creating natural data gaps
- **International instruments**: Show expected trading calendar differences due to regional holidays
- **Longest series**: SPY provides most complete coverage starting from 1993

**Trading Day Alignment Analysis:**
- **Common period identified**: From most recent instrument inception to present
- **Alignment assessment**: Mixed results reflecting natural market structure differences
- **International market exposure**: Currency pairs follow global forex trading patterns (24/5)
- **Conclusion**: 95%+ alignment achieved for US-listed instruments, with expected variations from international exposures

**Key Findings:**
- The analysis must consider **actively traded instruments**. The VIX is not available after 2018, and will therefore be excluded;
- **Currency instruments** provide valuable international diversification but follow different trading schedules;
- We can focus on a **robust common period** where core portfolio instruments are traded simultaneously;
- **Data quality varies** by asset class, with US equity ETFs showing highest consistency and international instruments showing natural calendar differences.

**Important note:** this discussion is greatly simplified. Consider, for example, that:

- when a time series is not available, there are industry standards to circumvent the problem depending on context. For example, a proxy highly correlated instrument with a longer (and suitably rescaled) time series could be selected;
- there are other quality indicators available. For example, transaction volumes should be considered to determine each data point's quality: a very low transaction volume might indicate unreliable prices.

#### Data Quality Remediations

Let us apply the adjustments discussed in the "Data Quality Results" section: