# Data Exploration: U.S. Sector ETFs

This notebook demonstrates how to download and explore historical price data for U.S. sector ETFs using the `download_etf_prices()` function.

## Sector ETFs Overview

We'll work with the following SPDR Sector ETFs:
- **XLF**: Financial Select Sector
- **XLK**: Technology Select Sector
- **XLE**: Energy Select Sector
- **XLV**: Health Care Select Sector
- **XLY**: Consumer Discretionary Select Sector
- **XLP**: Consumer Staples Select Sector
- **XLI**: Industrial Select Sector
- **XLB**: Materials Select Sector
- **XLU**: Utilities Select Sector

In [None]:
# Import required libraries
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from src.data.fetch import download_etf_prices

# Set display options
pd.set_option('display.max_columns', None)
%matplotlib inline

## 1. Download ETF Price Data

Let's download historical prices for a subset of sector ETFs to explore potential cointegration relationships.

In [None]:
# Define ETF tickers to analyze
tickers = ['XLF', 'XLK', 'XLE', 'XLV', 'XLY']

# Download data for the past 3 years
prices = download_etf_prices(
    tickers=tickers,
    start_date='2021-01-01',
    end_date='2023-12-31'
)

print(f"Downloaded {len(prices)} trading days of data")
print(f"Date range: {prices.index.min()} to {prices.index.max()}")
prices.head()

## 2. Basic Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
prices.describe()

In [None]:
# Check for missing values
print("Missing values per ticker:")
print(prices.isnull().sum())

## 3. Visualize Price Series

In [None]:
# Plot normalized prices (base = 100)
normalized_prices = (prices / prices.iloc[0]) * 100

fig, ax = plt.subplots(figsize=(12, 6))
for ticker in tickers:
    ax.plot(normalized_prices.index, normalized_prices[ticker], label=ticker)

ax.set_xlabel('Date')
ax.set_ylabel('Normalized Price (Base = 100)')
ax.set_title('Normalized Sector ETF Prices')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 4. Calculate Returns

In [None]:
# Calculate daily log returns
returns = np.log(prices / prices.shift(1)).dropna()

print("Returns Statistics:")
print(returns.describe())

In [None]:
# Plot returns distribution
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for idx, ticker in enumerate(tickers):
    axes[idx].hist(returns[ticker], bins=50, alpha=0.7, edgecolor='black')
    axes[idx].set_title(f'{ticker} Returns Distribution')
    axes[idx].set_xlabel('Log Returns')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(True, alpha=0.3)

# Remove extra subplot
fig.delaxes(axes[-1])
plt.tight_layout()
plt.show()

## 5. Correlation Analysis

In [None]:
# Calculate correlation matrix for returns
correlation_matrix = returns.corr()

print("Returns Correlation Matrix:")
print(correlation_matrix.round(3))

In [None]:
# Visualize correlation matrix
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(correlation_matrix, cmap='coolwarm', vmin=-1, vmax=1)
ax.set_xticks(np.arange(len(tickers)))
ax.set_yticks(np.arange(len(tickers)))
ax.set_xticklabels(tickers)
ax.set_yticklabels(tickers)

# Add correlation values to cells
for i in range(len(tickers)):
    for j in range(len(tickers)):
        text = ax.text(j, i, f'{correlation_matrix.iloc[i, j]:.2f}',
                      ha="center", va="center", color="black")

ax.set_title('ETF Returns Correlation Matrix')
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

## 6. Save Processed Data

Save the downloaded price data for use in subsequent analyses.

In [None]:
# Save to CSV
prices.to_csv('../data/etf_prices.csv')
returns.to_csv('../data/etf_returns.csv')

print("Data saved to ../data/ directory")

## Next Steps

In the next notebooks, we will:
1. Test for cointegration between ETF pairs
2. Estimate Ornstein-Uhlenbeck (OU) parameters for mean-reversion modeling
3. Develop and backtest a pairs trading strategy
4. Evaluate performance with realistic transaction costs