# Exploratory Data Analysis (EDA)

In this notebook, we perform an Exploratory Data Analysis (EDA) on the cleaned market and economic data. The primary goal is to uncover historical trends, relationships, and volatility patterns among key Exchange-Traded Funds (ETFs) and market indicators. This analysis will help us understand the behavior of these assets and identify potential features for our modeling phase.

First, we import the necessary libraries for data analysis and visualization.

In [1]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

We load the pre-cleaned market data from the CSV file into a pandas DataFrame. The first column is used as the index, and dates are parsed.

In [2]:
# Load the dataset
df = pd.read_csv("../data/cleaned_market_data.csv", index_col=0, parse_dates=True)

# Display the first few rows of the dataset
df.head()

Unnamed: 0,GLD,SPY,TLT,UUP,^MOVE,^VIX,T10Y2Y,BAMLC0A0CMEY,USRECP
2007-03-02,63.709999,138.669998,90.199997,24.959999,76.699997,18.610001,-0.04,5.46,0.0
2007-03-09,64.25,140.779999,89.400002,25.16,63.599998,14.09,-0.07,5.55,0.0
2007-03-16,64.620003,138.529999,89.849998,24.870001,67.0,16.790001,-0.03,5.52,0.0
2007-03-23,65.150002,143.389999,88.790001,24.93,68.699997,12.95,0.02,5.58,0.0
2007-03-30,65.739998,142.0,88.279999,24.790001,67.900002,14.64,0.07,5.6,0.0


The DataFrame contains weekly time-series data from March 2007 onwards. The columns represent:
- **ETFs:**
  - `GLD`: SPDR Gold Shares, tracking the price of gold.
  - `SPY`: SPDR S&P 500 ETF Trust, tracking the S&P 500 stock market index.
  - `TLT`: iShares 20+ Year Treasury Bond ETF, tracking long-term U.S. Treasury bonds.
  - `UUP`: Invesco DB US Dollar Index Bullish Fund, tracking the value of the U.S. dollar against a basket of foreign currencies.
- **Market Indicators:**
  - `VIX`: CBOE Volatility Index, measuring stock market volatility.
  - `MOVE`: ICE BofA MOVE Index, measuring bond market volatility.
  - `T10Y2Y`: The spread between 10-Year and 2-Year Treasury yields, a key indicator of economic expectations.
  - `BAMLC0A0CMEY`: ICE BofA US Corporate C Master II Effective Yield, a measure of corporate bond yields.
- **Recession Indicator:**
  - `USRECP`: A binary indicator from NBER, where `1` signifies a recession period.

Here we visualize the performance of:
- Gold (GLD)
- S&P 500 (SPY)
- 20+ Year Treasury Bond (TLT)
- Invesco DB US Dollar Index Bullish Fund (UUP) 

ETFs over time.

In [3]:
# Create a subplot for the ETFs
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(
        x=df.index,
        y=df["SPY"],
        name="SPY",
        line=dict(color="royalblue", width=2, dash="solid"),
    )
)

fig.add_trace(
    go.Scatter(
        x=df.index,
        y=df["GLD"],
        name="GLD",
        line=dict(color="crimson", width=1.5, dash="solid"),
    )
)

fig.add_trace(
    go.Scatter(x=df.index, y=df["TLT"], name="TLT", line=dict(width=1.5, dash="solid")),
)

fig.add_trace(
    go.Scatter(x=df.index, y=df["UUP"], name="UUP", line=dict(width=1.5, dash="dash")),
    secondary_y=True,
)

if df["USRECP"].sum() > 0:
    # Find the start of a recession (where USREC goes from 0 to 1)
    recession_starts = df.index[(df["USRECP"] == 1) & (df["USRECP"].shift(1) == 0)]

    # Find the end of a recession (where USREC goes from 1 to 0)
    # We need to look at the day *before* the switch to 0
    recession_ends = df.index[(df["USRECP"] == 0) & (df["USRECP"].shift(1) == 1)]

    # Handle edge cases: if the data starts or ends during a recession
    if df["USRECP"].iloc[0] == 1:
        recession_starts = recession_starts.insert(0, df.index[0])
    if df["USRECP"].iloc[-1] == 1:
        recession_ends = recession_ends.append(pd.Index([df.index[-1]]))

    # Add a shaded rectangle for each recession period found
    for start, end in zip(recession_starts, recession_ends):
        fig.add_vrect(
            x0=start,
            x1=end,
            fillcolor="grey",
            opacity=0.2,
            annotation_text="<b>NBER based Recession</b>",
            annotation_position="top left",
            layer="below",
            line_width=0,
        )

# COVID-19 specific recession period
covid_start = pd.Timestamp("2020-02-01")
covid_end = pd.Timestamp("2020-04-30")
fig.add_vrect(
    x0=covid_start,
    x1=covid_end,
    fillcolor="grey",
    opacity=0.2,
    layer="below",
    line_width=0,
    annotation_text="<b>COVID-19 Breakout</b>",
    annotation_position="top left",
)

fig.update_layout(
    title="<b>ETF Prices Over Time</b>",
    template="plotly_white",
    legend=dict(orientation="h", yanchor="bottom", y=1.1, xanchor="right", x=1),
    yaxis=dict(
        title="<b>SPY, GLD, TLT Prices</b>",
        tickfont=dict(color="crimson"),
        range=[0, df[["SPY", "GLD", "TLT"]].max().max() * 1.1],
        showgrid=False,
    ),
    yaxis2=dict(
        title="<b>UUP Prices</b>",
        tickfont=dict(color="royalblue"),
        range=[0, df["UUP"].max() * 1.1],
        overlaying="y",
        side="right",
    ),
    xaxis=dict(
        showgrid=False,
        tickformat="%Y-%m",
        ticks="outside",
    ),
    xaxis_title="Date",
    autosize=False,
    width=1050,
    height=600,
)

The line chart illustrates the long-term performance of the four major ETFs.

- **SPY (S&P 500)** shows a strong upward trend over the period, punctuated by major drawdowns during the 2008 Global Financial Crisis and the 2020 COVID-19 pandemic.
- **GLD (Gold)** exhibits periods of significant growth, especially during and after the 2008 crisis, acting as a safe-haven asset. It also saw a strong rally leading into the COVID-19 pandemic.
- **TLT (Treasury Bonds)** has a more stable, yet still positive, long-term trend. Its price often rises during periods of equity market stress (e.g., 2008, 2020), confirming its role as a defensive asset.
- **UUP (US Dollar)**, plotted on the secondary y-axis, shows cyclical behavior, with notable spikes in strength during times of global uncertainty like 2008, 2015, and 2022.

Next, we plot:
- CBOE Volatility Index (VIX) 
- ICE BofA MOVE Index (MOVE)
- 10-Year Treasury Constant Maturity Minus 2-Year Treasury Constant Maturity (T10Y2Y) 
- ICE BofA US Corporate C Master II Effective Yield (BAMLC0A0CMEY) 

to observe market volatility in both stocks and bonds.

In [6]:
indicators = ["^VIX", "^MOVE", "T10Y2Y", "BAMLC0A0CMEY"]

fig = make_subplots(specs=[[{"secondary_y": True}]])

secondary_y = False
width = 1.5
dash = "solid"

for i in indicators:
    if i in ["T10Y2Y", "BAMLC0A0CMEY"]:
        secondary_y = True
        width = 2
        dash = "dash"
    fig.add_trace(
        go.Scatter(x=df.index, y=df[i], name=i, line=dict(width=width, dash=dash)),
        secondary_y=secondary_y,
    )

if df["USRECP"].sum() > 0:
    # Find the start of a recession (where USREC goes from 0 to 1)
    recession_starts = df.index[(df["USRECP"] == 1) & (df["USRECP"].shift(1) == 0)]

    # Find the end of a recession (where USREC goes from 1 to 0)
    # We need to look at the day *before* the switch to 0
    recession_ends = df.index[(df["USRECP"] == 0) & (df["USRECP"].shift(1) == 1)]

    # Handle edge cases: if the data starts or ends during a recession
    if df["USRECP"].iloc[0] == 1:
        recession_starts = recession_starts.insert(0, df.index[0])
    if df["USRECP"].iloc[-1] == 1:
        recession_ends = recession_ends.append(pd.Index([df.index[-1]]))

    # Add a shaded rectangle for each recession period found
    for start, end in zip(recession_starts, recession_ends):
        fig.add_vrect(
            x0=start,
            x1=end,
            fillcolor="grey",
            opacity=0.2,
            annotation_text="<b>NBER based Recession</b>",
            annotation_position="top left",
            layer="below",
            line_width=0,
        )

# COVID-19 specific recession period
covid_start = pd.Timestamp("2020-02-01")
covid_end = pd.Timestamp("2020-04-30")
fig.add_vrect(
    x0=covid_start,
    x1=covid_end,
    fillcolor="grey",
    opacity=0.2,
    layer="below",
    line_width=0,
    annotation_text="<b>COVID-19 Breakout</b>",
    annotation_position="top left",
)

fig.update_layout(
    title="<b>VIX, MOVE, T10Y2Y, BAMLC0A0CMEY<br><sup>Indeces Over Time</sup>",
    template="plotly_white",
    legend=dict(orientation="h", yanchor="bottom", y=1.1, xanchor="right", x=1),
    yaxis=dict(
        title="<b>VIX, MOVE Indeces</b>",
        tickfont=dict(color="royalblue"),
        range=[0, df[["^VIX", "^MOVE"]].max().max() * 1.1],
        showgrid=False,
    ),
    yaxis2=dict(
        title="<b>T10Y2Y, BAMLC0A0CMEY Indeces</b>",
        tickfont=dict(color="crimson"),
    ),
    xaxis=dict(
        showgrid=False,
        tickformat="%Y-%m",
        ticks="outside",
    ),
    xaxis_title="Date",
    autosize=False,
    width=1050,
    height=600,
)

This chart visualizes key market health and volatility indicators, with official NBER recession periods shaded in gray.

- **VIX (Equity Volatility)** and **MOVE (Bond Volatility)** show significant spikes during periods of market turmoil, most notably during the 2008 financial crisis and the COVID-19 breakout. The MOVE index, in particular, has remained elevated in the post-2022 period, signaling persistent uncertainty in the bond market.
- **T10Y2Y (Yield Curve)** provides a forward-looking economic signal. The curve inverted (dropped below zero) ahead of both the 2008 and 2020 recessions, and again in 2022, which is a classic recessionary predictor.
- **BAMLC0A0CMEY (Corporate Bond Yield)** tends to spike when credit risk is perceived to be high, which aligns with the recessionary periods. The sharp increase in yields during the 2008 crisis is particularly evident.

To understand the relationships between the main asset classes, we compute and display a correlation matrix.

In [7]:
etfs = ["GLD", "SPY", "TLT", "UUP"]

# Calculate the correlation matrix
corr = df[etfs].corr()

# Create heatmap
fig = go.Figure(
    data=go.Heatmap(
        z=corr,
        x=etfs,
        y=etfs,
        text=[[f"{val:.2f}" for val in row] for row in corr.values],
        texttemplate="%{text}",
        textfont={"size": 12},
        hoverongaps=False,
        colorscale="magma",
        zmid=0,
    )
)

fig.update_layout(
    title="<b>Correlation Heatmap of ETFs</b>",
    template="plotly_white",
    width=700,
    height=600,
    xaxis={"side": "bottom"},
)

fig.show()

The correlation heatmap reveals how these ETFs have moved in relation to one another over the long term.

- **SPY vs. GLD (0.80):** There is a surprisingly strong positive correlation between equities and gold. While gold is often a safe-haven asset, this long-term view suggests that macroeconomic factors (like monetary policy and inflation expectations) can sometimes drive both asset classes in the same direction.
- **SPY vs. UUP (0.76):** A strong US stock market is positively correlated with a strong US dollar. This can be attributed to international capital flows into US assets during periods of economic strength.
- **TLT (Bonds):** The Treasury bond ETF shows very low correlation with both SPY (0.12) and GLD (0.05), and a slightly negative correlation with UUP (-0.02). This underscores its effectiveness as a portfolio diversifier, as its performance is largely independent of equities and the dollar.

Similarly, we compute the correlation matrix for the risk indicators.

In [8]:
# Calculate the correlation matrix
corr = df[indicators].corr()

# Create heatmap
fig = go.Figure(
    data=go.Heatmap(
        z=corr,
        x=indicators,
        y=indicators,
        text=[[f"{val:.2f}" for val in row] for row in corr.values],
        texttemplate="%{text}",
        textfont={"size": 12},
        hoverongaps=False,
        colorscale="magma",
        zmid=0,
    )
)

fig.update_layout(
    title="<b>Correlation Heatmap of Indicators</b>",
    template="plotly_white",
    width=700,
    height=600,
    xaxis={"side": "bottom"},
)

fig.show()

This heatmap shows the relationships between the market risk indicators.

- **VIX & MOVE (0.61):** As expected, volatility in the equity market (VIX) and the bond market (MOVE) are moderately to strongly correlated. This indicates that periods of market stress are often systemic and not confined to a single asset class.
- **MOVE & BAMLC0A0CMEY (0.79):** The strong positive correlation between bond market volatility and corporate bond yields is logical. Higher yields (implying higher risk) are associated with greater uncertainty and volatility in the credit markets.
- **T10Y2Y (Yield Curve):** The yield curve spread has a very low correlation with VIX (0.16) and MOVE (0.04), and a negative correlation with corporate yields (-0.11). This suggests that the yield curve captures different information—more about long-term economic outlooks—than the more immediate fear gauges of VIX and MOVE.

### EDA Summary and Next Steps

This exploratory analysis has provided several key insights:

1.  **Asset Behavior:** We've confirmed the typical behavior of equities (SPY), bonds (TLT), and the US dollar (UUP) during different economic cycles. The strong positive correlation between SPY and GLD is a notable long-term trend that warrants further investigation.
2.  **Volatility Clustering:** Volatility in both equity (VIX) and bond (MOVE) markets tends to spike concurrently, especially during recessions, highlighting synchronized risk-off events.
3.  **Divergent Indicators:** The T10Y2Y yield curve is a strong leading indicator for *economic recessions* but shows a very low correlation with immediate market volatility indicators like VIX and MOVE. This suggests it captures a different, longer-term economic risk profile rather than acute market stress.

**Next Steps:**

*   Given the low correlation of the T10Y2Y yield curve with other risk indicators, we will **not include it** in the construction of our systemic risk index, which is intended to measure immediate market stress.