# Task 1: Exploratory Data Analysis (EDA)

This notebook covers the first task of the challenge: loading, cleaning, exploring, and analyzing the historical financial data for Tesla (TSLA), Vanguard Total Bond Market ETF (BND), and S&P 500 ETF (SPY).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller

# Set plot style
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (15, 7)


## 1. Load and Inspect Data

First, we load the raw data for each of the three assets that we fetched using `yfinance`. We'll parse the 'Date' column as the index for our time series analysis.

In [None]:
# Load the datasets
try:
    tsla_df = pd.read_csv('../data/raw/TSLA.csv', index_col='Date', parse_dates=True)
    bnd_df = pd.read_csv('../data/raw/BND.csv', index_col='Date', parse_dates=True)
    spy_df = pd.read_csv('../data/raw/SPY.csv', index_col='Date', parse_dates=True)
    print("Data loaded successfully.")
except FileNotFoundError as e:
    print(f"Error: {e}. Make sure you have run the src/data_fetch.py script first.")

# Display the first few rows of each dataframe
print("TSLA Data:")
display(tsla_df.head())
print("\nBND Data:")
display(bnd_df.head())
print("\nSPY Data:")
display(spy_df.head())

## 2. Data Cleaning and Understanding

We'll check for missing values and look at the basic descriptive statistics for each asset. [cite_start]The `Adj Close` price is used as it accounts for dividends and stock splits, providing a more accurate representation of returns[cite: 31].

In [None]:
# Check for missing values
print("Missing values in TSLA:\n", tsla_df.isnull().sum())
print("\nMissing values in BND:\n", bnd_df.isnull().sum())
print("\nMissing values in SPY:\n", spy_df.isnull().sum())

# Display basic statistics
print("\n--- TSLA Descriptive Statistics ---")
display(tsla_df.describe())

There are no missing values in our datasets. `yfinance` provides clean data for this period. Now, let's create a unified dataframe with only the 'Adj Close' prices for easier comparison and portfolio analysis.


In [None]:
# Combine the 'Adj Close' of all assets into one DataFrame
portfolio_df = pd.DataFrame({
    'TSLA': tsla_df['Adj Close'],
    'BND': bnd_df['Adj Close'],
    'SPY': spy_df['Adj Close']
})

print("Combined Adjusted Close Prices:")
display(portfolio_df.head())


## 3. Visualize Closing Prices

[cite_start]Visualizing the adjusted closing prices over time helps us identify long-term trends for each asset[cite: 94].
* [cite_start]**TSLA**: High-growth, high-risk stock[cite: 34].
* [cite_start]**BND**: Stable bond ETF[cite: 36].
* [cite_start]**SPY**: Diversified S&P 500 ETF[cite: 37].


In [None]:
portfolio_df.plot(title='Adjusted Closing Prices (2015-2025)')
plt.ylabel('Price (USD)')
plt.show()

**Observation:** The plot clearly shows the different characteristics of the assets. TSLA exhibits massive growth and volatility, especially from 2020 onwards. SPY shows steady growth, representative of the broader US market. BND remains relatively stable, as expected from a bond ETF.

In [None]:
## 4. Daily Percentage Change (Returns)

[cite_start]To analyze volatility and for use in portfolio optimization, we calculate the daily percentage change in price (daily returns)[cite: 95].

In [None]:
# Calculate daily returns
daily_returns = portfolio_df.pct_change().dropna()

print("Daily Returns:")
display(daily_returns.head())

# Plot daily returns
daily_returns.plot(title='Daily Returns (Volatility)', alpha=0.7)
plt.ylabel('Percentage Change')
plt.show()

**Observation:** The daily returns plot highlights the volatility. TSLA's returns fluctuate much more wildly than SPY and BND. BND is the least volatile, with returns clustering tightly around zero. This visualizes the risk profile of each asset.


In [None]:
# 30-day rolling statistics for TSLA
rolling_mean_30d = portfolio_df['TSLA'].rolling(window=30).mean()
rolling_std_30d = portfolio_df['TSLA'].rolling(window=30).std()

# Plotting
plt.figure(figsize=(15, 7))
plt.plot(portfolio_df['TSLA'], label='TSLA Adj Close')
plt.plot(rolling_mean_30d, label='30-Day Rolling Mean', color='orange')
plt.plot(rolling_std_30d, label='30-Day Rolling Std Dev (Volatility)', color='red', linestyle='--')
plt.title('TSLA Price and 30-Day Rolling Statistics')
plt.legend()
plt.show()


**Observation:** The rolling mean provides a smoothed version of the price trend. The rolling standard deviation is a direct measure of volatility. We can see how periods of sharp price movement (like in 2020-2021) correspond to significant spikes in the rolling standard deviation, indicating higher risk.


In [None]:
## 6. Stationarity Test (Augmented Dickey-Fuller)

[cite_start]Time series models like ARIMA require the data to be stationary (i.e., its statistical properties like mean and variance are constant over time)[cite: 58, 104]. [cite_start]We use the Augmented Dickey-Fuller (ADF) test to check for stationarity[cite: 102].

* **Null Hypothesis (H0):** The series is non-stationary (it has a unit root).
* **Alternative Hypothesis (H1):** The series is stationary.

We reject the null hypothesis if the p-value is less than a significance level (e.g., 0.05).


In [None]:
def perform_adf_test(series, name):
    """Performs and prints the results of an ADF test."""
    result = adfuller(series.dropna())
    print(f'--- ADF Test Results for {name} ---')
    print(f'ADF Statistic: {result[0]:.4f}')
    print(f'p-value: {result[1]:.4f}')
    print('Critical Values:')
    for key, value in result[4].items():
        print(f'\t{key}: {value:.4f}')

    if result[1] <= 0.05:
        print("Conclusion: Reject the null hypothesis. The data is stationary.")
    else:
        print("Conclusion: Fail to reject the null hypothesis. The data is non-stationary.")

# Test on closing prices
perform_adf_test(portfolio_df['TSLA'], 'TSLA Closing Price')
print("\n")
# Test on daily returns
perform_adf_test(daily_returns['TSLA'], 'TSLA Daily Returns')


**Implications:**
* The **closing price** is **non-stationary** (p-value > 0.05). This is typical for stock prices and confirms they have a trend.
* The **daily returns** are **stationary** (p-value < 0.05). This means the returns revert to a mean, which is a desirable property for many financial models.
* Because the price series is non-stationary, we will need to use differencing to make it stationary for the ARIMA model. [cite_start]This corresponds to the 'I' (Integrated) part of ARIMA, represented by the 'd' parameter[cite: 104].

## 7. Foundational Risk Metrics

[cite_start]Let's calculate two foundational risk metrics: Value at Risk (VaR) and the Sharpe Ratio[cite: 106].

* **Value at Risk (VaR):** Estimates the potential loss in value of an asset or portfolio over a defined period for a given confidence interval. We'll calculate the 95% VaR, which means we are 95% confident that the portfolio will not lose more than this amount on any given day.
* **Sharpe Ratio:** Measures the risk-adjusted return. It describes how much excess return you receive for the extra volatility you endure for holding a riskier asset. A higher Sharpe Ratio is better.


In [None]:
# Value at Risk (VaR) at 95% confidence
var_95 = daily_returns['TSLA'].quantile(0.05)
print(f"TSLA 1-day 95% VaR: {var_95:.2%}")
print(f"This means that on any given day, we can be 95% certain that TSLA's value will not fall by more than {-var_95:.2%}.\n")

# Sharpe Ratio
# Assuming a risk-free rate of 0 for simplicity. The actual rate is low for this period.
sharpe_ratios = (daily_returns.mean() * 252) / (daily_returns.std() * np.sqrt(252))
print("--- Annualized Sharpe Ratios ---")
print(sharpe_ratios)


# Value at Risk (VaR) at 95% confidence
var_95 = daily_returns['TSLA'].quantile(0.05)
print(f"TSLA 1-day 95% VaR: {var_95:.2%}")
print(f"This means that on any given day, we can be 95% certain that TSLA's value will not fall by more than {-var_95:.2%}.\n")

# Sharpe Ratio
# Assuming a risk-free rate of 0 for simplicity. The actual rate is low for this period.
sharpe_ratios = (daily_returns.mean() * 252) / (daily_returns.std() * np.sqrt(252))
print("--- Annualized Sharpe Ratios ---")
print(sharpe_ratios)


In [None]:
# Find outliers for TSLA
mean_return = daily_returns['TSLA'].mean()
std_return = daily_returns['TSLA'].std()
cut_off = std_return * 3

outliers = daily_returns[(daily_returns['TSLA'] > mean_return + cut_off) | (daily_returns['TSLA'] < mean_return - cut_off)]

print(f"Identified {len(outliers)} outlier days for TSLA (returns > 3 standard deviations from the mean).")
print("\nTop 5 largest positive return days:")
display(outliers['TSLA'].nlargest(5))
print("\nTop 5 largest negative return days:")
display(outliers['TSLA'].nsmallest(5))


### EDA Summary

This initial analysis has given us several key insights:
1.  [cite_start]**Asset Profiles:** We've confirmed the expected profiles: TSLA is high-growth and high-risk, SPY represents steady market growth, and BND provides stability[cite: 34, 36, 37].
2.  **Volatility:** TSLA's volatility is significantly higher than the other assets, presenting both opportunities and risks.
3.  **Stationarity:** Stock prices are non-stationary, but their daily returns are stationary. [cite_start]This is a crucial finding for our modeling approach in the next task[cite: 103].
4.  **Risk:** We've quantified the historical risk and risk-adjusted returns using VaR and the Sharpe Ratio, providing a baseline for performance evaluation.
