# Phase 1 — Data Acquisition & Cleaning (S&P 500)

**Project:** Dynamic Financial Risk Modeling  
**Dataset:** S&P 500 Index (stooq.pl)  
**Time Span:** 2000–2025  

---

## Objective
The goal of this phase is to construct a clean, reproducible financial time-series dataset
suitable for downstream volatility, tail-risk, and regime analysis.

---

## Data Source

The S&P 500 index data is obtained from **stooq.pl**, an open-access financial data provider.
The raw dataset contains daily observations including open, high, low, close prices and volume,
with records extending back to the late 18th century.

Only modern market data (post-2000) is retained to ensure relevance and consistency
with contemporary financial market structures.

In [None]:
# Imports
import pandas as pd
import numpy as np

In [None]:
url = "https://stooq.pl/q/d/l/?s=^spx&i=d"
sp500_raw = pd.read_csv(url)

## Raw Data Archival

To ensure reproducibility, the original raw dataset is saved locally before any transformations
are applied.

In [None]:
sp500_raw.to_csv(
    "../data/raw/sp500_raw.csv",
    index=False
)

## Variable Standardization

The original dataset is provided with non-English column names.
To improve clarity, reproducibility, and alignment with standard
financial econometrics conventions, an English-labeled copy of the
raw dataset is created.

The original raw file is preserved without modification, while a
separate English-labeled version is generated for downstream analysis
and visualization.

In [None]:
sp500 = sp500_raw.copy()

sp500 = sp500.rename(columns={
    "Data": "Date",
    "Otwarcie": "Open",
    "Najwyzszy": "High",
    "Najnizszy": "Low",
    "Zamkniecie": "Close",
    "Wolumen": "Volume"
})

sp500.to_csv(
    "../data/raw/sp500_raw_en.csv",
    index=False
)

## Date Processing and Sample Selection

Dates are converted to a proper datetime format.
The analysis is restricted to observations from January 2000 onward
to focus on modern financial market dynamics.

In [13]:
sp500["Date"] = pd.to_datetime(sp500["Date"], errors="raise")

In [14]:
sp500 = sp500[sp500["Date"] >= "2000-01-01"].reset_index(drop=True)

In [15]:
sp500 = sp500.sort_values("Date").reset_index(drop=True)

## Sanity Checks

The cleaned dataset is verified by inspecting the first and last observations
to ensure correct ordering and time coverage.

In [16]:
sp500.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2000-01-03,1469.25,1478.0,1438.36,1455.22,517666667.0
1,2000-01-04,1455.22,1455.22,1397.43,1399.42,560555556.0
2,2000-01-05,1399.42,1413.27,1377.68,1402.11,603055556.0
3,2000-01-06,1402.11,1411.9,1392.1,1403.45,606833333.0
4,2000-01-07,1403.45,1441.47,1400.73,1441.47,680666667.0


In [17]:
sp500.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume
6526,2025-12-12,6886.85,6899.85,6801.79,6827.41,3223223000.0
6527,2025-12-15,6860.19,6861.59,6801.49,6816.51,3346423000.0
6528,2025-12-16,6800.12,6819.27,6759.74,6800.26,3271931000.0
6529,2025-12-17,6802.88,6812.26,6720.43,6721.43,3506740000.0
6530,2025-12-18,6778.06,6816.13,6758.5,6774.76,3362185000.0


## Feature Engineering

Log prices and log returns are computed, as log returns are additive over time
and commonly used in volatility and risk modeling.

In [18]:
sp500["log_price"] = np.log(sp500["Close"])
sp500["log_return"] = sp500["log_price"].diff()

returns = sp500.dropna().reset_index(drop=True)

## Processed Dataset Export

The final cleaned dataset containing log returns is saved for use in
subsequent modeling phases, including volatility modeling and EVT analysis.

In [None]:
returns[["Date", "log_return"]].to_csv(
    "../data/processed/sp500_log_returns.csv",
    index=False
)

## Phase 1 Conclusion — Data Acquisition and Preparation

This phase established a clean, reproducible foundation for all subsequent
statistical analyses by constructing a well-documented S&P 500 log-return
dataset.

Key outcomes include:
- Preservation of the original raw dataset to ensure reproducibility
- Standardization and translation of variables into financial econometrics
  conventions
- Cleaning and ordering of date indices
- Construction of daily log returns from price data
- Export of a processed dataset suitable for modeling and diagnostics

By separating raw, standardized, and processed data layers, this phase
enforces a transparent and modular data pipeline consistent with
research-grade statistical workflows.

---

## Bridge to Phase 2 — From Data Integrity to Empirical Properties

With a reliable dataset in place, the next step is to examine the empirical
characteristics of financial returns.

Phase 2 builds directly on this foundation by exploring distributional
properties, volatility behavior, and deviations from classical assumptions.
These exploratory diagnostics guide the selection of appropriate statistical
models in later phases, ensuring that methodological choices are driven by
data characteristics rather than convenience.

---

**Status:** Phase 1 completed — data cleaned, standardized, and structured for exploratory analysis (Phase 2).