In [1]:
# ---
# title: Data Engineering Foundations
# tags: [DataEngineering, MedallionArch, Parquet, Scraping]
# difficulty: Beginner
# ---

# Data Engineering Foundations: The Medallion Architecture

Welcome to the **Market-Mind Library**. In this first module, we explore how to build a robust financial data lake.

## Concepts Covered
1.  **Medallion Architecture**: Why we split data into Bronze (Raw), Silver (Clean), and Gold (Business-Level).
2.  **Parquet vs CSV**: Why Parquet is superior for time-series data (columnar storage, type retention).
3.  **ETL Pipelines**: Extracting from APIs (yfinance) and Web (Selenium).

### 1. The Architecture
We organize our data directory like this:
- `data/bronze`: Immutable raw dumps. If code breaks, we delete Silver/Gold and re-run from here.
- `data/silver`: Enriched data. Dates aligned, missing values filled, types cast.
- `data/gold`: Aggregated data ready for ML/Analytics.

Let's inspect our current setup.

In [2]:
import pandas as pd
from pathlib import Path

base_path = Path("../data")
for layer in ['bronze', 'silver', 'gold']:
    path = base_path / layer
    count = len(list(path.glob("*")))
    print(f"{layer.upper()} Layer: {count} files")

BRONZE Layer: 3 files
SILVER Layer: 5 files
GOLD Layer: 3 files


### 2. Efficiency: Parquet vs CSV
Let's start the `MarketDataFetcher` manually and compare formatting.

In [3]:
import sys
import os
sys.path.append(os.path.abspath("../src"))

from ingestion.market_data import MarketDataFetcher

# Fetch a small sample
fetcher = MarketDataFetcher()
df = fetcher.fetch_history(["AAPL"], period="5d")
print(df.head())

Fetching data for 1 tickers: ['AAPL'] over 5d...


Saved raw data to market_mind/data/bronze/market_data_5d_20251214_180800.parquet
Ticker            AAPL                                              
Price             Open        High         Low       Close    Volume
Date                                                                
2025-12-08  278.130005  279.670013  276.149994  277.890015  38211800
2025-12-09  278.160004  280.029999  276.920013  277.179993  32193300
2025-12-10  277.750000  279.750000  276.440002  278.779999  33038300
2025-12-11  279.100006  279.589996  273.809998  278.029999  33248000
2025-12-12  277.795013  279.220001  276.820007  278.279999  38360082


### 3. Web Scraping with Selenium
Scraping is fragile. We use `try/except` blocks and fallback mechanisms (like our Mock Data generator) to ensure pipeline stability.