# üìò Notebook 01 ‚Äî Multi-Year Raw Data Ingestion

## üéØ Purpose of This Notebook

This notebook is the **first operational notebook** in the Formula 1 Undercut Strategy Analytics pipeline.

Where **Notebook 00** was purely about *observing and discovering* what the FastF1 API gives us,  
**Notebook 01** is about **executing large-scale ingestion safely and deterministically**.

> The responsibility of this notebook is **only one thing**:
>
> **Fetch multi-year Formula 1 race data and persist it as raw Parquet files ‚Äî nothing more, nothing less.**

---

## üöß What This Notebook Is *Not* Allowed To Do

To preserve architectural correctness, this notebook explicitly **does NOT**:

- ‚ùå Clean data
- ‚ùå Normalize columns
- ‚ùå Rename fields
- ‚ùå Apply business logic
- ‚ùå Perform analytics
- ‚ùå Write to PostgreSQL
- ‚ùå Infer schemas dynamically

All of those steps belong to **later notebooks** by design.

---

## üß† Why This Notebook Exists

The project specification mandates:

- Multi-year coverage (2022 ‚Üí present)
- Robust execution that **does not crash** when a single race fails
- Explicit rate limiting to avoid API abuse
- Deterministic persistence so downstream notebooks never touch the API again

This notebook exists to **convert an unreliable external API into a stable internal data layer**.

---

## üìÇ Inputs Used by This Notebook

This notebook relies on **contracts and guarantees** produced earlier:

From **Notebook 00**:
- `schema_discovery_fastf1_bahrain_2022_2024.json`
- `schema_contract_columns.json`
- `schema_engineering_decisions.json`

These files tell us:
- Which tables are real
- Which columns are stable
- Which datasets are safe to ingest
- Which fields should be ignored downstream

> This notebook trusts those contracts blindly and does not re-discover schema.

---

## üß± Output Contract

By the end of this notebook, we expect:

- ‚úÖ One folder per season (`year=2022`, `year=2023`, `year=2024`)
- ‚úÖ One folder per race inside each season
- ‚úÖ Raw Parquet files for:
  - `laps`
  - `results`
  - `weather_data`
  - `track_status`
  - `session_status`
  - `race_control_messages`
- ‚úÖ No missing races unless explicitly logged as failures
- ‚úÖ No dependency on in-memory variables for downstream work

---

## ‚öôÔ∏è Execution Philosophy

This notebook is intentionally **boring**:

- Explicit loops
- Explicit logging
- Explicit failures
- Explicit persistence

If something breaks here, it should break **loudly, early, and visibly**.

---

## üèÅ Expected Outcome

At the end of this notebook:

- The FastF1 API will never need to be called again for historical races
- All future notebooks will operate **only on Parquet**
- The pipeline becomes reproducible, restartable, and safe

This notebook turns **external chaos** into **internal certainty**.


In [1]:
# ============================================================
# Notebook 01 ‚Äî Cell 1
# Multi-Year Ingestion: Environment & Contract Validation
# ============================================================

import sys
from pathlib import Path

# ------------------------------------------------------------
# 1. Resolve project root robustly (same strategy as Notebook 00)
# ------------------------------------------------------------
PROJECT_ROOT = Path.cwd().resolve()
while not (PROJECT_ROOT / "src").exists():
    if PROJECT_ROOT.parent == PROJECT_ROOT:
        raise RuntimeError("‚ùå Could not resolve project root")
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.insert(0, str(PROJECT_ROOT))

print(f"‚úÖ Project root resolved: {PROJECT_ROOT}")

# ------------------------------------------------------------
# 2. Import project modules (NO assumptions)
# ------------------------------------------------------------
from src.logging_config import setup_logging
from src.config import Config
from src.fastf1_client import setup_fastf1

# ------------------------------------------------------------
# 3. Initialize logging
# ------------------------------------------------------------
logger, error_logger = setup_logging()

logger.info("Notebook 01 ‚Äî Multi-Year Ingestion started")
logger.info("Cell 1 ‚Äî Environment and contract validation")

# ------------------------------------------------------------
# 4. Validate environment configuration (.env)
# ------------------------------------------------------------
try:
    Config.validate()
    logger.info("Environment variables validated successfully")
except Exception:
    error_logger.error("Environment validation failed", exc_info=True)
    raise

# ------------------------------------------------------------
# 5. Validate schema artifacts from Notebook 00
# ------------------------------------------------------------
INTERIM_DIR = PROJECT_ROOT / "data" / "interim"

required_artifacts = [
    "schema_discovery_fastf1_bahrain_2022_2024.json",
    "schema_contract_columns.json",
    "schema_engineering_decisions.json"
]

missing_artifacts = []

for artifact in required_artifacts:
    artifact_path = INTERIM_DIR / artifact
    if not artifact_path.exists():
        missing_artifacts.append(artifact)
    else:
        logger.info(f"Schema artifact found: {artifact}")

if missing_artifacts:
    error_logger.error(
        f"Missing required schema artifacts: {missing_artifacts}. "
        "Notebook 01 cannot proceed safely."
    )
    raise FileNotFoundError(
        "Notebook 00 artifacts missing. "
        "Complete Notebook 00 before running Notebook 01."
    )

# ------------------------------------------------------------
# 6. Initialize FastF1 (cache + rate limiting)
# ------------------------------------------------------------
try:
    setup_fastf1()
    logger.info("FastF1 initialized with cache and adaptive rate limiting")
except Exception:
    error_logger.error("FastF1 initialization failed", exc_info=True)
    raise

# ------------------------------------------------------------
# 7. Explicit scope confirmation (guardrail)
# ------------------------------------------------------------
logger.info(
    "Notebook 01 scope confirmed:\n"
    "- Multi-year raw data ingestion ONLY\n"
    "- Uses schema contracts from Notebook 00\n"
    "- NO cleaning\n"
    "- NO normalization\n"
    "- NO database writes\n"
    "- NO analytics"
)

print("‚úÖ Cell 1 completed ‚Äî environment and schema contracts validated")


‚úÖ Project root resolved: C:\Users\hersh\Desktop\f1_analysis_project


2025-12-14 22:33:15,661 | INFO | src.logging_config | Notebook 01 ‚Äî Multi-Year Ingestion started
2025-12-14 22:33:15,662 | INFO | src.logging_config | Cell 1 ‚Äî Environment and contract validation
2025-12-14 22:33:15,664 | INFO | src.logging_config | Environment variables validated successfully
2025-12-14 22:33:15,666 | INFO | src.logging_config | Schema artifact found: schema_discovery_fastf1_bahrain_2022_2024.json
2025-12-14 22:33:15,668 | INFO | src.logging_config | Schema artifact found: schema_contract_columns.json
2025-12-14 22:33:15,669 | INFO | src.logging_config | Schema artifact found: schema_engineering_decisions.json
2025-12-14 22:33:15,676 | INFO | src.logging_config | FastF1 cache enabled at: C:\Users\hersh\Desktop\f1_analysis_project\data\raw\fastf1_cache
2025-12-14 22:33:15,677 | INFO | src.logging_config | FastF1 initialized with cache and adaptive rate limiting
2025-12-14 22:33:15,679 | INFO | src.logging_config | Notebook 01 scope confirmed:
- Multi-year raw data 

‚úÖ Cell 1 completed ‚Äî environment and schema contracts validated


In [2]:
# ------------------------------------------------------------
# Cell 2 ‚Äî Full Season Raw Ingestion (2022)
# ------------------------------------------------------------

from pathlib import Path
import pandas as pd

from src.batch_loader import load_season_races
from src.logging_config import setup_logging

# ------------------------------------------------------------
# 1. Initialize logging (reuse global config)
# ------------------------------------------------------------
logger, error_logger = setup_logging()

logger.info("Cell 2 ‚Äî Full season raw ingestion started (2022)")

# ------------------------------------------------------------
# 2. Define raw Parquet output directory
# ------------------------------------------------------------
PROJECT_ROOT = Path.cwd().resolve()
while not (PROJECT_ROOT / "src").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

RAW_PARQUET_DIR = PROJECT_ROOT / "data" / "raw" / "parquet" / "year=2022"
RAW_PARQUET_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Raw Parquet output directory: {RAW_PARQUET_DIR}")

# ------------------------------------------------------------
# 3. Load all 2022 race sessions (race-only, all GPs)
# ------------------------------------------------------------
try:
    race_sessions_2022 = load_season_races(2022)
except Exception:
    error_logger.error(
        "Failed to load 2022 season race sessions",
        exc_info=True
    )
    raise

logger.info(
    f"Loaded {len(race_sessions_2022)} race sessions for 2022 season"
)

# ------------------------------------------------------------
# 4. Persist raw race tables to Parquet (NO cleaning)
# ------------------------------------------------------------
tables_written = 0

for session in race_sessions_2022:
    year = session.event.year
    round_no = session.event.RoundNumber
    event_name = session.event.EventName.replace(" ", "_")

    base_path = RAW_PARQUET_DIR / f"round={round_no}_{event_name}"
    base_path.mkdir(parents=True, exist_ok=True)

    # ---- Laps ----
    if hasattr(session, "laps") and session.laps is not None:
        session.laps.to_parquet(base_path / "laps.parquet")
        tables_written += 1

    # ---- Results ----
    if hasattr(session, "results") and session.results is not None:
        session.results.to_parquet(base_path / "results.parquet")
        tables_written += 1

    # ---- Weather ----
    if hasattr(session, "weather_data") and session.weather_data is not None:
        session.weather_data.to_parquet(base_path / "weather_data.parquet")
        tables_written += 1

    # ---- Track Status ----
    if hasattr(session, "track_status") and session.track_status is not None:
        session.track_status.to_parquet(base_path / "track_status.parquet")
        tables_written += 1

    # ---- Session Status ----
    if hasattr(session, "session_status") and session.session_status is not None:
        session.session_status.to_parquet(base_path / "session_status.parquet")
        tables_written += 1

    # ---- Race Control Messages ----
    if hasattr(session, "race_control_messages") and session.race_control_messages is not None:
        session.race_control_messages.to_parquet(
            base_path / "race_control_messages.parquet"
        )
        tables_written += 1

logger.info(
    f"Cell 2 completed ‚Äî 2022 season ingested successfully "
    f"({len(race_sessions_2022)} races, {tables_written} tables written)"
)

print(
    f"‚úÖ Cell 2 completed ‚Äî 2022 season ingested "
    f"({len(race_sessions_2022)} races)"
)


2025-12-14 22:33:15,698 | INFO | src.logging_config | Cell 2 ‚Äî Full season raw ingestion started (2022)
2025-12-14 22:33:15,700 | INFO | src.logging_config | Raw Parquet output directory: C:\Users\hersh\Desktop\f1_analysis_project\data\raw\parquet\year=2022
2025-12-14 22:33:15,701 | INFO | src.logging_config | Fetching race schedule for season 2022
2025-12-14 22:33:15,744 | INFO | src.logging_config | Discovered 22 race weekends for season 2022
2025-12-14 22:33:15,745 | INFO | src.logging_config | Loading race session ‚Äî 2022 Round 1: Bahrain Grand Prix
2025-12-14 22:33:15,746 | INFO | src.logging_config | Requesting session ‚Äî Year=2022, Round=1, Session=R
core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.7.0]
2025-12-14 22:33:17,160 | INFO | fastf1.fastf1.core | Loading data for Bahrain Grand Prix - Race [v3.7.0]
req            INFO 	Using cached data for session_info
2025-12-14 22:33:17,163 | INFO | fastf1.fastf1.req | Using cached data for session_info
req    

‚úÖ Cell 2 completed ‚Äî 2022 season ingested (22 races)


In [3]:
# ------------------------------------------------------------
# Cell 3 ‚Äî Full Season Raw Ingestion (2023)
# ------------------------------------------------------------

from pathlib import Path
import pandas as pd

from src.batch_loader import load_season_races
from src.logging_config import setup_logging

# ------------------------------------------------------------
# 1. Initialize logging (reuse global config)
# ------------------------------------------------------------
logger, error_logger = setup_logging()

logger.info("Cell 3 ‚Äî Full season raw ingestion started (2023)")

# ------------------------------------------------------------
# 2. Define raw Parquet output directory
# ------------------------------------------------------------
PROJECT_ROOT = Path.cwd().resolve()
while not (PROJECT_ROOT / "src").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

RAW_PARQUET_DIR = PROJECT_ROOT / "data" / "raw" / "parquet" / "year=2023"
RAW_PARQUET_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Raw Parquet output directory: {RAW_PARQUET_DIR}")

# ------------------------------------------------------------
# 3. Load all 2023 race sessions (race-only, all GPs)
# ------------------------------------------------------------
try:
    race_sessions_2023 = load_season_races(2023)
except Exception:
    error_logger.error(
        "Failed to load 2023 season race sessions",
        exc_info=True
    )
    raise

logger.info(
    f"Loaded {len(race_sessions_2023)} race sessions for 2023 season"
)

# ------------------------------------------------------------
# 4. Persist raw race tables to Parquet (NO cleaning)
# ------------------------------------------------------------
tables_written = 0

for session in race_sessions_2023:
    year = session.event.year
    round_no = session.event.RoundNumber
    event_name = session.event.EventName.replace(" ", "_")

    base_path = RAW_PARQUET_DIR / f"round={round_no}_{event_name}"
    base_path.mkdir(parents=True, exist_ok=True)

    # ---- Laps ----
    if hasattr(session, "laps") and session.laps is not None:
        session.laps.to_parquet(base_path / "laps.parquet")
        tables_written += 1

    # ---- Results ----
    if hasattr(session, "results") and session.results is not None:
        session.results.to_parquet(base_path / "results.parquet")
        tables_written += 1

    # ---- Weather ----
    if hasattr(session, "weather_data") and session.weather_data is not None:
        session.weather_data.to_parquet(base_path / "weather_data.parquet")
        tables_written += 1

    # ---- Track Status ----
    if hasattr(session, "track_status") and session.track_status is not None:
        session.track_status.to_parquet(base_path / "track_status.parquet")
        tables_written += 1

    # ---- Session Status ----
    if hasattr(session, "session_status") and session.session_status is not None:
        session.session_status.to_parquet(base_path / "session_status.parquet")
        tables_written += 1

    # ---- Race Control Messages ----
    if hasattr(session, "race_control_messages") and session.race_control_messages is not None:
        session.race_control_messages.to_parquet(
            base_path / "race_control_messages.parquet"
        )
        tables_written += 1

logger.info(
    f"Cell 3 completed ‚Äî 2023 season ingested successfully "
    f"({len(race_sessions_2023)} races, {tables_written} tables written)"
)

print(
    f"‚úÖ Cell 3 completed ‚Äî 2023 season ingested "
    f"({len(race_sessions_2023)} races)"
)


2025-12-14 22:35:13,425 | INFO | src.logging_config | Cell 3 ‚Äî Full season raw ingestion started (2023)
2025-12-14 22:35:13,427 | INFO | src.logging_config | Raw Parquet output directory: C:\Users\hersh\Desktop\f1_analysis_project\data\raw\parquet\year=2023
2025-12-14 22:35:13,429 | INFO | src.logging_config | Fetching race schedule for season 2023
2025-12-14 22:35:14,071 | INFO | src.logging_config | Discovered 22 race weekends for season 2023
2025-12-14 22:35:14,074 | INFO | src.logging_config | Loading race session ‚Äî 2023 Round 1: Bahrain Grand Prix
2025-12-14 22:35:14,075 | INFO | src.logging_config | Requesting session ‚Äî Year=2023, Round=1, Session=R
core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.7.0]
2025-12-14 22:35:15,449 | INFO | fastf1.fastf1.core | Loading data for Bahrain Grand Prix - Race [v3.7.0]
req            INFO 	Using cached data for session_info
2025-12-14 22:35:15,468 | INFO | fastf1.fastf1.req | Using cached data for session_info
req    

‚úÖ Cell 3 completed ‚Äî 2023 season ingested (22 races)


In [4]:
# ------------------------------------------------------------
# Cell 4 ‚Äî Full Season Raw Ingestion (2024)
# ------------------------------------------------------------

from pathlib import Path
from src.batch_loader import load_season_races
from src.logging_config import setup_logging

logger, error_logger = setup_logging()

logger.info("Cell 4 ‚Äî Full season raw ingestion started (2024)")

# ------------------------------------------------------------
# Resolve project root
# ------------------------------------------------------------
PROJECT_ROOT = Path.cwd().resolve()
while not (PROJECT_ROOT / "src").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

RAW_PARQUET_DIR = PROJECT_ROOT / "data" / "raw" / "parquet" / "year=2024"
RAW_PARQUET_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Raw Parquet output directory: {RAW_PARQUET_DIR}")

# ------------------------------------------------------------
# Load all 2024 race sessions
# ------------------------------------------------------------
race_sessions_2024 = load_season_races(2024)

logger.info(f"Loaded {len(race_sessions_2024)} race sessions for 2024")

# ------------------------------------------------------------
# Persist raw race tables
# ------------------------------------------------------------
tables_written = 0

for session in race_sessions_2024:
    year = session.event.year
    round_no = session.event.RoundNumber
    event_name = session.event.EventName.replace(" ", "_")

    base_path = RAW_PARQUET_DIR / f"round={round_no}_{event_name}"
    base_path.mkdir(parents=True, exist_ok=True)

    if session.laps is not None:
        session.laps.to_parquet(base_path / "laps.parquet")
        tables_written += 1

    if session.results is not None:
        session.results.to_parquet(base_path / "results.parquet")
        tables_written += 1

    if session.weather_data is not None:
        session.weather_data.to_parquet(base_path / "weather_data.parquet")
        tables_written += 1

    if session.track_status is not None:
        session.track_status.to_parquet(base_path / "track_status.parquet")
        tables_written += 1

    if session.session_status is not None:
        session.session_status.to_parquet(base_path / "session_status.parquet")
        tables_written += 1

    if session.race_control_messages is not None:
        session.race_control_messages.to_parquet(
            base_path / "race_control_messages.parquet"
        )
        tables_written += 1

logger.info(
    f"Cell 4 completed ‚Äî 2024 season ingested "
    f"({len(race_sessions_2024)} races, {tables_written} tables written)"
)

print(
    f"‚úÖ Cell 4 completed ‚Äî 2024 season ingested "
    f"({len(race_sessions_2024)} races)"
)


2025-12-14 22:37:05,623 | INFO | src.logging_config | Cell 4 ‚Äî Full season raw ingestion started (2024)
2025-12-14 22:37:05,626 | INFO | src.logging_config | Raw Parquet output directory: C:\Users\hersh\Desktop\f1_analysis_project\data\raw\parquet\year=2024
2025-12-14 22:37:05,627 | INFO | src.logging_config | Fetching race schedule for season 2024
2025-12-14 22:37:05,884 | INFO | src.logging_config | Discovered 24 race weekends for season 2024
2025-12-14 22:37:05,886 | INFO | src.logging_config | Loading race session ‚Äî 2024 Round 1: Bahrain Grand Prix
2025-12-14 22:37:05,887 | INFO | src.logging_config | Requesting session ‚Äî Year=2024, Round=1, Session=R
core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.7.0]
2025-12-14 22:37:07,298 | INFO | fastf1.fastf1.core | Loading data for Bahrain Grand Prix - Race [v3.7.0]
req            INFO 	Using cached data for session_info
2025-12-14 22:37:07,303 | INFO | fastf1.fastf1.req | Using cached data for session_info
req    

‚úÖ Cell 4 completed ‚Äî 2024 season ingested (24 races)


# üèÅ Notebook 01 ‚Äî Conclusion & Findings

## ‚úÖ What This Notebook Successfully Achieved

This notebook **completed its core mission**:

> It transformed the FastF1 API into a stable, replayable, disk-backed data source.

Across multiple seasons, we:
- Loaded **every race weekend** using official round numbers
- Enforced race-only ingestion (`Session = "R"`)
- Persisted raw data immediately to Parquet
- Ensured failures did **not stop the pipeline**

---

## üîç Cell-by-Cell What We Learned

### üß± Cell 1 ‚Äî Environment & Contract Validation

**What we confirmed:**
- Project root resolution works reliably
- All schema artifacts from Notebook 00 exist
- FastF1 caching must be explicitly enabled with a directory
- The notebook scope is guarded against accidental misuse

**Key lesson:**
> FastF1 will fail silently or catastrophically if cache initialization is incorrect.  
This must always be validated upfront.

---

### üèÅ Cell 2 / 3 / 4 ‚Äî Season-Level Ingestion (2022‚Äì2024)

**What actually happened under the hood:**

- FastF1 uses **lazy loading**
- Calling `session.load()` without accessing properties does *not* load tables
- Accessing `session.laps`, `session.weather_data`, etc. triggers real downloads
- Cache files (`.ff1pkl`) are written **per sub-dataset**, not per race

**Why `.ff1pkl` files exploded in count:**
- Each race contains many independent API calls:
  - laps
  - timing
  - weather
  - car telemetry
  - position data
  - race control messages
- Each of these becomes a separate cached artifact

**Why Parquet files are fewer:**
- We intentionally persist **only analytical tables**
- Many FastF1 internal datasets are not relevant downstream

This is expected and correct behavior.

---

## üö® Major Errors Encountered (and What They Taught Us)

### ‚ùå Error: `Invalid round: None`
**Cause:**  
Passing `None` instead of a real round number.

**Fix:**  
Always derive round numbers from the official schedule.

---

### ‚ùå Error: `AdaptiveRateLimiter has no attribute record_success`
**Cause:**  
Mismatch between design intent and actual implementation.

**Resolution Decision:**  
We explicitly **chose simplicity over half-baked adaptivity**.

The rate limiter now:
- Enforces deterministic waiting
- Avoids phantom adaptive behavior
- Matches the specification‚Äôs `sleep(3)` requirement

---

### ‚ùå Error: `DataNotLoadedError`
**Cause:**  
FastF1 does not preload tables automatically.

**Fix:**  
Accessing properties (`session.laps`, etc.) is the *correct* trigger.

---

### ‚ùå Error: UnicodeEncodeError in diagnostics
**Cause:**  
Windows console encoding (`cp1252`) cannot render emojis.

**Interpretation:**  
This is a logging issue, **not a data issue**.

---

## üß† What We Now Know With Certainty

After Notebook 01:

- ‚úÖ Multi-year ingestion works
- ‚úÖ Caching behaves correctly
- ‚úÖ Parquet persistence is reliable
- ‚úÖ Failures are isolated, not catastrophic
- ‚úÖ The API can now be ignored downstream

This notebook has converted a live API into a **static data lake**.

---

## ‚û°Ô∏è Next Steps (Notebook 02)

Based on both the **Specification** and **Pipeline documents**, the next notebook will:

### üìò Notebook 02 ‚Äî Standardization & Parquet Enforcement

It will:
- Enforce the ‚ÄúGolden Rules‚Äù
- Normalize time units
- Normalize tyre compounds
- Normalize track status
- Remove unsafe or irrelevant columns
- Rewrite Parquet files into **interim, standardized form**

> From this point onward, **schema drift is no longer allowed**.

---

## üß† Final Statement

Notebook 01 has done exactly what it was supposed to do ‚Äî no more, no less.

It is intentionally noisy, defensive, and unglamorous.

That is **precisely why the rest of the pipeline can now be clean, analytical, and confident**.

This notebook is now **closed by design**.
