# Notebook 1: SpaceX Launch Data Acquisition and Preprocessing

## Objective
This notebook retrieves and prepares **SpaceX launch metadata** using publicly
available APIs. The goal is to construct a clean, reproducible dataset of launch
times and mission details that can be merged with meteorological data in later
notebooks.

This notebook establishes the temporal backbone for the entire project.

---

## Data source
Launch metadata are obtained from the **SpaceX public API (v4)**, which provides
information on:
- Launch date and time (UTC)
- Launch site
- Rocket family
- Mission outcome and descriptive details

Only launches from **Kennedy Space Center and Cape Canaveral** are retained to
ensure geographic consistency with the meteorological analysis.

---

## Data processing steps
The following preprocessing steps are applied:

1. Normalize nested JSON responses into a tabular format
2. Select and rename fields relevant to weather and operational analysis
3. Convert launch times to timezone-aware UTC timestamps
4. Filter launches to KSC / Cape Canaveral pads
5. Attach rocket family names via a secondary API query
6. Sort and clean the resulting dataset for downstream use

These steps ensure a standardized and machine-readable launch table.

---

## Launch and scrub indicators
At this stage, only minimal indicators are defined:
- `launched_flag` indicates whether the mission eventually launched
- `weather_scrub_flag` is initialized conservatively and refined in later notebooks

Detailed weather-scrub identification is intentionally deferred to Notebook 3,
where text-based classification is applied.

---

## Output
The final output of this notebook is a CSV file containing:
- One row per SpaceX launch at KSC/Cape Canaveral
- Launch time and mission metadata
- Rocket family information

This dataset is used as the input for ERA5 weather extraction in Notebook 2.

---

## Limitations
- Public API metadata may omit internal scheduling changes
- Scrub reasons are not consistently labeled
- Mission descriptions are written for public communication, not technical analysis

Despite these limitations, the SpaceX API provides a reliable foundation for
launch-time alignment and reproducible analysis.


In [None]:
from __future__ import annotations

import requests
import pandas as pd
from pathlib import Path

In [None]:
SPACEX_API = "https://api.spacexdata.com/v4/launches"

resp = requests.get(SPACEX_API, timeout=60)
resp.raise_for_status()

launches_raw = resp.json()
print(f"Retrieved {len(launches_raw)} launches from SpaceX API")

In [None]:
df = pd.json_normalize(launches_raw)

print("Raw columns:")
print(sorted(df.columns))


In [None]:
keep_cols = {
    "id": "id",
    "name": "name",
    "date_utc": "date_utc",
    "rocket": "rocket_id",
    "launchpad": "launchpad",
    "success": "success",
    "details": "details",
}

df = df[list(keep_cols.keys())].rename(columns=keep_cols)

df["date_utc"] = pd.to_datetime(df["date_utc"], utc=True)

df["year"] = df["date_utc"].dt.year


In [None]:
KSC_LAUNCHPADS = {
    "5e9e4501f509094ba4566f84",  # LC-39A
    "5e9e4502f509092b78566f87",  # SLC-40
}

df = df[df["launchpad"].isin(KSC_LAUNCHPADS)].copy()

print(f"Launches at KSC/Cape Canaveral: {len(df)}")


In [None]:
rocket_resp = requests.get("https://api.spacexdata.com/v4/rockets", timeout=60)
rocket_resp.raise_for_status()
rockets = rocket_resp.json()

rocket_map = {r["id"]: r["name"] for r in rockets}

df["rocket_name"] = df["rocket_id"].map(rocket_map)


In [None]:
# launched_flag: did the mission eventually launch?
df["launched_flag"] = df["success"].notna().astype(float)

# weather_scrub_flag: placeholder (true labeling happens later via NLP)
df["weather_scrub_flag"] = False


In [None]:
df = df.sort_values("date_utc").reset_index(drop=True)

df = df[
    [
        "id",
        "name",
        "date_utc",
        "year",
        "launchpad",
        "rocket_name",
        "launched_flag",
        "weather_scrub_flag",
        "details",
    ]
]

df.head()


In [None]:
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

out_path = data_dir / "spacex_launches_ksc_2010_2024.csv"
df.to_csv(out_path, index=False)

out_path


In [None]:
print(df["rocket_name"].value_counts())
print(df["year"].value_counts().sort_index())