# SpaceX Launch Data Enrichment (Web Scraping)

## Objective
Enrich the SpaceX launch dataset with additional mission information
obtained via web scraping from Wikipedia.

This step complements the API data by adding payload mass, orbit type,
and detailed mission outcomes that are not consistently available
in structured API responses.


## Why Web Scraping?

Not all relevant launch attributes are available in a structured format
through the SpaceX API. Wikipedia maintains a comprehensive and
human-curated table of Falcon 9 launches that includes:

- Payload mass
- Orbit type
- Mission outcome details

Web scraping allows us to systematically extract this information
and merge it with API-based data sources.


In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pathlib import Path


In [2]:
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches"

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
}

response = requests.get(WIKI_URL, headers=headers, timeout=30)
response.raise_for_status()

html = response.text

print("Wikipedia page downloaded successfully")


Wikipedia page downloaded successfully


In [3]:
soup = BeautifulSoup(html, "lxml")

tables = soup.find_all("table", class_="wikitable")

len(tables)


4

## Extraction Strategy

Wikipedia organizes Falcon 9 launches into multiple tables,
each corresponding to a time period.

The tables share a similar structure, allowing us to:
1. Iterate through each table
2. Extract rows and columns
3. Combine all tables into a single dataset


In [4]:
dfs = []

for table in tables:
    df = pd.read_html(str(table))[0]
    dfs.append(df)

len(dfs)


  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]


4

In [5]:
wiki_df = pd.concat(dfs, ignore_index=True)
wiki_df.shape


(763, 10)

In [6]:
wiki_df.head()


Unnamed: 0,Flight No.,Date and time (UTC),"Version, booster[j]",Launch site,Payload[k],Payload mass,Orbit,Customer,Launch outcome,Booster landing
0,286,"January 3, 2024 03:44[23]",F9 B5 B1082‑1,"Vandenberg, SLC‑4E",Starlink: Group 7-9 (22 satellites),"~16,800 kg (37,000 lb)",LEO,SpaceX,Success,Success (OCISLY)
1,286,"Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl...","Launch of 22 Starlink v2 mini satellites, incl..."
2,287,"January 3, 2024 23:04[24]",F9 B5 B1076‑10,"Cape Canaveral, SLC‑40",Ovzon-3,"1,800 kg (4,000 lb)",GTO,Ovzon,Success,Success (LZ‑1)
3,287,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...,Broadband internet provider satellite.[25] Fir...
4,288,"January 7, 2024 22:35[28]",F9 B5 B1067‑16,"Cape Canaveral, SLC‑40",Starlink: Group 6-35 (23 satellites),"~17,100 kg (37,700 lb)",LEO,SpaceX,Success,Success (ASOG)


In [7]:
wiki_df.columns


Index(['Flight No.', 'Date and time (UTC)', 'Version, booster[j]',
       'Launch site', 'Payload[k]', 'Payload mass', 'Orbit', 'Customer',
       'Launch outcome', 'Booster landing'],
      dtype='object')

The scraped dataset contains:
- Multi-level column headers
- Inconsistent naming across tables
- Non-numeric payload values (e.g., text annotations)

These issues are expected and will be resolved
during the data wrangling phase.


In [8]:
# Flatten multi-level columns if present
if isinstance(wiki_df.columns, pd.MultiIndex):
    wiki_df.columns = wiki_df.columns.get_level_values(0)

wiki_df.columns = (
    wiki_df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

wiki_df.columns


Index(['flight_no.', 'date_and_time_(utc)', 'version,_booster[j]',
       'launch_site', 'payload[k]', 'payload_mass', 'orbit', 'customer',
       'launch_outcome', 'booster_landing'],
      dtype='object')

## Persisting Scraped Data

The scraped dataset is saved separately from API data to
preserve raw sources and enable reproducible data pipelines.


In [9]:
output_dir = Path("../data/raw")
output_dir.mkdir(parents=True, exist_ok=True)

output_path = output_dir / "spacex_wikipedia_launches_raw.csv"
wiki_df.to_csv(output_path, index=False)

print(f"Scraped data saved to: {output_path.resolve()}")


Scraped data saved to: /Users/razs/Desktop/RAZS/spacex-falcon9-landing-prediction/data/raw/spacex_wikipedia_launches_raw.csv


## Next Steps

In the next notebook, API-based launch data and
Wikipedia-scraped data will be cleaned, standardized,
and merged into a unified dataset suitable for
exploratory analysis and machine learning.
