### Space X FAlcon 9 landing prediction

### Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia

In this lab, we will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled List of Falcon 9 and Falcon Heavy launches

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches

## Objectives 


###  Web Scraping Falcon 9 Launch Records with BeautifulSoup

• Extract the Falcon 9 launch records HTML table from Wikipedia  
• Parse the HTML table using BeautifulSoup  
• Convert the parsed table into a Pandas DataFrame



In [3]:
import sys
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd 
import unicodedata

In [5]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

In [9]:
#perform an HTTP GET request
response=requests.get(static_url)
#check if the request was successful 
if response.status_code==200:
    print("Page fetched successfully!")
    html_content = response.text  
else:
    print(f"Failed to retrieve page, status code: {response.status_code}")

Page fetched successfully!


In [10]:
# Create a BeautifulSoup object from the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

#  Preview the page title to verify parsing
print(soup.title.text)

List of Falcon 9 and Falcon Heavy launches - Wikipedia


### Extract all column /variable names from the HTML table header

In [11]:
# Step 1: Find all <table> elements in the page
html_tables = soup.find_all("table")

# Step 2: Check how many tables were found 
print(f"Total tables found: {len(html_tables)}")

# Step 3: Print the third table (index 2) to inspect its content
# This is usually the first Falcon 9 launch record table
target_table = html_tables[2]
print(target_table.prettify()[:3000]) 

Total tables found: 25
<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
 <tbody>
  <tr>
   <th scope="col">
    Flight No.
   </th>
   <th scope="col">
    Date and
    <br/>
    time (
    <a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">
     UTC
    </a>
    )
   </th>
   <th scope="col">
    <a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">
     Version,
     <br/>
     Booster
    </a>
    <sup class="reference" id="cite_ref-booster_11-0">
     <a href="#cite_note-booster-11">
      <span class="cite-bracket">
       [
      </span>
      b
      <span class="cite-bracket">
       ]
      </span>
     </a>
    </sup>
   </th>
   <th scope="col">
    Launch site
   </th>
   <th scope="col">
    Payload
    <sup class="reference" id="cite_ref-Dragon_12-0">
     <a href="#cite_note-Dragon-12">
      <span class="cite-bracket">
       [
      </span>
      c
      <span class="cit

In [14]:
# Define or use the provided helper function
def extract_column_from_header(th):
    """Extracts clean column name from a <th> element."""
    if th.text:
        return th.text.strip()
    return None

# Step 1: Initialize the list to store column names
column_names = [col for col in column_names if not col.strip().isdigit()]

# Step 2: Find all <th> elements in the table
th_elements = target_table.find_all("th")

# Step 3: Extract column names
for th in th_elements:
    name = extract_column_from_header(th)
    if name is not None and len(name) > 0:
        column_names.append(name)

# Step 4: Check the extracted column names
print(column_names)

['Flight No.', 'Date andtime (UTC)', 'Version,Booster [b]', 'Launch site', 'Payload[c]', 'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding', 'Flight No.', 'Date andtime (UTC)', 'Version,Booster [b]', 'Launch site', 'Payload[c]', 'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding', 'Flight No.', 'Date andtime (UTC)', 'Version,Booster [b]', 'Launch site', 'Payload[c]', 'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding', '1', '2', '3', '4', '5', '6', '7']


In [15]:
# Your current messy column_names list
raw_column_names = ['Flight No.', 'Date andtime (UTC)', 'Version,Booster [b]', 'Launch site', 'Payload[c]', 
                    'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding',
                    'Flight No.', 'Date andtime (UTC)', 'Version,Booster [b]', 'Launch site', 'Payload[c]', 
                    'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding',
                    'Flight No.', 'Date andtime (UTC)', 'Version,Booster [b]', 'Launch site', 'Payload[c]', 
                    'Payload mass', 'Orbit', 'Customer', 'Launchoutcome', 'Boosterlanding',
                    '1', '2', '3', '4', '5', '6', '7']

# Step 1: Remove duplicates and numeric values
cleaned_columns = []
for col in raw_column_names:
    col = col.strip()
    if not col.isdigit() and col not in cleaned_columns:
        cleaned_columns.append(col)

# Step 2: Fix typos and rename for clarity
column_renames = {
    'Date andtime (UTC)': 'Date and time (UTC)',
    'Version,Booster [b]': 'Version Booster',
    'Payload[c]': 'Payload',
    'Launchoutcome': 'Launch outcome',
    'Boosterlanding': 'Booster landing'
}

# Apply renaming
final_column_names = [column_renames.get(col, col) for col in cleaned_columns]

# Step 3: Preview result
print(final_column_names)


['Flight No.', 'Date and time (UTC)', 'Version Booster', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Booster landing']


### Create a dataframe by parsing the Launc  HTML tables

In this step we will we will prepare a structureed dictionary to store cleaned launch data extracted from Falcon 9 launch table



In [19]:
#create a dictiionary with column names 
launch_dict=dict.fromkeys(column_names)


# let's initialize the keys
# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

In [27]:



from bs4 import BeautifulSoup
import pandas as pd
import unicodedata

# --- Helper Functions ---
def date_time(table_cells):
    strings = [dt.strip() for dt in list(table_cells.strings) if dt.strip()]
    return strings[:2] if len(strings) >= 2 else [strings[0], ""]

def booster_version(table_cells):
    parts = [booster.strip() for i, booster in enumerate(table_cells.strings) if i % 2 == 0]
    return ''.join(parts[:-1]) if len(parts) > 1 else ''.join(parts)

def landing_status(table_cells):
    strings = [s.strip() for s in table_cells.strings if s.strip()]
    return strings[0] if strings else 'N/A'

def get_mass(table_cells):
    mass = unicodedata.normalize("NFKD", table_cells.text).strip()
    return mass[:mass.find("kg")+2].strip() if "kg" in mass else "0"



# --- Initialize dictionary ---
launch_dict = {
    'Flight No.': [], 'Date': [], 'Time': [], 'Version Booster': [],
    'Launch Site': [], 'Payload': [], 'Payload mass': [], 'Orbit': [],
    'Customer': [], 'Launch outcome': [], 'Booster landing': []
}

extracted_row = 0

# --- Extract data ---
for table in soup.find_all('table', "wikitable plainrowheaders collapsible"):
    for rows in table.find_all("tr"):
        flag = False
        if rows.th and rows.th.string:
            flight_number = rows.th.string.strip()
            flag = flight_number.isdigit()

        row = rows.find_all('td')
        if flag:
            extracted_row += 1

            launch_dict['Flight No.'].append(flight_number)
            date, time = date_time(row[0])
            launch_dict['Date'].append(date.strip(','))
            launch_dict['Time'].append(time)

            bv = booster_version(row[1]) or (row[1].a.string if row[1].a else "")
            launch_dict['Version Booster'].append(bv)

            launch_dict['Launch Site'].append(row[2].a.string if row[2].a else row[2].text.strip())
            launch_dict['Payload'].append(row[3].a.string if row[3].a else row[3].text.strip())
            launch_dict['Payload mass'].append(get_mass(row[4]))
            launch_dict['Orbit'].append(row[5].a.string if row[5].a else row[5].text.strip())
            launch_dict['Customer'].append(row[6].a.string if row[6].a else row[6].text.strip())
            launch_dict['Launch outcome'].append(list(row[7].strings)[0].strip())
            launch_dict['Booster landing'].append(landing_status(row[8]))

# --- Save to CSV ---
df = pd.DataFrame(launch_dict)
df.to_csv("spacex_web_scraped.csv", index=False)
print(f" Extracted and saved {extracted_row} launch records to spacex_web_scraped.csv")


 Extracted and saved 121 launch records to spacex_web_scraped.csv


In [28]:
df= pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })

In [29]:
df.to_csv('spacex_web_scraped.csv', index=False)

In [30]:
df.head()

Unnamed: 0,Flight No.,Date,Time,Version Booster,Launch Site,Payload,Payload mass,Orbit,Customer,Launch outcome,Booster landing
0,1,4 June 2010,18:45,F9 v1.07B0003.18,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success,Failure
1,2,8 December 2010,15:43,F9 v1.07B0004.18,CCAFS,Dragon,0,LEO,NASA,Success,Failure
2,3,22 May 2012,07:44,F9 v1.07B0005.18,CCAFS,Dragon,525 kg,LEO,NASA,Success,No attempt
3,4,8 October 2012,00:35,F9 v1.07B0006.18,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success,No attempt
4,5,1 March 2013,15:10,F9 v1.07B0007.18,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success,No attempt
