<h1>SpaceX  Falcon 9 first stage Landing Prediction</h1>

---


# 1.2 Web scraping Falcon 9 and Falcon Heavy Launches Records


Falcon 9 historical launch records from
https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


### Objectives

- Extract a Falcon 9 launch records HTML table
- Parse the table and convert it into a Pandas data frame

---

In [1]:
import sys
import pandas as pd

import requests
from bs4 import BeautifulSoup
import re

import unicodedata

### Auxiliary Functions

In [2]:
def date_time(table_cells):
  """
  This function returns the data and time from the HTML  table cell
  """
  return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
  """
  This function returns the booster version from the HTML  table cell
  """
  out=''.join([booster_version for  i, booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
  return out

def landing_status(table_cells):
  """
  This function returns the landing status from the HTML table cell
  """
  out=[i for i in table_cells.strings][0]
  return out

def get_mass(table_cells):
  mass = unicodedata.normalize("NFKD", table_cells.text).strip()

  if mass:
    mass.find("kg")
    new_mass = mass[0:mass.find("kg")+2]
  else: new_mass = 0

  return new_mass

def extract_column_from_header(row):
  # Delete not needed tags
  if row.br : row.br.extract()
  if row.a : row.a.extract()
  if row.sup : row.sup.extract()

  # La lista row.contents contiene todos los elementos HTML dentro de la fila
  # incluidos los elementos que se han extraído en las lineas anteriores.
  colunm_name = ' '.join(row.contents)

  # Filter the digit only vertical headers
  if not( colunm_name.strip().isdigit() ):
    colunm_name = colunm_name.strip()
    return colunm_name

### 1. Request the Falcon9 Launch page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [3]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"
response = requests.get(static_url)

# Extract the source HTML as text (str) and set the parser to HTML
soup = BeautifulSoup(response.text, 'html')

soup.title

<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>

### 2. Extract all column/variable names from the HTML table header


Collect all relevant column names from the HTML table header


In [None]:
html_tables = soup.find_all('table')

# The third table is our target table; contains the actual launch records.
first_launch_table = html_tables[2]
print(first_launch_table)

Columns names are embedded in the `<th>` tag. Applying `extract_column_from_header()` the column names are extracted one by one.


In [5]:
column_names = []
for th in first_launch_table.find_all('th'):
  header = extract_column_from_header(th)
  if (header is not None) and (header.strip() != ''): column_names.append( header )

print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


## 3. Create a DataFrame by parsing the launch HTML tables


In [6]:
launch_dict = dict.fromkeys(column_names)
# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Initialize the launch_dict with each value
launch_dict['Flight No.'] = []; launch_dict['Launch site'] = []; launch_dict['Payload'] = []
launch_dict['Payload mass'] = []; launch_dict['Orbit'] = []; launch_dict['Customer'] = []; launch_dict['Launch outcome'] = []
# Extra columns
launch_dict['Version Booster'] = []; launch_dict['Booster landing'] = []; launch_dict['Date'] = []; launch_dict['Time'] = []

Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


In [7]:
#Extract each table
for table_number, table in enumerate(soup.find_all('table','wikitable plainrowheaders collapsible')):
  # iterate trough each table row
  for rows in table.find_all("tr"):
    #check if the vertical heading asociated to each row is as number corresponding to launch number
    #this depends in the design of the tables in the page, where the launch numbers are headings
    if rows.th:
      if rows.th.string:
        flight_number = rows.th.string.strip()
        flag = flight_number.isdigit()
    else: flag = False

    #Get all row cells -> if it is number save cells in a dictonary
    #There are hiperlinks inside most cells, so we need to accsess the <a> tag first (cell.a)
    #In every other case, we make use of the auxiliary functions, wich adjust to each particular cell structure
    row = rows.find_all('td')

    if flag:

      # Flight Number value
      launch_dict['Flight No.'].append(flight_number)

      # Date value
      datatimelist = date_time(row[0])  #; print(datatimelist)
      date = datatimelist[0].strip(',')
      launch_dict['Date'].append(date)

      # Time value
      time = datatimelist[1]
      launch_dict['Time'].append(time)

      # Booster version
      bv = booster_version(row[1])
      if not bv: bv = row[1].a.string
      launch_dict['Version Booster'].append(bv)

      # Launch Site
      launch_site = row[2].a.string
      launch_dict['Launch site'].append(launch_site)

      # Payload
      payload = row[3].a.string
      launch_dict['Payload'].append(payload)

      # Payload Mass
      payload_mass = get_mass(row[4])
      launch_dict['Payload mass'].append(payload_mass)

      # Orbit
      orbit = row[5].a.string
      launch_dict['Orbit'].append(orbit)

      # Customer
      customer = row[6].a
      if customer is not None: launch_dict['Customer'].append(customer.string)
      else: launch_dict['Customer'].append(None)


      # Launch outcome
      launch_outcome = list(row[7].strings)[0]
      launch_dict['Launch outcome'].append(launch_outcome)

      # Booster landing
      booster_landing = landing_status(row[8])
      launch_dict['Booster landing'].append(booster_landing)


In [8]:
# ValueError: All arrays must be of the same length (Solved)
for key in launch_dict.keys():
  print(f'{key}: {len(launch_dict[key])}')

Flight No.: 121
Launch site: 121
Payload: 121
Payload mass: 121
Orbit: 121
Customer: 121
Launch outcome: 121
Version Booster: 121
Booster landing: 121
Date: 121
Time: 121


In [9]:
df = pd.DataFrame(launch_dict).reset_index( drop = True )
df = df.replace(['Success\n', 'No attempt\n'], ['Success', 'No attempt'])

print(df.shape)
df.head()

(121, 11)


Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success,F9 v1.0B0003.1,Failure,4 June 2010,18:45
1,2,CCAFS,Dragon,0,LEO,NASA,Success,F9 v1.0B0004.1,Failure,8 December 2010,15:43
2,3,CCAFS,Dragon,525 kg,LEO,NASA,Success,F9 v1.0B0005.1,No attempt,22 May 2012,07:44
3,4,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success,F9 v1.0B0006.1,No attempt,8 October 2012,00:35
4,5,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success,F9 v1.0B0007.1,No attempt,1 March 2013,15:10


Export DataFrame to a <b>CSV</b>


In [10]:
df.to_csv('df_scraped.csv', index=False)