<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Estimated time needed: **40** minutes


In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/Falcon9_rocket_family.svg)


Falcon 9 first stage will land successfully


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


More specifically, the launch records are stored in a HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


  ## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract a Falcon 9 launch records HTML table from Wikipedia
- Parse the table and convert it into a Pandas data frame


First let's import required packages for this lab


In [1]:
# This command installs the BeautifulSoup4 library, which is used for web scraping tasks.
!pip3 install beautifulsoup4

# This command installs the Requests library, which is used for making HTTP requests.
!pip3 install requests




In [2]:
import sys
import requests # allows Python code to make HTTP requests to web servers and retrieve data from the web.
from bs4 import BeautifulSoup # A library used for web scraping tasks.
import re # Provides support for regular expressions in Python. Regular expressions are used for pattern matching and text manipulation.
import unicodedata # Provides access to Unicode character properties. It's useful for working with Unicode strings.
import pandas as pd # for data manipulation and analysis in Python. It provides data structures like DataFrame for working with structured data.


and we will provide some helper functions for you to process web scraped HTML table


In [3]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML table cell
    Input: the element of a table data cell extracts extra row
    """
    # Extracts text data from the table cell, strips leading and trailing spaces, and takes the first two elements
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML table cell 
    Input: the element of a table data cell extracts extra row
    """
    # Joins strings from the table cell's strings generator based on index conditions
    # It filters even-indexed strings and joins them into a single string
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the element of a table data cell extracts extra row
    """
    # Extracts the first string from the table cell's strings generator
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    # Normalizes Unicode text and strips leading and trailing spaces
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        # Finds the index of "kg" in the text and extracts the substring including "kg"
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        # If mass is empty, set new_mass to 0
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the element of a table data cell extracts extra row
    """
    # Removes <br>, <a>, and <sup> tags from the row
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    # Joins the contents of the row into a single string, stripping leading and trailing spaces
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


To keep the lab tasks consistent, We will scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on `9th June 2021`


In [4]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

# The static_url variable stores the URL of a Wikipedia page containing a list of Falcon 9 and Falcon Heavy launches.
# This URL points to a specific version of the page, identified by the oldid parameter.
# The page provides information about various launches conducted using Falcon 9 and Falcon Heavy rockets.


Next, we request the HTML page from the above URL and get a `response` object


### TASK 1: Request the Falcon9 Launch Wiki page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [6]:
# use requests.get() method with the provided static_url
# assign the response to a object
response = requests.get(static_url)

Create a `BeautifulSoup` object from the HTML `response`


In [8]:
# Use BeautifulSoup() to create a BeautifulSoup object from a response text content

# BeautifulSoup is a Python library for parsing HTML and XML documents.
# We pass two arguments to the BeautifulSoup constructor:
# 1. response.text: The text content of the response object obtained from the GET request.
#    This contains the HTML content of the webpage we fetched.

# 2. 'html.parser': The parser to be used for parsing the HTML content.
#    In this case, we specify 'html.parser', which is a built-in HTML parser provided by BeautifulSoup.
#    This parser is capable of handling most HTML documents and is included by default.
#    Alternatively, you can use other parsers like 'lxml' or 'html5lib' depending on your needs.
# The BeautifulSoup object created, named 'soup', represents the parse tree of the HTML document.
# We can then use this object to navigate and extract data from the HTML structure.
soup = BeautifulSoup(response.text, 'html.parser')


Printing the page title to verify if the `BeautifulSoup` object was created properly 


In [10]:
# Use soup.title attribute

# Access the title attribute of the BeautifulSoup object (soup) using dot notation.
# The title attribute represents the title of the HTML document, enclosed within the <title>...</title> tags.
# When we access soup.title, we are essentially selecting the <title> tag from the parsed HTML document.
# The .text property is then used to retrieve the text content enclosed within the <title> tag.
# This text content represents the title of the webpage.
# Finally, the print() function is used to output the title text to the console.
print(soup.title.text)


List of Falcon 9 and Falcon Heavy launches - Wikipedia


### TASK 2: Extract all column/variable names from the HTML table header


Next, we collect all relevant column names from the HTML table header


We find all tables on the wiki page first. 


In [14]:
# Use the find_all function in the BeautifulSoup object, with element type `table`
# Use the find_all function provided by BeautifulSoup to find all occurrences of a HTML element type within the parsed HTML document.
# In this case, we're searching for all <table> elements in the HTML document.
# The find_all function takes the HTML element type ("table") as its argumen and returns a list containing all the <table> elements found in the document.
# Each element in the list is a BeautifulSoup Tag object representing an individual <table> element.


html_tables = soup.find_all("table")


Starting from the third table is our target table contains the actual launch records.


In [15]:
# Let's print the third table and check its content
first_launch_table = html_tables[2]
print(first_launch_table)

<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
<tr>
<th rowspan="2" scope="row" style="text-align:center;">1
</th>
<td>

You should able to see the columns names embedded in the table header elements `<th>` as follows:


```
<tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
```


Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [21]:
column_names = []

# Apply find_all() function with `th` element on first_launch_table
# We should use the 'th' tag directly to find all table header elements.
# he result of find_all() is assigned to a variable.
th_elements = first_launch_table.find_all('th')

# Iterate over each th element and apply the provided extract_column_from_header() function to get a column name
for name in th_elements:
    # Call the extract_column_from_header function and store the result
    column_name = extract_column_from_header(name)
    
    # Check if the extracted column name is not None and has a non-zero length
    if column_name is not None and len(column_name) > 0:
        # Append the non-empty column name to the column_names list
        column_names.append(column_name)


Check the extracted column names


In [22]:
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


## TASK 3: Create a data frame by parsing the launch HTML tables


In [27]:
#Parsing refers to the process of analyzing a string of symbols (such as code or text) according to the rules of a formal grammar.
#In the context of HTML documents, parsing involves analyzing the structure and content of the HTML code to extract relevant information, such as text, tags, attributes, and their relationships.

#When we parse an HTML document, we are essentially breaking it down into its constituent parts and interpreting their meanings.


We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe


In [28]:
# The keys of the dictionary correspond to column names .
# The values for each key are initially set to None.
launch_dict= dict.fromkeys(column_names)

# Remove the 'Date and time ( )' column from the launch_dict dictionary
# This column is identified as irrelevant and therefore removed from the dictionary.
del launch_dict['Date and time ( )']

# Initialize the launch_dict dictionary with each value as an empty list for specific columns
# These columns are essential for storing launch data and are initialized as empty lists.
# Each key represents a column name, and the corresponding value is initialized as an empty list.
# This prepares the dictionary to store launch data in a structured format.
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []


# Adding some new columns to the launch_dict dictionary 
# The keys for these new columns are added to the launch_dict dictionary with empty lists as values.
launch_dict['Version Booster'] = []
launch_dict['Booster landing'] = []
launch_dict['Date'] = []
launch_dict['Time'] = []


Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


In [39]:
extracted_row = 0  # Initialize a variable to count the number of rows extracted
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    # Iterate through each table found in the HTML document
    
    for rows in table.find_all("tr"):
        # Iterate through each row in the current table
        if rows.th:
            # Check if the row contains a table header
            if rows.th.string:
                # Check if the table header contains a string
                flight_number = rows.th.string.strip()  # Extract flight number
                flag = flight_number.isdigit()  # Check if the flight number consists only of digits
        else:
            flag = False  # If no table header, set flag to False

        row = rows.find_all('td')  # Find all table data elements in the row
        if flag:
            # If the row contains a flight number
            extracted_row += 1  # Increment the count of extracted rows
            launch_dict["Flight No."].append(flight_number)# Append the flight_number to the list stored in launch_dict under the key "Flight No."

            # Extract date and time information from the first cell of the row
            datatimelist = date_time(row[0])
            
            
            # Date value
            # TODO: Append the date into launch_dict with key `Date`
            date = datatimelist[0].strip(',')
            launch_dict["Date"].append(date)
            print(date)
            
            # Time value
            # TODO: Append the time into launch_dict with key `Time`
            time = datatimelist[1]
            launch_dict["Time"].append(time)
            print(time)
              
            # Booster version
            # TODO: Append the bv into launch_dict with key `Version Booster`
            bv=booster_version(row[1])
            launch_dict["Version Booster"].append(bv)
            if not(bv):
                bv=row[1].a.string
            print(bv)
            
            # Launch Site
            # TODO: Append the bv into launch_dict with key `Launch Site`
            launch_site = row[2].a.string
            launch_dict["Launch site"].append(launch_site)
            print(launch_site)
            
            # Payload
            # TODO: Append the payload into launch_dict with key `Payload`
            payload = row[3].a.string
            launch_dict["Payload"].append(payload)
            print(payload)
            
            # Payload Mass
            # TODO: Append the payload_mass into launch_dict with key `Payload mass`
            payload_mass = get_mass(row[4])
            launch_dict["Payload mass"].append(payload_mass)
            print(payload)
            
            # Orbit
            # TODO: Append the orbit into launch_dict with key `Orbit`
            orbit = row[5].a.string
            launch_dict["Orbit"].append(orbit)
            print(orbit)
            
            # Customer
            # TODO: Append the customer into launch_dict with key `Customer`
            customer = row[5].a.string
            launch_dict["Customer"].append(customer)
            print(customer)
            
            # Launch outcome
            # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`
            launch_outcome = list(row[7].strings)[0]
            launch_dict["Launch outcome"].append(launch_outcome)
            print(launch_outcome)
            
            # Booster landing
            # TODO: Append the launch_outcome into launch_dict with key `Booster landing`
            booster_landing = landing_status(row[8])
            launch_dict["Booster landing"].append(booster_landing)
            print(booster_landing)
            

4 June 2010
18:45
F9 v1.0B0003.1
CCAFS
Dragon Spacecraft Qualification Unit
Dragon Spacecraft Qualification Unit
LEO
LEO
Success

Failure
8 December 2010
15:43
F9 v1.0B0004.1
CCAFS
Dragon
Dragon
LEO
LEO
Success
Failure
22 May 2012
07:44
F9 v1.0B0005.1
CCAFS
Dragon
Dragon
LEO
LEO
Success
No attempt

8 October 2012
00:35
F9 v1.0B0006.1
CCAFS
SpaceX CRS-1
SpaceX CRS-1
LEO
LEO
Success

No attempt
1 March 2013
15:10
F9 v1.0B0007.1
CCAFS
SpaceX CRS-2
SpaceX CRS-2
LEO
LEO
Success

No attempt

29 September 2013
16:00
F9 v1.1B1003
VAFB
CASSIOPE
CASSIOPE
Polar orbit
Polar orbit
Success
Uncontrolled
3 December 2013
22:41
F9 v1.1
CCAFS
SES-8
SES-8
GTO
GTO
Success
No attempt
6 January 2014
22:06
F9 v1.1
CCAFS
Thaicom 6
Thaicom 6
GTO
GTO
Success
No attempt
18 April 2014
19:25
F9 v1.1
Cape Canaveral
SpaceX CRS-3
SpaceX CRS-3
LEO
LEO
Success

Controlled
14 July 2014
15:15
F9 v1.1
Cape Canaveral
Orbcomm-OG2
Orbcomm-OG2
LEO
LEO
Success
Controlled
5 August 2014
08:00
F9 v1.1
Cape Canaveral
AsiaSat 8
Asia

After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.


In [38]:
df= pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })
df.head()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success\n,F9 v1.0B0003.1,Failure,4 June 2010,18:45
1,1,CCAFS,Dragon,0,LEO,NASA,Success,F9 v1.0B0003.1,Failure,4 June 2010,18:45
2,1,CCAFS,Dragon,525 kg,LEO,NASA,Success,F9 v1.0B0003.1,No attempt\n,4 June 2010,18:45
3,1,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,F9 v1.0B0003.1,No attempt,4 June 2010,18:45
4,2,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,F9 v1.0B0004.1,No attempt\n,8 December 2010,15:43


We can now export it to a <b>CSV</b> for the next section

In [None]:
df.to_csv('Da

<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>


## Authors


<a href="https://www.linkedin.com/in/yan-luo-96288783/">Yan Luo</a>


<a href="https://www.linkedin.com/in/nayefaboutayoun/">Nayef Abou Tayoun</a>


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates           |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |


Copyright © 2021 IBM Corporation. All rights reserved.
