In [1]:
{
    "nbformat": 4.0,
    "nbformat_minor": 5
}

{'nbformat': 4.0, 'nbformat_minor': 5}

In [8]:
pip install beautifulsoup4




# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\_1\_L2/images/Falcon9\_rocket_family.svg)


Falcon 9 first stage will land successfully


Several examples of an unsuccessful landing are shown here:


More specifically, the launch records are stored in a HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\_1\_L2/images/falcon9-launches-wiki.png)


## Objectives

Web scrap Falcon 9 launch records with `BeautifulSoup`:

*   Extract a Falcon 9 launch records HTML table from Wikipedia
*   Parse the table and convert it into a Pandas data frame


First let's import required packages for this lab


In [10]:
#import os
import requests 
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

Below are some helper functions to help process the web scraped HTML table.


In [11]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


To keep the lab tasks consistent, we scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`


In [12]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

Next, request the HTML page from the above URL and get a `response` object


 Request the Falcon9 Launch Wiki page from its `URL`


First, let's `perform an HTTP GET method` to `request` the Falcon9 Launch HTML page, as an `HTTP response.`


In [14]:
# use requests.get() method with the provided static_url
# to get the HTML content of the page we call response
response = requests.get(static_url)

Create a `BeautifulSoup` object from the HTML `response`


In [15]:
# Use BeautifulSoup() to create a BeautifulSoup object from response text content
soup = BeautifulSoup(response.text, 'html.parser')

Print the page title to verify if the `BeautifulSoup` object was created properly


In [16]:
# Use soup.title attribute
print(soup.title)


<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>


### Extract all column/variable names from the HTML table header


Next, we should collect all relevant column names from the HTML table header


To test, let's try to find all tables on the wiki page first. 


In [19]:
# Use the find_all function in the BeautifulSoup object, with element type `table`
# Assign the result to a list called `html_tables`
html_tables = soup.find_all('table')


Starting from the third table is our target table contains the actual launch records.


In [21]:
# Let's print the third table and check its content
first_launch_table = html_tables[2]
print(first_launch_table)

<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
<tr>
<th rowspan="2" scope="row" style="text-align:center;">1
</th>
<td>

You should able to see the columns names embedded in the table header elements `<th>` as follows: 
Flight No.- 
Date and Time-
Falcon 9 First Stage Boosters-
Lauch Site-
Payload, Payload Mass-
Orbit-
Customer-
Launch Outcome-
Falcon 9 first stage landing test-

Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [22]:
column_names = []

# Apply find_all() method with `th` element on first_launch_table
# Iterate each th element and apply the provided extract_column_from_header() to get a column name
# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names
for th in first_launch_table.find_all('th'):
    name = extract_column_from_header(th)
    if name is not None and len(name) > 0:
        column_names.append(name)


Check the extracted column names


In [23]:
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


##  Create a data frame by parsing the launch HTML tables


We will create an empty dictionary with keys from the extracted column names.

In [25]:
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Date']=[]
launch_dict['Time']=[]
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]


print(launch_dict)

{'Flight No.': [], 'Launch site': [], 'Payload': [], 'Payload mass': [], 'Orbit': [], 'Customer': [], 'Launch outcome': [], 'Date': [], 'Time': [], 'Version Booster': [], 'Booster landing': []}


Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


We create helper functions to simplify the parsing process


In [58]:
extracted_row = 0
#Extract each table and enumerate over each table use the soup object to find all tables with class "wikitable-plainrowheaders collapsible"
for table_number,table in enumerate(soup.find_all('table',class_="wikitable plainrowheaders collapsible")):

   # get table row 
    for rows in table.find_all('tr'):
        if rows.th:
            if rows.th.string:
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
        else:
            flag=False
        #get table element 
        row=rows.find_all('td')

        if flag:
            extracted_row += 1
            # Flight Number value
            # TODO: Append the flight_number into launch_dict with key `Flight No.`
            launch_dict['Flight No.'].append(flight_number)
            print(f"Flight No.: {flight_number}")

            datatimelist=date_time(row[0])

            
            # Date value
            # TODO: Append the date into launch_dict with key `Date
            date = datatimelist[0].strip(',')
            launch_dict['Date'].append(date)
            print(f"Date: {date}")
            
            # Time value
            # TODO: Append the time into launch_dict with key `Time`
            time = datatimelist[1].strip(',')
            launch_dict['Time'].append(time)
            print(f"Time: {time}")
              
            # Booster version
            # TODO: Append the bv into launch_dict with key `Version Booster` using booster_version function, strip trailing and leading spaces
            bv=booster_version(row[1])
            launch_dict['Version Booster'].append(bv)
            if not(bv):
                bv=row[1]
            print(f"Version Booster: {bv}")
            
            # Launch Site
            # TODO: Append the bv into launch_dict with key `Launch Site`
            launch_site = booster_version(row[2])
            launch_dict['Launch site'].append(launch_site.strip(','))
            if not (launch_site):
                launch_site = row[2]
            print(f"Launch site: {launch_site}")
            
            # Payload
            # TODO: Append the payload into launch_dict with key `Payload`
            payload = row[3].a.string
            launch_dict['Payload'].append(payload)
            print(f"Payload: {payload}")
            
            # Payload Mass
            # TODO: Append the payload_mass into launch_dict with key `Payload mass`
            payload_mass = get_mass(row[4])
            launch_dict['Payload mass'].append(payload_mass)
            print(f"Payload mass: {payload_mass}")
            
            # Orbit
            # TODO: Append the orbit into launch_dict with key `Orbit`
            orbit = row[5].a.string
            launch_dict['Orbit'].append(orbit)
            print(f"Orbit: {orbit}")
            
            # Customer
            # TODO: Append the customer into launch_dict with key `Customer`
            # Hint: Use the `a` tag to get the customer name
            customer = row[6]
            if customer.a:
                customer = customer.a.string
            else:
                customer = customer.string
                
            launch_dict['Customer'].append(customer)
            print(f"Customer: {customer}")
            
            # Launch outcome
            # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`
            launch_outcome = list(row[7].strings)[0]
            launch_dict['Launch outcome'].append(launch_outcome)
            print(f"Launch outcome: {launch_outcome}")
            
            # Booster landing
            # TODO: Append the launch_outcome into launch_dict with key `Booster landing`
            booster_landing = landing_status(row[8])
            launch_dict['Booster landing'].append(launch_outcome)
            print(f"Booster landing: {booster_landing}")
            
             # Debugging
            print(f"Extracted row: {extracted_row}")
            print(f"Length of Flight No. array: {len(launch_dict['Flight No.'])}")
            print(f"Length of Date array: {len(launch_dict['Date'])}")
            print(f"Length of Time array: {len(launch_dict['Time'])}")
            print(f"Length of Version Booster array: {len(launch_dict['Version Booster'])}")
            print(f"Length of Launch site array: {len(launch_dict['Launch site'])}")
            print(f"Length of Payload array: {len(launch_dict['Payload'])}")
            print(f"Length of Payload mass array: {len(launch_dict['Payload mass'])}")
            print(f"Length of Orbit array: {len(launch_dict['Orbit'])}")
            print(f"Length of Customer array: {len(launch_dict['Customer'])}")
            print(f"Length of Launch outcome array: {len(launch_dict['Launch outcome'])}")
            print(f"Length of Booster landing array: {len(launch_dict['Booster landing'])}")


Flight No.: 1
Date: 4 June 2010
Time: 18:45
Version Booster: F9 v1.0B0003.1
Launch site: CCAFS
Payload: Dragon Spacecraft Qualification Unit
Payload mass: 0
Orbit: LEO
Customer: SpaceX
Launch outcome: Success

Booster landing: Failure
Extracted row: 1
Length of Flight No. array: 1534
Length of Date array: 1531
Length of Time array: 1531
Length of Version Booster array: 1531
Length of Launch site array: 1531
Length of Payload array: 1531
Length of Payload mass array: 1531
Length of Orbit array: 1531
Length of Customer array: 1526
Length of Launch outcome array: 1526
Length of Booster landing array: 1526
Flight No.: 2
Date: 8 December 2010
Time: 15:43
Version Booster: F9 v1.0B0004.1
Launch site: CCAFS
Payload: Dragon
Payload mass: 0
Orbit: LEO
Customer: NASA
Launch outcome: Success
Booster landing: Failure
Extracted row: 2
Length of Flight No. array: 1535
Length of Date array: 1532
Length of Time array: 1532
Length of Version Booster array: 1532
Length of Launch site array: 1532
Length o

After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.


In [69]:
df = pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items()})
df.head()

#in the launch outcome column, there are some values that have \n in them. We need to remove them
df['Launch outcome'] = df['Launch outcome'].str.replace('\n', '')
df['Booster landing'] = df['Booster landing'].str.replace('\n', '')
df.head()

#remove the extra customer column
#df.drop('customer', axis=1, inplace=True)


Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Date,Time,Version Booster,Booster landing
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,"[[SpaceX], \n]",Success,4 June 2010,18:45,F9 v1.0B0003.1,Success
1,1,CCAFS,Dragon,0,LEO,"[[.mw-parser-output .plainlist ol,.mw-parser-o...",Success,8 December 2010,15:43,F9 v1.0B0004.1,Success
2,1,CCAFS,Dragon,525 kg,LEO,"[[NASA], (, [COTS], )\n]",Success,22 May 2012,07:44,F9 v1.0B0005.1,Success
3,1,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,"[[NASA], (, [CRS], )\n]",Success,8 October 2012,00:35,F9 v1.0B0006.1,Success
4,2,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,"[[NASA], (, [CRS], )\n]",Success,1 March 2013,15:10,F9 v1.0B0007.1,Success


In [70]:
df.to_csv('spacex_web_scraped_1.csv', index=False)

<code>df.to_csv('spacex_web_scraped1.csv', index=False)</code>
