<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Estimated time needed: **40** minutes


In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

[https://en.wikipedia.org/wiki/List_of_Falcon\_9\_and_Falcon_Heavy_launches](https://en.wikipedia.org/wiki/List_of_Falcon\_9\_and_Falcon_Heavy_launches?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2022-01-01)


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\_1\_L2/images/Falcon9\_rocket_family.svg)


Falcon 9 first stage will land successfully


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing\_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


More specifically, the launch records are stored in a HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\_1\_L2/images/falcon9-launches-wiki.png)


## Objectives

Web scrap Falcon 9 launch records with `BeautifulSoup`:

*   Extract a Falcon 9 launch records HTML table from Wikipedia
*   Parse the table and convert it into a Pandas data frame


First let's import required packages for this lab


In [1]:
!pip3 install beautifulsoup4
!pip3 install requests



You should consider upgrading via the 'C:\Users\mchae\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\Users\mchae\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [2]:
import sys

import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

and we will provide some helper functions for you to process web scraped HTML table


In [3]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`


In [4]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

Next, request the HTML page from the above URL and get a `response` object


### TASK 1: Request the Falcon9 Launch Wiki page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [5]:
# use requests.get() method with the provided static_url
# assign the response to a object
response = requests.get(static_url)
# response.content
response

<Response [200]>

Create a `BeautifulSoup` object from the HTML `response`


In [6]:
# Use BeautifulSoup() to create a BeautifulSoup object from a response text content
ma_soup = BeautifulSoup(response.content, 'html.parser')
# ma_soup.text

Print the page title to verify if the `BeautifulSoup` object was created properly


In [7]:
# Use soup.title attribute
ma_soup.title

<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>

### TASK 2: Extract all column/variable names from the HTML table header


Next, we want to collect all relevant column names from the HTML table header


Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab


In [8]:
# Use the find_all function in the BeautifulSoup object, with element type `table`
# Assign the result to a list called `html_tables`
html_tables = ma_soup.find_all('table')
# html_tables

Starting from the third table is our target table contains the actual launch records.


In [9]:
# Let's print the third table and check its content
first_launch_table = html_tables[2]
print(first_launch_table)

<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
<tr>
<th rowspan="2" scope="row" style="text-align:center;">1
</th>
<td>

You should able to see the columns names embedded in the table header elements `<th>` as follows:


```
<tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
```


Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [10]:
column_names = []

# Apply find_all() function with `th` element on first_launch_table
# Iterate each th element and apply the provided extract_column_from_header() to get a column name
# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names
for element in first_launch_table.find_all('th'): 
    name = extract_column_from_header(element)
    if name and len(name) > 0:
        column_names.append(name)

Check the extracted column names


In [11]:
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


## TASK 3: Create a data frame by parsing the launch HTML tables


We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe


In [12]:
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

launch_dict

{'Flight No.': [],
 'Launch site': [],
 'Payload': [],
 'Payload mass': [],
 'Orbit': [],
 'Customer': [],
 'Launch outcome': [],
 'Version Booster': [],
 'Booster landing': [],
 'Date': [],
 'Time': []}

Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:


In [13]:
# Mine, not perfect but much closer that that shit ibm/coursera was doing. Needs cleaning, but is that okay and to be expected of data science when reading in the csv later? 
# When do you clean the data? when parsing it from some source or when loading it into your tables? 
for table_number, table in enumerate(ma_soup.find_all('table',"wikitable plainrowheaders collapsible")):
    print(f"table number: {table_number}")
    for rows in table.find_all('tr'):
        if rows.th:  # if the first th tag exists in the row (not sure why it wouldn't...)... 
            # does not work if the <th> tag's text is not naked text like if it's in a <p> tag. what to do? matters? 
            if rows.th.string: 
                flight_number = rows.th.string.strip()
                if flight_number.isdigit():
                    row = rows.find_all('td')

                    # Flight Number
                    launch_dict['Flight No.'].append(flight_number)
                    
                    datetimepair = date_time(row[0])

                    # Date
                    launch_dict['Date'].append(datetimepair[0].strip(','))
                    date = datetimepair[0].strip(',')

                    # Time
                    time = datetimepair[1]
                    launch_dict['Time'].append(time)

                    # Booster 
                    bv = booster_version(row[1])
                    if not bv: 
                        bv = row[1].a.string
                    launch_dict['Version Booster'].append(bv)

                    # Launch Site
                    ls = row[2].a.string
                    launch_dict['Launch site'].append(ls)

                    # Payload
                    pl = row[3].a.string
                    launch_dict['Payload'].append(pl)

                    # Payload Mass
                    plm = get_mass(row[4])
                    launch_dict['Payload mass'].append(plm)

                    # Orbit
                    orb = row[5].a.string
                    launch_dict['Orbit'].append(orb)

                    # Customer
                    cust = ''
                    for element in list(row[6].text.strip()):
                        if element != '\n':
                            cust += element 
                        else: 
                            cust += '; '
                    launch_dict['Customer'].append(cust)

                    # Luanch Outcome
                    lo = list(row[7].strings)[0].strip()
                    launch_dict['Launch outcome'].append(lo)

                    # Booster Landing
                    bl = landing_status(row[8]).strip()
                    launch_dict['Booster landing'].append(bl)

                    print(
                        '\t############ Flight Number\n'
                        f"{flight_number}\n"
                        '\t############ Date\n'
                        f"\t{date}\n"  
                        '\t############ Time\n'
                        f"\t{time}\n" 
                        '\t############ Booster Version\n'
                        f"\t{bv}\n" 
                        '\t############ Launch Site\n'
                        f"\t{ls}\n"
                        '\t############ Payload\n'
                        f"\t{pl}\n"
                        '\t############ Payload Mass\n'
                        f"\t{plm}\n"
                        '\t############ Orbit\n'
                        f"\t{orb}\n"
                        '\t############ Customer\n'
                        f"\t{cust}\n"
                        '\t############ Launch Outcome\n'
                        f"\t{lo}\n"
                        '\t############ Booster Landing\n'
                        f"\t{bl}\n"
                    )

table number: 0
	############ Flight Number
1
	############ Date
	4 June 2010
	############ Time
	18:45
	############ Booster Version
	F9 v1.0B0003.1
	############ Launch Site
	CCAFS
	############ Payload
	Dragon Spacecraft Qualification Unit
	############ Payload Mass
	0
	############ Orbit
	LEO
	############ Customer
	SpaceX
	############ Launch Outcome
	Success
	############ Booster Landing
	Failure

	############ Flight Number
2
	############ Date
	8 December 2010
	############ Time
	15:43
	############ Booster Version
	F9 v1.0B0004.1
	############ Launch Site
	CCAFS
	############ Payload
	Dragon
	############ Payload Mass
	0
	############ Orbit
	LEO
	############ Customer
	NASA (COTS); NRO
	############ Launch Outcome
	Success
	############ Booster Landing
	Failure

	############ Flight Number
3
	############ Date
	22 May 2012
	############ Time
	07:44
	############ Booster Version
	F9 v1.0B0005.1
	############ Launch Site
	CCAFS
	############ Payload
	Dragon
	############ Payload

In [14]:
for key in launch_dict:
    print(f"{key} {len(launch_dict[key])}")

Flight No. 121
Launch site 121
Payload 121
Payload mass 121
Orbit 121
Customer 121
Launch outcome 121
Version Booster 121
Booster landing 121
Date 121
Time 121


After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.


In [15]:
df=pd.DataFrame(launch_dict)
print(df.to_string())

    Flight No.     Launch site                                   Payload    Payload mass        Orbit                                 Customer Launch outcome Version Booster Booster landing               Date      Time
0            1           CCAFS      Dragon Spacecraft Qualification Unit               0          LEO                                   SpaceX        Success  F9 v1.0B0003.1         Failure        4 June 2010     18:45
1            2           CCAFS                                    Dragon               0          LEO                         NASA (COTS); NRO        Success  F9 v1.0B0004.1         Failure    8 December 2010     15:43
2            3           CCAFS                                    Dragon          525 kg          LEO                              NASA (COTS)        Success  F9 v1.0B0005.1      No attempt        22 May 2012     07:44
3            4           CCAFS                              SpaceX CRS-1        4,700 kg          LEO                       

In [16]:
# One-Hot Encoding on Launch Outcome 
# landing_class = 0 on failure, 1 on success
landing_class = []
for outcome in df['Launch outcome']:
    if 'Success' in outcome:
        landing_class.append(1)
    else:
        landing_class.append(0)
df['Launch outcome'] = landing_class
df.head()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,1,F9 v1.0B0003.1,Failure,4 June 2010,18:45
1,2,CCAFS,Dragon,0,LEO,NASA (COTS); NRO,1,F9 v1.0B0004.1,Failure,8 December 2010,15:43
2,3,CCAFS,Dragon,525 kg,LEO,NASA (COTS),1,F9 v1.0B0005.1,No attempt,22 May 2012,07:44
3,4,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA (CRS),1,F9 v1.0B0006.1,No attempt,8 October 2012,00:35
4,5,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA (CRS),1,F9 v1.0B0007.1,No attempt,1 March 2013,15:10


In [17]:
# Remove kg from Payload Mass, rename column with kg 
new_values = []  # better way to do this than making a new list and replacing the old one? 
for i in range(len(df)):
    value = df.iloc[i][3]
    if type(value) == str: 
        new_values.append(value.split(' ')[0])
    else:
        new_values.append(value)
df['Payload mass'] = new_values
df.rename(columns={'Payload mass': 'Payload Mass kg'}, inplace=True)

In [18]:
df['Payload Mass kg'] = df['Payload Mass kg'].str.replace(',', '')
df['Payload Mass kg'] = df['Payload Mass kg'].str.replace('~', '')  
# why are the following positions given nan after the above? 
df.iat[0, 3] = 0  # 0 to nan for some reason 
df.iat[1, 3] = 0  # 0 to nan for some reason 

# 88 
df.iat[88, 3] = 5500  # 5000-6000

# 32, 46, 102
# mean = df['Payload Mass kg'].mean()  # unsupported operand type(s) for +: 'int' and 'str' Solution? 
values = []
for value in df['Payload Mass kg']: 
    if type(value) != str:
        values.append(value)
mean = sum(values) / len(values)
df.iat[32, 3] = mean  # Classified 
df.iat[46, 3] = mean  # Classified 
df.iat[102, 3] = mean  # Classified 

print(df.to_string())

    Flight No.     Launch site                                   Payload Payload Mass kg        Orbit                                 Customer  Launch outcome Version Booster Booster landing               Date      Time
0            1           CCAFS      Dragon Spacecraft Qualification Unit               0          LEO                                   SpaceX               1  F9 v1.0B0003.1         Failure        4 June 2010     18:45
1            2           CCAFS                                    Dragon               0          LEO                         NASA (COTS); NRO               1  F9 v1.0B0004.1         Failure    8 December 2010     15:43
2            3           CCAFS                                    Dragon             525          LEO                              NASA (COTS)               1  F9 v1.0B0005.1      No attempt        22 May 2012     07:44
3            4           CCAFS                              SpaceX CRS-1            4700          LEO                   

In [19]:
# changing dates
from datetime import datetime
# print(datetime.now())
# print(datetime.now().strftime("%m/%d/%Y, %H:%M:%S"))
df['Date'] = pd.to_datetime(df['Date'], format="%d %B %Y")  # day number month name year number 
df['Date'] = df['Date'].dt.strftime("%Y-%m-%d")  # year number-month number-day number
print(df['Date'])

# change times? 

0      2010-06-04
1      2010-12-08
2      2012-05-22
3      2012-10-08
4      2013-03-01
          ...    
116    2021-05-09
117    2021-05-15
118    2021-05-26
119    2021-06-03
120    2021-06-06
Name: Date, Length: 121, dtype: object


We can now export it to a <b>CSV</b> for the next section, but to make the answers consistent and in case you have difficulties finishing this lab.

Following labs will be using a provided dataset to make each lab independent.


<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>


In [20]:
df.to_csv('spacex_web_scraped.csv', index=False)

In [21]:
df

Unnamed: 0,Flight No.,Launch site,Payload,Payload Mass kg,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,1,F9 v1.0B0003.1,Failure,2010-06-04,18:45
1,2,CCAFS,Dragon,0,LEO,NASA (COTS); NRO,1,F9 v1.0B0004.1,Failure,2010-12-08,15:43
2,3,CCAFS,Dragon,525,LEO,NASA (COTS),1,F9 v1.0B0005.1,No attempt,2012-05-22,07:44
3,4,CCAFS,SpaceX CRS-1,4700,LEO,NASA (CRS),1,F9 v1.0B0006.1,No attempt,2012-10-08,00:35
4,5,CCAFS,SpaceX CRS-2,4877,LEO,NASA (CRS),1,F9 v1.0B0007.1,No attempt,2013-03-01,15:10
...,...,...,...,...,...,...,...,...,...,...,...
116,117,CCSFS,Starlink,15600,LEO,SpaceX,1,F9 B5B1051.10,Success,2021-05-09,06:42
117,118,KSC,Starlink,14000,LEO,SpaceX Capella Space and Tyvak,1,F9 B5B1058.8,Success,2021-05-15,22:56
118,119,CCSFS,Starlink,15600,LEO,SpaceX,1,F9 B5B1063.2,Success,2021-05-26,18:59
119,120,KSC,SpaceX CRS-22,3328,LEO,NASA (CRS),1,F9 B5B1067.1,Success,2021-06-03,17:29


## Authors


<a href="https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2022-01-01">Yan Luo</a>


<a href="https://www.linkedin.com/in/nayefaboutayoun/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2022-01-01">Nayef Abou Tayoun</a>


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By | Change Description          |
| ----------------- | ------- | ---------- | --------------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates               |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |


Copyright © 2021 IBM Corporation. All rights reserved.
