<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Estimated time needed: **40** minutes


In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/Falcon9_rocket_family.svg)


Falcon 9 first stage will land successfully


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


More specifically, the launch records are stored in a HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


  ## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract a Falcon 9 launch records HTML table from Wikipedia
- Parse the table and convert it into a Pandas data frame


First let's import required packages for this lab


In [20]:
#import sys
# Importing the requests module to perform HTTP requests
import requests

# Importing BeautifulSoup from bs4 for parsing HTML and XML documents
from bs4 import BeautifulSoup

# Importing the re module for regular expression operations
import re

# Importing the unicodedata module to handle and normalize Unicode text
import unicodedata

# Importing pandas for data manipulation and analysis
import pandas as pd

and we will provide some helper functions for you to process web scraped HTML table


In [21]:
import unicodedata  # Imports the unicodedata module to normalize Unicode text and remove accented characters or special symbols.

def date_time(table_cells):
	"""
	Extracts the date and time from an HTML table cell.

	Parameters:
		table_cells: BeautifulSoup object representing a <td> cell in a table.

	Returns:
		A list with the first two text elements obtained from the cell, which typically
		contain the date and time (e.g., ['2020-03-06', '17:50 UTC']).
	"""
	# Iterates through all text strings found within the cell,
	# strips leading and trailing whitespace from each string,
	# and returns only the first two elements.
	return [data_time.strip() for data_time in list(table_cells.strings)][0:2]


def booster_version(table_cells):
	"""
	Extracts and forms a string representing the booster version from an HTML table cell.

	Parameters:
		table_cells: BeautifulSoup object representing a <td> cell in a table.

	Returns:
		A string describing the booster version, for example: 'Falcon 9 Block 5 B1051.4'.
	"""
	# Iterates through each text string in the cell along with its index using enumerate.
	# Selects only those strings at even positions (i % 2 == 0),
	# which discards alternate elements that are not needed.
	# Then removes the last element from the resulting list (with [0:-1]) and joins all strings into one.
	out = ''.join([booster_version for i, booster_version in enumerate(table_cells.strings) if i % 2 == 0][0:-1])
	return out


def landing_status(table_cells):
	"""
	Extracts the landing status from an HTML table cell.

	Parameters:
		table_cells: BeautifulSoup object representing a <td> cell in a table.

	Returns:
		The first text string in the cell, which generally indicates the landing status 
		(e.g., 'Success' or 'Failure').
	"""
	# Creates a list from all strings in the cell and returns the first element.
	out = [i for i in table_cells.strings][0]
	return out


def get_mass(table_cells):
	"""
	Extracts and cleans the payload mass from an HTML table cell.

	Parameters:
		table_cells: BeautifulSoup object representing a <td> cell in a table.

	Returns:
		A string representing the payload mass (e.g., '8300 kg').
		If no mass information is found, returns 0.
	"""
	# Normalizes the text to convert Unicode characters to their simplest form (e.g., accents)
	# and strips whitespace from the beginning and end.
	mass = unicodedata.normalize("NFKD", table_cells.text).strip()
	
	if mass:
		# Finds the index where the substring "kg" is located and extracts the part of the text that contains it.
		new_mass = mass[0:mass.find("kg") + 2]
	else:
		# If there is no text in the cell, assigns 0 as the default value.
		new_mass = 0
	
	return new_mass


def extract_column_from_header(row):
	"""
	Extracts and cleans the column name from an HTML header cell (<th>).

	Parameters:
		row: BeautifulSoup object representing a <th> header cell.

	Returns:
		A string with the cleaned column name. 
		If the content is just a digit, returns None.
	"""
	# A header cell may contain internal tags (e.g., <br>, <a>, <sup>)
	# used for formatting but not part of the actual column name.
	
	if row.br:
		# If there is a <br> tag, remove it to avoid unnecessary line breaks.
		row.br.extract()
	if row.a:
		# If there is an <a> tag (link), remove it to avoid including texts or URLs that are not part of the name.
		row.a.extract()
	if row.sup:
		# If there is a <sup> tag (used for exponents, footnotes, or additional info),
		# remove it with extract(), which completely removes the tag and its content from the DOM tree.
		# This ensures any superfluous information, not affecting the column name, is discarded.
		row.sup.extract()
	
	# Joins the remaining content of the cell into a single string separated by spaces.
	colunm_name = ' '.join(row.contents)
	
	# The string ' ' is the separator placed between each element of the row.contents list when joined into a single string.
	# This means each fragment of the content will be separated by a space in the final result.
	# row.contents = ["Title", "of", "the", "Column"]
	# Result = "Title of the Column"
	
	# If the resulting name is not numeric, returns the cleaned text (stripping leading and trailing spaces).
	if not colunm_name.strip().isdigit():  # strip removes whitespace at the beginning and end of the string.
		# isdigit checks if all characters in the string are digits.
		return colunm_name.strip()


To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`


In [22]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

Next, request the HTML page from the above URL and get a `response` object


### TASK 1: Request the Falcon9 Launch Wiki page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [23]:
# use requests.get() method with the provided static_url

request_data=requests.get(static_url)
request_data.status_code



200

In [24]:
# assign the response to a object
htm_content=request_data.text


Create a `BeautifulSoup` object from the HTML `response`


In [25]:
# Use BeautifulSoup() to create a BeautifulSoup object from a response text content
soup=BeautifulSoup(htm_content,'html.parser')
#print(soup.prettify())


Print the page title to verify if the `BeautifulSoup` object was created properly 


In [26]:
# Use soup.title attribute

print(soup.title)






<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>


### TASK 2: Extract all column/variable names from the HTML table header


Next, we want to collect all relevant column names from the HTML table header


Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab


In [None]:
# Use the find_all function in the BeautifulSoup object, with element type `table`
# Assign the result to a list called `html_tables`
html_tables=soup.find_all('table')

# If you want to print the content of each table, uncomment the following lines:
# for tables in html_tables:
#     print(tables.prettify())

# Verify the result by printing the number of tables found
# print(f"Number of tables found: {len(html_tables)}") 


Starting from the third table is our target table contains the actual launch records.


In [41]:
# Let's print the third table and check its content
first_launch_table = html_tables[2]
#Uncoment to print-> #print(first_launch_table) 

You should able to see the columns names embedded in the table header elements `<th>` as follows:


```
<tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
```


Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [42]:
column_names = []

# Apply find_all() function with `th` element on first_launch_table
extraction = first_launch_table.findAll('th')

# Iterate each th element and apply the provided extract_column_from_header() to get a column name
for column in extraction:
    name = extract_column_from_header(column)
    # Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names
    if name is not None and len(name) > 0:
        column_names.append(name)

print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


Check the extracted column names


In [43]:
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


## TASK 3: Create a data frame by parsing the launch HTML tables


We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe


In [44]:
# Se crea un diccionario a partir de un iterable (column_names) utilizando el método fromkeys().
# Cada elemento de 'column_names' se convierte en una clave del diccionario, asignándole el valor 'None' por defecto.
launch_dict = dict.fromkeys(column_names)

# Se elimina una columna que se considera irrelevante.
# La clave 'Date and time ( )' se borra del diccionario.
del launch_dict['Date and time ( )']

#del:
# Es una palabra reservada en Python que se utiliza para eliminar referencias a objetos. 
# Cuando usas del en un elemento de un diccionario, eliminas la entrada correspondiente a la clave especificada.
# launch_dict['Date and time ( )']:
# Se accede a la entrada del diccionario launch_dict que tiene como clave 'Date and time ( )'.
# Efecto combinado:
# La instrucción del launch_dict['Date and time ( )'] elimina la clave 'Date and time ( )' 
# y su correspondiente valor del diccionario launch_dict.

In [45]:
launch_dict

{'Flight No.': None,
 'Launch site': None,
 'Payload': None,
 'Payload mass': None,
 'Orbit': None,
 'Customer': None,
 'Launch outcome': None}

In [46]:
# Initialize the 'launch_dict' dictionary by assigning an empty list to each key.
# This allows storing multiple values for each column, similar to how data is managed in a table.

# Initialize the list for the key 'Flight No.' to store flight numbers or identifiers.
launch_dict['Flight No.'] = []

# Initialize the list for the key 'Launch site' to store information about the launch site.
launch_dict['Launch site'] = []

# Initialize the list for the key 'Payload' to store information about the payload.
launch_dict['Payload'] = []

# Initialize the list for the key 'Payload mass' to store payload mass values.
launch_dict['Payload mass'] = []

# Initialize the list for the key 'Orbit' to store information about the orbit.
launch_dict['Orbit'] = []

# Initialize the list for the key 'Customer' to store customer or user information.
launch_dict['Customer'] = []

# Initialize the list for the key 'Launch outcome' to store the launch outcomes.
launch_dict['Launch outcome'] = []

# Add new columns to the dictionary with the same structure (empty lists) to store additional data.
# Initialize the list for the key 'Version Booster' to store information about the booster version.
launch_dict['Version Booster'] = []

# Initialize the list for the key 'Booster landing' to store information about the booster landing.
launch_dict['Booster landing'] = []

# Initialize the list for the key 'Date' to store the launch date.
launch_dict['Date'] = []

# Initialize the list for the key 'Time' to store the launch time.
launch_dict['Time'] = []


Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:


In [47]:
# -*- coding: utf-8 -*- # Asegura que los caracteres especiales se manejen correctamente

# --- Inicialización ---

# Contador para llevar la cuenta de las filas de datos de lanzamientos procesadas
extracted_row = 0

# --- Procesamiento de Tablas ---

# Itera sobre cada tabla encontrada con la clase específica.
# Se asume que cada tabla representa lanzamientos de un período (ej. un año).
# 'enumerate' proporciona un índice (table_number) para cada tabla.
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):

    # --- Procesamiento de Filas ---

    # Itera sobre cada fila (etiqueta <tr>) dentro de la tabla actual.
    for rows in table.find_all("tr"):

        # --- Identificación de Filas de Datos ---

        # Variable para marcar si la fila actual contiene datos válidos de un lanzamiento.
        is_launch_data_row = False # Renombrado 'flag' para mayor claridad

        # Verifica si la fila contiene una celda de encabezado de fila (<th>).
        # En la estructura típica de estas tablas de Wikipedia, las filas con datos
        # de lanzamiento suelen tener el número de vuelo dentro de un <th>.
        if rows.th:
            # Asegura que la celda <th> no esté vacía y contenga texto.
            if rows.th.string:
                # Extrae el texto del <th> (potencial número de vuelo) y elimina espacios en blanco al inicio/final.
                flight_number_text = rows.th.string.strip()

                # Comprueba si el texto extraído es puramente numérico.
                # Esto ayuda a distinguir las filas de datos de lanzamiento de otras filas (encabezados de tabla, notas).
                if flight_number_text.isdigit():
                    flight_number = flight_number_text # Guarda el número de vuelo si es válido
                    is_launch_data_row = True # Marca esta fila como una fila de datos válida

        # Nota: Si no se encuentra un <th>, o si el <th> está vacío, o si su contenido no es numérico,
        # 'is_launch_data_row' permanecerá False, y la fila será ignorada en el siguiente bloque.

        # --- Extracción de Datos (Solo para Filas Válidas) ---

        # Solo si 'is_launch_data_row' es True (la fila contiene datos de un lanzamiento), se procede a extraer la información.
        if is_launch_data_row:
            # Extrae todas las celdas de datos (<td>) de la fila actual.
            # Estas celdas contienen los detalles del lanzamiento.
            row_data_cells = rows.find_all('td') # Renombrado 'row' para evitar confusión con 'rows' del bucle externo

            # Incrementa el contador de filas de lanzamientos procesadas con éxito.
            extracted_row += 1

            # --- Poblado del Diccionario 'launch_dict' ---
            # A continuación, se extrae cada dato específico de las celdas (row_data_cells)
            # y se añade a la lista correspondiente dentro del diccionario 'launch_dict'.

            # 1. Número de vuelo
            # TODO: Agrega el 'flight_number' (ya validado) al diccionario.
            launch_dict['Flight No.'].append(flight_number)

            # 2. Fecha y hora
            # Se asume que la celda 0 (row_data_cells[0]) contiene la fecha y hora.
            # 'date_time' es una función auxiliar (no mostrada aquí) para parsear estos datos.
            datatimelist = date_time(row_data_cells[0]) # ['DD Month YYYY,', 'HH:MM']

            # 2a. Fecha
            # Extrae la fecha, eliminando la coma al final si existe.
            date = datatimelist[0].strip(',')
            # TODO: Agrega la 'date' al diccionario.
            launch_dict['Date'].append(date)

            # 2b. Hora
            # Extrae la hora.
            time = datatimelist[1]
            # TODO: Agrega la 'time' al diccionario.
            launch_dict['Time'].append(time)

            # 3. Versión del booster (cohete impulsor)
            # Se asume que la celda 1 (row_data_cells[1]) contiene la versión del booster.
            # 'booster_version' es una función auxiliar para extraer/limpiar este dato.
            bv = booster_version(row_data_cells[1])
            # Fallback: Si la función 'booster_version' no retorna nada (ej. formato inesperado),
            # intenta obtener el texto directamente del primer enlace (<a>) en la celda.
            if not bv and row_data_cells[1].a:
                bv = row_data_cells[1].a.string.strip() # Añadido strip() para limpiar
            # TODO: Agrega la 'bv' (versión del booster) al diccionario.
            launch_dict['Version Booster'].append(bv)

            # 4. Sitio de lanzamiento
            # Se asume que la celda 2 (row_data_cells[2]) contiene el sitio, usualmente en un enlace <a>.
            launch_site = None # Inicializa por si no se encuentra
            if row_data_cells[2].a:
                launch_site = row_data_cells[2].a.string.strip() # Añadido strip()
            # TODO: Agrega el 'launch_site' al diccionario.
            launch_dict['Launch site'].append(launch_site)

            # 5. Carga útil (Payload)
            # Se asume que la celda 3 (row_data_cells[3]) contiene el nombre/tipo de payload, usualmente en un enlace <a>.
            payload = None # Inicializa
            if row_data_cells[3].a:
                payload = row_data_cells[3].a.string.strip() # Añadido strip()
            # TODO: Agrega el 'payload' al diccionario.
            launch_dict['Payload'].append(payload)

            # 6. Masa de la carga útil
            # Se asume que la celda 4 (row_data_cells[4]) contiene la masa.
            # 'get_mass' es una función auxiliar para extraer y limpiar este valor numérico.
            payload_mass = get_mass(row_data_cells[4])
            # TODO: Agrega la 'payload_mass' al diccionario.
            launch_dict['Payload mass'].append(payload_mass)

            # 7. Órbita
            # Se asume que la celda 5 (row_data_cells[5]) contiene el tipo de órbita, usualmente en un enlace <a>.
            orbit = None # Inicializa
            if row_data_cells[5].a:
                orbit = row_data_cells[5].a.string.strip() # Añadido strip()
            # TODO: Agrega la 'orbit' al diccionario.
            launch_dict['Orbit'].append(orbit)

            # 8. Cliente
            # Se asume que la celda 6 (row_data_cells[6]) contiene el cliente, usualmente en un enlace <a>.
            customer = None # Inicializa
            # Verifica que exista un enlace <a> y que contenga texto antes de intentar extraerlo.
            if row_data_cells[6].a and row_data_cells[6].a.string:
                customer = row_data_cells[6].a.string.strip()
            # TODO: Agrega el 'customer' al diccionario (será None si no se encontró).
            launch_dict['Customer'].append(customer)

            # 9. Resultado del lanzamiento
            # Se asume que la celda 7 (row_data_cells[7]) contiene el resultado.
            # Usa 'list(row_data_cells[7].strings)[0]' para obtener el texto principal,
            # ignorando posibles textos adicionales (ej. en superíndices <sup/>).
            launch_outcome = None # Inicializa
            if row_data_cells[7].strings: # Verifica que haya texto
                launch_outcome = list(row_data_cells[7].strings)[0].strip() # Añadido strip()
            # TODO: Agrega el 'launch_outcome' al diccionario.
            launch_dict['Launch outcome'].append(launch_outcome)

            # 10. Resultado del aterrizaje del booster
            # Se asume que la celda 8 (row_data_cells[8]) contiene el estado del aterrizaje.
            # 'landing_status' es una función auxiliar para interpretar y estandarizar este estado.
            booster_landing = landing_status(row_data_cells[8])
            # TODO: Agrega el 'booster_landing' al diccionario.
            launch_dict['Booster landing'].append(booster_landing)

# --- Fin del Procesamiento ---
# Al finalizar los bucles, el diccionario 'launch_dict' contendrá listas con los datos
# de todos los lanzamientos extraídos de las tablas procesadas.
# La variable 'extracted_row' indicará cuántas filas de lanzamiento se procesaron.

After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.


In [48]:
df= pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })

In [49]:
df.head()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success,F9 v1.07B0003.18,Failure,4 June 2010,18:45
1,2,CCAFS,Dragon,0,LEO,NASA,Success,F9 v1.07B0004.18,Failure,8 December 2010,15:43
2,3,CCAFS,Dragon,525 kg,LEO,NASA,Success,F9 v1.07B0005.18,No attempt\n,22 May 2012,07:44
3,4,CCAFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success,F9 v1.07B0006.18,No attempt,8 October 2012,00:35
4,5,CCAFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success,F9 v1.07B0007.18,No attempt\n,1 March 2013,15:10


We can now export it to a <b>CSV</b> for the next section, but to make the answers consistent and in case you have difficulties finishing this lab. 

Following labs will be using a provided dataset to make each lab independent. 


<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>


In [50]:
#df.to_csv('spacex_web_scraped.csv', index=False) #Uncoment to savefile

## Authors


<a href="https://www.linkedin.com/in/yan-luo-96288783/">Yan Luo</a>


<a href="https://www.linkedin.com/in/nayefaboutayoun/">Nayef Abou Tayoun</a>


<!--
## Change Log
-->


<!--
| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates           |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |
-->


Copyright © 2021 IBM Corporation. All rights reserved.
