# Stage 01 — Data Collection (SpaceX API)

SpaceX advertises Falcon 9 rocket launches on its website for approximately **62 million USD** compared with more than **165 million USD** from other providers. Much of this cost advantage comes from the **reusability of the first stage booster**.  

Therefore, if we can predict whether the **first stage will successfully land**, we can approximate the cost efficiency of a given launch. This capability is valuable when assessing how competitors might bid against SpaceX for contracts.  

### Objective  
Collect Falcon 9 launch and payload metadata from the **public SpaceX REST API** and build a reproducible snapshot for downstream analysis.  

### Inputs  
- SpaceX REST API endpoints: **launches, payloads, rockets, landpads, cores**

### Outputs  
- Raw JSON snapshots (excluded from version control for reproducibility)  
- Sample CSV extract: `data/sample/launches_sample.csv`  

### Next  
Proceed to **Stage 02 — Web Scraping (Wikipedia enrichment)** to add complementary details.


----


## Import Libraries and Define Auxiliary Functions


In [None]:
# Requests allows us to make HTTP requests which we will use to get data from an API
import requests
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
# Datetime is a library that allows us to represent dates
import datetime

# Setting this option will print all collumns of a dataframe
pd.set_option('display.max_columns', None)
# Setting this option will print all of the data in a feature
pd.set_option('display.max_colwidth', None)

## Feature Extraction from API Response

We enrich the launch records by resolving foreign keys (IDs) via additional SpaceX API calls and extracting descriptive attributes for **Rocket**, **Payload**, **Launchpad**, and **Cores**.

## Booster Data

Helper function to retrieve the **booster version name** associated with each launch.  

This function loops through the `rocket` column of the dataset, queries the **SpaceX Rockets API**, and appends the booster name to the `BoosterVersion` list.



In [None]:
def getBoosterVersion(data):
    for x in data['rocket']:
       if x:
        response = requests.get("https://api.spacexdata.com/v4/rockets/"+str(x)).json()
        BoosterVersion.append(response['name'])

## Launchpad Data

Helper function to retrieve the **launch site name** and its **geographic coordinates** (longitude and latitude).  

This function loops through the `launchpad` column of the dataset, queries the **SpaceX Launchpads API**, and appends the extracted details to the respective lists.


In [None]:
def getLaunchSite(data):
    for x in data['launchpad']:
       if x:
         response = requests.get("https://api.spacexdata.com/v4/launchpads/"+str(x)).json()
         Longitude.append(response['longitude'])
         Latitude.append(response['latitude'])
         LaunchSite.append(response['name'])

## Payload Data

Helper function to retrieve the **payload mass (kg)** and the **target orbit**.  

This function loops through the `payloads` column of the dataset, queries the **SpaceX Payloads API**, and appends the extracted details to the corresponding lists.


In [None]:
def getPayloadData(data):
    for load in data['payloads']:
       if load:
        response = requests.get("https://api.spacexdata.com/v4/payloads/"+load).json()
        PayloadMass.append(response['mass_kg'])
        Orbit.append(response['orbit'])

## Core Data

Helper function to retrieve information on booster **reusability and landing outcomes**, including:  

- Landing outcome and landing type  
- Number of flights for the core  
- Grid fins and landing legs usage  
- Whether the core is reused  
- Landing pad designation  
- Core block version (hardware generation)  
- Number of times the core has been reused  
- Core serial identifier  


In [None]:
def getCoreData(data):
    for core in data['cores']:
            if core['core'] != None:
                response = requests.get("https://api.spacexdata.com/v4/cores/"+core['core']).json()
                Block.append(response['block'])
                ReusedCount.append(response['reuse_count'])
                Serial.append(response['serial'])
            else:
                Block.append(None)
                ReusedCount.append(None)
                Serial.append(None)
            Outcome.append(str(core['landing_success'])+' '+str(core['landing_type']))
            Flights.append(core['flight'])
            GridFins.append(core['gridfins'])
            Reused.append(core['reused'])
            Legs.append(core['legs'])
            LandingPad.append(core['landpad'])

## API Request and Data Retrieval

In this section, we use the **SpaceX REST API** to collect information about past Falcon 9 launches.  

Instead of printing the full JSON response (which is large and difficult to read), we preview only the first few characters to verify that the request was successful.  


In [None]:
spacex_url="https://api.spacexdata.com/v4/launches/past"

In [None]:
response = requests.get(spacex_url)

In [None]:
print(response.content[:200])

The preview confirms that the response contains structured metadata about Falcon 9 launches.  

---

### Using a Static JSON Snapshot

To ensure **reproducibility** and avoid issues with changing live API data, we will use a static JSON file provided. This guarantees consistent results across different runs of the notebook.  


In [None]:
static_json_url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/API_call_spacex_api.json'

In [None]:
response=requests.get(static_json_url)

In [None]:
response.status_code

A `200` response code confirms that the request was successful.  

---

### Convert JSON to DataFrame

Next, we parse the JSON response and convert it into a **Pandas DataFrame** using `json_normalize`. This tabular format is easier to explore and will serve as the foundation for further feature extraction and analysis.  

In [None]:
# Use json_normalize meethod to convert the json result into a dataframe
data = pd.json_normalize(response.json())

In [None]:
# Get the head of the dataframe
data.head(5)

## Enriching Launch Data with API Lookups

The initial dataset contains many **foreign keys (IDs)** instead of descriptive values. To make the data analysis-ready, we use the SpaceX API again to resolve these IDs into meaningful attributes.

In [None]:
# Lets take a subset of our dataframe keeping only the features we want and the flight number, and date_utc.
data = data[['rocket', 'payloads', 'launchpad', 'cores', 'flight_number', 'date_utc']]

# We will remove rows with multiple cores because those are falcon rockets with 2 extra rocket boosters and rows that have multiple payloads in a single rocket.
data = data[data['cores'].map(len)==1]
data = data[data['payloads'].map(len)==1]

# Since payloads and cores are lists of size 1 we will also extract the single value in the list and replace the feature.
data['cores'] = data['cores'].map(lambda x : x[0])
data['payloads'] = data['payloads'].map(lambda x : x[0])

# We also want to convert the date_utc to a datetime datatype and then extracting the date leaving the time
data['date'] = pd.to_datetime(data['date_utc']).dt.date

# Using the date we will restrict the dates of the launches
data = data[data['date'] <= datetime.date(2020, 11, 13)]

Specifically, we enrich the dataset using the following columns:  

- **Rocket** → Retrieve the **booster version name**.  
- **Payload** → Capture the **payload mass (kg)** and the **target orbit**.  
- **Launchpad** → Record the **launch site name** along with its **longitude and latitude**.  
- **Cores** → Collect detailed information on booster reusability and landing performance, including:  
  - Landing outcome and landing type  
  - Number of flights for the same core  
  - Use of grid fins and landing legs  
  - Whether the core was reused  
  - Landing pad designation  
  - Core block version (hardware generation)  
  - Number of times the core has been reused  
  - Core serial identifier  

All extracted values are stored in lists, which will later be assembled into a **Pandas DataFrame** for efficient exploration and analysis.

In [None]:
#Global variables 
BoosterVersion = []
PayloadMass = []
Orbit = []
LaunchSite = []
Outcome = []
Flights = []
GridFins = []
Reused = []
Legs = []
LandingPad = []
Block = []
ReusedCount = []
Serial = []
Longitude = []
Latitude = []

### Example: Booster Version

Before applying the helper functions, each list is initialized as empty.  

For instance, the `BoosterVersion` list starts out empty:


In [None]:
BoosterVersion

Now, after applying the **`getBoosterVersion`** function, the list is populated with booster names:


In [None]:
# Call getBoosterVersion
getBoosterVersion(data)

In [None]:
BoosterVersion[0:5]

### Apply Remaining Enrichment Functions

We now call the other helper functions to populate the remaining variables with data from the API:


In [None]:
# Call getLaunchSite
getLaunchSite(data)

In [None]:
# Call getPayloadData
getPayloadData(data)

In [None]:
# Call getCoreData
getCoreData(data)

### Constructing the Falcon 9 Dataset

Finally, we combine the extracted lists into a dictionary, which will serve as the basis for constructing a clean and structured **Pandas DataFrame**.

In [None]:
launch_dict = {'FlightNumber': list(data['flight_number']),
'Date': list(data['date']),
'BoosterVersion':BoosterVersion,
'PayloadMass':PayloadMass,
'Orbit':Orbit,
'LaunchSite':LaunchSite,
'Outcome':Outcome,
'Flights':Flights,
'GridFins':GridFins,
'Reused':Reused,
'Legs':Legs,
'LandingPad':LandingPad,
'Block':Block,
'ReusedCount':ReusedCount,
'Serial':Serial,
'Longitude': Longitude,
'Latitude': Latitude}


We create a Pandas data frame from the dictionary launch_dict.

In [None]:
# Create a data from launch_dict
df = pd.DataFrame(launch_dict)

In [None]:
# Show the head of the dataframe
df.head()

## Filter the Final Dataset to only Falcon 9 launches

We will remove the Falcon 1 launches keeping only the Falcon 9 launches. Filter the data dataframe using the <code>BoosterVersion</code> column to only keep the Falcon 9 launches. Save the filtered data to a new dataframe called <code>data_falcon9</code>.

In [None]:
# Hint data['BoosterVersion']!='Falcon 1'
data_falcon9 = df[df['BoosterVersion'] != 'Falcon 1'].reset_index(drop=True)

Since some rows were dropped during cleaning, we reset the **FlightNumber** column to ensure sequential numbering.

In [None]:
data_falcon9.loc[:,'FlightNumber'] = list(range(1, data_falcon9.shape[0]+1))
data_falcon9

## Data Wrangling

Inspecting the dataset reveals that some columns contain missing values.


In [None]:
data_falcon9.isnull().sum()

The **LandingPad** column will retain `None` values to indicate cases where a landing pad was not used.  

For numerical fields like **PayloadMass**, however, we impute missing values to maintain consistency.


### Handling Missing Values

We calculate the mean of the **PayloadMass** column and use it to replace all `NaN` entries. This ensures no missing values remain in that column.


In [None]:
# Calculate the mean value of PayloadMass column
payload_mean = data_falcon9['PayloadMass'].mean()

# Replace the np.nan values with its mean value
data_falcon9['PayloadMass'] = data_falcon9['PayloadMass'].replace(np.nan, payload_mean, inplace=True)


At this stage, the dataset is clean:  
- All missing payload masses have been imputed.  
- `LandingPad` retains `None` for launches without pad landings.  

---

## Export Clean Dataset

We now export the cleaned dataset to CSV for downstream analysis.  

In [None]:
data_falcon9.to_csv('dataset_part_1.csv', index=False)

## Authors


<a href="https://www.linkedin.com/in/joseph-s-50398b136/">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD. 


<!--## Change Log
-->


<!--

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2020-09-20|1.1|Joseph|get result each time you run|
|2020-09-20|1.1|Azim |Created Part 1 Lab using SpaceX API|
|2020-09-20|1.0|Joseph |Modified Multiple Areas|
-->


Copyright ©IBM Corporation. All rights reserved.
