## Data Wrangling

Exploratory Data Analysis and Label Creation
In this notebook, we will perform Exploratory Data Analysis (EDA) to identify patterns in the launch data and prepare the labels needed for supervised machine learning models.

The dataset includes multiple outcome cases where the booster either successfully landed or failed. Landing outcomes describe if and where the booster touched down after launch:

True Ocean: Booster successfully landed in a designated ocean region.

False Ocean: Booster attempted but failed to land in the ocean region.

True RTLS: Booster successfully landed on a ground pad using Return To Launch Site (RTLS).

False RTLS: Booster failed to land on the ground pad during RTLS.

True ASDS: Booster successfully landed on an Autonomous Spaceport Drone Ship (ASDS).

False ASDS: Booster failed to land on the drone ship.

Label Encoding for Supervised Learning
For the purpose of training classification models, we will convert these detailed outcome descriptions into binary training labels:

1 — Successful landing (good outcome)

0 — Unsuccessful landing (bad outcome)

This binary label will be our target variable to train models that predict whether a booster will land successfully based on the mission parameters.

## Objectives
Perform exploratory  Data Analysis and determine Training Labels 

- Exploratory Data Analysis
- Determine Training Labels 


In [36]:
import pandas as pd
import numpy as np

In [37]:
np.__version__

'1.26.4'

In [38]:
pd.__version__

'2.2.2'

In [39]:
df = pd.read_csv('phase_1/dataset_part_1.csv')
df.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,6123.547647,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857


In [40]:
df.isnull().sum()/len(df)*100

FlightNumber       0.000000
Date               0.000000
BoosterVersion     0.000000
PayloadMass        0.000000
Orbit              0.000000
LaunchSite         0.000000
Outcome            0.000000
Flights            0.000000
GridFins           0.000000
Reused             0.000000
Legs               0.000000
LandingPad        28.888889
Block              0.000000
ReusedCount        0.000000
Serial             0.000000
Longitude          0.000000
Latitude           0.000000
dtype: float64

In [41]:
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

In [42]:
import folium

In [43]:
# Extract coordinates
coordinates = df[['Longitude', 'Latitude']]
# Center the map
map_center = [coordinates['Latitude'].mean(), coordinates['Longitude'].mean()]
launch_map = folium.Map(location=map_center, zoom_start=5)

# Add markers for each launch site
for idx, row in coordinates.iterrows():
    folium.Marker([row['Latitude'], row['Longitude']]).add_to(launch_map)

# Display the map in the notebook
launch_map

### TASK 1: Calculating the number of launches on each site

The data contains several Space X  launch facilities: <a href='https://en.wikipedia.org/wiki/List_of_Cape_Canaveral_and_Merritt_Island_launch_sites'>Cape Canaveral Space</a> Launch Complex 40  <b>VAFB SLC 4E </b> , Vandenberg Air Force Base Space Launch Complex 4E <b>(SLC-4E)</b>, Kennedy Space Center Launch Complex 39A <b>KSC LC 39A </b>.The location of each Launch Is placed in the column <code>LaunchSite</code>


In [44]:
# Number of launches per launch site
launch_counts = df['LaunchSite'].value_counts()
print(launch_counts)


LaunchSite
CCSFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: count, dtype: int64




* <b>LEO</b>: Low Earth orbit (LEO)is an Earth-centred orbit with an altitude of 2,000 km (1,200 mi) or less (approximately one-third of the radius of Earth),[1] or with at least 11.25 periods per day (an orbital period of 128 minutes or less) and an eccentricity less than 0.25.[2] Most of the manmade objects in outer space are in LEO <a href='https://en.wikipedia.org/wiki/Low_Earth_orbit'>[1]</a>.

* <b>VLEO</b>: Very Low Earth Orbits (VLEO) can be defined as the orbits with a mean altitude below 450 km. Operating in these orbits can provide a number of benefits to Earth observation spacecraft as the spacecraft operates closer to the observation<a href='https://www.researchgate.net/publication/271499606_Very_Low_Earth_Orbit_mission_concepts_for_Earth_Observation_Benefits_and_challenges'>[2]</a>.


* <b>GTO</b> A geosynchronous orbit is a high Earth orbit that allows satellites to match Earth's rotation. Located at 22,236 miles (35,786 kilometers) above Earth's equator, this position is a valuable spot for monitoring weather, communications and surveillance. Because the satellite orbits at the same speed that the Earth is turning, the satellite seems to stay in place over a single longitude, though it may drift north to south,” NASA wrote on its Earth Observatory website <a  href="https://www.space.com/29222-geosynchronous-orbit.html" >[3] </a>.


* <b>SSO (or SO)</b>: It is a Sun-synchronous orbit  also called a heliosynchronous orbit is a nearly polar orbit around a planet, in which the satellite passes over any given point of the planet's surface at the same local mean solar time <a href="https://en.wikipedia.org/wiki/Sun-synchronous_orbit">[4] <a>.
    
    
    
* <b>ES-L1 </b>:At the Lagrange points the gravitational forces of the two large bodies cancel out in such a way that a small object placed in orbit there is in equilibrium relative to the center of mass of the large bodies. L1 is one such point between the sun and the earth <a href="https://en.wikipedia.org/wiki/Lagrange_point#L1_point">[5]</a> .
    
    
* <b>HEO</b> A highly elliptical orbit, is an elliptic orbit with high eccentricity, usually referring to one around Earth <a href="https://en.wikipedia.org/wiki/Highly_elliptical_orbit">[6]</a>.


* <b> ISS </b> A modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project between five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada)<a href="https://en.wikipedia.org/wiki/International_Space_Station"> [7] </a>


* <b> MEO </b> Geocentric orbits ranging in altitude from 2,000 km (1,200 mi) to just below geosynchronous orbit at 35,786 kilometers (22,236 mi). Also known as an intermediate circular orbit. These are "most commonly at 20,200 kilometers (12,600 mi), or 20,650 kilometers (12,830 mi), with an orbital period of 12 hours <a href="https://en.wikipedia.org/wiki/List_of_orbits"> [8] </a>


* <b> HEO </b> Geocentric orbits above the altitude of geosynchronous orbit (35,786 km or 22,236 mi) <a href="https://en.wikipedia.org/wiki/List_of_orbits"> [9] </a>


* <b> GEO </b> It is a circular geosynchronous orbit 35,786 kilometres (22,236 miles) above Earth's equator and following the direction of Earth's rotation <a href="https://en.wikipedia.org/wiki/Geostationary_orbit"> [10] </a>


* <b> PO </b> It is one type of satellites in which a satellite passes above or nearly above both poles of the body being orbited (usually a planet such as the Earth <a href="https://en.wikipedia.org/wiki/Polar_orbit"> [11] </a>

some are shown in the following plot:


<img src="Orbits.png" alt="Orbit info" />


### Now We Calculate the number and occurrence of each orbit


In [45]:
# Count the number of launches for each orbit type
orbit_counts = df['Orbit'].value_counts()
print(orbit_counts)


Orbit
GTO      27
ISS      21
VLEO     14
PO        9
LEO       7
SSO       5
MEO       3
ES-L1     1
HEO       1
SO        1
GEO       1
Name: count, dtype: int64


### Now We Calculate the number and occurence of mission outcome of the orbits


In [46]:
# Group by Orbit and Outcome, then count the number of launches
outcome_by_orbit = df.groupby(['Orbit', 'Outcome']).size().reset_index(name='Counts')

# Display the result
print(outcome_by_orbit)


    Orbit      Outcome  Counts
0   ES-L1   True Ocean       1
1     GEO    True ASDS       1
2     GTO   False ASDS       1
3     GTO    None ASDS       1
4     GTO    None None      11
5     GTO    True ASDS      13
6     GTO   True Ocean       1
7     HEO    True ASDS       1
8     ISS   False ASDS       2
9     ISS  False Ocean       1
10    ISS   False RTLS       1
11    ISS    None ASDS       1
12    ISS    None None       3
13    ISS    True ASDS       5
14    ISS   True Ocean       1
15    ISS    True RTLS       7
16    LEO    None None       2
17    LEO   True Ocean       1
18    LEO    True RTLS       4
19    MEO    None None       1
20    MEO    True ASDS       2
21     PO   False ASDS       1
22     PO  False Ocean       1
23     PO    None None       1
24     PO    True ASDS       5
25     PO   True Ocean       1
26     SO    None None       1
27    SSO    True ASDS       2
28    SSO    True RTLS       3
29   VLEO   False ASDS       2
30   VLEO    True ASDS      12


<code>True Ocean</code> means the mission outcome was successfully  landed to a specific region of the ocean while <code>False Ocean</code> means the mission outcome was unsuccessfully landed to a specific region of the ocean. <code>True RTLS</code> means the mission outcome was successfully  landed to a ground pad <code>False RTLS</code> means the mission outcome was unsuccessfully landed to a ground pad.<code>True ASDS</code> means the mission outcome was successfully  landed to a drone ship <code>False ASDS</code> means the mission outcome was unsuccessfully landed to a drone ship. <code>None ASDS</code> and <code>None None</code> these represent a failure to land.


In [47]:
# Get unique outcomes
outcomes = df['Outcome'].unique()

# Filter outcomes that indicate failure (e.g., start with 'False')
unsuccessful_outcomes = set([outcome for outcome in outcomes if outcome.startswith('False')])

# Display the set of unsuccessful outcomes
print(unsuccessful_outcomes)


{'False Ocean', 'False ASDS', 'False RTLS'}


### Creating a landing outcome label from Outcome column

In [48]:
# Define bad outcomes: all that are not successful
bad_outcomes = set([outcome for outcome in df['Outcome'].unique() if outcome is None or outcome.startswith('False') or outcome == 'None None'])

# Create class labels based on bad outcomes
landing_class = [0 if outcome in bad_outcomes else 1 for outcome in df['Outcome']]

# Assign to the DataFrame
df['Class'] = landing_class
df[['Outcome', 'Class']].head(5)


Unnamed: 0,Outcome,Class
0,None None,0
1,None None,0
2,None None,0
3,False Ocean,0
4,None None,0


In [49]:
df[['Outcome', 'Class']].tail(10)

Unnamed: 0,Outcome,Class
80,True ASDS,1
81,True ASDS,1
82,True ASDS,1
83,True ASDS,1
84,True RTLS,1
85,True ASDS,1
86,True ASDS,1
87,True ASDS,1
88,True ASDS,1
89,True ASDS,1


In [50]:
df[10:20]

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
10,11,2014-09-21,Falcon 9,2216.0,ISS,CCSFS SLC 40,False Ocean,1,False,False,False,,1.0,0,B1010,-80.577366,28.561857,0
11,12,2015-01-10,Falcon 9,2395.0,ISS,CCSFS SLC 40,False ASDS,1,True,False,True,5e9e3032383ecb761634e7cb,1.0,0,B1012,-80.577366,28.561857,0
12,13,2015-02-11,Falcon 9,570.0,ES-L1,CCSFS SLC 40,True Ocean,1,True,False,True,,1.0,0,B1013,-80.577366,28.561857,1
13,14,2015-04-14,Falcon 9,1898.0,ISS,CCSFS SLC 40,False ASDS,1,True,False,True,5e9e3032383ecb761634e7cb,1.0,0,B1015,-80.577366,28.561857,0
14,15,2015-04-27,Falcon 9,4707.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1016,-80.577366,28.561857,0
15,16,2015-06-28,Falcon 9,2477.0,ISS,CCSFS SLC 40,None ASDS,1,True,False,True,5e9e3032383ecb6bb234e7ca,1.0,0,B1018,-80.577366,28.561857,1
16,17,2015-12-22,Falcon 9,2034.0,LEO,CCSFS SLC 40,True RTLS,1,True,False,True,5e9e3032383ecb267a34e7c7,1.0,0,B1019,-80.577366,28.561857,1
17,18,2016-01-17,Falcon 9,553.0,PO,VAFB SLC 4E,False ASDS,1,True,False,True,5e9e3033383ecbb9e534e7cc,1.0,0,B1017,-120.610829,34.632093,0
18,19,2016-03-04,Falcon 9,5271.0,GTO,CCSFS SLC 40,False ASDS,1,True,False,True,5e9e3032383ecb6bb234e7ca,1.0,0,B1020,-80.577366,28.561857,0
19,20,2016-04-08,Falcon 9,3136.0,ISS,CCSFS SLC 40,True ASDS,1,True,False,True,5e9e3032383ecb6bb234e7ca,2.0,1,B1021,-80.577366,28.561857,1


In [51]:
df["Class"].mean().round(2)

0.69

In [53]:
df.to_csv("dataset_part_2.csv", index=False)