## Data Wrangling 

### Space X Data from REST API
In the data set, there are several different cases where the booster did not land successfully. Sometimes a landing was attempted but failed due to an accident; for example, True Ocean means the mission outcome was successfully landed to a specific region of the ocean while False Ocean means the mission outcome was unsuccessfully landed to a specific region of the ocean. True RTLS means the mission outcome was successfully landed to a ground pad False RTLS means the mission outcome was unsuccessfully landed to a ground pad.True ASDS means the mission outcome was successfully landed on a drone ship False ASDS means the mission outcome was unsuccessfully landed on a drone ship.

Convert those outcomes into Training Labels with 1 means the booster successfully landed 0 means it was unsuccessful.

### Import libraries

In [90]:
import pandas as pd
import numpy as np
import datetime

In [14]:
df = pd.read_csv('dataset_part_1.csv')
df.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,6123.547647,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857


In [15]:
#check data types
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

In [16]:
#calculate number of lauches on each site
df['LaunchSite'].value_counts()

LaunchSite
CCSFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: count, dtype: int64

In [17]:
#Calculate number of appearance of each orbit
df['Orbit'].value_counts()

Orbit
GTO      27
ISS      21
VLEO     14
PO        9
LEO       7
SSO       5
MEO       3
ES-L1     1
HEO       1
SO        1
GEO       1
Name: count, dtype: int64

In [18]:
# Count number of outcomes and asign it
landing_outcomes = df['Outcome'].value_counts()
landing_outcomes

Outcome
True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: count, dtype: int64

True Ocean means the mission outcome was successfully landed to a specific region of the ocean while False Ocean means the mission outcome was unsuccessfully landed to a specific region of the ocean. True RTLS means the mission outcome was successfully landed to a ground pad False RTLS means the mission outcome was unsuccessfully landed to a ground pad.True ASDS means the mission outcome was successfully landed to a drone ship False ASDS means the mission outcome was unsuccessfully landed to a drone ship. None ASDS and None None these represent a failure to land.

Using the Outcome, create a list where the element is zero if the corresponding row in Outcome is not successful; otherwise, it's one. Then assign it to the variable landing_class

In [20]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 None None
2 True RTLS
3 False ASDS
4 True Ocean
5 False Ocean
6 None ASDS
7 False RTLS


In [21]:
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

In [22]:
# Use the landing class variable to create a new columnn  
Outcome = df['Outcome']
landing_class = [0 if outcome in bad_outcomes else 1 for outcome in Outcome]

In [23]:
df['Class']=landing_class
df.head(10)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,6123.547647,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,0
5,6,2014-01-06,Falcon 9,3325.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1005,-80.577366,28.561857,0
6,7,2014-04-18,Falcon 9,2296.0,ISS,CCSFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1006,-80.577366,28.561857,1
7,8,2014-07-14,Falcon 9,1316.0,LEO,CCSFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1007,-80.577366,28.561857,1
8,9,2014-08-05,Falcon 9,4535.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1008,-80.577366,28.561857,0
9,10,2014-09-07,Falcon 9,4428.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1011,-80.577366,28.561857,0


In [24]:
#Calculate success rate
df['Class'].mean()

0.6666666666666666

In [43]:
df.to_csv("dataset_part_2.csv", index=False)

## Data wrangling on webscrapped data set

In [184]:
scrap_df = pd.read_csv('spacex_web_scraped.csv')
scrap_df.tail()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
116,117,CCSFS,Starlink,"15,600 kg",LEO,"<td><a href=""/wiki/SpaceX"" title=""SpaceX"">Spac...",Success\n,F9 B5B1051.10,Success,9 May 2021,06:42
117,118,KSC,Starlink,"~14,000 kg",LEO,"<td><a href=""/wiki/SpaceX"" title=""SpaceX"">Spac...",Success\n,F9 B5B1058.8,Success,15 May 2021,22:56
118,119,CCSFS,Starlink,"15,600 kg",LEO,"<td><a href=""/wiki/SpaceX"" title=""SpaceX"">Spac...",Success\n,F9 B5B1063.2,Success,26 May 2021,18:59
119,120,KSC,SpaceX CRS-22,"3,328 kg",LEO,"<td><a href=""/wiki/NASA"" title=""NASA"">NASA</a>...",Success\n,F9 B5B1067.1,Success,3 June 2021,17:29
120,121,CCSFS,SXM-8,"7,000 kg",GTO,"<td><a href=""/wiki/Sirius_XM"" title=""Sirius XM...",Success\n,F9 B5,Success,6 June 2021,04:26


In [186]:
# Handle missing values 
scrap_df.isnull().sum()

Flight No.         0
Launch site        0
Payload            0
Payload mass       0
Orbit              0
Customer           0
Launch outcome     0
Version Booster    0
Booster landing    0
Date               0
Time               0
dtype: int64

In [188]:
# Covert date column to datetime
month_dict = {
    "January": "01",
    "February": "02",
    "March": "03",
    "April": "04",
    "May": "05",
    "June": "06",
    "July": "07",
    "August": "08",
    "September": "09",
    "October": "10",
    "November": "11",
    "December": "12"
}
def parse_date(date_str):
    parts = date_str.split()
    day = parts[0]
    month = month_dict[parts[1]]
    year = parts[2]
    return f"{day} {month} {year}"

# Apply the function to the 'Date' column and convert it to datetime
scrap_df['Date'] = pd.to_datetime(scrap_df['Date'].apply(parse_date), format="%d %m %Y")

In [190]:
# Correct payload mass
scrap_df['Payload mass'] = scrap_df['Payload mass'].str.replace('kg','').str.replace(',','').str.replace('~','')
scrap_df['Payload mass'] = scrap_df['Payload mass'].str.strip()

In [230]:
scrap_df.tail()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
116,117,CCSFS,Starlink,15600,LEO,"<td><a href=""/wiki/SpaceX"" title=""SpaceX"">Spac...",Success\n,F9 B5B1051.10,Success,2021-05-09,06:42
117,118,KSC,Starlink,14000,LEO,"<td><a href=""/wiki/SpaceX"" title=""SpaceX"">Spac...",Success\n,F9 B5B1058.8,Success,2021-05-15,22:56
118,119,CCSFS,Starlink,15600,LEO,"<td><a href=""/wiki/SpaceX"" title=""SpaceX"">Spac...",Success\n,F9 B5B1063.2,Success,2021-05-26,18:59
119,120,KSC,SpaceX CRS-22,3328,LEO,"<td><a href=""/wiki/NASA"" title=""NASA"">NASA</a>...",Success\n,F9 B5B1067.1,Success,2021-06-03,17:29
120,121,CCSFS,SXM-8,7000,GTO,"<td><a href=""/wiki/Sirius_XM"" title=""Sirius XM...",Success\n,F9 B5,Success,2021-06-06,04:26


In [234]:
# Extract Customer from the customer column
from bs4 import BeautifulSoup
def extract_title(html_content):
    try:
        soup = BeautifulSoup(html_content, 'html5lib')
        return soup.a['title']
    except (AttributeError, TypeError):
        return None

# Apply the function to the column to extract the customer name
scrap_df['Customer'] = scrap_df['Customer'].apply(extract_title)

In [238]:
scrap_df.head()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX,Success\n,F9 v1.0B0003.1,Failure,2010-06-04,18:45
1,2,CCAFS,Dragon,0,LEO,NASA,Success,F9 v1.0B0004.1,Failure,2010-12-08,15:43
2,3,CCAFS,Dragon,525,LEO,NASA,Success,F9 v1.0B0005.1,No attempt\n,2012-05-22,07:44
3,4,CCAFS,SpaceX CRS-1,4700,LEO,NASA,Success\n,F9 v1.0B0006.1,No attempt,2012-10-08,00:35
4,5,CCAFS,SpaceX CRS-2,4877,LEO,NASA,Success\n,F9 v1.0B0007.1,No attempt\n,2013-03-01,15:10


In [None]:
#correct Launch outcome and Booster landing columns
scrap_df['Launch outcome'] = scrap_df['Launch outcome'].str.replace('\n','')
scrap_df['Booster landing'] = scrap_df['Booster landing'].str.replace('\n','')
scrap_df['Booster landing'].str.strip()

Select columns needed for our exploratory analysis and drop the remaining columns

In [264]:
scrap_data = scrap_df[['Date','Time', 'Version Booster', 'Launch site','Payload', 'Payload mass', 'Orbit','Customer']]

In [266]:
scrap_data.head()

Unnamed: 0,Date,Time,Version Booster,Launch site,Payload,Payload mass,Orbit,Customer
0,2010-06-04,18:45,F9 v1.0B0003.1,CCAFS,Dragon Spacecraft Qualification Unit,0,LEO,SpaceX
1,2010-12-08,15:43,F9 v1.0B0004.1,CCAFS,Dragon,0,LEO,NASA
2,2012-05-22,07:44,F9 v1.0B0005.1,CCAFS,Dragon,525,LEO,NASA
3,2012-10-08,00:35,F9 v1.0B0006.1,CCAFS,SpaceX CRS-1,4700,LEO,NASA
4,2013-03-01,15:10,F9 v1.0B0007.1,CCAFS,SpaceX CRS-2,4877,LEO,NASA


In [271]:
scrap_data.to_csv('scraped_data.csv', index=False)