# **SpaceX  Falcon 9 first stage Landing Prediction**

In this project, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

## Collecting the data

### Import Libraries and Defining Auxiliary Functions

In [None]:
# Requests allows us to make HTTP requests which we will use to get data from an API
import requests
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
# Datetime is a library that allows us to represent dates
import datetime

# Setting this option will print all collumns of a dataframe
pd.set_option('display.max_columns', None)
# Setting this option will print all of the data in a feature
pd.set_option('display.max_colwidth', None)

### Defining a series of helper functions

From the rocket column we would like to learn the booster name.

In [None]:
# Takes the dataset and uses the rocket column to call the API and append the data to the list
def getBoosterVersion(data):
    for x in data['rocket']:
        response = requests.get("https://api.spacexdata.com/v4/rockets/"+str(x)).json()
        BoosterVersion.append(response['name'])

From the launchpad we would like to know the name of the launch site being used, the logitude, and the latitude.

In [None]:
# Takes the dataset and uses the launchpad column to call the API and append the data to the list
def getLaunchSite(data):
    for x in data['launchpad']:
        response = requests.get("https://api.spacexdata.com/v4/launchpads/"+str(x)).json()
        Longitude.append(response['longitude'])
        Latitude.append(response['latitude'])
        LaunchSite.append(response['name'])

From the payload we would like to learn the mass of the payload and the orbit that it is going to.

In [None]:
# Takes the dataset and uses the payloads column to call the API and append the data to the lists
def getPayloadData(data):
    for load in data['payloads']:
        response = requests.get("https://api.spacexdata.com/v4/payloads/"+load).json()
        PayloadMass.append(response['mass_kg'])
        Orbit.append(response['orbit'])

From cores we would like to learn the outcome of the landing, the type of the landing, number of flights with that core, whether gridfins were used, wheter the core is reused, wheter legs were used, the landing pad used, the block of the core which is a number used to seperate version of cores, the number of times this specific core has been reused, and the serial of the core.

In [None]:
# Takes the dataset and uses the cores column to call the API and append the data to the lists
def getCoreData(data):
    for core in data['cores']:
            if core['core'] != None:
                response = requests.get("https://api.spacexdata.com/v4/cores/"+core['core']).json()
                Block.append(response['block'])
                ReusedCount.append(response['reuse_count'])
                Serial.append(response['serial'])
            else:
                Block.append(None)
                ReusedCount.append(None)
                Serial.append(None)
            Outcome.append(str(core['landing_success'])+' '+str(core['landing_type']))
            Flights.append(core['flight'])
            GridFins.append(core['gridfins'])
            Reused.append(core['reused'])
            Legs.append(core['legs'])
            LandingPad.append(core['landpad'])

Requesting Rocket launch data from SpaceX API

In [None]:
spacex_url="https://api.spacexdata.com/v4/launches/past"

In [None]:
response = requests.get(spacex_url)

### Requesting and parsing the SpaceX launch data using the GET request

In [None]:
static_json_url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/API_call_spacex_api.json'

In [None]:
response.status_code

In [None]:
# Use json_normalize meethod to convert the json result into a dataframe
data=pd.json_normalize(response.json())

In [None]:
# Displaying first 5 rows
data.head()

We notice that a lot of the data are IDs. For example the rocket column has no information about the rocket just an identification number.

We will now use the API again to get information about the launches using the IDs given for each launch. Specifically we will be using columns rocket, payloads, launchpad, and cores.

In [None]:
# Lets take a subset of our dataframe keeping only the features we want and the flight number, and date_utc.
data = data[['rocket', 'payloads', 'launchpad', 'cores', 'flight_number', 'date_utc']]

# We will remove rows with multiple cores because those are falcon rockets with 2 extra rocket boosters and rows that have multiple payloads in a single rocket.
data = data[data['cores'].map(len)==1]
data = data[data['payloads'].map(len)==1]

# Since payloads and cores are lists of size 1 we will also extract the single value in the list and replace the feature.
data['cores'] = data['cores'].map(lambda x : x[0])
data['payloads'] = data['payloads'].map(lambda x : x[0])

# We also want to convert the date_utc to a datetime datatype and then extracting the date leaving the time
data['date'] = pd.to_datetime(data['date_utc']).dt.date

# Using the date we will restrict the dates of the launches
data = data[data['date'] <= datetime.date(2020, 11, 13)]
data.head()

*   From the <code>rocket</code> we would like to learn the booster name

*   From the <code>payload</code> we would like to learn the mass of the payload and the orbit that it is going to

*   From the <code>launchpad</code> we would like to know the name of the launch site being used, the longitude, and the latitude.

*   From <code>cores</code> we would like to learn the outcome of the landing, the type of the landing, number of flights with that core, whether gridfins were used, whether the core is reused, whether legs were used, the landing pad used, the block of the core which is a number used to seperate version of cores, the number of times this specific core has been reused, and the serial of the core.

The data from these requests will be stored in lists and will be used to create a new dataframe.

In [None]:
#Global variables 
BoosterVersion = []
PayloadMass = []
Orbit = []
LaunchSite = []
Outcome = []
Flights = []
GridFins = []
Reused = []
Legs = []
LandingPad = []
Block = []
ReusedCount = []
Serial = []
Longitude = []
Latitude = []

The defined functions will apply the outputs globally to the above variables.

In [None]:
BoosterVersion

Applying getBoosterVersion function method to get the booster version

In [None]:
# Call getBoosterVersion
getBoosterVersion(data)

In [None]:
BoosterVersion[0:5]

Applying the rest of the functions here:

In [None]:
# Call getLaunchSite
getLaunchSite(data)

In [None]:
# Call getPayloadData
getPayloadData(data)

In [None]:
# Call getCoreData
getCoreData(data)

Constructing our dataset using the data we have obtained. We we combine the columns into a dictionary.

In [None]:
launch_dict = {'FlightNumber': list(data['flight_number']),
'Date': list(data['date']),
'BoosterVersion':BoosterVersion,
'PayloadMass':PayloadMass,
'Orbit':Orbit,
'LaunchSite':LaunchSite,
'Outcome':Outcome,
'Flights':Flights,
'GridFins':GridFins,
'Reused':Reused,
'Legs':Legs,
'LandingPad':LandingPad,
'Block':Block,
'ReusedCount':ReusedCount,
'Serial':Serial,
'Longitude': Longitude,
'Latitude': Latitude}

Creating a Pandas data frame from the dictionary launch_dict.

In [None]:
# Create a data from launch_dict
launch_data=pd.DataFrame(launch_dict)

In [None]:
# Displaying first 5 rows
launch_data.head()

### Filtering dataframe to only include Falcon 9 launches

In [None]:
data_falcon9=launch_data[launch_data['BoosterVersion']!='Falcon 1']

In [None]:
data_falcon9.loc[:,'FlightNumber'] = list(range(1, data_falcon9.shape[0]+1))
data_falcon9

In [None]:
data_falcon9=pd.read_csv('../input/spacex-falcon9-launch-data/SpaceX_Falcon9.csv')

## Data Wrangling

In [None]:
data_falcon9.isnull().sum()

### Dealing with Missing Values

In [None]:
# # Replacing the np.nan values with mean value of PayloadMass column
x=data_falcon9['PayloadMass'].mean()
data_falcon9['PayloadMass'].replace(np.nan,x, inplace=True)
data_falcon9.isnull().sum()

In [None]:
x

In [None]:
data_falcon9.head()

## Exploring and Preparing Data

we will explore data to find some patterns in the data and determine what would be the label for training supervised models.

In the data set, there are several different cases where the booster did not land successfully. Sometimes a landing was attempted but failed due to an accident; for example, True Ocean means the mission outcome was successfully landed to a specific region of the ocean while False Ocean means the mission outcome was unsuccessfully landed to a specific region of the ocean. True RTLS means the mission outcome was successfully landed to a ground pad False RTLS means the mission outcome was unsuccessfully landed to a ground pad.True ASDS means the mission outcome was successfully landed on a drone ship False ASDS means the mission outcome was unsuccessfully landed on a drone ship.

In this lab we will mainly convert those outcomes into Training Labels with 1 means the booster successfully landed 0 means it was unsuccessful.

In [None]:
df=data_falcon9
df.head(10)

In [None]:
df.isnull().sum()/df.count()*100

In [None]:
df.dtypes

### Calculating the number of launches on each site

In [None]:
df['LaunchSite'].value_counts()

### Calculating the number and occurrence of each orbit

Each launch aims to an dedicated orbit, and here are some common orbit types:

*   <b>LEO</b>: Low Earth orbit (LEO)is an Earth-centred orbit with an altitude of 2,000 km (1,200 mi) or less (approximately one-third of the radius of Earth),\[1] or with at least 11.25 periods per day (an orbital period of 128 minutes or less) and an eccentricity less than 0.25.\[2] Most of the manmade objects in outer space are in LEO <a href='https://en.wikipedia.org/wiki/Low_Earth_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01'>\[1]</a>.

*   <b>VLEO</b>: Very Low Earth Orbits (VLEO) can be defined as the orbits with a mean altitude below 450 km. Operating in these orbits can provide a number of benefits to Earth observation spacecraft as the spacecraft operates closer to the observation<a href='https://www.researchgate.net/publication/271499606_Very_Low_Earth_Orbit_mission_concepts_for_Earth_Observation_Benefits_and_challenges?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01'>\[2]</a>.

*   <b>GTO</b> A geosynchronous orbit is a high Earth orbit that allows satellites to match Earth's rotation. Located at 22,236 miles (35,786 kilometers) above Earth's equator, this position is a valuable spot for monitoring weather, communications and surveillance. Because the satellite orbits at the same speed that the Earth is turning, the satellite seems to stay in place over a single longitude, though it may drift north to south,” NASA wrote on its Earth Observatory website <a  href="https://www.space.com/29222-geosynchronous-orbit.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01" >\[3] </a>.

*   <b>SSO (or SO)</b>: It is a Sun-synchronous orbit  also called a heliosynchronous orbit is a nearly polar orbit around a planet, in which the satellite passes over any given point of the planet's surface at the same local mean solar time <a href="https://en.wikipedia.org/wiki/Sun-synchronous_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01">\[4] <a>.

*   <b>ES-L1 </b>:At the Lagrange points the gravitational forces of the two large bodies cancel out in such a way that a small object placed in orbit there is in equilibrium relative to the center of mass of the large bodies. L1 is one such point between the sun and the earth <a href="https://en.wikipedia.org/wiki/Lagrange_point?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01#L1_point">\[5]</a> .

*   <b>HEO</b> A highly elliptical orbit, is an elliptic orbit with high eccentricity, usually referring to one around Earth <a href="https://en.wikipedia.org/wiki/Highly_elliptical_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01">\[6]</a>.

*   <b> ISS </b> A modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project between five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada)<a href="https://en.wikipedia.org/wiki/International_Space_Station?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01"> \[7] </a>

*   <b> MEO </b> Geocentric orbits ranging in altitude from 2,000 km (1,200 mi) to just below geosynchronous orbit at 35,786 kilometers (22,236 mi). Also known as an intermediate circular orbit. These are "most commonly at 20,200 kilometers (12,600 mi), or 20,650 kilometers (12,830 mi), with an orbital period of 12 hours <a href="https://en.wikipedia.org/wiki/List_of_orbits?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01"> \[8] </a>

*   <b> HEO </b> Geocentric orbits above the altitude of geosynchronous orbit (35,786 km or 22,236 mi) <a href="https://en.wikipedia.org/wiki/List_of_orbits?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01"> \[9] </a>

*   <b> GEO </b> It is a circular geosynchronous orbit 35,786 kilometres (22,236 miles) above Earth's equator and following the direction of Earth's rotation <a href="https://en.wikipedia.org/wiki/Geostationary_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01"> \[10] </a>

*   <b> PO </b> It is one type of satellites in which a satellite passes above or nearly above both poles of the body being orbited (usually a planet such as the Earth <a href="https://en.wikipedia.org/wiki/Polar_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01"> \[11] </a>

In [None]:
df['Orbit'].value_counts()

### Calculating the number and occurence of mission outcome per orbit type

In [None]:
landing_outcomes = df['Outcome'].value_counts()
landing_outcomes

In [None]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

Creating a set of outcomes where the second stage did not land successfully

In [None]:
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

### Creating a landing outcome label from Outcome column

For outcome in bad_outcome landing class=0 else landing class=1

In [None]:
landing_class=[]
for outcome in df['Outcome']:
    if outcome in bad_outcomes:
        landing_class.append(0)
    else:
        landing_class.append(1)

This variable will represent the classification variable that represents the outcome of each launch. If the value is zero, the first stage did not land successfully; one means the first stage landed Successfully

In [None]:
df['Class']=landing_class
df[['Class']].head(8)

In [None]:
df.head(5)

In [None]:
df["Class"].mean()

In [None]:
# Matplotlib is a plotting library for python and pyplot gives us a MatLab like plotting framework. We will use this in our plotter function to plot data.
import matplotlib.pyplot as plt
#Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics
import seaborn as sns

## Exploratory Data Analysis

### Visualizing relationship between payload and Flight Number

In [None]:
sns.catplot(y="PayloadMass", x="FlightNumber", hue="Class", data=df, aspect = 5)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Pay load Mass (kg)",fontsize=20)
plt.show()

### Visualizing the relationship between Flight Number and Launch Site

In [None]:
# Plotting a scatter point chart with x axis to be Flight Number and y axis to be the launch site, and hue to be the class value
sns.catplot(y='LaunchSite', x='FlightNumber', hue='Class', data=df, aspect=5)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("LaunchSite",fontsize=20)
plt.show()

The sucsess of landing increased as flight number increased.

### Visualizing the relationship between Payload and Launch Site

In [None]:
# Plotting a scatter point chart with x axis to be Pay Load Mass (kg) and y axis to be the launch site, and hue to be the class value
sns.catplot(y='LaunchSite', x='PayloadMass', hue='Class', data=df, aspect=5)
plt.xlabel("Pauload Mass(Kg)",fontsize=20)
plt.ylabel("LaunchSite",fontsize=20)
plt.show()

- Most of the launches were carried out at CCSFS SLC-40, followed by KSC LC-39A and least launches were carried out at VAFB SLC 4E

### Visualizing the relationship between success rate of each orbit type

In [None]:
df_success=df.groupby('Orbit')['Class'].mean()*100
df_success.plot(kind='bar', figsize=(10,6))
plt.xlabel('Orbit') # add to x-label to the plot
plt.ylabel('Success Rate') # add y-label to the plot
plt.title('Relationship between Success Rate and Orbit') # add title to the plot

plt.show()

Orbits ES_L1, GEO, HEO and SSO have highest 100% success rate.

### Visualizing the relationship between FlightNumber and Orbit type

In [None]:
# Plotting a scatter point chart with x axis to be FlightNumber and y axis to be the Orbit, and hue to be the class value
sns.catplot(y='Orbit', x='FlightNumber', hue='Class', data=df, aspect=5)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

### Visualizing the relationship between Payload and Orbit type

In [None]:
# Plotting a scatter point chart with x axis to be Payload and y axis to be the Orbit, and hue to be the class value
sns.catplot(y='Orbit', x='PayloadMass', hue='Class', data=df, aspect=5)
plt.xlabel("Pauload Mass(Kg)",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

We observe that Heavy payloads have a negative influence on GTO orbits and positive on Polar LEO and ISS orbits.

### Visualizing launch success yearly trend

In [None]:
df['Date']=df['Date'].astype(str)

In [None]:
# A function to Extract years from the date 
year=[]
def Extract_year(df):
    for i in df["Date"]:
        year.append(i.split("-")[0])
    return year

In [None]:
# Plotting a line chart with x axis to be the extracted year and y axis to be the success rate
Extract_year(df)
df['Year']=year
fig,ax=plt.subplots()
df_success1=df.groupby('Year')['Class'].mean()*100
df_success1.plot(kind='line', figsize=(10,6))
plt.xlabel('Year') # add to x-label to the plot
plt.ylabel('Success Rate') # add y-label to the plot
plt.title('Relationship between Success Rate and Year') # add title to the plot

plt.show()

We can observe that the success rate since 2013 kept increasing till 2020.

In [None]:
df_success1.head()

## Features Engineering

Based on the preliminary insights we select  the features that will be used in the success prediction.

In [None]:
features = df[['FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights', 'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block', 'ReusedCount', 'Serial']]
features.head()

### Creating dummy variables for categorical columns

In [None]:
from sklearn import preprocessing

In [None]:
#Using get_dummies() function on the categorical columns
features_one_hot=pd.get_dummies(features, prefix=['Orbits','LaunchSite','LandingPad','Serial'])
features_one_hot.head()

### Casting all numeric columns to float64

In [None]:
features_one_hot.astype('float64')

## Launch Sites Locations Analysis with Folium

The launch success rate may depend on many factors such as payload mass, orbit type, and so on. It may also depend on the location and proximities of a launch site, i.e., the initial position of rocket trajectories. Finding an optimal location for building a launch site certainly involves many factors and hopefully we could discover some of the factors by analyzing the existing launch site locations.

In [None]:
!pip3 install wget

In [None]:
import folium
import wget

In [None]:
# Import folium MarkerCluster plugin
from folium.plugins import MarkerCluster
# Import folium MousePosition plugin
from folium.plugins import MousePosition
# Import folium DivIcon plugin
from folium.features import DivIcon

### Marking all launch sites on a map

In [None]:
spacex_csv_file = wget.download('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/spacex_launch_geo.csv')
spacex_df=pd.read_csv(spacex_csv_file)

In [None]:
# Select relevant sub-columns: `Launch Site`, `Lat(Latitude)`, `Long(Longitude)`, `class`
spacex_df = spacex_df[['Launch Site', 'Lat', 'Long', 'class']]
launch_sites_df = spacex_df.groupby(['Launch Site'], as_index=False).first()
launch_sites_df = launch_sites_df[['Launch Site', 'Lat', 'Long']]
launch_sites_df

In [None]:
spacex_df[['Launch Site']].value_counts

Creating a folium Map object, with an initial center location to be NASA Johnson Space Center at Houston, Texas.

In [None]:
nasa_coordinate = [29.559684888503615, -95.0830971930759]
site_map = folium.Map(location=nasa_coordinate, zoom_start=10)

Using folium.Circle to add a highlighted circle area with a text label on location NASA Johnson Space Centre

In [None]:
# Create a blue circle at NASA Johnson Space Center's coordinate with a popup label showing its name
circle = folium.Circle(nasa_coordinate, radius=1000, color='#d35400', fill=True).add_child(folium.Popup('NASA Johnson Space Center'))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker = folium.map.Marker(
    nasa_coordinate,
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % 'NASA JSC',
        )
    )
site_map.add_child(circle)
site_map.add_child(marker)

Adding a circle for each launch site in data frame launch_sites

In [None]:
# Initial the map
site_map = folium.Map(location=nasa_coordinate, zoom_start=4.4)
# For each launch site, add a Circle object based on its coordinate (Lat, Long) values. In addition, add Launch site name as a popup label
list1=[[28.562302,-80.577356],[28.563197,-80.576820],[28.573255,-80.646895],[34.632834,-120.610745]]
list2=['CCAFS LC-40','CCAFS SLC-40','KSC LC-39A','VAFB SLC-4E']

circle1 = folium.Circle(list1[0], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[0]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker1 = folium.map.Marker(
    list1[0],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[0],
        )
    )
site_map.add_child(circle1)
site_map.add_child(marker1)

circle2 = folium.Circle(list1[1], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[1]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker2 = folium.map.Marker(
    list1[1],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[1],
        )
    )
site_map.add_child(circle2)
site_map.add_child(marker2)

circle3 = folium.Circle(list1[2], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[2]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker3 = folium.map.Marker(
    list1[2],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[2],
        )
    )
site_map.add_child(circle3)
site_map.add_child(marker3)

circle4 = folium.Circle(list1[3], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[3]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker4 = folium.map.Marker(
    list1[3],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[3],
        )
    )
site_map.add_child(circle4)
site_map.add_child(marker4)

### Marking the success/failed launches for each site on the map

In [None]:
spacex_df.tail(10)

In [None]:
marker_cluster = MarkerCluster()

Creating a new column in launch_sites dataframe called marker_color to store the marker colors based on the class value

In [None]:
# Function to assign color to launch outcome
def assign_marker_color(launch_outcome):
    if launch_outcome == 1:
        return 'green'
    else:
        return 'red'
    
spacex_df['marker_color'] = spacex_df['class'].apply(assign_marker_color)
spacex_df.tail(10)
launch_sites_df = spacex_df[['Launch Site', 'Lat', 'Long','marker_color']]
launch_sites_df

For each launch result in spacex_df data frame, adding a folium.Marker to marker_cluster

In [None]:
# Add marker_cluster to current site_map
marker_cluster=folium.plugins.MarkerCluster()
site_map.add_child(marker_cluster)
# for each row in spacex_df data frame
# create a Marker object with its coordinate
# and customize the Marker's icon property to indicate if this launch was successed or failed, 
# e.g., icon=folium.Icon(color='white', icon_color=row['marker_color']
for index, record in spacex_df.iterrows():
    launchsite=record['Launch Site']
    # Create and add a Marker cluster to the site map
    marker = folium.Marker([record['Lat'], record['Long']], 
                  icon=folium.Icon(color='white', icon_color=record['marker_color'],html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % launchsite,))
    marker_cluster.add_child(marker)
site_map

### Calculating the distances between a launch site to its proximities

Adding a MousePosition on the map to get coordinate for a mouse over a point on the map.

In [None]:
# Add Mouse Position to get the coordinate (Lat, Long) for a mouse over on the map
formatter = "function(num) {return L.Util.formatNum(num, 5);};"
mouse_position = MousePosition(
    position='topright',
    separator=' Long: ',
    empty_string='NaN',
    lng_first=False,
    num_digits=20,
    prefix='Lat:',
    lat_formatter=formatter,
    lng_formatter=formatter,
)

site_map.add_child(mouse_position)
site_map

We can calculate the distance between two points on the map based on their Lat and Long values using the following method:

In [None]:
from math import sin, cos, sqrt, atan2, radians

def calculate_distance(lat1, lon1, lat2, lon2):
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    return distance

In [None]:
# find coordinate of railway point
#distance_railway = calculate_distance(launch_site_lat, launch_site_lon, raiwaly_lat, raiwaly_lon)
distance = calculate_distance(28.57468,-80.65229,28.573255 ,-80.646895)
distance

In [None]:
# Creating and adding a folium.Marker on selected closest railway point on the map
# Displaying the distance between railway point and launch site using the icon property 
coordinate = [28.57468,-80.65229]
distance_marker = folium.Marker(
    coordinate,
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(distance),
        )
    )
site_map.add_child(distance_marker)
site_map

Drawing a PolyLine between a launch site to the selected railway point

In [None]:
# Creating a `folium.PolyLine` object using the railway point coordinate and launch site coordinate
coordinates=[[28.57468,-80.65229],[28.573255 ,-80.646895]]
lines=folium.PolyLine(locations=coordinates, weight=1)
site_map.add_child(lines)

Similarly, drawing a line betwee a launch site to its closest city, coastline, highway, etc.

In [None]:
# Creating a marker with distance to a closest city, coastline, highway, etc.
# Drawing a line between the marker to the launch site
coordinates=[[28.57468,-80.65229],[28.57322 ,-80.60703],[28.5248,-80.6446],[28.53386,-81.38535]]
coordinate=[28.573255 ,-80.646895]
for x in coordinates:
    lines=folium.PolyLine(locations=[x,coordinate], weight=1)
    site_map.add_child(lines)

    distance_marker = folium.Marker(
        x,
        icon=DivIcon(
            icon_size=(20,20),
            icon_anchor=(0,0),
            html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(calculate_distance(x[0],x[1],coordinate[0] ,coordinate[1])),
        )
    )
    site_map.add_child(distance_marker)
site_map

In [None]:
coordinates=[[28.57367, -80.58472],[28.5248,-80.64],[28.563197, -80.56772],[28.56,-81.38535]]
coordinate=[28.562302,-80.577356]
for x in coordinates:
    lines=folium.PolyLine(locations=[x,coordinate], weight=1)
    site_map.add_child(lines)

    distance_marker = folium.Marker(
        x,
        icon=DivIcon(
            icon_size=(20,20),
            icon_anchor=(0,0),
            html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(calculate_distance(x[0],x[1],coordinate[0] ,coordinate[1])),
        )
    )
    site_map.add_child(distance_marker)
site_map

In [None]:
coordinates=[[34.63141, -120.62568],[34.66992, -120.45753],[34.6336, -120.62606],[34.63658, -120.4542]]
coordinate=[34.632834, -120.610746]
for x in coordinates:
    lines=folium.PolyLine(locations=[x,coordinate], weight=1)
    site_map.add_child(lines)

    distance_marker = folium.Marker(
        x,
        icon=DivIcon(
            icon_size=(20,20),
            icon_anchor=(0,0),
            html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(calculate_distance(x[0],x[1],coordinate[0] ,coordinate[1])),
        )
    )
    site_map.add_child(distance_marker)
site_map

**Observations**
- Launch Sites are in close proximity to coast.
- Launch Sites are also close to Major Highways and Railway for logistic purposes.
- Launch sites are far from dense human habitats like cities.

## Machine Learning Prediction

*   Standardize the data
*   Split into training data and test data

*   Find best Hyperparameters for SVM, Decision Tree, KNN and Logistic Regression.

*   Find the method performs best using test data among all classification models.


In [None]:
from sklearn import preprocessing
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
# Allows us to test parameters of classification algorithms and find the best one
from sklearn.model_selection import GridSearchCV
# Logistic Regression classification algorithm
from sklearn.linear_model import LogisticRegression
# Support Vector Machine classification algorithm
from sklearn.svm import SVC
# Decision Tree classification algorithm
from sklearn.tree import DecisionTreeClassifier
# K Nearest Neighbors classification algorithm
from sklearn.neighbors import KNeighborsClassifier

Defining function to plot confusion matrix

In [None]:
def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['did not land', 'land']); ax.yaxis.set_ticklabels(['did not land', 'landed'])

In [None]:
df.head()

Predictor Variables

In [None]:
X=features_one_hot
X.head(100)

Creating a NumPy array of Target Variable from the column Class in df

In [None]:
Y=df['Class'].to_numpy()
Y

### Feature Scaling

In [None]:
transform = preprocessing.StandardScaler()

In [None]:
X=transform.fit_transform(X)
X

### Splitting data into train and test sets

In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(X, Y, test_size=0.2, random_state=2)

In [None]:
Y_test.shape

### Classification Algorithms

 ### Hyperparameter Tuning using Crossvalidation with GridSearchCV

### Logistic Regression

In [None]:
parameters ={'C':[0.01,0.1,1],
             'penalty':['l1','l2'],
             'solver':['lbfgs']}

In [None]:
parameters ={"C":[0.01,0.1,1],'penalty':['l2'], 'solver':['lbfgs']}# l1 lasso l2 ridge
lr=LogisticRegression()

In [None]:
logreg_cv=GridSearchCV(lr,parameters, cv=10)
logreg_cv.fit(X_train,Y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

Accuracy of **Logistic Regression** on test data

In [None]:
print('Accuracy on test data is: {:.3f}'.format(logreg_cv.score(X_test, Y_test)))

Confusion Matrix for **Logistic Regression**

In [None]:
yhat=logreg_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

Logistic regression classified successful/unsuccessful landings well only problem is the false positives.

### Support Vector Machine Classifier

In [None]:
parameters = {'kernel':('linear', 'rbf','poly','rbf', 'sigmoid'),
              'C': np.logspace(-3, 3, 5),
              'gamma':np.logspace(-3, 3, 5)}
svm = SVC()

In [None]:
svm_cv=GridSearchCV(svm, parameters, cv=10)
svm_cv.fit(X_train,Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",svm_cv.best_params_)
print("accuracy :",svm_cv.best_score_)

Accuracy of **SVM** on the test data

In [None]:
print('Accuracy on test data is: {:.3f}'.format(svm_cv.score(X_test, Y_test)))

Confusion Matrix for **SVM**

In [None]:
yhat=svm_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

### Decision Tree Classifier

In [None]:
parameters = {'criterion': ['gini', 'entropy'],
     'splitter': ['best', 'random'],
     'max_depth': [2*n for n in range(1,10)],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]}

predtree = DecisionTreeClassifier()

In [None]:
tree_cv=GridSearchCV(predtree, parameters, cv=10, scoring='accuracy')
tree_cv.fit(X_train,Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",tree_cv.best_params_)
print("accuracy :",tree_cv.best_score_)

Accuracy of **Decision Tree Classifier** on test data

In [None]:
print('Accuracy on test data is: {:.3f}'.format(tree_cv.score(X_test, Y_test)))

Confusion Matrix for **Decision Tree Classifier**

In [None]:
yhat = tree_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

### K Nearest Neighbours Classification

In [None]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1,2]}

KNN = KNeighborsClassifier()

In [None]:
knn_cv=GridSearchCV(KNN, parameters, cv=10)
knn_cv.fit(X_train,Y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)

Accuracy of **KNN Algorithm** on test data

In [None]:
print('Accuracy on test data is: {:.3f}'.format(knn_cv.score(X_test, Y_test)))

Cofusion Matrix for **KNN Classifier**

In [None]:
yhat = knn_cv.predict(X_test)
plot_confusion_matrix(Y_test,yhat)

### Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, f1_score,average_precision_score, confusion_matrix,
                             average_precision_score, precision_score, recall_score, roc_auc_score, )

In [None]:
clf = RandomForestClassifier(criterion='gini', max_depth= 18, n_estimators=200, max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, random_state=200)
clf.fit(X_train,Y_train)

In [None]:
Ypred_train=clf.predict(X_train)

In [None]:
Rftrainscore=clf.score(X_train,Y_train)

In [None]:
print("Accuracy of Random Forest Classifier on train data:", Rftrainscore)

In [None]:
Rftestscore=clf.score(X_test,Y_test)

In [None]:
print('Accuracy of Random Forest Classifier on test data:',Rftestscore)

In [None]:
Ypred=clf.predict(X_test)

In [None]:
plot_confusion_matrix(Y_test,Ypred)

### Extreme Gradient Boosting Classification

In [None]:
pip install xgboost

In [None]:
from xgboost import XGBClassifier

In [None]:
clf1 = XGBClassifier(max_depth = 10,random_state = 10,n_estimators=100, eval_metric = 'auc', min_child_weight = 3,
                    colsample_bytree = 0.75, subsample= 0.9)
clf1.fit(X_train,Y_train)

In [None]:
XGBtrainscore=clf1.score(X_train,Y_train)

In [None]:
print('Accuracy of XGBClassifier on train data:',XGBtrainscore)

In [None]:
XGBtestscore=clf1.score(X_test,Y_test)

In [None]:
print('Accuracy of XGBClassifier on test data:',XGBtestscore)

In [None]:
Ypred=clf1.predict(X_test)

In [None]:
plot_confusion_matrix(Y_test,Ypred)

### Finding the best model

Accuracy Comparison of different algorithms on training data

In [None]:
algorithms = {'KNN':knn_cv.best_score_,'Tree':tree_cv.best_score_,'LogisticRegression':logreg_cv.best_score_,'SVM':svm_cv.best_score_,'RandomForest':Rftrainscore,'XGBClassifier':XGBtrainscore}
bestalgorithm = max(algorithms, key=algorithms.get)
print('Best Algorithm is',bestalgorithm,'with a score of',algorithms[bestalgorithm])

In [None]:
score_df = pd.DataFrame.from_dict(algorithms, orient='index', columns=['Train Data Accuracy'])
score_df.sort_values(['Train Data Accuracy'], inplace=True)
score_df.head(6)

In [None]:
score_df = score_df.reset_index()
score_df.rename(columns = {'index': 'Algorithm'}, inplace = True)
score_df.head(6)

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
fig = px.bar(score_df, x='Algorithm', y='Train Data Accuracy', hover_data=['Algorithm', 'Train Data Accuracy'], color='Algorithm')
fig.update_layout(title='Algorithm vs. Train Data Accuracy', xaxis_title='Algorithm', yaxis_title='Train Data Accuracy' )
fig.show()

Accuracy comparison of different algorithms on test data

In [None]:
algorithms2 = {'KNN':knn_cv.score(X_test, Y_test),'Tree':tree_cv.score(X_test, Y_test),'LogisticRegression':logreg_cv.score(X_test, Y_test),'SVM':svm_cv.score(X_test, Y_test),'RandomForest':Rftestscore,'XGBClassifier':XGBtestscore}
bestalgorithm2 = max(algorithms2, key=algorithms2.get)
print('Best Algorithm is',bestalgorithm2,'with a score of',algorithms2[bestalgorithm2])

In [None]:
score_df1 = pd.DataFrame.from_dict(algorithms2, orient='index', columns=['Test Data Accuracy'])
score_df1.sort_values(['Test Data Accuracy'], inplace=True)
score_df1 = score_df1.reset_index()
score_df1.rename(columns = {'index': 'Algorithm'}, inplace = True)
score_df1.head(6)

In [None]:
import plotly.express as px
import plotly.graph_objects as go
fig = px.bar(score_df1, x='Algorithm', y='Test Data Accuracy', hover_data=['Algorithm', 'Test Data Accuracy'], color='Algorithm')
fig.update_layout(title='Algorithm vs. Test Data Accuracy', xaxis_title='Algorithm', yaxis_title='Test Data Accuracy' )
fig.show()