# **SpaceX  Falcon 9 first stage Landing Prediction**


# Lab 1: Collecting the data


> In this capstone, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. -- -   Therefore **if we can determine if the first stage will land, we can determine the cost of a launch**. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

In [None]:
![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/lab_v2/images/landing_1.gif)


> Several examples of an unsuccessful landing are shown here:


In [None]:
![Crash Image](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/lab_v2/images/crash.gif)


## Objectives

Wr will make a **get request** to the SpaceX API. We will also do some basic data wrangling and formating. 

- Request to the SpaceX API
- Clean the requested data

- We will be working with SpaceX launch data that is gathered from an API, specifically the SpaceX REST API. This API will give us data about launches, including information about the rocket used, payload delivered, launch specifications, landing specifications, and landing outcome.
- Our goal is to use this data to predict whether SpaceX will attempt to land a rocket or not.
- The SpaceX REST API endpoints, or URL, starts with **api.spacexdata.com/v4/** .
- We have the different end points, for example: /capsules and /cores. We will be working with the endpoint **api.spacexdata.com/v4/launches/past**.

## Import Libraries and Define Auxiliary Functions


In [1]:
# Requests allows us to make HTTP requests which we will use
# to get data from an API
import requests

# Datetime is a library that allows us to represent dates
import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Setting this option will print all collumns of a dataframe
pd.set_option('display.max_columns', None)
# Setting this option will print all of the data in a feature
pd.set_option('display.max_colwidth', None)

### Defining a series of helper functions

* Below we will define a series of helper functions that will help us use the API to extract information using identification numbers in the launch data.

From the <code>rocket</code> column we would like to learn the booster name.


In [None]:
# Takes the dataset and uses the rocket column to call the API and append the data to the list
def getBoosterVersion(data):
    for x in data['rocket']:
       if x:
        response = requests.get("https://api.spacexdata.com/v4/rockets/"+str(x)).json()
        BoosterVersion.append(response['name'])

From the <code>launchpad</code> we would like to know the name of the launch site being used, the logitude, and the latitude.


In [None]:
# Takes the dataset and uses the launchpad column to call the API and append the data to the list
def getLaunchSite(data):
    for x in data['launchpad']:
       if x:
         response = requests.get("https://api.spacexdata.com/v4/launchpads/"+str(x)).json()
         Longitude.append(response['longitude'])
         Latitude.append(response['latitude'])
         LaunchSite.append(response['name'])

From the <code>payload</code> we would like to learn the mass of the payload and the orbit that it is going to.

In [None]:
# Takes the dataset and uses the payloads column to call the API and append the data to the lists
def getPayloadData(data):
    for load in data['payloads']:
       if load:
        response = requests.get("https://api.spacexdata.com/v4/payloads/"+load).json()
        PayloadMass.append(response['mass_kg'])
        Orbit.append(response['orbit'])

From <code>cores</code> we would like to learn the outcome of the landing, the type of the landing, number of flights with that core, whether gridfins were used, wheter the core is reused, wheter legs were used, the landing pad used, the block of the core which is a number used to seperate version of cores, the number of times this specific core has been reused, and the serial of the core.


In [None]:
# Takes the dataset and uses the cores column to call the API and append the data to the lists
def getCoreData(data):
    for core in data['cores']:
            if core['core'] != None:
                response = requests.get("https://api.spacexdata.com/v4/cores/"+core['core']).json()
                Block.append(response['block'])
                ReusedCount.append(response['reuse_count'])
                Serial.append(response['serial'])
            else:
                Block.append(None)
                ReusedCount.append(None)
                Serial.append(None)
            Outcome.append(str(core['landing_success'])+' '+str(core['landing_type']))
            Flights.append(core['flight'])
            GridFins.append(core['gridfins'])
            Reused.append(core['reused'])
            Legs.append(core['legs'])
            LandingPad.append(core['landpad'])

> Now let's start requesting rocket launch data from SpaceX API with the following URL:

In [None]:
spacex_url="https://api.spacexdata.com/v4/launches/past"

In [None]:
response = requests.get(spacex_url)

In [None]:
# Check the content of the response
print(response.content)

We can see that the <code>response</code> contains massive information about SpaceX launches. Next, let's try to discover some more relevant information for this project.

**Requesting and parse the SpaceX launch data using the GET request**

To make the requested JSON results more consistent, we will use the following static response object for this project:

In [None]:
static_json_url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/API_call_spacex_api.json'

We should see that the request was successfull with the 200 status response code

In [None]:
response.status_code

Now we decode the response content as a Json using <code>.json()</code> and turn it into a Pandas dataframe using <code>.json_normalize()</code>

In [None]:
# Use json_normalize meethod to convert the json result into a dataframe

# 1. Decode the JSON response content
data = response.json()

# 2. Create a Pandas DataFrame using json_normalize
df = pd.json_normalize(data)

In [None]:
df.head(2)

In [None]:
df.shape

- We notice that a lot of the data are IDs. For example the rocket column has no information about the rocket just an identification number.

- We will now use the API again to get information about the launches using the IDs given for each launch. Specifically we will be using columns rocket, payloads, launchpad, and cores.

In [None]:
# Lets take a subset of our dataframe keeping only the features we want and the flight number, and date_utc.
df = df[['rocket', 'payloads', 'launchpad', 'cores', 'flight_number', 'date_utc']]

# We will remove rows with multiple cores because those are falcon rockets with 2 extra rocket boosters and rows that have multiple payloads in a single rocket.
df = df[df['cores'].map(len)==1]
df = df[df['payloads'].map(len)==1]

# Since payloads and cores are lists of size 1 we will also extract the single value in the list and replace the feature.
df['cores'] = df['cores'].map(lambda x : x[0])
df['payloads'] = df['payloads'].map(lambda x : x[0])

# We also want to convert the date_utc to a datetime datatype and then extracting the date leaving the time
df['date'] = pd.to_datetime(df['date_utc']).dt.date

# Using the date we will restrict the dates of the launches
df = df[df['date'] <= datetime.date(2020, 11, 13)]

In [None]:
df.head(3)

* From the <code>rocket</code> we would like to learn the booster name

* From the <code>payload</code> we would like to learn the mass of the payload and the orbit that it is going to

* From the <code>launchpad</code> we would like to know the name of the launch site being used, the longitude, and the latitude.

* From <code>cores</code> we would like to learn the outcome of the landing, the type of the landing, number of flights with that core, whether gridfins were used, whether the core is reused, whether legs were used, the landing pad used, the block of the core which is a number used to seperate version of cores, the number of times this specific core has been reused, and the serial of the core.

**The data from these requests will be stored in lists and will be used to create a new dataframe.**

In [None]:
#Global variables

BoosterVersion = []
PayloadMass = []
Orbit = []
LaunchSite = []
Outcome = []
Flights = []
GridFins = []
Reused = []
Legs = []
LandingPad = []
Block = []
ReusedCount = []
Serial = []
Longitude = []
Latitude = []

**These defined functions will apply the outputs globally to the above variables.** Let's take a looks at <code>BoosterVersion</code> variable. Before we apply  <code>getBoosterVersion</code> the list is empty:

In [None]:
BoosterVersion

**Applying <code> getBoosterVersion</code> function method to get the booster version**

In [None]:
# Call getBoosterVersion
getBoosterVersion(df)

In [None]:
# the list has now been updated
BoosterVersion[0:5]

**We can apply the rest of the  functions here:**

In [None]:
# Call getLaunchSite
getLaunchSite(df)

In [None]:
# Call getPayloadData
getPayloadData(df)

In [None]:
# Call getCoreData
getCoreData(df)

**Finally, lets construct our dataset using the data we have obtained. We'll combine the columns into a dictionary *launch_dict*.**

In [None]:
launch_dict = {'FlightNumber': list(df['flight_number']),
'Date': list(df['date']),
'BoosterVersion':BoosterVersion,
'PayloadMass':PayloadMass,
'Orbit':Orbit,
'LaunchSite':LaunchSite,
'Outcome':Outcome,
'Flights':Flights,
'GridFins':GridFins,
'Reused':Reused,
'Legs':Legs,
'LandingPad':LandingPad,
'Block':Block,
'ReusedCount':ReusedCount,
'Serial':Serial,
'Longitude': Longitude,
'Latitude': Latitude}

**Creating a Pandas dataframe from the dictionary *launch_dict*.**

In [None]:
# Create a data from launch_dict
df = pd.DataFrame(launch_dict)

In [None]:
# Show the Summary of the dataframe
df.sample(3)

### Filtering the dataframe to only include *`Falcon 9`* launches

Finally we will remove the Falcon 1 launches keeping only the Falcon 9 launches. Filter the data dataframe using the <code>BoosterVersion</code> column to only keep the Falcon 9 launches. Save the filtered data to a new dataframe called <code>df_falcon9</code>.

In [None]:
df_falcon9 = df[df['BoosterVersion'] != 'Falcon 1']

- Now that we have removed some values we should reset the FlgihtNumber column

In [None]:
df_falcon9.loc[:,'FlightNumber'] = list(range(1, df_falcon9.shape[0]+1))
df_falcon9

In [None]:
df_copy=df_falcon9.copy(deep=True)

## Data Wrangling

We can see below that some of the rows are missing values in our dataset.

In [None]:
df_copy.isnull().sum()

Before we can continue we must deal with these missing values. The <code>LandingPad</code> column will retain None values to represent when landing pads were not used.

**Dealing with Missing Values**

Calculate below the mean for the <code>PayloadMass</code> using the <code>.mean()</code>. Then use the mean and the <code>.replace()</code> function to replace `np.nan` values in the data with the mean you calculated.

In [None]:
# Calculating the mean value of PayloadMass column
mean_payload_mass = df_copy['PayloadMass'].mean()

# Replace the np.nan values with its mean value
df_copy['PayloadMass'].fillna(mean_payload_mass, inplace=True)

In [None]:
df_copy.isnull().sum()

- We can see that the number of missing values of the <code>PayLoadMass</code> change to zero.
- Now we should have no missing values in our dataset except for in <code>LandingPad</code>.
- We can now export it to a <b>CSV</b> for the next section,but to make the answers consistent, in the next lab we will provide data in a pre-selected date range. 

In [None]:
df_copy.to_csv('dataset_part_1.csv', index=False)

## LAB2:- Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia

In [None]:
!pip3 install beautifulsoup4
!pip3 install requests

In [None]:
import sys

import requests
from bs4 import BeautifulSoup
import re
import unicodedata

In [None]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


In [None]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


In [None]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`

In [None]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

In [None]:
response = requests.get(static_url)

In [None]:
if response.status_code == 200:
   # status-code == 200 means "OK" and indicates that the request was successful.  
   # Create a BeautifulSoup object from the response content
    soup = BeautifulSoup(response.content, 'html.parser')
    # Now, We can work with the 'soup' object to parse and manipulate the HTML content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

In [None]:
# Use soup.title attribute
print("Page Title:", soup.title.string)

In [None]:
html_tables = soup.find_all('table')  
first_launch_table = html_tables[2]

column_names = []
for th in first_launch_table.find_all("th"):
    name = extract_column_from_header(th)
    if name is not None and len(name) > 0:
        column_names.append(name)

launch_dict = dict.fromkeys(column_names) 
del launch_dict['Date and time ( )']

launch_dict['Flight No.'] = []
launch_dict['Launch site'] = [] 
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = [] 
launch_dict['Launch outcome'] = []
launch_dict['Version Booster'] = []
launch_dict['Booster landing'] = []
launch_dict['Date'] = []
launch_dict['Time'] = []

print(column_names)

In [None]:
extracted_row = 0
#Extract each table 
for table_number,table in enumerate(soup.find_all('table',"wikitable plainrowheaders collapsible")):
   # get table row 
    for rows in table.find_all("tr"):
        #check to see if first table heading is as number corresponding to launch a number 
        if rows.th:
            if rows.th.string:
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
        else:
            flag=False
        #get table element 
        row=rows.find_all('td')
        #if it is number save cells in a dictonary 
        if flag:
            extracted_row += 1
            # Flight Number value
            # TODO: Append the flight_number into launch_dict with key `Flight No.`
            #print(flight_number)
            datatimelist=date_time(row[0])
            
            # Date value
            # TODO: Append the date into launch_dict with key `Date`
            date = datatimelist[0].strip(',')
            #print(date)
            
            # Time value
            # TODO: Append the time into launch_dict with key `Time`
            time = datatimelist[1]
            #print(time)
              
            # Booster version
            # TODO: Append the bv into launch_dict with key `Version Booster`
            bv=booster_version(row[1])
            if not(bv):
                bv=row[1].a.string
            print(bv)
            
            # Launch Site
            # TODO: Append the bv into launch_dict with key `Launch Site`
            launch_site = row[2].a.string
            #print(launch_site)
            
            # Payload
            # TODO: Append the payload into launch_dict with key `Payload`
            payload = row[3].a.string
            #print(payload)
            
            # Payload Mass
            # TODO: Append the payload_mass into launch_dict with key `Payload mass`
            payload_mass = get_mass(row[4])
            #print(payload)
            
            # Orbit
            # TODO: Append the orbit into launch_dict with key `Orbit`
            orbit = row[5].a.string
            #print(orbit)
            
            # Customer
            # TODO: Append the customer into launch_dict with key `Customer`
            customer = row[6].a.string
            #print(customer)
            
            # Launch outcome
            # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`
            launch_outcome = list(row[7].strings)[0]
            #print(launch_outcome)
            
            # Booster landing
            # TODO: Append the launch_outcome into launch_dict with key `Booster landing`
            booster_landing = landing_status(row[8])
            #print(booster_landing)
            

In [None]:
df = pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })

In [None]:
df.to_csv('spacex_web_scraped.csv', index=False)

In [None]:
data=pd.read_csv("spacex_web_scraped.csv")
data.head()

## Exploring and Preparing Data

- We will explore data to find some patterns in the data and determine what would be the label for training supervised models. 

- In the data set, there are several different cases where the booster did not land successfully. Sometimes a landing was attempted but failed due to an accident; for example, <code>True Ocean</code> means the mission outcome was successfully  landed to a specific region of the ocean while <code>False Ocean</code> means the mission outcome was unsuccessfully landed to a specific region of the ocean.
- <code>True RTLS</code> means the mission outcome was successfully  landed to a ground pad <code>False RTLS</code> means the mission outcome was unsuccessfully landed to a ground pad.<code>True ASDS</code> means the mission outcome was successfully landed on  a drone ship <code>False ASDS</code> means the mission outcome was unsuccessfully landed on a drone ship. 

- We will mainly convert those outcomes into Training Labels with **`1`** means the booster **successfully landed** and **`0`** means it was **unsuccessful**.

Load Space X dataset, from last section.

In [None]:
cleaned_df=df_copy.copy(deep=True)
cleaned_df.sample(3)

> Identify and calculate the percentage of the missing values in each attribute

In [None]:
cleaned_df.isnull().sum()/cleaned_df.count()*100

- In **LandingPad** column, there are **40%** missing values.

In [None]:
cleaned_df.dtypes

In [None]:
cleaned_df.columns

- **Categorical** columns are : Date, BoosterVersion, Orbit, LaunchSite, Outcome, GridFins, Reused, Legs, LandingPad, Serial

- **Numerical** columns are : FlightNumber, PayloadMass, Flights, Block, ReusedCount, Longitude, Latitude

There are 10 Categorical and 7 Numerical Columns.

In [None]:
# Converting Bool columns datatype into Integers datatype
cleaned_df[['GridFins','Reused','Legs']] = cleaned_df[['GridFins','Reused','Legs']].astype(int)

**Dealing with LandingPad Null values**

In [None]:
cleaned_df['LandingPad'].unique()

In [None]:
# Calculate the mode of the 'LandingPad' column
landing_pad_mode = cleaned_df['LandingPad'].mode().values[0]

# Replace NaN values with the mode
cleaned_df['LandingPad'].fillna(landing_pad_mode, inplace=True)

In [None]:
cleaned_df['LandingPad'].value_counts()

### TASK 1: Calculate the number of launches on each site

The data contains several Space X  launch facilities: <a href='https://en.wikipedia.org/wiki/List_of_Cape_Canaveral_and_Merritt_Island_launch_sites'>Cape Canaveral Space</a> Launch Complex 40  <b>VAFB SLC 4E </b> , Vandenberg Air Force Base Space Launch Complex 4E <b>(SLC-4E)</b>, Kennedy Space Center Launch Complex 39A <b>KSC LC 39A </b>.The location of each Launch Is placed in the column <code>LaunchSite</code>

**Calculating the number of launches for each site.**

In [None]:
# Apply value_counts() on column LaunchSite
cleaned_df['LaunchSite'].value_counts()

**Calculating the number and occurrence of each orbit**

Each launch aims to an dedicated orbit, and here are some common orbit types:



* <b>LEO</b>: Low Earth orbit (LEO)is an Earth-centred orbit with an altitude of 2,000 km (1,200 mi) or less (approximately one-third of the radius of Earth),[1] or with at least 11.25 periods per day (an orbital period of 128 minutes or less) and an eccentricity less than 0.25.[2] Most of the manmade objects in outer space are in LEO <a href='https://en.wikipedia.org/wiki/Low_Earth_orbit'>[1]</a>.

* <b>VLEO</b>: Very Low Earth Orbits (VLEO) can be defined as the orbits with a mean altitude below 450 km. Operating in these orbits can provide a number of benefits to Earth observation spacecraft as the spacecraft operates closer to the observation<a href='https://www.researchgate.net/publication/271499606_Very_Low_Earth_Orbit_mission_concepts_for_Earth_Observation_Benefits_and_challenges'>[2]</a>.


* <b>GTO</b> A geosynchronous orbit is a high Earth orbit that allows satellites to match Earth's rotation. Located at 22,236 miles (35,786 kilometers) above Earth's equator, this position is a valuable spot for monitoring weather, communications and surveillance. Because the satellite orbits at the same speed that the Earth is turning, the satellite seems to stay in place over a single longitude, though it may drift north to south,” NASA wrote on its Earth Observatory website <a  href="https://www.space.com/29222-geosynchronous-orbit.html" >[3] </a>.


* <b>SSO (or SO)</b>: It is a Sun-synchronous orbit  also called a heliosynchronous orbit is a nearly polar orbit around a planet, in which the satellite passes over any given point of the planet's surface at the same local mean solar time <a href="https://en.wikipedia.org/wiki/Sun-synchronous_orbit">[4] <a>.
    
    
    
* <b>ES-L1 </b>:At the Lagrange points the gravitational forces of the two large bodies cancel out in such a way that a small object placed in orbit there is in equilibrium relative to the center of mass of the large bodies. L1 is one such point between the sun and the earth <a href="https://en.wikipedia.org/wiki/Lagrange_point#L1_point">[5]</a> .
    
    
* <b>HEO</b> A highly elliptical orbit, is an elliptic orbit with high eccentricity, usually referring to one around Earth <a href="https://en.wikipedia.org/wiki/Highly_elliptical_orbit">[6]</a>.


* <b> ISS </b> A modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project between five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada)<a href="https://en.wikipedia.org/wiki/International_Space_Station"> [7] </a>


* <b> MEO </b> Geocentric orbits ranging in altitude from 2,000 km (1,200 mi) to just below geosynchronous orbit at 35,786 kilometers (22,236 mi). Also known as an intermediate circular orbit. These are "most commonly at 20,200 kilometers (12,600 mi), or 20,650 kilometers (12,830 mi), with an orbital period of 12 hours <a href="https://en.wikipedia.org/wiki/List_of_orbits"> [8] </a>


* <b> HEO </b> Geocentric orbits above the altitude of geosynchronous orbit (35,786 km or 22,236 mi) <a href="https://en.wikipedia.org/wiki/List_of_orbits"> [9] </a>


* <b> GEO </b> It is a circular geosynchronous orbit 35,786 kilometres (22,236 miles) above Earth's equator and following the direction of Earth's rotation <a href="https://en.wikipedia.org/wiki/Geostationary_orbit"> [10] </a>


* <b> PO </b> It is one type of satellites in which a satellite passes above or nearly above both poles of the body being orbited (usually a planet such as the Earth <a href="https://en.wikipedia.org/wiki/Polar_orbit"> [11] </a>

some are shown in the following plot:

In [None]:
# Apply value_counts on Orbit column
cleaned_df['Orbit'].value_counts()

**Calculating the number and occurence of mission outcome per orbit type**

In [None]:
landing_outcomes = cleaned_df['Outcome'].value_counts()
landing_outcomes

- <code>True Ocean</code> means the mission outcome was successfully  landed to a specific region of the ocean while <code>False Ocean</code> means the mission outcome was unsuccessfully landed to a specific region of the ocean.
- <code>True RTLS</code> means the mission outcome was successfully landed to a ground pad <code>False RTLS</code> means the mission outcome was unsuccessfully landed to a ground pad.
- <code>True ASDS</code> means the mission outcome was successfully landed to a drone ship <code>False ASDS</code> means the mission outcome was unsuccessfully landed to a drone ship.
- <code>None ASDS</code> and <code>None None</code> these represent a failure to land.

In [None]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

**Creating a set of outcomes where the second stage did not land successfully:**

In [None]:
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

**Creating a landing outcome label from Outcome column**

Using the <code>Outcome</code> column,  create a list where the element for outcome in bad_outcome landing class=0 else landing class=1.
Then assign it to the variable <code>landing_class</code>:

In [None]:
# landing_class = 0 if bad_outcome
# landing_class = 1 otherwise

landing_class=[]
for outcome in cleaned_df['Outcome']:
    if outcome in bad_outcomes:
        landing_class.append(0)
    else:
        landing_class.append(1)

In [None]:
# Assign the landing_class list to a variable
cleaned_df['Class'] = landing_class

This variable will represent the classification variable that represents the outcome of each launch. If the value is zero, the  first stage did not land successfully; one means  the first stage landed Successfully

In [None]:
cleaned_df[['Class']].head(5)

In [None]:
cleaned_df.head(3)

In [None]:
# We can use the following line of code to determine  the success rate:
cleaned_df["Class"].mean()

We can now export it to a CSV for the next section,but to make the answers consistent, in the next lab we will provide data in a pre-selected date range.

### Final step of Wrangling: Store data

In [None]:
# Store the file
cleaned_df.reset_index(drop=True)
cleaned_df.to_csv("dataset_part_2.csv", index=False)

## SQL Notebook for Peer Assignment

In [None]:
%load_ext sql

In [None]:
import csv, sqlite3

con = sqlite3.connect("my_data1.db")
cur = con.cursor()

In [None]:
!pip install -q pandas==1.1.5

In [None]:
%sql sqlite:///my_data1.db

In [None]:
cleaned_df.to_sql("SPACEXTBL", con, if_exists='replace', index=False,method="multi")

In [None]:
%sql create table SPACEXTABLE as select * from SPACEXTBL where Date is not null

**The key points:**

- Start the cell with `%%sql` to tell Jupyter this is a SQL cell
- Then put your SQL query on the next lines
- The output of the query will be displayed automatically below the cell when you run it

In [None]:
%%sql
SELECT COUNT(*) FROM SPACEXTABLE;

**1.Display the names of the unique launch sites  in the space mission**

In [None]:
%%sql
SELECT DISTINCT "Launch_Site" 
FROM SPACEXTABLE;

**2.Display 5 records where launch sites begin with the string 'CCA'**

In [None]:
%%sql 
SELECT *
FROM SPACEXTABLE
WHERE "Launch_Site" LIKE 'CCA%';

**3.Display the total payload mass carried by boosters launched by NASA (CRS)**

In [None]:
%%sql 
SELECT SUM("PAYLOAD_MASS__KG_") AS "Total Payload Mass"
FROM SPACEXTABLE
WHERE "Customer" LIKE '%NASA (CRS)%';

**4.Display average payload mass carried by booster version F9 v1.1**

In [None]:
%%sql
SELECT AVG("PAYLOAD_MASS__KG_") AS "Average Payload Mass"
FROM SPACEXTABLE  
WHERE "Booster_Version" LIKE 'F9 v1.1';

**5.List the date when the first succesful landing outcome in ground pad was acheived.**

In [None]:
%%sql
SELECT DATE
FROM SPACEXTABLE
WHERE "Landing_Outcome" LIKE '%Success (ground pad)%'
ORDER BY DATE;

**6.List the names of the boosters which have success in drone ship and have payload mass greater than 4000 but less than 6000**

In [None]:
%%sql
SELECT DISTINCT "Booster_Version"
FROM SPACEXTABLE
WHERE "Landing_Outcome" LIKE '%Success (drone ship)%'
AND "PAYLOAD_MASS__KG_" > 4000 AND "PAYLOAD_MASS__KG_" < 6000;

**7.List the total number of successful and failure mission outcomes**

In [None]:
%%sql
SELECT 
 SUM(CASE WHEN "Landing_Outcome" LIKE '%Success%' THEN 1 ELSE 0 END) AS Successes,
 SUM(CASE WHEN "Landing_Outcome" LIKE '%Failure%' THEN 1 ELSE 0 END) AS Failures
FROM SPACEXTABLE;

**8.List the names of the booster_versions which have carried the maximum payload mass. Use a subquery**

In [None]:
%%sql
SELECT DISTINCT "Booster_Version"
FROM SPACEXTABLE
WHERE "PAYLOAD_MASS__KG_" = (SELECT MAX("PAYLOAD_MASS__KG_") FROM SPACEXTABLE);

**9.List the records which will display the month names, failure landing_outcomes in drone ship ,booster versions, launch_site for the months in year 2015.**


`Note:` SQLLite does not support monthnames. So you need to use  substr(Date, 6,2) as month to get the months and substr(Date,0,5)='2015' for year.

In [None]:
%%sql 
SELECT DISTINCT MONTHNAME(DATE) AS Month,  
       (CASE WHEN "Landing_Outcome" LIKE '%Failure (drone ship)%' THEN 1 ELSE 0 END) AS "Drone Ship Failure",
       "Booster_Version",
       "Launch_Site"
FROM SPACEXTABLE 
WHERE YEAR(DATE) = 2015;

**10.Rank the count of landing outcomes (such as Failure (drone ship) or Success (ground pad)) between the date 2010-06-04 and 2017-03-20, in descending order.**

In [None]:
%%sql
SELECT * FROM
  (SELECT "Landing_Outcome", COUNT(*) AS landing_outcome_count
   FROM SPACEXTABLE
   WHERE DATE BETWEEN '2010-06-04' AND '2017-03-20'
   GROUP BY "Landing_Outcome")
ORDER BY landing_outcome_count DESC;

## Day 3 : Exploratory Analysis using Pandas and Matplotlib

In [None]:
eda_df=cleaned_df.copy(deep=True)

**Visualizing relationship between payload and Flight Number**

In [None]:
sns.catplot(y="PayloadMass", x="FlightNumber", hue="Class", data=eda_df, aspect=5, s=90)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Pay load Mass (kg)",fontsize=20)
plt.show()

We see that different launch sites have different success rates.  <code>CCAFS LC-40</code>, has a success rate of 60 %, while  <code>KSC LC-39A</code> and <code>VAFB SLC 4E</code> has a success rate of 77%.

Next, let's drill down to each site visualize its detailed launch records.

**Visualizing the relationship between Flight Number and Launch Site**

In [None]:
# Groupping the data by Launch Site and count the number of flights for each site
launch_site_counts = eda_df['LaunchSite'].value_counts().reset_index()
launch_site_counts.columns = ['LaunchSite', 'FlightNumber']

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='LaunchSite', y='FlightNumber', data=launch_site_counts, palette="viridis")
plt.title('Flight Count by Launch Site', fontsize=16)
plt.xlabel('Launch Site', fontsize=14)
plt.ylabel('Flight Count', fontsize=14)
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

plt.show()

Use the function <code>catplot</code> to plot <code>FlightNumber</code> vs <code>LaunchSite</code>, set the  parameter <code>x</code>  parameter to <code>FlightNumber</code>,set the  <code>y</code> to <code>Launch Site</code> and set the parameter <code>hue</code> to <code>'class'</code>

In [None]:
# Plotting a scatter point chart with x axis to be Flight Number and y axis to be the launch site, and hue to be the class value

# Use the catplot function to create the scatter plot
sns.catplot(x="FlightNumber", y="LaunchSite", hue="Class", data=eda_df, aspect=2, kind="strip",s=50)

# Set the title and adjust the axis labels
plt.title('Flight Number vs Launch Site')
plt.xlabel('Flight Number')
plt.ylabel('Launch Site')

# Show the plot
plt.show()


- The sucsess of landing increased as flight number increased.

Now try to explain the patterns you found in the Flight Number vs. Launch Site scatter point plots.

In [None]:
### TASK 2: Visualize the relationship between Payload and Launch Site

plt.figure(figsize=(10, 6))
sns.barplot(data=eda_df, x='LaunchSite', y='PayloadMass', estimator=np.mean, ci=None, palette="viridis")
plt.title('Average Payload Mass by Launch Site', fontsize=16)
plt.xlabel('Launch Site', fontsize=14)
plt.ylabel('Average Payload Mass (kg)', fontsize=14)

plt.show()

We also want to observe if there is any relationship between launch sites and their payload mass.


In [None]:
# Plot a scatter point chart with x axis to be Pay Load Mass (kg) and y axis to be the launch site, and hue to be the class value

# Create a scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PayloadMass', y='LaunchSite', hue='Class', data=eda_df, palette='viridis', s=60)
plt.title('Payload Mass vs. Launch Site', fontsize=16)
plt.xlabel('Payload Mass (kg)', fontsize=14)
plt.ylabel('Launch Site', fontsize=14)
plt.legend(title='Class', title_fontsize='12', loc='upper right')

plt.show()

- Now if you observe Payload Vs. Launch Site scatter point chart you will find for the VAFB-SLC  launchsite there are no  rockets  launched for  heavypayload mass(greater than 10000).
- Most of the launches were carried out at CCSFS SLC-40, followed by KSC LC-39A and least launches were carried out at VAFB SLC 4E

In [None]:
### TASK  3: Visualize the relationship between success rate of each orbit type
df_success=eda_df.groupby('Orbit')['Class'].mean()*100
df_success.plot(kind='bar', figsize=(10,6))
plt.xlabel('Orbit')
plt.ylabel('Success Rate')
plt.title('Relationship between Success Rate and Orbit')

Next, we want to visually check if there are any relationship between success rate and orbit type.

Let's create a `bar chart` for the sucess rate of each orbit

In [None]:
# HINT use groupby method on Orbit column and get the mean of Class column

# Group the data by 'Orbit' and calculate the mean of 'Class'
orbit_success_rate = eda_df.groupby('Orbit')['Class'].mean().reset_index()

# Create a bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(orbit_success_rate['Orbit'], orbit_success_rate['Class'], color='royalblue')
plt.title('Success Rate by Orbit', fontsize=16)
plt.xlabel('Orbit', fontsize=14)
plt.ylabel('Success Rate', fontsize=14)
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability

# Annotate the bars with counts
for i, bar in enumerate(bars):
    height = bar.get_height()
    count = orbit_success_rate.loc[i]['Class']
    plt.annotate(f'{count:.2f}', xy=(bar.get_x() + bar.get_width() / 2, height), xytext=(0, 3),
                 textcoords='offset points', ha='center', va='bottom')
plt.show()

- Orbits ES_L1, GEO, HEO and SSO have highest 100% success rate.

For each orbit, we want to see if there is any relationship between FlightNumber and Orbit type.

In [None]:
# Task 4: Plotting a scatter point chart with x axis to be FlightNumber and y axis to be the Orbit, and hue to be the class value

sns.catplot(y='Orbit', x='FlightNumber', hue='Class', data=eda_df, aspect=5, s=80)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

- In the LEO orbit the Success appears related to the number of flights.
- On the other hand, there seems to be **No** relationship between FlightNumber when in GTO orbit.

**Visualize the relationship between Payload and Orbit type**

Similarly, we can plot the Payload vs. Orbit scatter point charts to reveal the relationship between Payload and Orbit type


In [None]:
# Plot a scatter point chart with x axis to be Payload and y axis to be the Orbit, and hue to be the class value
sns.catplot(y='Orbit', x='PayloadMass', hue='Class', data=eda_df, aspect=5,s=70)
plt.xlabel("Pauload Mass(Kg)",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

In [None]:
# Create a Scatterplot with PayloadMass on the x-axis, Orbit on the y-axis, and Class as hue

eda_df.plot(x='Orbit', y='PayloadMass', kind='scatter')
sns.set_style("darkgrid")
title_font = {'size': 16, 'color': 'darkblue'}
plt.xlabel("Orbit", fontdict=title_font)
plt.ylabel("PayloadMass", fontdict=title_font)

plt.show()

- With heavy payloads the successful landing (positive landing) rate are more for Polar,LEO and ISS.
- However for GTO we cannot distinguish this well as both positive landing rate and negative landing(unsuccessful mission) are both there here.
- We observe that Heavy payloads have a negative influence on GTO orbits and positive on Polar LEO and ISS orbits

#### TASK  6: Visualizing the launch success yearly trend

We can plot a line chart with x axis to be <code>Year</code> and y axis to be average success rate, to get the average launch success trend.

In [None]:
eda_df['Date']=eda_df['Date'].astype(str)

In [None]:
# A function to Extract years from the date 
year=[]
def Extract_year(eda_df):
    for i in eda_df["Date"]:
        year.append(i.split("-")[0])
    return year


In [None]:
# Plotting a line chart with x axis to be the extracted year and y axis to be the success rate

Extract_year(eda_df)
eda_df['Year']=year

fig,ax=plt.subplots()
df_success1=eda_df.groupby('Year')['Class'].mean()*100
df_success1.plot(kind='line', figsize=(10,6))
plt.xlabel('Year')
plt.ylabel('Success Rate')
plt.title('Relationship between Success Rate and Year')

plt.show()

- We can observe that the sucess rate since 2013 kept increasing till 2020.

In [None]:
df_success1.head()

In [None]:
fe_df=eda_df.copy(deep=True)

## Features Engineering

Based on obtaining some preliminary insights about how each important variable would affect the success rate, we will select the features that will be used in success prediction in the future module.

In [None]:
features = fe_df[['FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights',
               'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block',
               'ReusedCount', 'Serial']]
features.head()

**Creating dummy variables to categorical columns**

Use the function <code>get_dummies</code> and <code>features</code> dataframe to apply OneHotEncoder to the column <code>Orbits</code>, <code>LaunchSite</code>, <code>LandingPad</code>, and <code>Serial</code>. Assign the value to the variable <code>features_one_hot</code>, display the results using the method head. Your result dataframe must include all features including the encoded ones.


In [None]:
# Use get_dummies() function on the categorical columns
from sklearn import preprocessing

train_encoded = pd.get_dummies(features,
                    prefix=['Orbits','LaunchSite','LandingPad','Serial'])
train_encoded.head(3)

In [None]:
train_encoded.shape

**Casting all numeric columns to `float64`**

Now that our <code>features_one_hot</code> dataframe only contains numbers cast the entire dataframe to variable type <code>float64</code>

In [None]:
# Using astype function
train_encoded.astype('float64')

### Launch Sites Locations Analysis with Folium

## Objectives
We'll proceeed with the following tasks :
- **Task1 :** Mark all launch sites on a map
- **Task2 :** Mark the success/failed launches for each site on the map
- **Task3 :** Calculate the distances between a launch site to its proximities

In [None]:
!pip3 install wget

In [None]:
pip install folium

In [None]:
import folium
import wget

In [None]:
# Import folium MarkerCluster plugin
from folium.plugins import MarkerCluster

# Import folium MousePosition plugin
from folium.plugins import MousePosition

# Import folium DivIcon plugin
from folium.features import DivIcon

**Marking all launch sites on a map**

First, let's try to add each site's location on a map using site's latitude and longitude coordinates

The following dataset with the name `spacex_launch_geo.csv` is an augmented dataset with latitude and longitude added for each site.

In [None]:
spacex_csv_file = wget.download('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/spacex_launch_geo.csv')
spacex_df=pd.read_csv(spacex_csv_file)

**Take a look at what are the coordinates for each site.**

In [None]:
# Select relevant sub-columns: `Launch Site`, `Lat(Latitude)`, `Long(Longitude)`, `class`
spacex_df = spacex_df[['Launch Site', 'Lat', 'Long', 'class']]
launch_sites_df = spacex_df.groupby(['Launch Site'], as_index=False).first()
launch_sites_df = launch_sites_df[['Launch Site', 'Lat', 'Long']]
launch_sites_df

- Above coordinates are just plain numbers that can not give any intuitive insights about where are those launch sites. Let's visualize those locations by pinning them on a map.

In [None]:
spacex_df[['Launch Site']].value_counts

**Creating a folium `Map` object, with an initial center location to be NASA Johnson Space Center at Houston, Texas.**

In [None]:
nasa_coordinate = [29.559684888503615, -95.0830971930759]
site_map = folium.Map(location=nasa_coordinate, zoom_start=10)

**Using `folium.Circle` to add a highlighted circle area with a text label on NASA Johnson Space Centre.**

In [None]:
# Create a blue circle at NASA Johnson Space Center's coordinate with a popup label showing its name
circle = folium.Circle(nasa_coordinate, radius=1000, color='#d35400',
                       fill=True).add_child(folium.Popup('NASA Johnson Space Center'))

# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker = folium.map.Marker(
    nasa_coordinate,
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % 'NASA JSC',
        )
    )
site_map.add_child(circle)
site_map.add_child(marker)

**Adding a circle for each launch site in data frame launch_sites**

In [None]:
# Initial the map
site_map = folium.Map(location=nasa_coordinate, zoom_start=4.4)

# For each launch site, add a Circle object based on its coordinate (Lat, Long) values. In addition, add Launch site name as a popup label
list1=[[28.562302,-80.577356],[28.563197,-80.576820],[28.573255,-80.646895],[34.632834,-120.610745]]
list2=['CCAFS LC-40','CCAFS SLC-40','KSC LC-39A','VAFB SLC-4E']

**Circle1**

In [None]:
circle1 = folium.Circle(list1[0], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[0]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker1 = folium.map.Marker(
    list1[0],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[0],
              )
    )
site_map.add_child(circle1)
site_map.add_child(marker1)

**Circle2**

In [None]:
circle2 = folium.Circle(list1[1], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[1]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker2 = folium.map.Marker(
    list1[1],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[1],
        )
     )
site_map.add_child(circle2)
site_map.add_child(marker2)

**Circle3**

In [None]:
circle3 = folium.Circle(list1[2], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[2]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker3 = folium.map.Marker(
    list1[2],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[2],
        )
    )
site_map.add_child(circle3)
site_map.add_child(marker3)

**Circle4**

In [None]:
circle4 = folium.Circle(list1[3], radius=100, color='#d35400', fill=True).add_child(folium.Popup(list2[3]))
# Create a blue circle at NASA Johnson Space Center's coordinate with a icon showing its name
marker4 = folium.map.Marker(
    list1[3],
    # Create an icon as a text label
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % list2[3],
        )
    )
site_map.add_child(circle4)
site_map.add_child(marker4)

Now, As we explored the map by zoom-in/out the marked areas, here are the following findings:

**1)** No, Not all the launch sites are in proximity to the Equator line.

**2)** All launch sites are in very close proximity to the coast.

**Marking the success/failed launches for each site on the map**

- let's try to enhance the map by adding the launch outcomes for each site, and see which sites have high success rates.
- Recall that data frame spacex_df has detailed launch records, and the `class` column indicates if this launch was successful or not.

In [None]:
spacex_df.tail(3)

**Next,** let's create markers for all launch records.
If a launch was **successful `(class=1)`**, then we use a **green marker** and if a launch was **failed**, we use a **red marker `(class=0)`**

In [None]:
marker_cluster = MarkerCluster()

**Creating a new column in `launch_sites` dataframe called `marker_color` to store the marker colors based on the `Class` value**

In [None]:
# Function to assign color to launch outcome
def assign_marker_color(launch_outcome):
    if launch_outcome == 1:
        return 'green'
    else:
        return 'red'
    
spacex_df['marker_color'] = spacex_df['class'].apply(assign_marker_color)
spacex_df.tail(10)
launch_sites_df = spacex_df[['Launch Site', 'Lat', 'Long','marker_color']]
launch_sites_df

**For each launch result in `spacex_df` data frame, adding a `folium.Marker` to `marker_cluster`**

In [None]:
# Add marker_cluster to current site_map
marker_cluster=folium.plugins.MarkerCluster()
site_map.add_child(marker_cluster)
# for each row in spacex_df data frame
# create a Marker object with its coordinate
# and customize the Marker's icon property to indicate if this launch was successed or failed, 
# e.g., icon=folium.Icon(color='white', icon_color=row['marker_color']

for index, record in spacex_df.iterrows():
    launchsite=record['Launch Site']
    # Create and add a Marker cluster to the site map
    marker = folium.Marker([record['Lat'], record['Long']], 
                  icon=folium.Icon(color='white', icon_color=record['marker_color'],html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % launchsite,))
    marker_cluster.add_child(marker)
site_map

**For each launch result in spacex_df data frame, adding a folium.Marker to marker_cluster**

In [None]:
# Add marker_cluster to current site_map
marker_cluster=folium.plugins.MarkerCluster()
site_map.add_child(marker_cluster)
# for each row in spacex_df data frame
# create a Marker object with its coordinate
# and customize the Marker's icon property to indicate if this launch was successed or failed, 
# e.g., icon=folium.Icon(color='white', icon_color=row['marker_color']
for index, record in spacex_df.iterrows():
    launchsite=record['Launch Site']
    # Create and add a Marker cluster to the site map
    marker = folium.Marker([record['Lat'], record['Long']], 
                  icon=folium.Icon(color='white', icon_color=record['marker_color'],html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % launchsite,))
    marker_cluster.add_child(marker)
site_map

**Calculating the distances between a launch site to its proximities**

Adding a MousePosition on the map to get coordinate for a mouse over a point on the map.

In [None]:
# Add Mouse Position to get the coordinate (Lat, Long) for a mouse over on the map
formatter = "function(num) {return L.Util.formatNum(num, 5);};"
mouse_position = MousePosition(
    position='topright',
    separator=' Long: ',
    empty_string='NaN',
    lng_first=False,
    num_digits=20,
    prefix='Lat:',
    lat_formatter=formatter,
    lng_formatter=formatter,
)

site_map.add_child(mouse_position)
site_map

We can calculate the distance between two points on the map based on their Lat and Long values using the following method:

In [None]:
from math import sin, cos, sqrt, atan2, radians

def calculate_distance(lat1, lon1, lat2, lon2):
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    return distance

In [None]:
import math

def calculate_distance(lat1, lon1, lat2, lon2):
  # Convert latitude and longitude to 
  # spherical coordinates in radians.
  degrees_to_radians = math.pi/180.0
        
  # phi = 90 - latitude
  phi1 = (90.0 - lat1)*degrees_to_radians
  phi2 = (90.0 - lat2)*degrees_to_radians
        
  # theta = longitude
  theta1 = lon1*degrees_to_radians
  theta2 = lon2*degrees_to_radians
        
  # Compute spherical distance from spherical coordinates.
        
  # For two locations in spherical coordinates 
  # (1, theta, phi) and (1, theta, phi)
  # cosine( arc length ) = 
  #    sin phi sin phi' cos(theta-theta') + cos phi cos phi'
  # distance = rho * arc length
    
  cos = (math.sin(phi1)*math.sin(phi2)*math.cos(theta1 - theta2) + 
         math.cos(phi1)*math.cos(phi2))
  arc = math.acos( cos )
 
  # Remember to multiply arc by the radius of the earth 
  # in your favorite set of units to get length.
  return arc

In [None]:
distance = calculate_distance(28.57468,-80.65229,28.573255 ,-80.646895)
distance

In [None]:
# Creating and adding a folium.Marker on selected closest railway point on the map
# Displaying the distance between railway point and launch site using the icon property 
coordinate = [28.57468,-80.65229]
distance_marker = folium.Marker(
    coordinate,
    icon=DivIcon(
        icon_size=(20,20),
        icon_anchor=(0,0),
        html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(distance),
        )
    )
site_map.add_child(distance_marker)
site_map

**Drawing a PolyLine between a launch site to the selected railway point**

In [None]:
# Creating a `folium.PolyLine` object using the railway point coordinate and launch site coordinate
coordinates=[[28.57468,-80.65229],[28.573255 ,-80.646895]]
lines=folium.PolyLine(locations=coordinates, weight=1)
site_map.add_child(lines)

**Similarly, drawing a line betwee a launch site to its closest city, coastline, highway, etc**

In [None]:
# Creating a marker with distance to a closest city, coastline, highway, etc.
# Drawing a line between the marker to the launch site
coordinates=[[28.57468,-80.65229],[28.57322 ,-80.60703],[28.5248,-80.6446],[28.53386,-81.38535]]
coordinate=[28.573255 ,-80.646895]
for x in coordinates:
    lines=folium.PolyLine(locations=[x,coordinate], weight=1)
    site_map.add_child(lines)

    distance_marker = folium.Marker(
        x,
        icon=DivIcon(
            icon_size=(20,20),
            icon_anchor=(0,0),
            html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(calculate_distance(x[0],x[1],coordinate[0] ,coordinate[1])),
        )
    )
    site_map.add_child(distance_marker)
site_map

In [None]:
coordinates=[[28.57367, -80.58472],[28.5248,-80.64],[28.563197, -80.56772],[28.56,-81.38535]]
coordinate=[28.562302,-80.577356]
for x in coordinates:
    lines=folium.PolyLine(locations=[x,coordinate], weight=1)
    site_map.add_child(lines)

    distance_marker = folium.Marker(
        x,
        icon=DivIcon(
            icon_size=(20,20),
            icon_anchor=(0,0),
            html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(calculate_distance(x[0],x[1],coordinate[0] ,coordinate[1])),
        )
    )
    site_map.add_child(distance_marker)
site_map

In [None]:
coordinates=[[34.63141, -120.62568],[34.66992, -120.45753],[34.6336, -120.62606],[34.63658, -120.4542]]
coordinate=[34.632834, -120.610746]
for x in coordinates:
    lines=folium.PolyLine(locations=[x,coordinate], weight=1)
    site_map.add_child(lines)

    distance_marker = folium.Marker(
        x,
        icon=DivIcon(
            icon_size=(20,20),
            icon_anchor=(0,0),
            html='<div style="font-size: 12; color:#d35400;"><b>%s</b></div>' % "{:10.2f} KM".format(calculate_distance(x[0],x[1],coordinate[0] ,coordinate[1])),
        )
    )
    site_map.add_child(distance_marker)
site_map

**Observations**
- Launch Sites are in close proximity to coast.
- Launch Sites are also close to Major Highways and Railway for logistic purposes.
- Launch sites are far from dense human habitats like cities.

In [None]:
combined_df = pd.concat([train_encoded, cleaned_df['Class']], axis=1)
combined_df.shape

In [None]:
train_df=combined_df.copy(deep=True)

## Machine Learning Prediction

- Standardize the data
- Split into training data and test data
- Find best Hyperparameters for SVM, Decision Tree, KNN and Logistic Regression.
- Find the method performs best using test data among all classification models.

In [None]:
from sklearn import preprocessing

# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split

# Allows us to test parameters of classification algorithms and find the best one
from sklearn.model_selection import GridSearchCV

# Logistic Regression classification algorithm
from sklearn.linear_model import LogisticRegression

# Support Vector Machine classification algorithm
from sklearn.svm import SVC

# Decision Tree classification algorithm
from sklearn.tree import DecisionTreeClassifier

# K Nearest Neighbors classification algorithm
from sklearn.neighbors import KNeighborsClassifier

# Random Forest Classifier algorithm
from sklearn.ensemble import RandomForestClassifier

# Extreme Gradient Boosting Classification algorithm
from xgboost import XGBClassifier

# Metrices
from sklearn.metrics import (accuracy_score, f1_score,average_precision_score, confusion_matrix,
                             average_precision_score, precision_score, recall_score, roc_auc_score, )

**Defining function to plot confusion matrix**

In [None]:
def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['did not land', 'land']); ax.yaxis.set_ticklabels(['did not land', 'landed'])

**Predictor Variables**

In [None]:
X=train_encoded
X.head(5)

**Creating a NumPy array of Target Variable from the column Class in df**

In [None]:
y=combined_df['Class'].to_numpy()
y

### Feature Scaling

In [None]:
transform = preprocessing.StandardScaler()

In [None]:
X=transform.fit_transform(X)
X

**Splitting data into train and test sets**

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=2)

In [None]:
y_test.shape

In [None]:
X_test.shape

## Classification Algorithms

### 1.Logistic Regression

In [None]:
parameters ={'C':[0.01,0.1,1],
             'penalty':['l1','l2'],
             'solver':['lbfgs']}

In [None]:
parameters ={"C":[0.01,0.1,1],'penalty':['l2'], 'solver':['lbfgs']}
# l1: Lasso, l2: Ridge
lr=LogisticRegression()

In [None]:
logreg_cv=GridSearchCV(lr,parameters, cv=10)
logreg_cv.fit(X_train,y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

**Accuracy of Logistic Regression on test data**

In [None]:
print('Accuracy on test data is: {:.3f}'.format(logreg_cv.score(X_test, y_test)))

**Confusion Matrix for Logistic Regression**

In [None]:
y_hat=logreg_cv.predict(X_test)
plot_confusion_matrix(y_test,y_hat)

Logistic regression classified successful/unsuccessful landings well only problem is the false positives.

### 2. Support Vector Machine Classifier

In [None]:
parameters = {'kernel':('linear', 'rbf','poly','rbf', 'sigmoid'),
              'C': np.logspace(-3, 3, 5),
              'gamma':np.logspace(-3, 3, 5)}
svm = SVC()

In [None]:
svm_cv=GridSearchCV(svm, parameters, cv=10)
svm_cv.fit(X_train,y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",svm_cv.best_params_)
print("accuracy :",svm_cv.best_score_)

**Accuracy of SVM on the test data**

In [None]:
print('Accuracy on test data is: {:.3f}'.format(svm_cv.score(X_test, y_test)))

**Confusion Matrix for SVM**

In [None]:
y_hat=svm_cv.predict(X_test)
plot_confusion_matrix(y_test,y_hat)

### 3. Decision Tree Classifier

In [None]:
parameters = {'criterion': ['gini', 'entropy'],
     'splitter': ['best', 'random'],
     'max_depth': [2*n for n in range(1,10)],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]}

predtree = DecisionTreeClassifier()

In [None]:
tree_cv=GridSearchCV(predtree, parameters, cv=10, scoring='accuracy')
tree_cv.fit(X_train,y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",tree_cv.best_params_)
print("accuracy :",tree_cv.best_score_)

**Accuracy of Decision Tree Classifier on test data**

In [None]:
print('Accuracy on test data is: {:.3f}'.format(tree_cv.score(X_test, y_test)))

**Confusion Matrix for Decision Tree Classifier**

In [None]:
y_hat = tree_cv.predict(X_test)
plot_confusion_matrix(y_test,y_hat)

**Distribution Plot**

In [None]:
ax = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(y_hat, hist=False, color="b", label="Fitted Values" , ax=ax)

The more overlapping of two colors, the more accurate the model is.

In [None]:
from sklearn.metrics import accuracy_score, mean_absolute_error, r2_score
print("Accuracy Score: ", accuracy_score(y_test, y_hat))
print("Mean Absolute Error: ", mean_absolute_error(y_test, y_hat))
print("R2 Score: ", r2_score(y_test, y_hat))

### 4. K Nearest Neighbours Classification

In [None]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1,2]}

KNN = KNeighborsClassifier()

In [None]:
knn_cv=GridSearchCV(KNN, parameters, cv=10)
knn_cv.fit(X_train,y_train)

In [None]:
print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)

**Accuracy of KNN Algorithm on test data**

In [None]:
print('Accuracy on test data is: {:.3f}'.format(knn_cv.score(X_test, y_test)))

**Cofusion Matrix for KNN Classifier**

In [None]:
y_hat = knn_cv.predict(X_test)
plot_confusion_matrix(y_test,y_hat)

### 5. Random Forest Classification

In [None]:
clf_rf = RandomForestClassifier(criterion='gini', max_depth= 18, n_estimators=200, max_features='sqrt', min_samples_leaf= 1, min_samples_split= 2, random_state=200)
clf_rf.fit(X_train,y_train)

In [None]:
Ypred_train=clf_rf.predict(X_train)

In [None]:
Rftrainscore=clf_rf.score(X_train,y_train)

In [None]:
print("Accuracy of Random Forest Classifier on train data:", Rftrainscore)

In [None]:
Rftestscore=clf_rf.score(X_test,y_test)

In [None]:
print('Accuracy of Random Forest Classifier on test data:',Rftestscore)

In [None]:
Ypred=clf_rf.predict(X_test)

In [None]:
plot_confusion_matrix(y_test,Ypred)

### 6. Extreme Gradient Boosting Classification

In [None]:
clf_xgb = XGBClassifier(max_depth = 10,random_state = 10,n_estimators=100, eval_metric = 'auc', min_child_weight = 3,
                    colsample_bytree = 0.75, subsample= 0.9)
clf_xgb.fit(X_train,y_train)

In [None]:
XGBtrainscore=clf_xgb.score(X_train,y_train)

In [None]:
print('Accuracy of XGBClassifier on train data:',XGBtrainscore)

In [None]:
XGBtestscore=clf_xgb.score(X_test,y_test)

In [None]:
print('Accuracy of XGBClassifier on test data:',XGBtestscore)

In [None]:
Ypred_xgb=clf_xgb.predict(X_test)

In [None]:
plot_confusion_matrix(y_test,Ypred)

### Finding the best model
Accuracy Comparison of different algorithms on training data

In [None]:
algorithms = {'KNN':knn_cv.best_score_,'Decision Tree':tree_cv.best_score_,
              'LogisticRegression':logreg_cv.best_score_,'SVM':svm_cv.best_score_,
              'RandomForest':Rftrainscore,'XGBClassifier':XGBtrainscore}
bestalgorithm = max(algorithms, key=algorithms.get)
print('Best Algorithm is',bestalgorithm,'with a score of',algorithms[bestalgorithm])

In [None]:
score_df = pd.DataFrame.from_dict(algorithms, orient='index', columns=['Train Data Accuracy'])
score_df.sort_values(['Train Data Accuracy'], inplace=True)
score_df.head(6)

In [None]:
score_df = score_df.reset_index()
score_df.rename(columns = {'index': 'Algorithm'}, inplace = True)
score_df.head(6)

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
fig = px.bar(score_df, x='Algorithm', y='Train Data Accuracy', hover_data=['Algorithm', 'Train Data Accuracy'], color='Algorithm')
fig.update_layout(title='Algorithm vs. Train Data Accuracy', xaxis_title='Algorithm', yaxis_title='Train Data Accuracy' )
fig.show()

**Accuracy comparison of different algorithms on test data**

In [None]:
algorithms2 = {'KNN':knn_cv.score(X_test, y_test),
               'Tree':tree_cv.score(X_test, y_test),
               'LogisticRegression':logreg_cv.score(X_test, y_test),
               'SVM':svm_cv.score(X_test, y_test),
               'RandomForest':Rftestscore,'XGBClassifier':XGBtestscore}
bestalgorithm2 = max(algorithms2, key=algorithms2.get)
print('Best Algorithm is',bestalgorithm2,'with a score of',algorithms2[bestalgorithm2])

In [None]:
score_df1 = pd.DataFrame.from_dict(algorithms2, orient='index', columns=['Test Data Accuracy'])
score_df1.sort_values(['Test Data Accuracy'], inplace=True)
score_df1 = score_df1.reset_index()
score_df1.rename(columns = {'index': 'Algorithm'}, inplace = True)
score_df1.head(6)

In [None]:
import plotly.express as px
import plotly.graph_objects as go
fig = px.bar(score_df1, x='Algorithm', y='Test Data Accuracy', hover_data=['Algorithm', 'Test Data Accuracy'], color='Algorithm')
fig.update_layout(title='Algorithm vs. Test Data Accuracy', xaxis_title='Algorithm', yaxis_title='Test Data Accuracy' )
fig.show()