# SpaceX Missions EDA
## Do fork and star the repository, and do contribute!
Made with ❤ by _Saud Hashmi_

In [None]:
# Importing all the important packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# !pip install seaborn
import seaborn as sns

# Importing and basic schema exploration

In [None]:
missions = pd.read_csv('https://raw.githubusercontent.com/BetterCallSaud/astro-data-science/main/SpaceX%20Missions/spacex.csv')

In [None]:
missions.shape

In [None]:
missions.head()

In [None]:
missions.info()

# Data cleaning?

### Taking care of NaN

Since we know that every column must have 41 values, we see only 7 data fields that don't have NULL type values.
A lot of these NaN values represent that a certain type did not have the field type.
For example, payload mass as NaN, where a mission may not have a payload stage.

Let's see how many null values are present in the whole dataset.

In [None]:
sum(missions.isna().sum())

In [None]:
# How many NaNs in payload mass data field
missions['Payload Mass (kg)'].isna().sum()

To make things simple, let's convert all **Payload Mass (kg)** NaNs into zeroes. A zero value makes our evaluation of the payload mass distribution easier.

In [None]:
missions['Payload Mass (kg)'].fillna(0, inplace=True)
missions['Payload Mass (kg)'].isna().sum()

To verify that there are exactly 8 NaNs (check `In [27]:`), let's write a function that compares every entry with the number 0 and returns a count of zeroes. If it comes out to be 8, we are all set to go!

In [None]:
def anyZeroes(df, col):
    zero_count = 0
    for entry in df[col].values:
        # Using int typecasting to convert floating 0.0 to 0
        if int(entry) == 0:
            zero_count += 1
    return zero_count

In [None]:
anyZeroes(missions, 'Payload Mass (kg)')

Also let's take care of `Launch Date`

In [None]:
missions['Launch Date'] = pd.to_datetime(missions['Launch Date'])
missions['Launch Date'].head()

In [None]:
missions['Year'] = pd.DatetimeIndex(missions['Launch Date']).year
missions.describe()

**Awesome!** We are good to proceed to our EDA.

# Let the EDA commence!

Let's print out what data fields we have and we can consider these fields in our questions.

In [None]:
missions.columns

Also we want to see all the unique values of all object and category data types, you know, to just get some idea.

In [None]:
objects = [field for field in missions.columns if missions[field].dtype == 'O']

for obj in objects:
    print(f'{obj}: {missions[obj].unique()}\n')

Since EDA is all about asking the right questions and then turning those questions into answers, that become their own story after we are done. Feel free to contribute to this notebook, if any questions come in your mind. For starters, below are some questions I wanted answers to:

### Questions:

1. How many missions did SpaceX do for NASA?
2. What is the success/failure pattern of SpaceX missions?
3. Distribution of orbit types
4. Any correlation between payload type and payload mass
5. What's the reason for the most failures?
6. When did SpaceX do their first landing?
7. Payload mass distribution over the years (2006-2017)

## 1. How many missions did SpaceX do for NASA?

In [None]:
nasa_missions = missions['Customer Name'].values
nasa_count = 0
for mission in nasa_missions:
    if 'NASA' in str(mission):
        nasa_count += 1
        
print("Total NASA missions (including partners): " + str(nasa_count))
print(f"Proportion of NASA missions: " + str(100 * round(nasa_count / len(nasa_missions), 3)) + "%")

## 2. What is the success/failure pattern of SpaceX missions?

In [None]:
landing_outcome = missions['Landing Outcome']
landing_outcome = landing_outcome.fillna(0)
landing_outcome.replace(['Failure', 'Success'], [1, 2], inplace=True)

x = np.array([i+1 for i, _ in enumerate(landing_outcome)])
print(x.shape, landing_outcome.shape)

sns.set_style('dark')
sns.scatterplot(x=x, y=landing_outcome, hue=landing_outcome, palette='husl')

## 3. Distribution of Orbit Types

In [None]:
orbit_types = missions['Payload Orbit']
orbit_types = orbit_types.astype('category')
orbit_types.dtype

In [None]:
orbit_types.value_counts()

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x=orbit_types)
plt.show()

In [None]:
# SELECT Payload Name, Payload Type FROM missions WHERE Payload Orbit = 'Polar Orbit' AND 'Sun/Earth Orbit'
unique_orbits = ['Polar Orbit', 'Sun/Earth Orbit']
missions[missions['Payload Orbit'].isin(unique_orbits)][['Payload Name', 'Payload Type']]

## 4. Any correlation between payload type and payload mass

In [None]:
payload_type = missions['Payload Type']
payload_mass = missions['Payload Mass (kg)']

In [None]:
payload_type.value_counts()

In [None]:
def clean_payload_type(col):
    new_col = []
    for typ in col.values:
        if typ == 'Communication/Research Satellite':
            new_col.append('Communication Satellite')
        elif typ == 'Research Satellites':
            new_col.append('Research Satellite')
        else:
            new_col.append(typ)
    return new_col

In [None]:
cleaned_payload_types = pd.Series(clean_payload_type(payload_type))
cleaned_payload_types.value_counts()

In [None]:
cleaned_payload_types.fillna(0, inplace=True)
cleaned_payload_types.replace(['Research Satellite', 'Communication Satellite',
       'Human Remains', 'Weather Satellite', 'Space Station Supplies'], [1,2,3,4,5], inplace=True)

In [None]:
x = np.array([i+1 for i, _ in enumerate(cleaned_payload_types)])

plt.figure(figsize=(16, 9))
sns.set_style('dark')
sns.scatterplot(x=x, y=payload_mass, hue=cleaned_payload_types, palette='Greens', s=60)
plt.legend(labels=['No payload','Research Satellite', 'Communication Satellite',
       'Human Remains', 'Weather Satellite', 'Space Station Supplies'])

## 5. What's the reason for the most failures?

In [None]:
failure_reasons = missions['Failure Reason']
failure_reasons

In [None]:
plt.figure(figsize=(20,5))
sns.countplot(x=failure_reasons)

Looks like the major reason for mission failure (reasons that can be known, i.e. which are not NaN) is **Collision During Launch**

## 6. When did SpaceX do their first landing?

In [None]:
landing_outcomes = missions['Landing Outcome']

In [None]:
"""
get_first_success()
@params: col <pd.Series>
returns: index <int>
"""
def get_first_success(col):
    for i, v in enumerate(landing_outcomes.values):
        if (v == 'Success'):
            return i

In [None]:
idx = get_first_success(landing_outcomes)
print(f'First occurence of success found at index: {idx}')

Let's check if the value at index 16 is SUCCESS

In [None]:
print(landing_outcomes.iloc[16])

Now we will access the `Launch Date` of the `missions` data frame of index 16

In [None]:
first_landing = missions['Launch Date'].iloc[idx]
first_landing

Let's also see the data entry of that certain mission

In [None]:
print(missions.iloc[idx])

## 7. Payload mass distribution over the years (2006-2017)

Let's start by creating a `Year` column in the dataframe

In [None]:
missions['Year'] = pd.DatetimeIndex(missions['Launch Date']).year

A scatterplot would make sense to study the distribution!

In [None]:
plt.figure(figsize=(15, 8))
sns.scatterplot(data=missions, x='Year', y='Payload Mass (kg)', hue='Year', palette='Blues')