# Stage 04 — EDA: Data Wrangling

**Objective** 
* Prepare the Falcon 9 dataset for downstream analysis and modeling by:

* Cleaning and validating fields

* Standardizing labels for consistency to represent landing success



----


Install the below libraries


In [None]:
!pip install pandas
!pip install numpy

## Import Libraries & Helper Functions
Load core packages for data manipulation and quick quality checks
---

In [None]:
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np

### Data Analysis 


Load Space X dataset, from stage01

In [None]:
df=pd.read_csv("dataset_part_1.csv")
df.head(10)

## Data Overview & Quality Checks
- Validate date formats and categorical values (e.g., Launch Site, Orbit, Booster Version).
- Check missing values, duplicates, and apparent outliers
  
---

In [None]:
df.isnull().sum()/len(df)*100

Identify which columns are numerical and categorical:


In [None]:
df.dtypes

## Number of Launches by Site
**Goal:** Quantify how many launches occurred at each site.  
**Approach:** Group by `Launch_Site`, count, sort descending.  
**What to note:** Sites with the largest share of launches and any sites with sparse data (potentially lower statistical power).

---

In [None]:
# Apply value_counts() on column LaunchSite
df.value_counts('LaunchSite')

## Orbit Distribution  

**Goal:** Identify which orbit types occur most frequently in the dataset.  
**Approach:** Group records by `Orbit` and count their frequency.  
**Note:** Some orbits appear only a few times (long tail). These may be consolidated later to improve modeling robustness.  

---

In [None]:
# Apply value_counts on Orbit column
df.value_counts('Orbit')

### Create a Landing Outcome Label  

**Goal**  
Define a clear binary label (`Class`) that indicates whether the Falcon 9 first stage successfully landed.  

**Approach** Inspect unique values in the `Outcome` column. Group all failure and non-attempt outcomes into a set (`bad_outcomes`).  

---

In [None]:
# landing_outcomes = values on Outcome column
landing_outcomes = df['Outcome'].value_counts()
landing_outcomes

In [None]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

In [None]:
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

## Creating the Landing Outcome Class  

To model launch success, we introduce a binary **`Class`** variable:  
- **0 = Failure** (the first stage did not land successfully)  
- **1 = Success** (the first stage landed successfully)  

In [None]:
# landing_class = 0 if bad_outcome
# landing_class = 1 otherwise
landing_class = [0 if outcome in bad_outcomes else 1 for outcome in df['Outcome']]

In [None]:
df['Class']=landing_class
df[['Class']].head(8)

In [None]:
df.head(5)

In [None]:
df["Class"].mean()

## Export
Export CSV and store for reuse in Stage 04B and beyond.


In [None]:
df.to_csv("dataset_part_2.csv", index=False)

## Authors


<a href="https://www.linkedin.com/in/joseph-s-50398b136/">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.


<a href="https://www.linkedin.com/in/nayefaboutayoun/">Nayef Abou Tayoun</a> is a Data Scientist at IBM and pursuing a Master of Management in Artificial intelligence degree at Queen's University.


<!--
## Change Log
-->


<!--
| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-08-31        | 1.1     | Lakshmi Holla    | Changed Markdown |
| 2020-09-20        | 1.0     | Joseph     | Modified Multiple Areas |
| 2020-11-04        | 1.1.    | Nayef      | updating the input data |
| 2021-05-026       | 1.1.    | Joseph      | updating the input data |
-->


Copyright © 2021 IBM Corporation. All rights reserved.
