# Data Wrangling with Python

Data wrangling, also known as data munging, is the process of cleaning, transforming, and organizing raw data into a format that is more appropriate for analysis. This step is crucial in the data analysis pipeline because real-world data often contains inconsistencies, missing values, and other issues that need to be addressed before meaningful insights can be extracted.

In this notebook, we'll perform data wrangling on a dataset that includes information about SpaceX Falcon 9 launches. The goal is to prepare the data for further analysis, such as predicting the success of future launches based on past data. We'll use Python libraries like `pandas` and `numpy` for this task.

---

# Step 1: Importing Necessary Libraries

We'll begin by importing the required libraries. `pandas` is used for data manipulation, while `numpy` provides support for large, multi-dimensional arrays and matrices, as well as a collection of mathematical functions to operate on these arrays.


In [1]:
import pandas as pd
import numpy as np


---

# Step 2: Loading the Dataset

Next, we'll load the dataset using `pandas`. The dataset contains information about SpaceX Falcon 9 launches, including the launch site, orbit, outcome, and other relevant details.


In [3]:
df=pd.read_csv("C:\personal\spacex-ibm\dataset_part_1.csv")
df.head(10)

  df=pd.read_csv("C:\personal\spacex-ibm\dataset_part_1.csv")


Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2006-03-24,Falcon 1,20.0,LEO,Kwajalein Atoll,None None,1,False,False,False,,,0,Merlin1A,167.743129,9.047721
1,2,2007-03-21,Falcon 1,5919.165341,LEO,Kwajalein Atoll,None None,1,False,False,False,,,0,Merlin2A,167.743129,9.047721
2,4,2008-09-28,Falcon 1,165.0,LEO,Kwajalein Atoll,None None,1,False,False,False,,,0,Merlin2C,167.743129,9.047721
3,5,2009-07-13,Falcon 1,200.0,LEO,Kwajalein Atoll,None None,1,False,False,False,,,0,Merlin3C,167.743129,9.047721
4,6,2010-06-04,Falcon 9,5919.165341,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857
5,8,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857
6,10,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857
7,11,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093
8,12,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857
9,13,2014-01-06,Falcon 9,3325.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1005,-80.577366,28.561857


---

# Step 3: Checking for Missing Values

Data often contains missing values, which can cause problems during analysis. We'll check the percentage of missing values in each column to understand the extent of this issue.


In [4]:
df.isnull().sum()/len(df)*100

FlightNumber       0.000000
Date               0.000000
BoosterVersion     0.000000
PayloadMass        0.000000
Orbit              0.000000
LaunchSite         0.000000
Outcome            0.000000
Flights            0.000000
GridFins           0.000000
Reused             0.000000
Legs               0.000000
LandingPad        31.914894
Block              4.255319
ReusedCount        0.000000
Serial             0.000000
Longitude          0.000000
Latitude           0.000000
dtype: float64

---

# Step 4: Inspecting Data Types

It's important to verify the data types of each column to ensure that the data is in the correct format for analysis. For example, numerical data should be stored as integers or floats, while categorical data should be stored as objects or categories.


In [5]:
df.dtypes


FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

---

# Step 5: Analyzing Specific Columns

We'll take a closer look at the distribution of values in specific columns such as `LaunchSite` and `Orbit`. This can provide insights into patterns or irregularities in the data.


In [6]:
df['LaunchSite'].value_counts()

LaunchSite
CCSFS SLC 40       55
KSC LC 39A         22
VAFB SLC 4E        13
Kwajalein Atoll     4
Name: count, dtype: int64

In [7]:
df['Orbit'].value_counts()

Orbit
GTO      27
ISS      21
VLEO     14
LEO      11
PO        9
SSO       5
MEO       3
ES-L1     1
HEO       1
SO        1
GEO       1
Name: count, dtype: int64

---

# Step 6: Understanding Landing Outcomes

The `Outcome` column contains information about the result of each launch. We'll analyze the frequency of each outcome to understand how successful the launches have been.


In [8]:
landing_outcomes=df['Outcome'].value_counts()

In [12]:

for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 None None
2 True RTLS
3 False ASDS
4 True Ocean
5 False Ocean
6 None ASDS
7 False RTLS


In [13]:
landing_outcomes

Outcome
True ASDS      41
None None      23
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: count, dtype: int64

---

# Step 7: Mapping Outcomes to Classes

We'll map the landing outcomes to a binary classification: `1` for successful landings and `0` for unsuccessful ones. This binary classification will be useful for predictive modeling later on.


In [14]:
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

In [16]:
df['landing_class'] = df['Outcome'].apply(lambda x: 0 if x in bad_outcomes else 1)

# Check the results
df[['Outcome', 'landing_class']].head(100)

Unnamed: 0,Outcome,landing_class
0,None None,0
1,None None,0
2,None None,0
3,None None,0
4,None None,0
...,...,...
89,True ASDS,1
90,True ASDS,1
91,True ASDS,1
92,True ASDS,1


---

# Step 8: Creating and Checking the `Class` Column

To maintain consistency, we'll create a `Class` column that duplicates the `landing_class` values. We'll then check the first few rows to ensure it matches our expectations.


In [19]:
df['Class'] = df['Outcome'].apply(lambda x: 0 if x in bad_outcomes else 1)

# Display the first 8 rows of the Class column
print(df[['Class']].head(100))

    Class
0       0
1       0
2       0
3       0
4       0
..    ...
89      1
90      1
91      1
92      1
93      1

[94 rows x 1 columns]


---

# Step 10: Analyzing the Success Rate

Finally, we'll calculate the mean of the `Class` column, which represents the overall success rate of the launches in the dataset.


In [20]:
df["Class"].mean()

0.6382978723404256

---

# Step 11: Saving the Cleaned Dataset

After all the data wrangling steps are completed, we'll save the cleaned dataset to a new CSV file, `dataset_part_2.csv`, for further analysis.


In [21]:
df.to_csv("dataset_part_2.csv", index=False)