<a href="https://colab.research.google.com/github/GabrielleRab/SRMPmachine/blob/main/Exoplanets_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cleaning the exoplanet data** 

Welcome to Google Colab! This is an interactive coding environment where you can run code to perform calculations and much more! Today we will be using the programming language Python to clean our exoplanet dataset.

But first, let's learn a little bit about colabs...

All the text and code in a colab is written in cells. (Even these words you are reading are inside of a text cell!) In order for any of the code in a colab to execute, you need to **run** its cell. 

Let's practice. Run the code cell below by pressing the "play" button on the left:

In [None]:
print("hello world!")

Congratulations! You just ran some Python code!

You'll notice a little [1] on the left side of the cell now. That means its the first code cell you've run. If you run it again, that number will change to a [2] because it is now the *second* code cell you've run. In general, you should only run each code cell a single time unless instructed to do otherwise.

## Importing libraries

Now let's run some code we will need to work with the exoplanets dataset. In Python, you can access a lot of pre-written commands by importing *libraries*, or bundles of commands. Let's import the libraries we will need. 

Go ahead and run the next code cell:

In [None]:
# import the necessary python libraries
import pandas as pd
import seaborn as sns

## Loading the data

Next we will load in the dataset. This is the same data we've been working with in Google Sheets, but this time we will turn it into a *data frame* which is a fancy way of describing a spread sheet that Python can read.

Run this code to import and preview the data frame:

In [None]:
# load in the data and create a data frame called "df"
df = pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/exoplanets.csv")

# return the first 5 rows of the data frame
df.head()

You'll notice that these are the same columns of data we worked with yesterday! We can print the "length" of the data frame to see how many rows there are:

In [None]:
# return the number of rows in the data frame
len(df)

## Describing the data

Now let's calculate some summary statistics. We'll look at the same statistics we calculated using Google Sheets yesterday:

In [None]:
# calculate mean, quartiles, and a histogram for relevant variables 
print("Planetary mass")
print("mean:", df["pl_bmassj"].mean())
print("median:", df["pl_bmassj"].median())
print("1st quartile:", df["pl_bmassj"].quantile(0.25))
print("3rd quartile:", df["pl_bmassj"].quantile(0.75))

print("")

print("Orbital period")
print("mean:", df["pl_orbper"].mean())
print("median:", df["pl_orbper"].median())
print("1st quartile:", df["pl_orbper"].quantile(0.25))
print("3rd quartile:", df["pl_orbper"].quantile(0.75))

print("")

print("Planetary radius")
print("mean:", df["pl_radj"].mean())
print("median:", df["pl_radj"].median())
print("1st quartile:", df["pl_radj"].quantile(0.25))
print("3rd quartile:", df["pl_radj"].quantile(0.75))

print("")

print("Star distance")
print("mean:", df["sy_dist"].mean())
print("median:", df["sy_dist"].median())
print("1st quartile:", df["sy_dist"].quantile(0.25))
print("3rd quartile:", df["sy_dist"].quantile(0.75))

We can create histograms as well:

In [None]:
# creating the planetary mass histogram
print("Planetary mass")
print(df["pl_bmassj"].hist())

In [None]:
# create the orbital period histogram
print("Orbital period")
print(df["pl_orbper"].hist())

In [None]:
# create the planetary radius histogram
print("Planetary radius")
print(df["pl_radj"].hist())

In [None]:
# create the star distance histogram
print("Star distance")
print(df["sy_dist"].hist())

We can also create box plots to visualize these variables and identify outliers:

In [None]:
# create a boxplot for planetary mass
sns.boxplot(x=df['pl_bmassj'])

## Removing outliers

You may notice that there are a number of exoplanets hundreds of times the mass of Jupiter and at least one supposed exoplanet with a mass of over 700 Jupiters. In fact, once you get beyond 13 times the mass of Jupiter, you're too big to be a planet!

This means that some stars have accidentally been included in this dataset. Let's remove all rows with a planetary mass of greater than 13 Jupiters. We will create a new cleaned data frame so we don't lose any of our original data:

In [None]:
# remove all rows with a planetary mass of greater than 13 Jupiters
# and store this cleaned data in a new dataframe
df_clean = df.drop(df[df.pl_bmassj > 13].index)

Let's see how many rows we have left:

In [None]:
# calculate the number of rows in the cleaned dataset
len(df_clean)

Subtracting that from our original will tell us how many outliers we removed:

In [None]:
# calculate the number of removed rows
print(len(df)-len(df_clean))

We can create another histogram and boxplot for this variable to see how our distribution has changed

In [None]:
# create a histogram for the cleaned planetary mass data
print("Planetary mass (cleaned)")
print(df_clean["pl_bmassj"].hist())

In [None]:
# create a boxplot for the cleaned planetary mass data
sns.boxplot(x=df_clean['pl_bmassj'])