# Predicting an exoplanet's distance from its host star

### Group Name: SPACE!
Ben Rycroft (s3947135)

Rita Lam Cordeiro (s3471881)

## Table of Contents

- Introduction
    - Dataset Source
    - Dataset Details
    - Dataset Features
    - Dataset Variables
- Goals and Objectives
- Data Cleaning and Preprocessing
- Summary and Conclusion
- References

## Introduction

#### Dataset Source

**website:** https://exoplanetarchive.ipac.caltech.edu  

The NASA Exoplanet archive compiles data on all known exoplanets and their host stars including exoplanet parameters, stellar parameters and, discovery/characterization data. 

This archive includes three data sets: "List of All known planets and hosts", "List of all Kepler Objects of Interest (KOIs)", and "List of all Kepler Threshold-Crossing Events (TCEs). We have chosen to solely use "List of All known planets and hosts" for this project.


### Dataset Details

Due to large amount of NaN values we have decided to split the data into subsets which focus on corellating different sets of values with a planet's distance from its host star, this is our response variable. There are 11 subsets in our dataset, they include the value of the observed attribute and the upper and lower

### Dataset Variables

following is a summary of each subset and its collumns, unit names are in brackets.

**rad (Radius):**  
Data Type: numeric   
[Earth Radius]: Earth Radius is a measure of planetary radius that describes how many earth radii  an exoplanet's radius is equal to  

pl_rade:        Planet Radius  
pl_radeerr1:    Planet Radius Upper Uncertainty   
pl_radeerr2:    Planet Radius Lower Uncertainty  
pl_radelim:     Planet Radius Limit Flag  

**mass:**  
Data Type: numeric  
[Earth Mass]  Earth Mass is a measure of planetary Mass that describes how many earth masses an exoplanet's radius is equal to

pl_masse:       Planet Mass  
pl_masseerr1:   Planet Mass Upper Uncertainty  
pl_masseerr2:   Planet Mass Lower Uncertainty  
pl_masselim:    Planet Mass Limit Flag  

**dens (Density):**  
Data Type: numeric  
[g/cm^3]: grammes per centimeter cubed.  

pl_dens:        Planet Density  
pl_denserr1:    Planet Density Upper Uncertainty
pl_denserr2:    Planet Density Lower Uncertainty
pl_denslim:     Planet Density Limit Flag  

**orbeccen (Orbit Eccentricity):**    
Data Type: numeric  
[eccentricity]: the measure deviation of an orbit from circularity, the closer to 1, the more circular.  

pl_orbeccen:    Eccentricity  
pl_orbeccenerr1: Eccentricity Upper Uncertainty  
pl_orbeccenerr2: Eccentricity Lower Uncertainty  
pl_orbeccenlim: Eccentricity Limit Flag  

**insol (Insolation):**  
Data Type: numeric  
[Earth Flux]: Flux measures how much light energy is being radiated in a given area. 

pl_insol:       Insolation Flux   
pl_insolerr1:   Insolation Flux Upper Uncertainty   
pl_insolerr2:   Insolation Flux Lower Uncertainty   
pl_insollim:    Insolation Flux Limit Flag  

**eqt (Equilibrium Temperature):**  
Data Type: numeric  
[K]: Kelvin, a measure of temperature. 

pl_eqt:         Equilibrium Temperature  
pl_eqterr1:     Equilibrium Temperature Upper Uncertainty  
pl_eqterr2:     Equilibrium Temperature Lower Uncertainty  
pl_eqtlim:      Equilibrium Temperature Limit Flag  

**teff (Stellar Effective Temperature):**   
Data Type: numeric  
[K]: Kelvin, a measure of temperature.

st_teff:        Stellar Effective Temperature   
st_tefferr1:    Stellar Effective Temperature Upper Unc.   
st_tefferr2:    Stellar Effective Temperature Lower Unc.   
st_tefflim:     Stellar Effective Temperature Limit Flag  

**radst (Radius of Star):**      
Data Type: numeric   
[Solar Radius]:  Solar Radius is a measure of stellar radius that describes how many Solar radii a Star's radius is equal to.  

st_rad:         Stellar Radius    
st_raderr1:     Stellar Radius Upper Uncertainty   
st_raderr2:     Stellar Radius Lower Uncertainty   
st_radlim:      Stellar Radius Limit Flag  

**massst (Mass of Star):**     
Data Type: numeric  
[Solar mass]: Solar Mass is a measure of stellar mass that describes how many solar masses a Star's radius is equal to.    

st_mass:        Stellar Mass   
st_masserr1:    Stellar Mass Upper Uncertainty   
st_masserr2:    Stellar Mass Lower Uncertainty   
st_masslim:     Stellar Mass Limit Flag  

**met (Metallicity):**   
Data Type: numeric  
[dex]: decimal exponent, measures the abundance of metal in a star.

st_met:         Stellar Metallicity   
st_meterr1:     Stellar Metallicity Upper Uncertainty   
st_meterr2:     Stellar Metallicity Lower Uncertainty  
st_metlim:      Stellar Metallicity Limit Flag  

**agest (Stellar Age):**  
Data Type: numeric  
[Gyr]: Giga year (one billion years), the age of the star.

st_age:         Stellar Age   
st_ageerr1:     Stellar Age Upper Uncertainty   
st_ageerr2:     Stellar Age Lower Uncertainty  
st_agelim:      Stellar Age Limit Flag  

In [10]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

pd.set_option('display.max_columns', None) 

###
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")
###

In [11]:
#naming the dataset
df = pd.read_csv("https://raw.githubusercontent.com/BenRyc/SPACE-/main/orbitDistNoLim.csv")

## Data Cleaning and Preprocessing
Preparing the data for modelling

### Step 1
The site where the data was colected from required us to specify what columns of the larger dataset that we needed to download. We only selected the relevent columns that we wanted to investigate and we also excluded datapoints like names of planets and stars. 

### Step 2 
Here are the columns of the dataset

In [12]:
df.columns

Index(['sy_snum', 'sy_pnum', 'discoverymethod', 'pl_orbper', 'pl_orbpererr1',
       'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2',
       'pl_rade', 'pl_radeerr1', 'pl_radeerr2', 'pl_masse', 'pl_masseerr1',
       'pl_masseerr2', 'pl_dens', 'pl_denserr1', 'pl_denserr2', 'pl_orbeccen',
       'pl_orbeccenerr1', 'pl_orbeccenerr2', 'pl_insol', 'pl_insolerr1',
       'pl_insolerr2', 'pl_eqt', 'pl_eqterr1', 'pl_eqterr2', 'pl_eqtlim',
       'st_spectype', 'st_teff', 'st_tefferr1', 'st_tefferr2', 'st_rad',
       'st_raderr1', 'st_raderr2', 'st_mass', 'st_masserr1', 'st_masserr2',
       'st_met', 'st_meterr1', 'st_meterr2', 'st_metlim', 'st_logg',
       'st_loggerr1', 'st_loggerr2', 'st_age', 'st_ageerr1', 'st_ageerr2',
       'st_agelim', 'st_dens', 'st_denserr1', 'st_denserr2', 'sy_dist',
       'sy_disterr1', 'sy_disterr2'],
      dtype='object')

There are many different variables that we intend to compare to orbit and not every row has values for every column this is why we decided to split the dataset into lots of smaller datasets each only compairing one verable against the orbit radius. This way we can remove all the rows with null values from each of the datasubsets without removing the rows from other datasubsets. 

In [13]:
rad = df.dropna(subset=['pl_rade', 'pl_radeerr1', 'pl_radeerr2'])
mass = df.dropna(subset=['pl_masse', 'pl_masseerr1', 'pl_masseerr2'])
dens = df.dropna(subset=['pl_dens', 'pl_denserr1', 'pl_denserr2'])
orbeccen = df.dropna(subset=['pl_orbeccen', 'pl_orbeccenerr1', 'pl_orbeccenerr2'])
insol = df.dropna(subset=['pl_insol', 'pl_insolerr1', 'pl_insolerr2'])
eqt = df.dropna(subset=['pl_eqt', 'pl_eqterr1', 'pl_eqterr2'])
teff = df.dropna(subset=['st_teff', 'st_tefferr1', 'st_tefferr2'])
radst = df.dropna(subset=['st_rad', 'st_raderr1', 'st_raderr2'])
massst = df.dropna(subset=['st_mass', 'st_masserr1', 'st_masserr2'])
met = df.dropna(subset=['st_met', 'st_meterr1', 'st_meterr2'])
agest = df.dropna(subset=['st_age', 'st_ageerr1', 'st_ageerr2'])

### Step 3 
Next was to only include the relevant columns in each data subse.

In [14]:
rad = rad[['pl_rade', 'pl_radeerr1', 'pl_radeerr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
mass = mass[['pl_masse', 'pl_masseerr1', 'pl_masseerr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
dens = dens[['pl_dens', 'pl_denserr1', 'pl_denserr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
orbeccen = orbeccen[['pl_orbeccen', 'pl_orbeccenerr1', 'pl_orbeccenerr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
insol = insol[['pl_insol', 'pl_insolerr1', 'pl_insolerr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
eqt = eqt[['pl_eqt', 'pl_eqterr1', 'pl_eqterr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
teff = teff[['st_teff', 'st_tefferr1', 'st_tefferr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
radst = radst[['st_rad', 'st_raderr1', 'st_raderr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
massst = massst[['st_mass', 'st_masserr1', 'st_masserr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
met = met[['st_met', 'st_meterr1', 'st_meterr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
agest = agest[['st_age', 'st_ageerr1', 'st_ageerr2', 'pl_orbper', 'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1', 'pl_orbsmaxerr2']]
