<a href="https://colab.research.google.com/github/RafDingo/Maths40/blob/main/Maths40.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exoplanet Discoverability using Linear Regression

## Contents
* [Introduction](#Introduction)
    * [Dataset Source](#Dataset_Source)
    * [Dataset Details](#Dataset_Details)
    * [Dataset Variables](#Dataset_Variables)
    * [Response Variables](#Response_Variables)
    * [Goals and Objectives](#Goals_and_Objectives)
* [Data Cleaning and Preprocessing](#Data_Cleaning_and_Preprocessing)
    * [Removing Unnecessary Columns](#Removing_Unnecessary_Columns)
    * [Finding Outliers](#Finding_Outliers)
    * [Finding Outliers](#Finding_Outliers)
* [Summary & Conclusions](#Summary_and_Conclusions)
* [Literature Review](#Literature_Review)
* [References](#References)

## Introduction <a name="Introduction"></a>
Humanity have discovered thousands of planets outside of our solar system, using a variety of methods. What most of these methods have in common, is the detection of an anomoly in a signal from avisible star. This could be an exo-planet passing in front of the star from our perspective, lowering its brightness, or a wobble imparted by the mass of a planet, as both the planet's and star's gravity pulls at eachother.
We suspect that the methods used to discover planets, would impart a bias on the way which we discover these bodies, and hope to exoplore this link. 
### Dataset Source <a name="Dataset_Source"></a>
This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program (1)

The data is sourced from multiple missions which had/have the goal of discoving exo-planets. TESS, Kepler, K2, KELT and UKIRT.
### Dataset Details <a name="Dataset_Details"></a>
The dataset is a subset of information provided by NASA about all discovered Exo-planets. An exoplanet or extra-solar planet any planet sized body outside of our solar system. These can either orbit another star, or be star-less. (2)

NASA's archive provides many features, which we limited to 28.
After futher filtering features which are not applicable to our study, we have 18 features, and 4521 entries.
### Dataset Variables <a name="Dataset_Variables"></a>
The features in our dataset are described in the table below.


In [None]:
from tabulate import tabulate

from tabulate import tabulate

table = [['Name','Data Type','Units','Description'],
         ['sy_snum', 'Numeric', 'NA', 'Number of stars in the system'],
         ['sy_pnum', 'Numeric', 'NA', 'Number of planets in the system'],
         ['cb_flag', 'Nominal categorical', 'NA', 'Circumbinary flag: whether the planet orbits 2 stars'],
         ['pl_orbper', 'Numeric', 'Earth days', 'Orbital period (Time it takes planet to complete an orbit'],
         ['pl_orbsmax', 'Numeric', 'au', 'Orbit semi-Major Axis. au is the distance from Earth to sun.'],
         ['pl_rade', 'Numeric', 'Earth radius', 'Planet radius, where 1.0 is Earth\'s radius'],
         ['pl_bmasse', 'Numeric', 'Earth Mass', 'Planetary Mass, where 1.0 is Earth\'s mass'],
         ['pl_orbeccen', 'Numeric', 'NA', 'Planet\s orbital eccentricity'],
         ['pl_eqt', 'Numeric', 'Kelvin', 'Equilibrium Temperature: (The planetary equilibrium temperature is a the theoretical temperature that a planet would be a black body being heated only by its parent star)'],
         ['st_teff', 'Numeric', 'Kelvin', 'Stellar Effective Temperature'],
         ['st_rad', 'Numeric', 'Solar Radius', 'Stellar Radius, where 1.0 is 1 of our Sun\'s radius'],
         ['st_mass', 'Numeric', 'Solar Mass', 'Stellar Mass, where 1.0 is 1* our Sun\'s mass'],
         ['st_lum', 'Numeric', 'log(Solar luminosity)', 'Stellar Luminosity'],
         ['st_age', 'Numeric', 'gyr (Gigayear)', 'Stellar Age'],
         ['glat', 'Numeric', 'degrees', 'Galactic Latitude'],
         ['glon', 'Numeric', 'degrees', 'Galactic Longitude'],
         ['sy_dist', 'Numeric', 'parsec', 'Distance'],
         ['sy_plx', 'Numeric', 'mas (miliarcseconds)', 'Parallax: Distance the star moves in relation to other objects in the night sky'],
        ]

print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))

╒═════════════╤═════════════════════╤═══════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Name        │ Data Type           │ Units                 │ Description                                                                                                                                                            │
╞═════════════╪═════════════════════╪═══════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ sy_snum     │ Numeric             │ NA                    │ Number of stars in the system                                                                                                                                          │
├─────────────┼─────────────────────┼───────────────────────┼───────────────

### Response Variables <a name="Response_Variables"></a>
Our reponse variable is sy_pnum. This is the number of exo-planets detected in the star system of the exo-planet.

As our study focuses on planet discoverability, this is the most logical response variable.

### Goals and Objectives <a name="Goals_and_Objectives"></a>
Our goal is to visuilise and explore the correlation between features of a discovered exo-planet, it's star, and the amount of planets discovered in its system.
These correlations should provide us with insights regarding how these features affect the likelihood of an exo-planet being discovered.

We will explore many relationships in our data, including:

A planet's:
* Mass
* Radius
* Orbital Period
* Orbbital distance (Semi-major Axis)
* Orbital eccentricity 
* Surface temperature

The star's:
* Temperature
* Radius
* Mass
* Luminosity
* Age
* Position (Relative to the galaxy)
* Parallax
* Distance from the Solar system.

We will also derive the ratio between the planet and star's radii, temperatures and masses, and explore these relationships.

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

pd.set_option('display.max_columns', None) 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

# For mac users
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

In [23]:
url = 'https://raw.githubusercontent.com/RafDingo/Maths40/main/data/exoPlanets.csv'
df_main = pd.read_csv(url, error_bad_lines=False, index_col=False)

In [24]:
print('Shape:', df_main.shape)
df_main.sample(10, random_state=999)

Shape: (4521, 28)


Unnamed: 0,loc_rowid,sy_snum,sy_pnum,sy_mnum,cb_flag,pl_orbper,pl_orbsmax,pl_rade,pl_bmasse,pl_dens,pl_orbeccen,pl_insol,pl_eqt,ttv_flag,st_spectype,st_teff,st_rad,st_mass,st_metratio,st_lum,st_logg,st_age,st_rotp,glat,glon,sy_pmdec,sy_dist,sy_plx
3009,3010,1,3,0,0,6.988055,0.069,1.15,7.3,26.4,0.0,101.84,914.0,0,,5631.0,0.8,0.93,[M/H],-0.313,4.59,4.37,,13.51572,80.80071,-7.50588,616.637,1.5932
1852,1853,1,1,0,0,10.458434,0.0845,1.33,2.33,5.44,0.0,24.21,609.0,0,,4730.0,0.72,0.76,[Fe/H],-0.727,4.62,3.98,,18.96771,79.57519,-7.08245,342.509,2.89083
2512,2513,1,1,0,0,33.4969,0.2008,1.26,2.12,5.83,0.0,37.3,630.0,0,,4681.53,0.77,0.75,[M/H],-0.588,4.54,,,15.04776,74.44132,-19.4631,247.771,4.00705
1469,1470,1,4,0,0,5.577212,0.04,1.08,1.28,5.58,0.11,8.5,,0,,3360.47,0.33,0.27,,-1.837,4.9,,,-49.94032,51.46106,74.3391,66.4321,15.0242
3589,3590,1,1,0,0,2.237493,0.0325,1.33,2.33,5.44,0.0,336.67,1160.0,0,,5099.0,0.76,0.8,[Fe/H],-0.501,4.58,4.9,,12.92363,73.89376,0.090604,591.776,1.66115
3347,3348,1,1,0,0,18.684049,0.1396,2.63,7.41,2.24,0.0,176.23,832.0,0,,5573.0,1.15,0.95,[Fe/H],0.525,4.29,10.47,,11.33166,80.6889,1.32626,992.852,0.978491
4251,4252,1,7,0,0,2.421937,0.0158,1.097,1.308,5.463993,0.00654,2.21,342.0,1,,2566.0,0.12,0.09,[Fe/H],-3.257,5.24,0.5,1.4,-56.64891,69.71519,-492.0,12.429889,80.451243
2436,2437,1,1,0,0,9.878482,0.0932,2.58,7.18,2.3,0.0,187.85,1003.0,0,,6150.0,1.18,1.12,[Fe/H],0.145,4.34,3.16,,19.93538,78.45346,7.76766,1031.31,0.940878
2654,2655,1,3,0,0,3.619337,0.046,1.48,2.79,4.73,0.0,460.66,1156.0,0,,5502.0,1.06,1.02,[Fe/H],0.009,4.39,6.3,,11.2887,79.45283,20.872,656.38,1.49472
2570,2571,1,2,0,0,2.061897,0.032,1.68,3.46,4.01,0.0,1069.23,1463.0,0,,6021.0,1.14,0.98,[Fe/H],0.034,4.35,5.8,,19.64002,73.51949,-14.7743,968.544,1.00392


## Data Cleaning and Preprocessing <a name="Data_Cleaning_and_Preprocessing"></a>

### 📃 **Instructions**
In general, the following steps will be necessary for data preparation **in this specific order**:
1. Outliers and unusual values (such as a negative age) are taken care of: they are either imputed, dropped, or set to missing values.
2. Missing values are imputed or the rows containing them are dropped.
3. Any categorical descriptive feature is encoded to be numeric as follows: 
   - one-hot-encoding for nominals, 
   - one-hot-encoding or integer-encoding for ordinals.
4. All descriptive features (which are all numeric at this point) are scaled.
5. In case of a classification problem, the target feature is label-encoded (in case of a binary problem, the positive class is encoded as 1).
6. If the dataset has too many observations, only a small random subset of entire dataset is selected to be used during model tuning and model comparison. 
7. Before fitting any `Scikit-Learn` models, any `Pandas` series or data frame is converted to a `NumPy` array using the `values` method in Pandas.

---




In [25]:
# Copy orignal data to a a working dataframe
df = df_main.copy()

In [26]:
# View all columns to find the ones required for regression
df.columns

Index(['loc_rowid', 'sy_snum', 'sy_pnum', 'sy_mnum', 'cb_flag', 'pl_orbper',
       'pl_orbsmax', 'pl_rade', 'pl_bmasse', 'pl_dens', 'pl_orbeccen',
       'pl_insol', 'pl_eqt', 'ttv_flag', 'st_spectype', 'st_teff', 'st_rad',
       'st_mass', 'st_metratio', 'st_lum', 'st_logg', 'st_age', 'st_rotp',
       'glat', 'glon', 'sy_pmdec', 'sy_dist', 'sy_plx'],
      dtype='object')

### ⛔ **Removing Unnecessary Columns** <a name="Removing_Unnecessary_Columns"></a>

The following columns has been identified to be either useless towards our analysis of the target feature or is not suitable for machine learning.

Variables to drop:
  - loc_rowid (ID column)
  - pl_dens (Data already covered by mass and size)
  - pl_insol (Similar feature to size)
  - ttv_flag (Data not suitable for our analysis)
  - st_spectype (Categorical version of solar luminosity)
  - st_metratio (Bad values)
  - st_logg (Similar feature to solar mass and size)
  - st_rotp (Data unrelated to our target feature)
  - sy_pmdec (Data unrelated to our target feature)

In [27]:
# Drop id and irrelevant columns
del df['loc_rowid']
del df['pl_dens']
del df['pl_insol']
del df['ttv_flag']
del df['st_spectype']
del df['st_metratio']
del df['st_logg']
del df['st_rotp']
del df['sy_pmdec']
del df['sy_mnum']
# data check
df.columns

Index(['sy_snum', 'sy_pnum', 'cb_flag', 'pl_orbper', 'pl_orbsmax', 'pl_rade',
       'pl_bmasse', 'pl_orbeccen', 'pl_eqt', 'st_teff', 'st_rad', 'st_mass',
       'st_lum', 'st_age', 'glat', 'glon', 'sy_dist', 'sy_plx'],
      dtype='object')

#### Rename columns

In [28]:
# Change to readable column names.
df.rename({
      'sy_snum': 'num_star',
      'sy_pnum': 'num_planet',
      'cb_flag': '2_stars',
      'pl_orbper': 'orbital_period',
      'pl_orbsmax': 'semi-major_axis',
      'pl_rade': 'planet_radius',
      'pl_bmasse': 'planet_mass',
      'pl_orbeccen': 'planet_eccen',
      'pl_eqt': 'planet_temp',
      'st_teff': 'star_temp',
      'st_rad': 'star_radius',
      'st_mass': 'star_mass',
      'st_lum': 'star_bright',
      'st_age': 'star_age',
      'glat': 'latitude_gal',
      'glon': 'longitude_gal',
      'sy_dist': 'distance',
      'sy_plx': 'parallax'
      }, 
      axis=1, inplace=True
)
# data check
df.head()

Unnamed: 0,num_star,num_planet,2_stars,orbital_period,semi-major_axis,planet_radius,planet_mass,planet_eccen,planet_temp,star_temp,star_radius,star_mass,star_bright,star_age,latitude_gal,longitude_gal,distance,parallax
0,2,1,0,326.03,1.29,12.1,6165.6,0.231,,4742.0,19.0,2.7,2.243,,78.28058,264.13775,93.1846,10.7104
1,1,1,0,516.21997,1.53,12.3,4684.8142,0.08,,4213.0,29.79,2.78,2.43,1.56,41.04437,108.719,125.321,7.95388
2,1,1,0,185.84,0.83,12.9,1525.5,0.0,,4813.0,11.0,2.2,1.763,4.5,-21.05141,106.41269,75.4392,13.2289
3,1,2,0,1773.40002,2.93,12.9,1481.0878,0.37,,5338.0,0.93,0.9,-0.153,3.9,46.94447,69.16849,17.9323,55.7363
4,3,1,0,798.5,1.66,13.5,565.7374,0.68,,5750.0,1.13,1.08,0.097,7.4,13.20446,83.33558,21.1397,47.2754


In [29]:
# Overview into data types and uniqueness
print('Unique rows =', df.shape[0], '| Unique columns =', df.shape[1])
print('-----')
print('Data types: ', df.dtypes)
print('-----')
print('Unique values per column: ', df.nunique())

Unique rows = 4521 | Unique columns = 18
-----
Data types:  num_star             int64
num_planet           int64
2_stars              int64
orbital_period     float64
semi-major_axis    float64
planet_radius      float64
planet_mass        float64
planet_eccen       float64
planet_temp        float64
star_temp          float64
star_radius        float64
star_mass          float64
star_bright        float64
star_age           float64
latitude_gal       float64
longitude_gal      float64
distance           float64
parallax           float64
dtype: object
-----
Unique values per column:  num_star              4
num_planet            8
2_stars               2
orbital_period     4364
semi-major_axis    2485
planet_radius      1235
planet_mass        1955
planet_eccen        437
planet_temp        1412
star_temp          1895
star_radius         397
star_mass           227
star_bright        1739
star_age            548
latitude_gal       3349
longitude_gal      3350
distance           3317

### 🔎 **Finding Outliers** <a name="Finding_Outliers"></a>



Using the standard 1.5* outlier check, systems with either planets > 4, or star > 2 will be considered to be outliers. 

We know that these are reasonable data points therefore remvoing them will not be helpful for our study. Because of this we have decided to use a 3.0* outlier check which is common for astronomical data. (3)

---



In [None]:
# Find outliers of every column and store them into dictionary
# dict = {}
excluded_columns = [
                    'num_star',
                    'num_planet',
                    '2_stars',
]
for column_name in df.columns: 
    # conditional to exclude certain columns from outlier check
    if column_name in excluded_columns:
        continue
    else:
        column = df[column_name]
        q1 = column.quantile(0.25)
        q3 = column.quantile(0.75)
        iqr = column.quantile(0.75) - column.quantile(0.25)

        lower = q1 - 3 * iqr
        upper = q3 + 3 * iqr
        num_column_outliers = df[(column > upper) | (column < lower)].shape[0]

        # Add to list as dict
        # dict[column] = [lower, upper, num_column_outliers]

        # set rows that exceeds outlier parameters to none
        df[(column > upper) | (column < lower)].fillna(0)

df

Unnamed: 0,num_star,num_planet,2_stars,orbital_period,semi-major_axis,planet_radius,planet_mass,planet_eccen,planet_temp,star_temp,star_radius,star_mass,star_bright,star_age,latitude_gal,longitude_gal,distance,parallax
0,2,1,0,326.030000,1.290000,12.1,6165.60000,0.2310,,4742.00,19.00,2.70,2.243,,78.28058,264.13775,93.1846,10.71040
1,1,1,0,516.219970,1.530000,12.3,4684.81420,0.0800,,4213.00,29.79,2.78,2.430,1.56,41.04437,108.71900,125.3210,7.95388
2,1,1,0,185.840000,0.830000,12.9,1525.50000,0.0000,,4813.00,11.00,2.20,1.763,4.50,-21.05141,106.41269,75.4392,13.22890
3,1,2,0,1773.400020,2.930000,12.9,1481.08780,0.3700,,5338.00,0.93,0.90,-0.153,3.90,46.94447,69.16849,17.9323,55.73630
4,3,1,0,798.500000,1.660000,13.5,565.73740,0.6800,,5750.00,1.13,1.08,0.097,7.40,13.20446,83.33558,21.1397,47.27540
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,1,1,0,305.500000,1.170000,12.1,6547.00000,0.0310,,4388.00,26.80,2.30,2.519,1.22,17.19879,187.21879,112.5370,8.87786
4517,2,3,0,4.617033,0.059222,14.0,218.53100,0.0215,,6156.77,1.56,1.30,0.525,5.00,-20.66791,132.00045,13.4054,74.57110
4518,2,3,0,241.258000,0.827774,12.3,4443.24113,0.2596,,6156.77,1.56,1.30,0.525,5.00,-20.66791,132.00045,13.4054,74.57110
4519,2,3,0,1276.460000,2.513290,12.5,3257.74117,0.2987,,6156.77,1.56,1.30,0.525,5.00,-20.66791,132.00045,13.4054,74.57110


In [None]:
# print the dictionary
# for key, value in dict.items():
#     print(f'''{key.upper()} | lower: {value[0]}, upper: {value[1]}, 
#     num of outliers: {value[2]}
#     ''')

### 💻 **Processing Rows** <a name="Processing_Rows"></a>

#### *Dropping all NaN values*

In [None]:
# Overview of null values
df.isna().sum()

num_star              0
num_planet            0
2_stars               0
orbital_period      151
semi-major_axis     184
planet_radius        14
planet_mass          22
planet_eccen        540
planet_temp        1146
star_temp           122
star_radius         141
star_mass             4
star_bright         136
star_age            814
latitude_gal          0
longitude_gal         0
distance              6
parallax            192
dtype: int64

In [22]:
df = df.dropna()

nan_values = False
for column in df.isna().sum():
    if not column == 0:
        nan_values = True
        break

print('NaN Values?', nan_values)
print()
print('Shape:', df.shape)

NaN Values? False

Shape: (2895, 18)


## Summary & Conclusions <a name="Summary_and_Conclusions"></a>
 of the first phase of your project: a comprehensive summary of Phase 1 and any insights you gained in this phase as they relate to your goals and objectives.

## Literature Review <a name="Literature_Review"></a>


##

# References <a name="References"></a>
1. NASA Exoplanet Archive. (n.d.). Retrieved September 27, 2021, from https://exoplanetarchive.ipac.caltech.edu/ "This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program."
2. Greicius, T. (2018, April 12). What in the World is an 'Exoplanet?' Retrieved from https://www.nasa.gov/feature/jpl/what-in-the-world-is-an-exoplanet
3. Zhang, Y., Luo, A., & Zhao, Y. (2004). Outlier detection in astronomical data. Optimizing Scientific Return for Astronomy through Information Technologies. doi:10.1117/12.550998
4. Feature Selection and Ranking in Machine Learning
http://www.featureranking.com/

