<a href="https://colab.research.google.com/github/RafDingo/Maths40/blob/main/Maths40.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exoplanet Discoverability using Linear Regression
## Introduction
Humanity have discovered thousands of planets outside of our solar system, using a variety of methods. What most of these methods have in common, is the detection of a planet affecting a signal from a visible star. This could be the planet passing in front of the star from our perspective, lowering its brightness, or a wobble imparted by the mass of a planet, as both the planet's and star's gravity pulls at eachother.
We suspect that the methods used to discover planets, would impart a bias on the way which we discover these bodies, and hope to exoplore this link. 
### Dataset Source
The dataset was produced by the NASA Exoplanet Archive (1)
### Dataset Details
The dataset is a subset of information provided by NASA about all discovered Exo-planets. An exoplanet or extra-solar planet any planet sized body outside of our solar system. These can either orbit another star, or be star-less. (2)

NASA's archive provides many features, which we limited to 28, and about 5000 datapoints.
### Dataset Features
The features in our dataset are described in the table below.


In [53]:
from tabulate import tabulate

from tabulate import tabulate

table = [['Name','Data Type','Units','Description'],
         ['sy_snum', 'Numeric', 'NA', 'Number of stars in the system'],
         ['sy_pnum', 'Numeric', 'NA', 'Number of planets in the system'],
         ['sy_mnum', 'Numeric', 'NA', 'Number of Moons the system  has'],
         ['cb_flag', 'Nominal categorical', 'NA', 'Circumbinary flag: whether the planet orbits 2 stars'],
         ['pl_orbper', 'Numeric', 'Earth days', 'Orbital period (Time it takes planet to complete an orbit'],
         ['pl_orbsmax', 'Numeric', 'au', 'Orbit semi-Major Axis. au is the distance from Earth to sun.'],
         ['pl_rade', 'Numeric', 'Earth radius', 'Planet radius, where 1.0 is Earth\'s radius'],
         ['pl_bmasse', 'Numeric', 'Earth Mass', 'Planetary Mass, where 1.0 is Earth\'s mass'],
         ['pl_dens', 'Numeric', 'g/cm**3', 'Planet density'],
         ['pl_orbeccen', 'Numeric', 'NA', 'Planet\s orbital eccentricity'],
         ['pl_insol', 'Numeric', 'Earth flux', 'Insolation Flux: the amount of downward solar radiation energy incident on a planet\'s surface'],
         ['pl_eqt', 'Numeric', 'Kelvin', 'Equilibrium Temperature: (The planetary equilibrium temperature is a theoretical temperature that a planet would be a black body being heated only by its parent star)'],
         ['ttv_flag', 'Nominal categorical', 'NA', 'Data show Transit Timing Variations (How the planet is found)'],
         ['st_spectype', 'Ordinal categorical', 'NA', 'Star Spectral Type'],
         ['st_teff', 'Numeric', 'Kelvin', 'Stellar Effective Temperature'],
         ['st_rad', 'Numeric', 'Solar Radius', 'Stellar Radius, where 1.0 is 1 of our Sun\'s radius'],
         ['st_mass', 'Numeric', 'Solar Mass', 'Stellar Mass, where 1.0 is 1* our Sun\'s mass'],
         ['st_metratio', 'Categorical', 'NA', 'Stellar Metallicity Ratio (Elemental composition)'],
         ['st_lum', 'Numeric', 'log(Solar luminosity)', 'Stellar Luminosity'],
         ['st_logg', 'Numeric', 'log10(cm/s**2)', 'Stellar Surface Gravity'],
         ['st_age', 'Numeric', 'gyr (Gigayear)', 'Stellar Age'],
         ['st_rotp', 'Numeric', 'days', 'Stellar Rotational Period'],
         ['glat', 'Numeric', 'degrees', 'Galactic Latitude'],
         ['glon', 'Numeric', 'degrees', 'Galactic Longitude'],
         ['sy_pmdec', 'Numeric', 'mas/yr', 'Proper motion: distance moved in our night sky, measured in milisecond of Arc per year.'],
         ['sy_dist', 'Numeric', 'parsec', 'Distance'],
         ['sy_plx', 'Numeric', 'mas (miliarcseconds)', 'Parallax: Distance the star moves in relation to other objects in the night sky'],
        ]

print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))

╒═════════════╤═════════════════════╤═══════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Name        │ Data Type           │ Units                 │ Description                                                                                                                                                            │
╞═════════════╪═════════════════════╪═══════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ sy_snum     │ Numeric             │ NA                    │ Number of stars in the system                                                                                                                                          │
├─────────────┼─────────────────────┼───────────────────────┼───────────────

### Target Feature
For this project, the target feature in this dataset will be the house price in Australian dollars. That is, the price of Melbourne houses will be predicted based on the explanatory / descriptive variables.

Target Feature:
sy_pnum

Variables required:
sy_snum, [num_star]
sy_pnum, [num_planet]
sy_mnum, (maybe) [moon_num]
cb_flag, [2_stars]
pl_orbper, [orbital_period]
pl_orbsmax, (maybe) [semi-major_axis]
pl_rade, [planet_radius]
pl_bmasse, [planet_mass]
pl_orbeccen, [planet_eccentricity]
pl_eqt, [planet_temp]
st_teff, [star_temp]
st_rad, [star_radius]
st_mass, [star_mass]
st_lum, [star_luminosity]
st_age, [star_age]
glat, [latitude_gal]
glon, [longitude_gal]
sy_dist, [distance]
sy_plx [parallax]

## Goals and Objectives

Melbourne has a very active housing market as demand for housing throughout the city is usually well above the supply. For this reason, a model that can accurately predict a house's selling price in Melbourne would have many real-world uses. For instance, real estate agents can provide better service to their customers using this predictive model. Likewise, banks lending out money to home buyers can better estimate the financial aspects of this home loan. Perhaps more importantly as potential home buyers, we as individuals can better figure out if we are being ripped off or we are getting a good deal provided that our predictive model is a reliable one.

Thus, the main objective of this project is two-fold: (1) predict the price of a house sold in Melbourne based on publicly available features of the house, and (2) which features seem to be the best predictors of the house sale price. A secondary objective is to perform some exploratory data analysis by basic descriptive statistics & data visualisation plots to gain some insight into the patterns and relationships existing in the data subsequent to some data cleaning & preprocessing, which is the subject of this Phase 1 report.

At this point, we make the important assumption that rows in our dataset are not correlated. That is, we assume that house price observations are independent of one another in this dataset. Of course, this is not a very realistic assumption, however, this assumption allows us to circumvent time series aspects of the underlying dynamics of house prices and also to resort to rather classical predictive models such as multiple linear regression.
## Data Cleaning and Preprocessing
In this section, we describe the data cleaning and preprocessing steps undertaken for this project.
### Data Retrieval¶
- We read in the dataset from our GitHub repository and load the modules we will use throughout this report
- We display 10 randomly sampled rows from this dataset.


In [83]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

pd.set_option('display.max_columns', None) 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

# For mac users
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

In [114]:
url = 'https://raw.githubusercontent.com/RafDingo/Maths40/main/data/exoPlanets.csv'
df_main = pd.read_csv(url, error_bad_lines=False)

In [115]:
print('Shape:', df_main.shape)
df_main.sample(10, random_state=999)

Shape: (4521, 28)


Unnamed: 0,loc_rowid,sy_snum,sy_pnum,sy_mnum,cb_flag,pl_orbper,pl_orbsmax,pl_rade,pl_bmasse,pl_dens,pl_orbeccen,pl_insol,pl_eqt,ttv_flag,st_spectype,st_teff,st_rad,st_mass,st_metratio,st_lum,st_logg,st_age,st_rotp,glat,glon,sy_pmdec,sy_dist,sy_plx
3009,3010,1,3,0,0,6.988055,0.069,1.15,7.3,26.4,0.0,101.84,914.0,0,,5631.0,0.8,0.93,[M/H],-0.313,4.59,4.37,,13.51572,80.80071,-7.50588,616.637,1.5932
1852,1853,1,1,0,0,10.458434,0.0845,1.33,2.33,5.44,0.0,24.21,609.0,0,,4730.0,0.72,0.76,[Fe/H],-0.727,4.62,3.98,,18.96771,79.57519,-7.08245,342.509,2.89083
2512,2513,1,1,0,0,33.4969,0.2008,1.26,2.12,5.83,0.0,37.3,630.0,0,,4681.53,0.77,0.75,[M/H],-0.588,4.54,,,15.04776,74.44132,-19.4631,247.771,4.00705
1469,1470,1,4,0,0,5.577212,0.04,1.08,1.28,5.58,0.11,8.5,,0,,3360.47,0.33,0.27,,-1.837,4.9,,,-49.94032,51.46106,74.3391,66.4321,15.0242
3589,3590,1,1,0,0,2.237493,0.0325,1.33,2.33,5.44,0.0,336.67,1160.0,0,,5099.0,0.76,0.8,[Fe/H],-0.501,4.58,4.9,,12.92363,73.89376,0.090604,591.776,1.66115
3347,3348,1,1,0,0,18.684049,0.1396,2.63,7.41,2.24,0.0,176.23,832.0,0,,5573.0,1.15,0.95,[Fe/H],0.525,4.29,10.47,,11.33166,80.6889,1.32626,992.852,0.978491
4251,4252,1,7,0,0,2.421937,0.0158,1.097,1.308,5.463993,0.00654,2.21,342.0,1,,2566.0,0.12,0.09,[Fe/H],-3.257,5.24,0.5,1.4,-56.64891,69.71519,-492.0,12.429889,80.451243
2436,2437,1,1,0,0,9.878482,0.0932,2.58,7.18,2.3,0.0,187.85,1003.0,0,,6150.0,1.18,1.12,[Fe/H],0.145,4.34,3.16,,19.93538,78.45346,7.76766,1031.31,0.940878
2654,2655,1,3,0,0,3.619337,0.046,1.48,2.79,4.73,0.0,460.66,1156.0,0,,5502.0,1.06,1.02,[Fe/H],0.009,4.39,6.3,,11.2887,79.45283,20.872,656.38,1.49472
2570,2571,1,2,0,0,2.061897,0.032,1.68,3.46,4.01,0.0,1069.23,1463.0,0,,6021.0,1.14,0.98,[Fe/H],0.034,4.35,5.8,,19.64002,73.51949,-14.7743,968.544,1.00392


## Preparing data

### 📃 **Instructions**
In general, the following steps will be necessary for data preparation **in this specific order**:
1. Outliers and unusual values (such as a negative age) are taken care of: they are either imputed, dropped, or set to missing values.
2. Missing values are imputed or the rows containing them are dropped.
3. Any categorical descriptive feature is encoded to be numeric as follows: 
   - one-hot-encoding for nominals, 
   - one-hot-encoding or integer-encoding for ordinals.
4. All descriptive features (which are all numeric at this point) are scaled.
5. In case of a classification problem, the target feature is label-encoded (in case of a binary problem, the positive class is encoded as 1).
6. If the dataset has too many observations, only a small random subset of entire dataset is selected to be used during model tuning and model comparison. 
7. Before fitting any `Scikit-Learn` models, any `Pandas` series or data frame is converted to a `NumPy` array using the `values` method in Pandas.

---




In [124]:
# Copy orignal data to a a working dataframe
df = df_main.copy()

In [94]:
# View all columns to find the ones required for regression
df.columns

Index(['loc_rowid', 'sy_snum', 'sy_pnum', 'sy_mnum', 'cb_flag', 'pl_orbper',
       'pl_orbsmax', 'pl_rade', 'pl_bmasse', 'pl_dens', 'pl_orbeccen',
       'pl_insol', 'pl_eqt', 'ttv_flag', 'st_spectype', 'st_teff', 'st_rad',
       'st_mass', 'st_metratio', 'st_lum', 'st_logg', 'st_age', 'st_rotp',
       'glat', 'glon', 'sy_pmdec', 'sy_dist', 'sy_plx'],
      dtype='object')

### ⛔ **Removing unnecessary columns**

The following columns has been identified to be either useless towards our analysis of the target feature or is not suitable for machine learning.

Variables to drop:
  - loc_rowid (ID column)
  - pl_dens (Data already covered by mass and size)
  - pl_insol (Similar feature to size)
  - ttv_flag (Data not suitable for our analysis)
  - st_spectype (Categorical version of solar luminosity)
  - st_metratio (Bad values)
  - st_logg (Similar feature to solar mass and size)
  - st_rotp (Data unrelated to our target feature)
  - sy_pmdec (Data unrelated to our target feature)

In [125]:
# Drop id and irrelevant columns
del df['loc_rowid']
del df['pl_dens']
del df['pl_insol']
del df['ttv_flag']
del df['st_spectype']
del df['st_metratio']
del df['st_logg']
del df['st_rotp']
del df['sy_pmdec']
del df['sy_mnum']
# data check
df.columns

Index(['sy_snum', 'sy_pnum', 'cb_flag', 'pl_orbper', 'pl_orbsmax', 'pl_rade',
       'pl_bmasse', 'pl_orbeccen', 'pl_eqt', 'st_teff', 'st_rad', 'st_mass',
       'st_lum', 'st_age', 'glat', 'glon', 'sy_dist', 'sy_plx'],
      dtype='object')

In [126]:
# Change to readable column names.
df.rename({
      'sy_snum': 'num_star',
      'sy_pnum': 'num_planet',
      'cb_flag': '2_stars',
      'pl_orbper': 'orbital_period',
      'pl_orbsmax': 'semi-major_axis',
      'pl_rade': 'planet_radius',
      'pl_bmasse': 'planet_mass',
      'pl_orbeccen': 'planet_eccen',
      'pl_eqt': 'planet_temp',
      'st_teff': 'star_temp',
      'st_rad': 'star_radius',
      'st_mass': 'star_mass',
      'st_lum': 'star_bright',
      'st_age': 'star_age',
      'glat': 'latitude_gal',
      'glon': 'longitude_gal',
      'sy_dist': 'distance',
      'sy_plx': 'parallax'
      }, 
      axis=1, inplace=True
)
# data check
df.head()

Unnamed: 0,num_star,num_planet,2_stars,orbital_period,semi-major_axis,planet_radius,planet_mass,planet_eccen,planet_temp,star_temp,star_radius,star_mass,star_bright,star_age,latitude_gal,longitude_gal,distance,parallax
0,2,1,0,326.03,1.29,12.1,6165.6,0.231,,4742.0,19.0,2.7,2.243,,78.28058,264.13775,93.1846,10.7104
1,1,1,0,516.21997,1.53,12.3,4684.8142,0.08,,4213.0,29.79,2.78,2.43,1.56,41.04437,108.719,125.321,7.95388
2,1,1,0,185.84,0.83,12.9,1525.5,0.0,,4813.0,11.0,2.2,1.763,4.5,-21.05141,106.41269,75.4392,13.2289
3,1,2,0,1773.40002,2.93,12.9,1481.0878,0.37,,5338.0,0.93,0.9,-0.153,3.9,46.94447,69.16849,17.9323,55.7363
4,3,1,0,798.5,1.66,13.5,565.7374,0.68,,5750.0,1.13,1.08,0.097,7.4,13.20446,83.33558,21.1397,47.2754


In [None]:
# Overview into data types and uniqueness
print('Unique rows =', df.shape[0], '| Unique columns =', df.shape[1])
print('-----')
print('Data types: ', df.dtypes)
print('-----')
print('Unique values per column: ', df.nunique())

### 🔎 **Finding Outliers**:


Using the standard 1.5* outlier check, systems with either planets > 4, or star > 2 will be considered to be outliers. 

We know that these are reasonable data points therefore remvoing them will not be helpful for our study. Because of this we have decided to use a 3.0* outlier check which is common for astronomical data. (3)

---



In [127]:
# Find outliers of every column and store them into dictionary
# dict = {}
excluded_columns = [
                    'num_star',
                    'num_planet',
                    '2_stars',
]
for column_name in df.columns: 
    # conditional to exclude certain columns from outlier check
    if column_name in excluded_columns:
        continue
    else:
        column = df[column_name]
        q1 = column.quantile(0.25)
        q3 = column.quantile(0.75)
        iqr = column.quantile(0.75) - column.quantile(0.25)

        lower = q1 - 3 * iqr
        upper = q3 + 3 * iqr
        num_column_outliers = df[(column > upper) | (column < lower)].shape[0]

        # Add to list as dict
        # dict[column] = [lower, upper, num_column_outliers]

        # drop rows that exceeds outlier parameters
        df = df[(column < upper) | (column > lower)]

df

Unnamed: 0,num_star,num_planet,2_stars,orbital_period,semi-major_axis,planet_radius,planet_mass,planet_eccen,planet_temp,star_temp,star_radius,star_mass,star_bright,star_age,latitude_gal,longitude_gal,distance,parallax
24,3,1,0,11688.000000,12.00000,13.400,635.66000,0.4500,700.0,7295.00,1.49,1.65,0.752,0.020,-30.65764,198.61297,29.7575,33.57700
26,2,5,0,14.651600,0.11340,13.900,263.97850,0.0000,700.0,5172.00,0.94,0.91,-0.197,5.500,37.69663,196.79526,12.5855,79.42740
29,2,5,0,0.736547,0.01544,1.875,7.99000,0.0500,1958.0,5172.00,0.94,0.91,-0.197,10.200,37.69663,196.79526,12.5855,79.42740
43,1,2,0,8.463000,0.06450,4.070,17.00000,0.0000,593.0,3700.00,0.75,0.50,-1.046,0.022,-36.80401,12.65304,9.7221,102.82900
44,1,2,0,18.859019,0.11010,3.240,13.60000,0.0000,454.0,3700.00,0.75,0.50,-1.065,0.022,-36.80401,12.65304,9.7221,102.82900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4479,1,1,0,2.864142,0.04421,15.390,225.34147,0.0380,1743.0,6250.00,1.48,1.41,0.431,1.180,27.47379,117.58141,234.1490,4.24195
4487,1,2,0,7665.000000,9.10000,18.495,3496.13000,0.0800,1612.0,8038.68,1.54,1.76,0.953,0.012,-30.61167,258.36341,19.7442,50.62310
4507,1,2,0,6.267900,0.06839,2.042,4.82000,0.0000,1170.0,6037.00,1.10,1.09,0.160,2.980,-29.77608,292.50808,18.2702,54.70520
4509,1,2,0,39.845800,0.21960,13.800,332.10000,0.0373,614.0,5627.00,1.36,0.89,0.232,10.000,48.92412,53.48367,17.4671,57.22160


In [None]:
# print the dictionary
# for key, value in dict.items():
#     print(f'''{key.upper()} | lower: {value[0]}, upper: {value[1]}, 
#     num of outliers: {value[2]}
#     ''')

### Dropping columns

#### Dropping of NaN values

In [47]:
# Overview of null values
df.isna().sum()

num_star              0
num_planet            0
moon_num              0
2_stars               0
orbital_period      151
semi-major_axis     184
planet_radius        14
planet_mass          22
planet_eccen        540
planet_temp        1146
star_temp           122
star_radius         141
star_mass             4
star_bright         136
star_age            814
latitude_gal          0
longitude_gal         0
distance              6
parallax            192
dtype: int64

In [51]:
df = df.dropna()
df.isna().sum()

(2895, 19)

# Referances
1. http://exoplanetarchive.ipac.caltech.edu/
2. https://www.nasa.gov/feature/jpl/what-in-the-world-is-an-exoplanet
3. http://www.china-vo.org/sites/default/files/docs/spie04_zyx_2a.pdf


