<a href="https://colab.research.google.com/github/RafDingo/Maths40/blob/main/Maths40.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Title
## Introduction
### Dataset Source
The dataset was produced by the NASA Exoplanet Archive http://exoplanetarchive.ipac.caltech.edu
### Dataset Details
The dataset is about the housing market in Melbourne and contains information about the house sale price, location, and the brokering real estate agency. Additional features included in this dataset are the date of sale, year built, number of rooms & bathrooms, distance to city center, land size, suburb, and the number of properties within the suburb. These features seem to be sufficient for an attempt for predictive modeling of Melbourne house prices as a regression problem.

This dataset has a total of 20 features (excluding the street address of houses) and 13580 observations. Houses with no price information have already been removed from the dataset.
### Dataset Features
The features in our dataset are described in the table below.


In [32]:
from tabulate import tabulate

from tabulate import tabulate

table = [['Name','Data Type','Units','Description'],
         ['sy_snum', 'Numeric', 'NA', 'Number of stars in the system'],
         ['sy_pnum', 'Numeric', 'NA', 'Number of planets in the system'],
         ['sy_mnum', 'Numeric', 'NA', 'Number of Moons the system  has'],
         ['cb_flag', 'Nominal categorical', 'NA', 'Circumbinary flag: whether the planet orbits 2 stars'],
         ['pl_orbper', 'Numeric', 'Earth days', 'Orbital period (Time it takes planet to complete an orbit'],
         ['pl_orbsmax', 'Numeric', 'au', 'Orbit semi-Major Axis. au is the distance from Earth to sun.'],
         ['pl_rade', 'Numeric', 'Earth radius', 'Planet radius, where 1.0 is Earth\'s radius'],
         ['pl_bmasse', 'Numeric', 'Earth Mass', 'Planetary Mass, where 1.0 is Earth\'s mass'],
         ['pl_dens', 'Numeric', 'g/cm**3', 'Planet density'],
         ['pl_orbeccen', 'Numeric', 'NA', 'Planet\s orbital eccentricity'],
         ['pl_insol', 'Numeric', 'Earth flux', 'Insolation Flux: the amount of downward solar radiation energy incident on a planet\'s surface'],
         ['pl_eqt', 'Numeric', 'Kelvin', 'Equilibrium Temperature: (The planetary equilibrium temperature is a theoretical temperature that a planet would be a black body being heated only by its parent star)'],
         ['ttv_flag', 'Nominal categorical', 'NA', 'Data show Transit Timing Variations (How the planet is found)'],
         ['st_spectype', 'Ordinal categorical', 'NA', 'Star Spectral Type'],
         ['st_teff', 'Numeric', 'Kelvin', 'Stellar Effective Temperature'],
         ['st_rad', 'Numeric', 'Solar Radius', 'Stellar Radius, where 1.0 is 1 of our Sun\'s radius'],
         ['st_mass', 'Numeric', 'Solar Mass', 'Stellar Mass, where 1.0 is 1* our Sun\'s mass'],
         ['st_metratio', 'Categorical', 'NA', 'Stellar Metallicity Ratio (Elemental composition)'],
         ['st_lum', 'Numeric', 'log(Solar luminosity)', 'Stellar Luminosity'],
         ['st_logg', 'Numeric', 'log10(cm/s**2)', 'Stellar Surface Gravity'],
         ['st_age', 'Numeric', 'gyr (Gigayear)', 'Stellar Age'],
         ['st_rotp', 'Numeric', 'days', 'Stellar Rotational Period'],
         ['glat', 'Numeric', 'degrees', 'Galactic Latitude'],
         ['glon', 'Numeric', 'degrees', 'Galactic Longitude'],
         ['sy_pmdec', 'Numeric', 'mas/yr', 'Proper motion: distance moved in our night sky, measured in milisecond of Arc per year.'],
         ['sy_dist', 'Numeric', 'parsec', 'Distance'],
         ['sy_plx', 'Numeric', 'mas (miliarcseconds)', 'Parallax: Distance the star moves in relation to other objects in the night sky'],
        ]

print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))

╒═════════════╤═════════════════════╤═══════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Name        │ Data Type           │ Units                 │ Description                                                                                                                                                            │
╞═════════════╪═════════════════════╪═══════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ sy_snum     │ Numeric             │ NA                    │ Number of stars in the system                                                                                                                                          │
├─────────────┼─────────────────────┼───────────────────────┼───────────────

### Target Feature
For this project, the target feature in this dataset will be the house price in Australian dollars. That is, the price of Melbourne houses will be predicted based on the explanatory / descriptive variables.

Target Feature:
sy_pnum

Variables required:
sy_snum, [num_star]
sy_pnum, [num_planet]
sy_mnum, (maybe) [moon_num]
cb_flag, [2_stars]
pl_orbper, [orbital_period]
pl_orbsmax, (maybe) [semi-major_axis]
pl_rade, [planet_radius]
pl_bmasse, [planet_mass]
pl_orbeccen, [planet_eccentricity]
pl_eqt, [planet_temp]
st_teff, [star_temp]
st_rad, [star_radius]
st_mass, [star_mass]
st_lum, [star_luminosity]
st_age, [star_age]
glat, [latitude_gal]
glon, [longitude_gal]
sy_dist, [distance]
sy_plx [parallax]

## Goals and Objectives

Melbourne has a very active housing market as demand for housing throughout the city is usually well above the supply. For this reason, a model that can accurately predict a house's selling price in Melbourne would have many real-world uses. For instance, real estate agents can provide better service to their customers using this predictive model. Likewise, banks lending out money to home buyers can better estimate the financial aspects of this home loan. Perhaps more importantly as potential home buyers, we as individuals can better figure out if we are being ripped off or we are getting a good deal provided that our predictive model is a reliable one.

Thus, the main objective of this project is two-fold: (1) predict the price of a house sold in Melbourne based on publicly available features of the house, and (2) which features seem to be the best predictors of the house sale price. A secondary objective is to perform some exploratory data analysis by basic descriptive statistics & data visualisation plots to gain some insight into the patterns and relationships existing in the data subsequent to some data cleaning & preprocessing, which is the subject of this Phase 1 report.

At this point, we make the important assumption that rows in our dataset are not correlated. That is, we assume that house price observations are independent of one another in this dataset. Of course, this is not a very realistic assumption, however, this assumption allows us to circumvent time series aspects of the underlying dynamics of house prices and also to resort to rather classical predictive models such as multiple linear regression.
## Data Cleaning and Preprocessing
In this section, we describe the data cleaning and preprocessing steps undertaken for this project.
### Data Retrieval¶
- We read in the dataset from our GitHub repository and load the modules we will use throughout this report
- We display 10 randomly sampled rows from this dataset.


In [33]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

pd.set_option('display.max_columns', None) 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

# For mac users
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

In [34]:
url = 'https://raw.githubusercontent.com/RafDingo/Maths40/main/exoPlanets.csv'
df_main = pd.read_csv(url, error_bad_lines=False)

In [35]:
print('Shape:', df_main.shape)
df_main.sample(10, random_state=999)

Shape: (4521, 28)


Unnamed: 0,loc_rowid,sy_snum,sy_pnum,sy_mnum,cb_flag,pl_orbper,pl_orbsmax,pl_rade,pl_bmasse,pl_dens,pl_orbeccen,pl_insol,pl_eqt,ttv_flag,st_spectype,st_teff,st_rad,st_mass,st_metratio,st_lum,st_logg,st_age,st_rotp,glat,glon,sy_pmdec,sy_dist,sy_plx
3009,3010,1,3,0,0,6.988055,0.069,1.15,7.3,26.4,0.0,101.84,914.0,0,,5631.0,0.8,0.93,[M/H],-0.313,4.59,4.37,,13.51572,80.80071,-7.50588,616.637,1.5932
1852,1853,1,1,0,0,10.458434,0.0845,1.33,2.33,5.44,0.0,24.21,609.0,0,,4730.0,0.72,0.76,[Fe/H],-0.727,4.62,3.98,,18.96771,79.57519,-7.08245,342.509,2.89083
2512,2513,1,1,0,0,33.4969,0.2008,1.26,2.12,5.83,0.0,37.3,630.0,0,,4681.53,0.77,0.75,[M/H],-0.588,4.54,,,15.04776,74.44132,-19.4631,247.771,4.00705
1469,1470,1,4,0,0,5.577212,0.04,1.08,1.28,5.58,0.11,8.5,,0,,3360.47,0.33,0.27,,-1.837,4.9,,,-49.94032,51.46106,74.3391,66.4321,15.0242
3589,3590,1,1,0,0,2.237493,0.0325,1.33,2.33,5.44,0.0,336.67,1160.0,0,,5099.0,0.76,0.8,[Fe/H],-0.501,4.58,4.9,,12.92363,73.89376,0.090604,591.776,1.66115
3347,3348,1,1,0,0,18.684049,0.1396,2.63,7.41,2.24,0.0,176.23,832.0,0,,5573.0,1.15,0.95,[Fe/H],0.525,4.29,10.47,,11.33166,80.6889,1.32626,992.852,0.978491
4251,4252,1,7,0,0,2.421937,0.0158,1.097,1.308,5.463993,0.00654,2.21,342.0,1,,2566.0,0.12,0.09,[Fe/H],-3.257,5.24,0.5,1.4,-56.64891,69.71519,-492.0,12.429889,80.451243
2436,2437,1,1,0,0,9.878482,0.0932,2.58,7.18,2.3,0.0,187.85,1003.0,0,,6150.0,1.18,1.12,[Fe/H],0.145,4.34,3.16,,19.93538,78.45346,7.76766,1031.31,0.940878
2654,2655,1,3,0,0,3.619337,0.046,1.48,2.79,4.73,0.0,460.66,1156.0,0,,5502.0,1.06,1.02,[Fe/H],0.009,4.39,6.3,,11.2887,79.45283,20.872,656.38,1.49472
2570,2571,1,2,0,0,2.061897,0.032,1.68,3.46,4.01,0.0,1069.23,1463.0,0,,6021.0,1.14,0.98,[Fe/H],0.034,4.35,5.8,,19.64002,73.51949,-14.7743,968.544,1.00392


## Preparing data

In [36]:
# Copy orignal data to a a working dataframe
df = df_main.copy()

In [40]:
# View all columns to find the ones requird for regression
df.columns

Index(['loc_rowid', 'sy_snum', 'sy_pnum', 'sy_mnum', 'cb_flag', 'pl_orbper',
       'pl_orbsmax', 'pl_rade', 'pl_bmasse', 'pl_dens', 'pl_orbeccen',
       'pl_insol', 'pl_eqt', 'ttv_flag', 'st_spectype', 'st_teff', 'st_rad',
       'st_mass', 'st_metratio', 'st_lum', 'st_logg', 'st_age', 'st_rotp',
       'glat', 'glon', 'sy_pmdec', 'sy_dist', 'sy_plx'],
      dtype='object')

### Removing unnecessary columns

The following columns has been identified to be either useless towards our analysis of the target feature or is not suitable for machine learning.

Variables to drop:
  - loc_rowid
  - pl_dens
  - pl_insol
  - ttv_flag
  - st_spectype
  - st_metratio
  - st_logg
  - st_rotp
  - sy_pmdec

In [41]:
# Drop id coclumns
del df['loc_rowid']
del df['pl_dens']
del df['pl_insol']
del df['ttv_flag']
del df['st_spectype']
del df['st_metratio']
del df['st_logg']
del df['st_rotp']
del df['sy_pmdec']
df.columns

Index(['sy_snum', 'sy_pnum', 'sy_mnum', 'cb_flag', 'pl_orbper', 'pl_orbsmax',
       'pl_rade', 'pl_bmasse', 'pl_orbeccen', 'pl_eqt', 'st_teff', 'st_rad',
       'st_mass', 'st_lum', 'st_age', 'glat', 'glon', 'sy_dist', 'sy_plx'],
      dtype='object')

In [43]:
# Change to readable column names.
df.rename({
      'sy_snum': 'num_star',
      'sy_pnum': 'num_planet',
      'sy_mnum': 'moon_num',
      'cb_flag': '2_stars',
      'pl_orbper': 'orbital_period',
      'pl_orbsmax': 'semi-major_axis',
      'pl_rade': 'planet_radius',
      'pl_bmasse': 'planet_mass',
      'pl_orbeccen': 'planet_eccen',
      'pl_eqt': 'planet_temp',
      'st_teff': 'star_temp',
      'st_rad': 'star_radius',
      'st_mass': 'star_mass',
      'st_lum': 'star_bright',
      'st_age': 'star_age',
      'glat': 'latitude_gal',
      'glon': 'longitude_gal',
      'sy_dist': 'distance',
      'sy_plx': 'parallax'
      }, 
      axis=1, inplace=True
)
df.head()

Unnamed: 0,num_star,num_planet,moon_num,2_stars,orbital_period,semi-major_axis,planet_radius,planet_mass,planet_eccen,planet_temp,star_temp,star_radius,star_mass,star_bright,star_age,latitude_gal,longitude_gal,distance,parallax
0,2,1,0,0,326.03,1.29,12.1,6165.6,0.231,,4742.0,19.0,2.7,2.243,,78.28058,264.13775,93.1846,10.7104
1,1,1,0,0,516.21997,1.53,12.3,4684.8142,0.08,,4213.0,29.79,2.78,2.43,1.56,41.04437,108.719,125.321,7.95388
2,1,1,0,0,185.84,0.83,12.9,1525.5,0.0,,4813.0,11.0,2.2,1.763,4.5,-21.05141,106.41269,75.4392,13.2289
3,1,2,0,0,1773.40002,2.93,12.9,1481.0878,0.37,,5338.0,0.93,0.9,-0.153,3.9,46.94447,69.16849,17.9323,55.7363
4,3,1,0,0,798.5,1.66,13.5,565.7374,0.68,,5750.0,1.13,1.08,0.097,7.4,13.20446,83.33558,21.1397,47.2754
