# Exoplanet Habitability Analysis - A Machine Learning EDA

##### A quick note: I've always loved outer space since I was a child, fascinated by planet, asteroids, stars, galaxies, and other celestial objects. From reading books explaining space, to watching theories on TV and movies, it is something I never really grew out of. Now as my final year in my BS program comes to a close, I'm excited to combine my passions for research and outer space in a solo endeavor. 

### TABLE OF CONTENTS:
- TBA

### Flow: This EDA will analyze the exoplanets within the NASA Exoplanet Archive as of Sunday, October 5th, 2025. This analysis will classify exoplanets in 2 ways:
- classified based on ideal features for habitability
- via comparison to earth
### The techniques that I will use are binary classification and clustering.

In [None]:
# imports 
# Standard library
import time

# Core libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import ConfusionMatrixDisplay, roc_curve, roc_auc_score

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, WeightedRandomSampler

# Scikit-learn - model selection
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV
)

# Scikit-learn - preprocessing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
    LabelEncoder,
    PolynomialFeatures
)
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Scikit-learn - models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier,
    BaggingClassifier,
    AdaBoostClassifier
)
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.svm import SVC

# Scikit-learn - feature selection and calibration
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.calibration import CalibratedClassifierCV

# Scikit-learn - evaluation metrics
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    precision_recall_curve,
    f1_score,
    recall_score,
    precision_score,
    roc_auc_score,
    roc_curve,
    auc,
)

# Imbalanced-learn
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

# Scipy
from scipy.stats import loguniform


# load csv into a pandas DataFrame
exoplanet_df = pd.read_csv("data/exoplanet_data.csv", comment = "#")



In [None]:
# Data preprocessing and cleaning
# PART 1: Prepping for classification


# there are around ~6000 exoplanets currently discovered, but there are many rows due to multiple discoveries
# must remove these extras so we only have 1 row/planet. Setting default flag to 1 to get most accurate/accepted data on planet
exoplanet_df = exoplanet_df[exoplanet_df["default_flag"] == 1]
print(exoplanet_df.info())
print(exoplanet_df.describe())



<class 'pandas.core.frame.DataFrame'>
Index: 6022 entries, 0 to 38951
Data columns (total 92 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pl_name          6022 non-null   object 
 1   hostname         6022 non-null   object 
 2   default_flag     6022 non-null   int64  
 3   sy_snum          6022 non-null   int64  
 4   sy_pnum          6022 non-null   int64  
 5   discoverymethod  6022 non-null   object 
 6   disc_year        6022 non-null   int64  
 7   disc_facility    6022 non-null   object 
 8   soltype          6022 non-null   object 
 9   pl_controv_flag  6022 non-null   int64  
 10  pl_refname       6022 non-null   object 
 11  pl_orbper        5703 non-null   float64
 12  pl_orbpererr1    5214 non-null   float64
 13  pl_orbpererr2    5214 non-null   float64
 14  pl_orbperlim     5703 non-null   float64
 15  pl_orbsmax       3754 non-null   float64
 16  pl_orbsmaxerr1   2887 non-null   float64
 17  pl_orbsmaxerr2   2

In [6]:
# let's see what the first 10 rows look like just to get a feel for the data
print(exoplanet_df.head(10))

                    pl_name               hostname  default_flag  sy_snum  \
0                  11 Com b                 11 Com             1        2   
5                  11 UMi b                 11 UMi             1        1   
7                  14 And b                 14 And             1        1   
10                 14 Her b                 14 Her             1        1   
21               16 Cyg B b               16 Cyg B             1        3   
23                 17 Sco b                 17 Sco             1        1   
25                 18 Del b                 18 Del             1        2   
27  1RXS J160929.1-210524 b  1RXS J160929.1-210524             1        1   
31                 24 Boo b                 24 Boo             1        1   
33                 24 Sex b                 24 Sex             1        1   

    sy_pnum  discoverymethod  disc_year  \
0         1  Radial Velocity       2007   
5         1  Radial Velocity       2009   
7         1  Radial Vel

### First, we need a target. There is no "habitable" column, so we'll create one by defining certain parameters that are similar to those of a habitable planet 

In [None]:
# habitable planet = rocky surface, temperate, single star system 
# rough definition, but just broad so we have something to go off of

radius_cond = (exoplanet_df["pl_rade"] >= 0.5) & (exoplanet_df["pl_rade"] <= 1.5)
temp_cond = (exoplanet_df["st_teff"] >= 2600) & (exoplanet_df["st_teff"] <= 10000)
starNum_cond = (exoplanet_df["sy_snum"] == 1) 
flux_cond = (exoplanet_df["pl_insol"] >= 0.3) & (exoplanet_df["pl_insol"] <= 1.8)

exoplanet_df["potentially_habitable"] = (radius_cond & temp_cond & starNum_cond & flux_cond).fillna(False)
# the column looks like this
#print(exoplanet_df["potentially_habitable"] == True)

# see how many candidates we have just from the initial cate
print(exoplanet_df["potentially_habitable"].value_counts())

names = exoplanet_df.loc[exoplanet_df["potentially_habitable"] == True, "pl_name"]
for name in names:
    print(name)


potentially_habitable
False    6012
True       10
Name: count, dtype: int64
Gliese 12 b
K2-3 d
K2-72 e
Kepler-1649 c
Kepler-186 f
Kepler-438 b
Kepler-442 b
LP 890-9 c
TOI-700 d
TOI-700 e


### Now, we have our "habitable planets". As expected, we have way more false, than true. This means we have an unbalanced dataset.

#### I think we can find more. The main issue is the column "pl_insol". This is the insolation flux, an **extremely** important factor in planet habitability.
#### Unfortunately, this column has a high missing value 

In [31]:
print(exoplanet_df["pl_insol"].isnull().sum())

5143


#### 5143 out of 6022 entries have a missing pl_insol value, and with that being one of my main categories for habitability, it needs to be worked with


### DATA PREPROCESSING
- dropping irrelevant/redundant columns
- handling missing vals
- standardize formatting
- scaling features for model training

In [36]:
# STEP 1: handpick the most infuential columns for determining habitability 
# and make a new dataframe

new_cols = ["pl_name", "sy_snum", "sy_pnum", "pl_rade", "pl_orbeccen", 
            "pl_insol", "pl_eqt", "st_teff", 
            "st_rad", "st_logg", "potentially_habitable"]
df = exoplanet_df[new_cols]

# now lets get the info of our new df
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 6022 entries, 0 to 38951
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   pl_name                6022 non-null   object 
 1   sy_snum                6022 non-null   int64  
 2   sy_pnum                6022 non-null   int64  
 3   pl_rade                4493 non-null   float64
 4   pl_orbeccen            2546 non-null   float64
 5   pl_insol               879 non-null    float64
 6   pl_eqt                 1620 non-null   float64
 7   st_teff                5332 non-null   float64
 8   st_rad                 5251 non-null   float64
 9   st_logg                4989 non-null   float64
 10  potentially_habitable  6022 non-null   bool   
dtypes: bool(1), float64(7), int64(2), object(1)
memory usage: 523.4+ KB
None
           sy_snum      sy_pnum      pl_rade  pl_orbeccen      pl_insol  \
count  6022.000000  6022.000000  4493.000000  2546.000000    879.000