# Exoplanet Habitability Analysis - A Machine Learning EDA

##### A quick note: I've always loved outer space since I was a child, fascinated by planet, asteroids, stars, galaxies, and other celestial objects. From reading books explaining space, to watching theories on TV and movies, it is something I never really grew out of. Now as my final year in my BS program comes to a close, I'm excited to combine my passions for research and outer space in a solo endeavor. 

### TABLE OF CONTENTS:
- TBA

### Flow: This EDA will analyze the exoplanets within the NASA Exoplanet Archive as of Sunday, October 5th, 2025. This analysis will classify exoplanets in 2 ways:
- classified based on ideal features for habitability
- via comparison to earth
### The techniques that I will use are binary classification and clustering.

In [None]:
# imports 
import pandas as pd
import numpy as np




In [5]:
# Data preprocessing and cleaning
# PART 1: Prepping for classification
exoplanet_df = pd.read_csv("data/exoplanet_data.csv", comment = "#")

# there are around ~6000 exoplanets currently discovered, but there are many rows due to multiple discoveries
# must remove these extras so we only have 1 row/planet. Setting default flag to 1 to get most accurate/accepted data on planet
exoplanet_df = exoplanet_df[exoplanet_df["default_flag"] == 1]
print(exoplanet_df.info())
print(exoplanet_df.describe())



<class 'pandas.core.frame.DataFrame'>
Index: 6022 entries, 0 to 38951
Data columns (total 92 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pl_name          6022 non-null   object 
 1   hostname         6022 non-null   object 
 2   default_flag     6022 non-null   int64  
 3   sy_snum          6022 non-null   int64  
 4   sy_pnum          6022 non-null   int64  
 5   discoverymethod  6022 non-null   object 
 6   disc_year        6022 non-null   int64  
 7   disc_facility    6022 non-null   object 
 8   soltype          6022 non-null   object 
 9   pl_controv_flag  6022 non-null   int64  
 10  pl_refname       6022 non-null   object 
 11  pl_orbper        5703 non-null   float64
 12  pl_orbpererr1    5214 non-null   float64
 13  pl_orbpererr2    5214 non-null   float64
 14  pl_orbperlim     5703 non-null   float64
 15  pl_orbsmax       3754 non-null   float64
 16  pl_orbsmaxerr1   2887 non-null   float64
 17  pl_orbsmaxerr2   2

In [6]:
# let's see what the first 10 rows look like just to get a feel for the data
print(exoplanet_df.head(10))

                    pl_name               hostname  default_flag  sy_snum  \
0                  11 Com b                 11 Com             1        2   
5                  11 UMi b                 11 UMi             1        1   
7                  14 And b                 14 And             1        1   
10                 14 Her b                 14 Her             1        1   
21               16 Cyg B b               16 Cyg B             1        3   
23                 17 Sco b                 17 Sco             1        1   
25                 18 Del b                 18 Del             1        2   
27  1RXS J160929.1-210524 b  1RXS J160929.1-210524             1        1   
31                 24 Boo b                 24 Boo             1        1   
33                 24 Sex b                 24 Sex             1        1   

    sy_pnum  discoverymethod  disc_year  \
0         1  Radial Velocity       2007   
5         1  Radial Velocity       2009   
7         1  Radial Vel

### DATA PREPROCESSING

In [9]:
# # automation for dropping columns with too little data
# missing_percentages = exoplanet_df.isnull().sum() / len(exoplanet_df) * 100
# threshold = 65
# columns_to_drop = ["pl_name", "hostname", "default_flag", "disc_year", "disc_facility", "discoverymethod"]
# print(missing_percentages.sort_values(ascending=False).head(15))
# # insolation flux is extremely important for planet habitability, so keep we keep it despite it's high missingness.
# # however, it is not a factor in our initial habitable column in order to give models as much planets to work with 
# # possible. 
# for col in exoplanet_df:
#     if missing_percentages[col] > threshold:
#         if col == "pl_insol":
#             continue # keep pl_insol
#         columns_to_drop.append(col)
#     if exoplanet_df[col].dtype == "object" and exoplanet_df[col].str.contains(r"https?://", na=False).any():
#         columns_to_drop.append(col)


# print("\nColumns dropped:", columns_to_drop)
# exoplanet_df.drop(columns = columns_to_drop, inplace = False)
# print(exoplanet_df.info())

### First, we need a target. There is no "habitable" column, so we'll create one by defining certain parameters that are similar to those of a habitable planet 

In [10]:
# # habitable planet = rocky surface, temperate, single star system 
# # rough definition, but just broad so we have something to go off of

# radius_cond = (exoplanet_df["pl_rade"] >= 0.5) & (exoplanet_df["pl_rade"] <= 1.5)
# temp_cond = (exoplanet_df["st_teff"] >= 2600) & (exoplanet_df["st_teff"] <= 10000)
# starNum_cond = (exoplanet_df["sy_snum"] == 1) 
# flux_cond = (exoplanet_df["pl_insol"] >= 0.3) & (exoplanet_df["pl_insol"] <= 1.8)

# exoplanet_df["potentially_habitable"] = (radius_cond & temp_cond & starNum_cond & flux_cond).fillna(False)
# # the column looks like this
# #print(exoplanet_df["potentially_habitable"] == True)

# # see how many candidates we have just from the initial cate
# print(exoplanet_df["potentially_habitable"].value_counts())


### Now, we have our "habitable planets". As expected, we have way more false, than true. This means we have an unbalanced dataset.