In [14]:
import pandas as pd
import numpy as np

# Load the cleaned dataset
data_path = "../data/cleaned_5250.csv"
df = pd.read_csv(data_path)

In [15]:
df.head()

Unnamed: 0,name,distance,stellar_magnitude,planet_type,discovery_year,mass_multiplier,mass_wrt,radius_multiplier,radius_wrt,orbital_radius,orbital_period,eccentricity,detection_method
0,11 Comae Berenices b,304.0,4.72307,Gas Giant,2007,19.4,Jupiter,1.08,Jupiter,1.29,0.892539,0.23,Radial Velocity
1,11 Ursae Minoris b,409.0,5.013,Gas Giant,2009,14.74,Jupiter,1.09,Jupiter,1.53,1.4,0.08,Radial Velocity
2,14 Andromedae b,246.0,5.23133,Gas Giant,2008,4.8,Jupiter,1.15,Jupiter,0.83,0.508693,0.0,Radial Velocity
3,14 Herculis b,58.0,6.61935,Gas Giant,2002,8.13881,Jupiter,1.12,Jupiter,2.773069,4.8,0.37,Radial Velocity
4,16 Cygni B b,69.0,6.215,Gas Giant,1996,1.78,Jupiter,1.2,Jupiter,1.66,2.2,0.68,Radial Velocity


The habitable zone (or "Goldilocks zone") is the region around a star where conditions might be just right for liquid water to exist on a planet's surface. The exact boundaries of the habitable zone can vary based on the type, age, and activity of the star. However, for our current dataset, without more detailed star information, we'll make some simplifying assumptions to determine the habitable zone based on orbital_radius.

Approach:

Simplified Assumptions:
We'll assume all stars in the dataset are similar to our Sun (a G-type star).
For a Sun-like star, the inner boundary of the habitable zone is roughly 0.95 AU (Astronomical Units) and the outer boundary is around 1.67 AU. An Astronomical Unit (AU) is the average distance from the Earth to the Sun, approximately 93 million miles or 150 million kilometers.
Feature Creation:
Based on the above boundaries, we'll categorize planets as being Inside, Outside, or On the Edge of the habitable zone.

In [16]:
# Define the boundaries for the habitable zone for a Sun-like star
inner_boundary = 0.95
outer_boundary = 1.67

# Categorize planets based on their orbital_radius and the habitable zone boundaries
df['habitable_zone'] = pd.cut(df['orbital_radius'],
                              bins=[0, inner_boundary, outer_boundary, np.inf],
                              labels=['Inside', 'On the Edge', 'Outside'],
                              right=False)

# Display the distribution of the new feature
habitable_zone_distribution = df['habitable_zone'].value_counts(normalize=True)
habitable_zone_distribution

habitable_zone
Inside         0.850635
Outside        0.105221
On the Edge    0.044144
Name: proportion, dtype: float64

Inside: About 85.06% of the exoplanets are within the habitable zone.
Outside: Approximately 10.52% of the exoplanets are outside the habitable zone.
On the Edge: Roughly 4.41% of the exoplanets are on the edge of the habitable zone.
These percentages provide a simplified view based on our assumptions. In reality, the habitable zone can vary greatly depending on the specific characteristics of the star. However, for our dataset and the assumptions we made, the majority of exoplanets fall within the habitable zone.