<a href="https://colab.research.google.com/github/Dworlock11/Exoplanet-Machine-Learning-Analysis/blob/main/Exoplanet_Habitability_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Statements

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

df = pd.read_excel("/content/drive/MyDrive/College and Work/Exoplanet Catalog.xlsx")
# df = pd.read_excel("exoplanet_catalog.xlsx")
pd.set_option('display.max_columns', None)
df

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Preprocessing

When a value in the column "P_HABITABLE" equals "2", it that planet is potentially habitable under conservative estimates. For the purposes of allowing binary classification, conservative and liberal estimates will both simply be considered "potentially habitable".

In [None]:
df["P_HABITABLE"] = df["P_HABITABLE"].mask(df["P_HABITABLE"] == 2, 1)
df["P_HABITABLE"].value_counts()

As many of the columns from the dataset contain a lot of null entries, it is best to simply remove them. All columns with the number of null values greater than a quarter the length of the dataset are removed.

In [None]:
col_non_null_count = df.isna().sum()
cols_non_majority_null = col_non_null_count[col_non_null_count < len(df)/4].index.to_list()
df = df[cols_non_majority_null]

Additional feature selection is conducted, as many of the features are unhelpful for model training, are copies of one another, or are close in value.

In [None]:
df = df.drop(["P_STATUS", "P_RADIUS", "P_YEAR", "P_UPDATED", "S_NAME", "S_RADIUS", "S_ALT_NAMES", "P_HABZONE_OPT", "P_HABZONE_CON", "S_CONSTELLATION_ABR", "P_PERIOD_ERROR_MIN", "P_PERIOD_ERROR_MAX", "S_DISTANCE_ERROR_MIN", "S_DISTANCE_ERROR_MAX", "P_FLUX_MIN", "P_FLUX_MAX", "P_TEMP_EQUIL_MIN", "P_TEMP_EQUIL_MAX"], axis=1)
df.shape

Categorical features with far too many unique values are removed to simplify the model after encoding.

In [None]:
num_features = df.select_dtypes(include=np.number)
cat_features = df.select_dtypes(exclude=np.number)

for col in cat_features.columns:
  print(col, "-", len(cat_features[col].value_counts()))

df = df.drop(["S_RA_T", "S_DEC_T", "S_CONSTELLATION", "S_CONSTELLATION_ENG"], axis=1)

The data is checked for the skew of each feature to determine the appropriate imputing method. Since the data is heavily skewed, the median will be chosen.

In [None]:
df.skew(axis=0, numeric_only=True, skipna=True)

# Logistic Regression

The data is separated into the features and the target.

In [None]:
X = df.drop(["P_NAME", "P_HABITABLE"], axis=1)
y = df["P_HABITABLE"]

All rows where the target value is null are removed.

In [None]:
y_na = y[y.isna()]
data = X.join(y)
data = data.drop(y_na.index)
X = data.drop("P_HABITABLE", axis=1)
y = data["P_HABITABLE"]
print(y.isna().sum())

The data is split into the training and testing data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

Before the pipeline can be built, the appropriate number of components to leave after PCA must be found. An initial pipeline is created to impute null entries, as PCA doesn't accept null values. Transformers for numerical and categorical data must be created separately. Additionally, since different encoders will be used depending on the model, two different proprocessors will be built. The first will use one-hot encoding and the second ordinal encoding.

In [None]:
# Separate numerical and categorical features
num_features = X_train.select_dtypes(include=np.number)
cat_features = X_train.select_dtypes(exclude=np.number)
num_col_names = num_features.columns
cat_col_names = cat_features.columns

# Build transformers
num_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

ohe_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Combine transformers
combine_transformers = ColumnTransformer([
        ("num_transformer", num_transformer, num_col_names),
        ("cat_transformer", ohe_transformer, cat_col_names)
    ]
)

# Build pipeline, fit to data, and plot cumulative explained variance
component_finder = Pipeline([
    ("combine_transformers", combine_transformers),
    ("pca", PCA())])

component_finder.fit(X_train)
pca = component_finder.named_steps["pca"]
plt.plot(np.arange(1, pca.n_components_+1), np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")

Because there is no clear point to cut off the number of components based on the explained variance, the number will simply be chosen based on when the cumulative variance is greater than 0.95.

In [None]:
cat_preprocessor = Pipeline([
    ("combine_transformers", combine_transformers),
    ("pca", PCA(n_components=0.95))
])

The pipeline is created and hyperparameter tuning is implemented.

In [None]:
pipe = Pipeline([
    ("cat_preprocessor", cat_preprocessor),
    ("log_reg", LogisticRegression())
])

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

param_distributions = [{
    'log_reg__penalty': ['l1', 'l2'],
    'log_reg__C': np.logspace(-4, 4, 20),
    'log_reg__solver': ['liblinear', 'lbfgs'],
    'log_reg__max_iter': [200, 500, 1000],
    'log_reg__class_weight': ['balanced', None]
}]


search = RandomizedSearchCV(pipe, param_distributions=param_distributions, n_iter=50, cv=kf, random_state=42)

The model is trained and tested and then scored with the F1 score.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
y_pred = best_model.predict(X_test)
print("\n", f1_score(y_test, y_pred))

# Decision Tree

Now the second preprocessor is built with OrdinalEncoder.

In [None]:
# # Build categorical transformer
# oe_transformer = Pipeline([
#     ("imputer", SimpleImputer(strategy="most_frequent")),
#     ("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=11))
# ])

# # Combine transformers
# combine_transformers = ColumnTransformer([
#         ("num_transformer", num_transformer, num_col_names),
#         ("oe_transformer", oe_transformer, cat_col_names)
#     ]
# )

# # Build pipeline, fit to data, and plot cumulative explained variance
# component_finder = Pipeline([
#     ("combine_transformers", combine_transformers),
#     ("pca", PCA())])

# component_finder.fit(X_train)
# pca = component_finder.named_steps["pca"]
# plt.plot(np.arange(1, pca.n_components_+1), np.cumsum(pca.explained_variance_ratio_))
# plt.xlabel("Number of Components")
# plt.ylabel("Cumulative Explained Variance")

Once again, because there is no clear point to cut off the number of components based on the explained variance, the number will simply be chosen based on when the cumulative variance is greater than 0.95.

In [None]:
# tree_preprocessor = Pipeline([
#     ("combine_transformers", combine_transformers),
#     ("pca", PCA(n_components=0.95))
# ])