<a href="https://colab.research.google.com/github/Dworlock11/Exoplanet-Machine-Learning-Analysis/blob/main/Exoplanet_Habitability_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Statements

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import recall_score, make_scorer, classification_report, f1_score, mean_absolute_error, r2_score, root_mean_squared_error, mean_squared_error
from sklearn.exceptions import ConvergenceWarning
from sklearn.inspection import permutation_importance
from warnings import simplefilter
from scipy.stats import randint, uniform

df = pd.read_excel("Exoplanet Catalog.xlsx")
pd.set_option('display.max_columns', None)
df

# Preprocessing

As many of the columns from the dataset contain a lot of null entries, it is best to simply remove them. All columns with the number of null values greater than a quarter the length of the dataset are removed.

In [None]:
col_non_null_count = df.isna().sum()
cols_non_majority_null = col_non_null_count[col_non_null_count < len(df)/4].index.to_list()
df = df[cols_non_majority_null]

Additional feature selection is conducted, as many of the features are unhelpful for model training, are copies of one another, or are close in value.

In [None]:
df = df.drop(["P_NAME", "P_STATUS", "P_RADIUS", "P_YEAR", "P_UPDATED", "S_NAME", "S_RADIUS", "S_ALT_NAMES", "P_HABZONE_OPT", "P_HABZONE_CON", "S_CONSTELLATION_ABR", "P_PERIOD_ERROR_MIN", "P_PERIOD_ERROR_MAX", "S_DISTANCE_ERROR_MIN", "S_DISTANCE_ERROR_MAX", "P_FLUX_MIN", "P_FLUX_MAX", "P_TEMP_EQUIL_MIN", "P_TEMP_EQUIL_MAX"], axis=1)
df.shape

Categorical features with far too many unique values are removed to simplify the model after encoding.

In [None]:
num_features = df.select_dtypes(include=np.number)
cat_features = df.select_dtypes(exclude=np.number)

for col in cat_features.columns:
  print(col, "-", len(cat_features[col].value_counts()))

df = df.drop(["S_RA_T", "S_DEC_T", "S_CONSTELLATION", "S_CONSTELLATION_ENG"], axis=1)

The data is checked for the skew of each feature to determine the appropriate imputing method. Since the data is heavily skewed, the median will be chosen.

In [None]:
df.skew(axis=0, numeric_only=True, skipna=True).sort_values(ascending=False)

The distribution of the classification target is observed.

In [None]:
df["P_TYPE"].value_counts()

A single Miniterran planet can't be split amongst a training and test set. According to the official classification practice of exoplanets, Miniterrans have a radius between 0.03 and 0.04 times that of Earth. Subterrans have a radius between 0.4 and 0.8 times that of Earth. If the Miniterran in the data has a radius close to that of Subterrans, it wouldn't be a problem to mask it as one.

In [None]:
miniterran = df[df["P_TYPE"] == "Miniterran"]
miniterran["P_RADIUS_EST"]

Indeed, the radius is around 0.33 times that of Earth, which isn't too far from the 0.4 minimum for a Subterran. Therefore, the planet is masked as one.

In [None]:
df["P_TYPE"] = df["P_TYPE"].mask(df["P_TYPE"] == "Miniterran", "Subterran")
df["P_TYPE"].value_counts()

Now, the distribution of the target for the regression models is analyzed.

In [None]:
df["P_MASS_EST"].describe()

In [None]:
plt.plot(df.index, df["P_MASS_EST"].sort_values(ascending=False))
plt.show()

It's not clear what exactly it means for a planet to have a mass of 0.0. It might be a mistake. Such entries will be removed to be safe.

In [None]:
df = df[df["P_MASS_EST"] != 0.0]

Additionally, the smallest planets and the largest are orders of magnitude apart. Therefore, it would make sense to tranform the mass to be in log space.

Mean Absolute Error (MAE) will be used to evaluate the model, as it is robust against outliers, of which the data has a lot.

# Logistic Regression

The data is separated into the features and the target.

In [None]:
X = df.drop(["P_TYPE"], axis=1)
y = df["P_TYPE"]

All rows where the target value is null are removed.

In [None]:
y_na = y[y.isna()]
data = X.join(y)
data = data.drop(y_na.index)
X = data.drop("P_TYPE", axis=1)
y = data["P_TYPE"]
print(y.isna().sum())

The data is split into the training and testing data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=9)

Transformers for numerical and categorical data are created.

In [None]:
# Separate numerical and categorical features
num_features = X_train.select_dtypes(include=np.number)
cat_features = X_train.select_dtypes(exclude=np.number)
num_col_names = num_features.columns
cat_col_names = cat_features.columns

# Build transformers
num_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

ohe_cat_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", feature_name_combiner="concat"))
])

# Combine transformers
log_preprocessor = ColumnTransformer([
    ("num_transformer", num_transformer, num_col_names),
    ("ohe_cat_transformer", ohe_cat_transformer, cat_col_names)
])

The pipeline is created and hyperparameter tuning is implemented.

In [None]:
log_pipe = Pipeline([
    ("log_preprocessor", log_preprocessor),
    ("log_reg", LogisticRegression(
        solver="lbfgs",
        penalty="l2",
        max_iter=300
    ))
])

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=9)

param_dist = {
    "log_reg__C": np.logspace(-3, 3, 15),
}

search = RandomizedSearchCV(log_pipe, param_distributions=param_dist, n_iter=10, cv=kf, random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
simplefilter("ignore", category=ConvergenceWarning)

search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

The model performs very well across all metrics.

Permutation is used to find the importance of the individual features. It will be used across all models for standardized results. The test set must be manually transformed with all preprocessing steps before implementing permutation to match the number of columns present in the model.

In [None]:
# Extract components
preprocessor = best_model.named_steps["log_preprocessor"]
log_reg = best_model.named_steps["log_reg"]
feature_names = preprocessor.get_feature_names_out()

raw_feature_names = preprocessor.get_feature_names_out()

clean_feature_names = [
    name.split("__", 1)[1] if "__" in name else name
    for name in raw_feature_names
]

# Transform X_test into expanded feature space
X_test_transformed = preprocessor.transform(X_test)

# Run permutation importance on the classifier only
importances = permutation_importance(log_reg, X_test_transformed, y_test, n_repeats=10, random_state=9, n_jobs=-1)

# Display results
highest_importances = pd.Series(importances.importances_mean, index=clean_feature_names).sort_values(ascending=False).head(10)
plt.bar(highest_importances.index, highest_importances)
plt.xticks(rotation=90)
plt.title("Logistic Regression Feature Importance")
plt.xlabel("Feature")
plt.ylabel("Drop in Performance")
plt.show()

Apparently, the most important feature for predicting the type of the planet is P_RADIUS_EST. This make sense, as the classification of a planet is based on the planet's radius compared to Earth's.

# Polynomial Logistic Regression

Now polynomial features will be added to see if there will be a significant difference.

Transformers are created once again.

In [None]:
# Build transformers
poly_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler())
])

# Combine transformers
poly_log_preprocessor = ColumnTransformer([
    ("poly_transformer", poly_transformer, num_col_names),
    ("ohe_cat_transformer", ohe_cat_transformer, cat_col_names)
])

The pipeline is created and hyperparameter tuning is implemented.

In [None]:
poly_log_pipe = Pipeline([
    ("poly_log_preprocessor", poly_log_preprocessor),
    ("log_reg", LogisticRegression(
        solver="lbfgs",
        penalty="l2",
        max_iter=300
    ))
])

param_dist = {
    # "log_reg__C": np.logspace(-3, 3, 15),
    "log_reg__C" : [51.794746792312125]
}

search = RandomizedSearchCV(poly_log_pipe, param_distributions=param_dist, n_iter=10, cv=kf, random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
# simplefilter("ignore", category=ConvergenceWarning)

# search.fit(X_train, y_train)
# best_model = search.best_estimator_
# for param, value in search.best_params_.items():
#   print(param,":", value)

# y_pred = best_model.predict(X_test)

In [None]:
# print(classification_report(y_test, y_pred))

The model performs around the same as without polynomial features. However, the time necesary to fit is significantly longer. Therefore, there seems to be little reason to use polynomial logistic regression.

Feature importance is ignored, as most of the features are simply engineered polynomial features.

# Decision Tree

Now, a decision tree model will be trained following the same process.

A new categorical transformer is created using ordinal encoding, which is suitable for tree-based models.

In [None]:
tree_cat_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
])

# Combine transformers
tree_preprocessor = ColumnTransformer([
    ("num_transformer", num_transformer, num_col_names),
    ("tree_cat_transformer", tree_cat_transformer, cat_col_names)
])

The pipeline is created and hyperparameter tuning is implemented, testing ranges of values for the major hyperparameters.

In [None]:
tree_pipe = Pipeline([
    ("tree_preprocessor", tree_preprocessor),
    ("dec_tree", DecisionTreeClassifier())
])

param_dist = {
    "dec_tree__max_depth": [None, 2, 5, 10, 20],
    "dec_tree__min_samples_split": [2, 5, 10, 20, 50],
    "dec_tree__min_samples_leaf": [1, 2, 5, 10, 20],
    "dec_tree__max_features": ["sqrt", "log2", None],
}

search = RandomizedSearchCV(tree_pipe, param_distributions=param_dist, n_iter=10, cv=kf, random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

The metrics are notably better than those from the logistic regression model. Perhaps decision trees are better suited to multiclass classification.

Permutation is once again used to discover feature importance.

In [None]:
# Extract components
preprocessor = best_model.named_steps["tree_preprocessor"]
dec_tree = best_model.named_steps["dec_tree"]

# Remove name of transformer from each feature
raw_feature_names = preprocessor.get_feature_names_out()

clean_feature_names = [
    name.split("__", 1)[1] if "__" in name else name
    for name in raw_feature_names
]

# Transform X_test into expanded feature space
X_test_transformed = preprocessor.transform(X_test)

# Run permutation importance on the classifier only
importances = permutation_importance(dec_tree, X_test_transformed, y_test, n_repeats=10, random_state=9, n_jobs=-1)

# Display results
highest_importances = pd.Series(importances.importances_mean, index=clean_feature_names).sort_values(ascending=False).head(10)
plt.bar(highest_importances.index, highest_importances)
plt.xticks(rotation=90)
plt.title("Decision Tree Feature Importance")
plt.xlabel("Feature")
plt.ylabel("Drop in Performance")
plt.show()

In comparison to the logistic regression model, P_MASS_EST has more significance for prediction. The planet's radius is still the most important predictor, however.

# Random Forest

Now, a random forest model will be trained. The pipeline is created and hyperparameter tuning is implemented, testing ranges of values for the major hyperparameters.

In [None]:
# forest_pipe = Pipeline([
#     ("tree_preprocessor", tree_preprocessor),
#     ("rand_for", RandomForestClassifier())
# ])

# param_dist = {
#     "rand_for__n_estimators": [200, 400, 600, 800],
#     "rand_for__max_depth": [None, 5, 10, 20, 40],
#     "rand_for__min_samples_split": [2, 5, 10, 20],
#     "rand_for__min_samples_leaf": [1, 2, 5, 10],
#     "rand_for__max_features": ["sqrt", "log2", None],
#     "rand_for__bootstrap": [True, False],
# }

# search = RandomizedSearchCV(forest_pipe, param_distributions=param_dist, n_iter=10, cv=kf, random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
# search.fit(X_train, y_train)
# best_model = search.best_estimator_
# for param, value in search.best_params_.items():
#   print(param,":", value)

# y_pred = best_model.predict(X_test)

In [None]:
# print(classification_report(y_test, y_pred))

The metrics are around the same as those for the decision tree model. However, it takes much longer to fit, making random forests apparently unnecessary for planet classification.

Permutation is once again used to discover feature importance.

In [None]:
# # Extract components
# preprocessor = best_model.named_steps["tree_preprocessor"]
# rand_for = best_model.named_steps["rand_for"]

# # Transform X_test into expanded feature space
# X_test_transformed = preprocessor.transform(X_test)

# # Run permutation importance on the classifier only
# importances = permutation_importance(rand_for, X_test_transformed, y_test, n_repeats=10, random_state=9, n_jobs=-1)

# # Display results
# highest_importances = pd.Series(importances.importances_mean, index=clean_feature_names).sort_values(ascending=False).head(10)
# plt.bar(highest_importances.index, highest_importances)
# plt.xticks(rotation=90)
# plt.title("Random Forest Feature Importance")
# plt.xlabel("Feature")
# plt.ylabel("Drop in Performance")
# plt.show()

As before, P_RADIUS_EST and P_MASS_EST are the two most important features. As with the decision tree model, the other features have little importance compared to those two.

# Ridge Regression

Now, the mass of planets will be predicted using various models, starting with Ridge regression. Ridge is chosen over standard linear regression to enable regularization.

As mentioned earlier, P_MASS_EST is transformed to be in log space.

In [None]:
log_df = df.copy()
log_df["Log_Mass"] = np.log10(log_df["P_MASS_EST"])
log_df = log_df.drop("P_MASS_EST", axis=1)

The data is split into the features and the target.

In [None]:
X = log_df.drop("Log_Mass", axis=1)
y = log_df["Log_Mass"]

All rows where the target value is null are removed.

In [None]:
y_na = y[y.isna()]
data = X.join(y)
data = data.drop(y_na.index)
X = data.drop("Log_Mass", axis=1)
y = data["Log_Mass"]
print(y.isna().sum())

The data is split into the training and testing data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

Transformers for numerical and categorical data are created.

In [None]:
# Separate numerical and categorical features
num_features = X_train.select_dtypes(include=np.number)
cat_features = X_train.select_dtypes(exclude=np.number)
num_col_names = num_features.columns
cat_col_names = cat_features.columns

# Combine transformers
ridge_preprocessor = ColumnTransformer([
    ("num_transformer", num_transformer, num_col_names),
    ("ohe_cat_transformer", ohe_cat_transformer, cat_col_names)
])

The pipeline is created and hyperparameter tuning is implemented.

In [None]:
ridge_pipe = Pipeline([
    ("ridge_preprocessor", ridge_preprocessor),
    ("ridge", Ridge())
])

kf = KFold(n_splits=5, shuffle=True, random_state=9)

param_dist = {
    "ridge__alpha": np.logspace(-4, 4)
}

search = RandomizedSearchCV(ridge_pipe, param_distributions=param_dist, scoring="neg_root_mean_squared_error", n_iter=50, cv=kf,
                            random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

In [None]:
y_pred_se = pd.Series(y_pred)
y_pred_se.describe()

In [None]:
y_test_se = pd.Series(y_test)
y_test_se.describe()

The MAE is used to evaluate the model.

In [None]:
print("RMSE:", root_mean_squared_error(y_test, y_pred))

The model performs reasonably well across all metrics.

In [None]:
search = RandomizedSearchCV(ridge_pipe, param_distributions=param_dist, scoring="neg_mean_absolute_error", n_iter=50, cv=kf,
                            random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

In [None]:
y_pred_se = pd.Series(y_pred)
y_pred_se.describe()

The MAE is used to evaluate the model.

In [None]:
print("MAE:", mean_absolute_error(y_test, y_pred))

# Polynomial Ridge Regression

The output from Ridge will be compared to its output with polynomial features.

The pipeline is created and hyperparameter tuning is implemented.

In [None]:
poly_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("poly", PolynomialFeatures(include_bias=False)),
    ("scaler", StandardScaler())
])

poly_ridge_preprocessor = ColumnTransformer([
    ("poly_transformer", poly_transformer, num_col_names),
    ("ohe_cat_transformer", ohe_cat_transformer, cat_col_names)
])

In [None]:
poly_ridge_pipe = Pipeline([
    ("poly_ridge_preprocessor", poly_ridge_preprocessor),
    ("ridge", Ridge())
])

param_dist = {
    "poly_ridge_preprocessor__poly_transformer__poly__degree" : [2, 3],
    "ridge__alpha" : np.logspace(-4, 4)
}

search = RandomizedSearchCV(poly_ridge_pipe, param_distributions=param_dist, scoring="neg_root_mean_squared_error", n_iter=10, cv=kf,
                            random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
# search.fit(X_train, y_train)
# best_model = search.best_estimator_
# for param, value in search.best_params_.items():
#   print(param,":", value)

# y_pred = best_model.predict(X_test)

In [None]:
# y_pred_se = pd.Series(y_pred)
# y_pred_se.describe()

In [None]:
# y_pred_se.sort_values(ascending=False).head(10)

The MAE and R-squared scores are used to evaluate the model.

In [None]:
# print("RMSE:", root_mean_squared_error(y_test, y_pred))

The model performs reasonably well across all metrics.

In [None]:
# search = RandomizedSearchCV(poly_ridge_pipe, param_distributions=param_dist, scoring="neg_mean_absolute_error", n_iter=10, cv=kf,
#                             random_state=9, n_jobs=-1)

In [822]:
# search.fit(X_train, y_train)
# best_model = search.best_estimator_
# for param, value in search.best_params_.items():
#   print(param,":", value)

# y_pred = best_model.predict(X_test)

In [None]:
# y_pred_se = pd.Series(y_pred)
# y_pred_se.describe()

In [None]:
# y_pred_se.sort_values(ascending=False).head(10)

In [None]:
# print("MAE:", mean_absolute_error(y_test, y_pred))

# Decision Tree Regressor

The pipeline is created and hyperparameter tuning is implemented.

In [None]:
X = df.drop("P_MASS_EST", axis=1)
y = df["P_MASS_EST"]

All rows where the target value is null are removed.

In [None]:
y_na = y[y.isna()]
data = X.join(y)
data = data.drop(y_na.index)
X = data.drop("P_MASS_EST", axis=1)
y = data["P_MASS_EST"]
print(y.isna().sum())

The data is split into the training and testing data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [None]:
tree_pipe = Pipeline([
    ("tree_preprocessor", tree_preprocessor),
    ("dec_tree", DecisionTreeRegressor())
])

param_dist = {
    "dec_tree__max_depth": [None, 3, 5, 10, 20],
    "dec_tree__min_samples_split": [2, 5, 10, 20, 50],
    "dec_tree__min_samples_leaf": [1, 2, 5, 10, 20, 50],
    "dec_tree__max_features": [None, "sqrt", "log2"]
}

search = RandomizedSearchCV(tree_pipe, param_distributions=param_dist, scoring="neg_root_mean_squared_error", n_iter=50, cv=kf,
                            random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

The MAE is used to evaluate the model.

In [None]:
print("RMSE:", root_mean_squared_error(y_test, y_pred))

In [None]:
# Extract components
preprocessor = best_model.named_steps["tree_preprocessor"]
dec_tree = best_model.named_steps["dec_tree"]

# Remove name of transformer from each feature
raw_feature_names = preprocessor.get_feature_names_out()

clean_feature_names = [
    name.split("__", 1)[1] if "__" in name else name
    for name in raw_feature_names
]

# Transform X_test into expanded feature space
X_test_transformed = preprocessor.transform(X_test)

# Run permutation importance on the classifier only
importances = permutation_importance(dec_tree, X_test_transformed, y_test, scoring="neg_root_mean_square_error" n_repeats=10,
                                     random_state=9, n_jobs=-1)

# Display results
highest_importances = pd.Series(importances.importances_mean, index=clean_feature_names).sort_values(ascending=False).head(10)
plt.bar(highest_importances.index, highest_importances)
plt.xticks(rotation=90)
plt.title("Decision Tree Feature Importance")
plt.xlabel("Feature")
plt.ylabel("Drop in Performance")
plt.show()

The model performs reasonably well across all metrics.

In [None]:
search = RandomizedSearchCV(tree_pipe, param_distributions=param_dist, scoring="neg_mean_absolute_error", n_iter=50, cv=kf,
                            random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

The MAE is used to evaluate the model.

In [None]:
print("MAE:", mean_absolute_error(y_test, y_pred))

In [None]:
# Extract components
dec_tree = best_model.named_steps["dec_tree"]

# Run permutation importance on the classifier only
importances = permutation_importance(dec_tree, X_test_transformed, y_test, scoring="neg_mean_absolute_error" n_repeats=10,
                                     random_state=9, n_jobs=-1)

# Display results
highest_importances = pd.Series(importances.importances_mean, index=clean_feature_names).sort_values(ascending=False).head(10)
plt.bar(highest_importances.index, highest_importances)
plt.xticks(rotation=90)
plt.title("Decision Tree Feature Importance")
plt.xlabel("Feature")
plt.ylabel("Drop in Performance")
plt.show()

# Random Forest Regressor

The pipeline is created and hyperparameter tuning is implemented.

In [None]:
tree_pipe = Pipeline([
    ("tree_preprocessor", tree_preprocessor),
    ("rand_for", RandomForestRegressor())
])

param_dist = {
    "rand_for__n_estimators": [200, 300, 500, 800],
    "rand_for__max_depth": [None, 5, 10, 20, 40],
    "rand_for__min_samples_split": [2, 5, 10, 20],
    "rand_for__min_samples_leaf": [1, 2, 5, 10, 20],
    "rand_for__max_features": ["sqrt", "log2", None],
    "rand_for__bootstrap": [True, False]
}

search = RandomizedSearchCV(tree_pipe, param_distributions=param_dist, scoring="neg_root_mean_squared_error", n_iter=50, cv=kf,
                            random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

The MAE is used to evaluate the model.

In [None]:
print("RMSE:", root_mean_squared_error(y_test, y_pred))

In [None]:
# Extract components
preprocessor = best_model.named_steps["tree_preprocessor"]
rand_for = best_model.named_steps["rand_for"]

# Transform X_test into expanded feature space
X_test_transformed = preprocessor.transform(X_test)

# Run permutation importance on the classifier only
importances = permutation_importance(rand_for, X_test_transformed, y_test, scoring="neg_root_mean_square_error" n_repeats=10,
                                     random_state=9, n_jobs=-1)

# Display results
highest_importances = pd.Series(importances.importances_mean, index=clean_feature_names).sort_values(ascending=False).head(10)
plt.bar(highest_importances.index, highest_importances)
plt.xticks(rotation=90)
plt.title("Random Forest Feature Importance")
plt.xlabel("Feature")
plt.ylabel("Drop in Performance")
plt.show()

The model performs reasonably well across all metrics.

In [None]:
search = RandomizedSearchCV(tree_pipe, param_distributions=param_dist, scoring="neg_mean_absolute_error", n_iter=50, cv=kf,
                            random_state=9, n_jobs=-1)

The model is trained, tested, and scored with a classification report.

In [None]:
search.fit(X_train, y_train)
best_model = search.best_estimator_
for param, value in search.best_params_.items():
  print(param,":", value)

y_pred = best_model.predict(X_test)

The MAE is used to evaluate the model.

In [None]:
print("MAE:", mean_absolute_error(y_test, y_pred))

In [None]:
# Extract components
rand_for = best_model.named_steps["rand_for"]

# Run permutation importance on the classifier only
importances = permutation_importance(rand_for, X_test_transformed, y_test, scoring="neg_mean_absolute_error" n_repeats=10,
                                     random_state=9, n_jobs=-1)

# Display results
highest_importances = pd.Series(importances.importances_mean, index=clean_feature_names).sort_values(ascending=False).head(10)
plt.bar(highest_importances.index, highest_importances)
plt.xticks(rotation=90)
plt.title("Random Forest Feature Importance")
plt.xlabel("Feature")
plt.ylabel("Drop in Performance")
plt.show()