## Classifying Exoplanets: Exploring NASA's Kepler Space Observatory Dataset

## Introduction

One of the most fascinating subjects in astronomical research is finding exoplanets, planets that orbit stars beyond our solar system. The Kepler Space Observatory, a NASA space telescope for finding exoplanets, has analyzed thousands of planets, especially ones that are roughly Earth-sized and within habitable zones. From 2009 to 2018, Kepler revolutionized our understanding of extrasolar systems by cross-checking previous observations of exoplanets and labeling them as confirmed planets, candidates, or false positives.
Our primary question is: *Can we accurately classify celestial bodies as exoplanets based on their observed characteristics using the Kepler exoplanet dataset?*
Our project will analyze the NASA Kepler exoplanet dataset. This dataset contains details about celestial objects, including their radius, transit, stellar luminosity, and other essential attributes. By analyzing this dataset, we hope to develop a predictive classification model that discerns exoplanets from other extrasolar entities.

In [None]:
install.packages("devtools")
library(devtools)
install_github("ggobi/ggally")

In [None]:
install.packages("recipes")
install.packages("kknn")

In [None]:
library(tidyverse)
library(tidymodels)
library(GGally)
library(repr)
library(recipes)
library(kknn)
options(repr.matrix.max.rows = 6)

## Loading Original Data

In [None]:
## Reading the original data from Kaggle
exoplanet <- read_csv("https://raw.githubusercontent.com/QuwackJ/dsci-100-group-37/main/Data/cumulative.csv")
cat("\n↓ Table 1. First 6 rows of the Exoplanet dataset ↓\n")
head(exoplanet)

## Wrangling Original Data

In [None]:
## Counting NA values in original data
na_in_exoplanet <- exoplanet |>
                   summarize_all(~ sum(is.na(.)))
cat("\n↓ Table 2. NA values in the Exoplanet dataset ↓\n")
na_in_exoplanet

## Selecting for our predictors and removing planets with less confidence
exoplanet_selected <- exoplanet |>
                        mutate(koi_disposition = as_factor(koi_disposition)) |>
                        mutate(koi_disposition = fct_recode(koi_disposition, "NOT EXOPLANET" = "FALSE POSITIVE")) |>
                        filter((koi_disposition == "NOT EXOPLANET" & koi_score <= 0.3) | (koi_disposition == "CONFIRMED" & koi_score >= 0.8)) |>
                        select(koi_disposition, koi_score, koi_period, koi_depth, koi_duration, koi_impact)
cat("\n↓ Table 3. First 6 rows of the selected Exoplanet dataset ↓\n")
head(exoplanet_selected)

## Counting the number of rows in the selected data
row_count_exoplanet_selected <- count(exoplanet_selected)
cat("\n↓ Table 4. Counts of rows in the selected Exoplanet dataset ↓\n")
row_count_exoplanet_selected

## Counting NA values in selected data
na_in_exoplanet_selected <- exoplanet_selected |>
                            summarize_all(~ sum(is.na(.)))
cat("\n↓ Table 5. NA values in the selected Exoplanet dataset ↓\n")
na_in_exoplanet_selected

## Removing NA values in exoplanet_selected
exoplanet_selected <- exoplanet_selected |>
                      drop_na()  
cat("\n↓ Table 6. First 6 rows of the selected Exoplanet dataset without NA values ↓\n")
head(exoplanet_selected)

## Splitting into training and testing data
exoplanet_split <- initial_split(exoplanet_selected, prop = 0.75, strata = koi_disposition)
training_data <- training(exoplanet_split)   
testing_data <- testing(exoplanet_split)

## Classification

In [None]:
set.seed(1234) 

options(repr.plot.height = 6, repr.plot.width = 6)

recipe <- recipe(koi_disposition ~ ., data = training_data) |>
          step_rm(koi_score) |>
          step_scale(all_predictors()) |>
          step_center(all_predictors())  

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
           set_engine("kknn") |>
           set_mode("classification")

vfold <- vfold_cv(training_data, v = 5, strata = koi_disposition)

k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))

knn_fit <- workflow() |>
          add_recipe(recipe) |>
          add_model(knn_tune) |>
          tune_grid(resamples = vfold, grid = k_vals) |>
          collect_metrics()
      
accuracies <- knn_fit |> 
            filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = "Neighbors", y = "Accuracy Estimate") + 
                  ggtitle("Figure 1. Estimated accuracy versus number of neighbors for the training data") +
                  theme(text = element_text(size = 12))

cross_val_plot

In [None]:
exo_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 6)|>
          set_engine("kknn") |>
          set_mode("classification")

exo_fit <- workflow() |>
          add_recipe(recipe) |>
          add_model(exo_spec) |>
          fit(data = training_data)

final_prediction <- predict(exo_fit, testing_data)|>
                    bind_cols(testing_data)

final_prediction_accuracy <- final_prediction |>
                            metrics(truth=koi_disposition,estimate=.pred_class)
cat("\n↓ Table 7. Accuracies of predicting the testing data ↓\n")

final_prediction_accuracy

**Classification analysis**

After we observed relationships between our chosen predictors, we tried to solve the predictive question of accurately classifying celestial bodies as exoplanets based on the predictors selected from the Kepler exoplanet dataset.

We used the K-Nearest Neighbors (KNN) algorithm for this classification analysis, with the dataset split into 75% training data and 25% testing data. This step is essential for assessing the performance and generalization capabilities of the KNN model. It allows for an unbiased evaluation and helps prevent overfitting, which is critical for developing reliable machine-learning models. Next, we created a model recipe and a specification for the training data and implemented 5-fold cross-validation to optimize the hyperparameter K. We tuned the KNN classifier with these predictors and collected the accuracy for the best K. By creating a plot with estimated accuracy versus the number of neighbors from 1 to 15, we found the best K value to be 5 or 6 (Fig. 1). The plot shows the accuracy estimates increase with K value from 1 to 6, followed by a slight decrease. Since the dataset is relatively large as shown in Table 9, choosing a large K value has its advantages, such as lower variance to avoid overfitting and reduced sensitivity to noise. However, it also introduces drawbacks, including potential loss of local detail and increased bias. The optimal K value depends on the specific characteristics of the dataset and the balance between bias and variance that is suitable for the given problem. 

According to the accuracy plot, we picked K = 10 for the KNN classifier model of the testing data. We used the metrics function to assess the model's performance, with the final accuracy achieved as 0.8538360, and the Kappa (kap) statistic was 0.6885618 (Table 8). The accuracy of 85.38% suggests that our classification model is performing well. This metric represents the proportion of correctly classified instances among the total instances in the testing set. In the context of our predictive question, this high accuracy indicates that the observed characteristics in the Kepler exoplanet dataset are informative and discriminatory enough to distinguish between exoplanets and other celestial bodies. The Kappa statistic measures the agreement between the predicted and actual classifications while accounting for the possibility of agreement occurring by chance. A Kappa value of 0.6885618 suggests substantial agreement beyond what would be expected by random chance. In the context of our analysis, this indicates that our classification model is accurate and robust, as it considers the possibility of random agreement and still demonstrates significant predictive power.

## Predictor Correlation Plot

In [None]:
options(repr.plot.width = 11, repr.plot.height = 11)
predictor_plot <- ggpairs(exoplanet_selected, columns = 2:5, 
                          ggplot2::aes(colour = koi_disposition), 
                          upper = list(continuous = wrap("cor", size = 4)), 
                          lower = list(continuous = wrap("points", alpha = 0.6)))
predictor_plot + theme(axis.text = element_text(size = 12), 
                       strip.text.x = element_text(size = 15), 
                       strip.text.y = element_text(size = 15)) +
                       ggtitle("Figure 2. Correlation between all the predictors in the selected Exoplanet dataset")

In [None]:
#Summary Tables:-




##Count the number of confirmed and not exoplanet



disposition_count <- exoplanet_selected |>
                     group_by(koi_disposition) |>
                     summarize(count = n())

cat("\n↓ Table 8. Disposition Count ↓\n")

disposition_count



##Mean and SD for predictor values for confirmed


exoplanet_filter_confirmed_mean <- exoplanet_selected |>
                                filter(koi_disposition == "CONFIRMED") |>
                                select(-koi_disposition, -koi_score) |>
                                map_df(mean, na.rm = TRUE)


exoplanet_filter_confirmed_sd <- exoplanet_selected |>
                                filter(koi_disposition == "CONFIRMED") |>
                                select(-koi_disposition, -koi_score) |>
                                map_df(sd, na.rm = TRUE)



cat("\n↓ Table 9. Means of all predictors for CONFIRMED exoplanets ↓\n")

exoplanet_filter_confirmed_mean

cat("\n↓ Table 10. Standard deviation of all predictors for CONFIRMED exoplanets ↓\n")

exoplanet_filter_confirmed_sd
##Mean and SD for predictor values for not exoplanet


exoplanet_filter_not_exoplanet_mean <- exoplanet_selected |>
                                filter(koi_disposition == "NOT EXOPLANET") |>
                                select(-koi_disposition, -koi_score) |>
                                map_df(mean, na.rm = TRUE)



exoplanet_filter_not_exoplanet_sd <- exoplanet_selected |>
                                filter(koi_disposition == "NOT EXOPLANET") |>
                                select(-koi_disposition, -koi_score) |>
                                map_df(sd, na.rm = TRUE)

cat("\n↓ Table 11. Means of all predictors for NOT EXOPLANETS ↓\n")

exoplanet_filter_not_exoplanet_mean

cat("\n↓ Table 12. Standard deviation of all predictors for NOT EXOPLANETS ↓\n")

exoplanet_filter_not_exoplanet_sd

## References

Dataset: https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results

Column Explanation: https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html
