## Classifying Exoplanets: Exploring NASA's Kepler Space Observatory Dataset

## Introduction

One of the most fascinating subjects in astronomical research is finding exoplanets, planets that orbit stars beyond our solar system. The Kepler Space Observatory, a NASA space telescope for finding exoplanets, has analyzed thousands of planets, especially ones that are roughly Earth-sized and within habitable zones. From 2009 to 2018, Kepler revolutionized our understanding of extrasolar systems by cross-checking previous observations of exoplanets and labeling them as confirmed planets, candidates, or false positives.
Our primary question is: *Can we accurately classify celestial bodies as exoplanets based on their observed characteristics using the Kepler exoplanet dataset?*
Our project will analyze the NASA Kepler exoplanet dataset. This dataset contains details about celestial objects, including their radius, transit, stellar luminosity, and other essential attributes. By analyzing this dataset, we hope to develop a predictive classification model that discerns exoplanets from other extrasolar entities.

In [4]:
install.packages("GGally")
install.packages("recipes")
install.packages("kknn")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [5]:
library(tidyverse)
library(tidymodels)
library(GGally)
library(repr)
library(recipes)
library(kknn)
options(repr.matrix.max.rows = 6)

## Loading Original Data

In [6]:
## Reading the original data from Kaggle
exoplanet <- read_csv("https://raw.githubusercontent.com/QuwackJ/dsci-100-group-37/main/Data/cumulative.csv")

head(exoplanet)

[1mRows: [22m[34m9564[39m [1mColumns: [22m[34m50[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (5): kepoi_name, kepler_name, koi_disposition, koi_pdisposition, koi_tc...
[32mdbl[39m (43): rowid, kepid, koi_score, koi_fpflag_nt, koi_fpflag_ss, koi_fpflag_...
[33mlgl[39m  (2): koi_teq_err1, koi_teq_err2

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,⋯,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,10797460,K00752.01,Kepler-227 b,CONFIRMED,CANDIDATE,1.0,0,0,0,⋯,-81,4.467,0.064,-0.096,0.927,0.105,-0.061,291.9342,48.14165,15.347
2,10797460,K00752.02,Kepler-227 c,CONFIRMED,CANDIDATE,0.969,0,0,0,⋯,-81,4.467,0.064,-0.096,0.927,0.105,-0.061,291.9342,48.14165,15.347
3,10811496,K00753.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,⋯,-176,4.544,0.044,-0.176,0.868,0.233,-0.078,297.0048,48.13413,15.436
4,10848459,K00754.01,,FALSE POSITIVE,FALSE POSITIVE,0.0,0,1,0,⋯,-174,4.564,0.053,-0.168,0.791,0.201,-0.067,285.5346,48.28521,15.597
5,10854555,K00755.01,Kepler-664 b,CONFIRMED,CANDIDATE,1.0,0,0,0,⋯,-211,4.438,0.07,-0.21,1.046,0.334,-0.133,288.7549,48.2262,15.509
6,10872983,K00756.01,Kepler-228 d,CONFIRMED,CANDIDATE,1.0,0,0,0,⋯,-232,4.486,0.054,-0.229,0.972,0.315,-0.105,296.2861,48.22467,15.714


## Wrangling Original Data

In [10]:
## Counting NA values in original data
na_in_exoplanet <- exoplanet |>
                   summarize_all(~ sum(is.na(.)))

na_in_exoplanet

## Selecting for our predictors
exoplanet_selected <- exoplanet |>
                        mutate(koi_disposition = as_factor(koi_disposition)) |>
                        mutate(koi_disposition = fct_recode(koi_disposition, "NOT EXOPLANET" = "FALSE POSITIVE")) |>
                        filter((koi_disposition == "NOT EXOPLANET" & koi_score <= 0.3) | (koi_disposition == "CONFIRMED" & koi_score >= 0.8)) |>
                        select(koi_disposition, koi_score, koi_period, koi_depth, koi_duration, koi_impact)

head(exoplanet_selected)

disposition_count <- exoplanet_selected |>
                     group_by(koi_disposition) |>
                     summarize(count = n())

disposition_count

## Counting NA values in selected data
na_in_exoplanet_selected <- exoplanet_selected |>
                            summarize_all(~ sum(is.na(.)))

na_in_exoplanet_selected

## Counting the number of rows in the selected data
row_count_exoplanet_selected <- count(exoplanet_selected)

row_count_exoplanet_selected


## Splitting into training and testing data
exoplanet_split <- initial_split(exoplanet_selected, prop = 0.75, strata = koi_disposition)
training_data <- training(exoplanet_split)   
testing_data <- testing(exoplanet_split)

rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,⋯,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,7270,0,0,1510,0,0,0,⋯,483,363,468,468,363,468,468,0,0,1


koi_disposition,koi_score,koi_period,koi_depth,koi_duration,koi_impact
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
CONFIRMED,1.0,9.488036,615.8,2.9575,0.146
CONFIRMED,0.969,54.418383,874.8,4.507,0.586
NOT EXOPLANET,0.0,19.89914,10829.0,1.7822,0.969
NOT EXOPLANET,0.0,1.736952,8079.2,2.40641,1.276
CONFIRMED,1.0,2.525592,603.3,1.6545,0.701
CONFIRMED,1.0,11.094321,1517.5,4.5945,0.538


koi_disposition,count
<fct>,<int>
CONFIRMED,2183
NOT EXOPLANET,3916


koi_disposition,koi_score,koi_period,koi_depth,koi_duration,koi_impact
<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,54,0,54


n
<int>
6099


In [11]:
set.seed(1234) 

options(repr.plot.height = 5, repr.plot.width = 6)

recipe <- recipe(koi_disposition ~ ., data = training_data)

knn_tune<-nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
       set_engine("kknn") |>
       set_mode("classification")

vfold <- vfold_cv(training_data, v = 5, strata = koi_disposition)

k_vals <- tibble(neighbors = seq(from = 1, to = 5, by = 1))

fit <- workflow() |>
                  add_recipe(recipe) |>
                  add_model(knn_tune) |>
                  tune_grid(resamples=vfold,grid=k_vals) |>
                  collect_metrics()|>
filter(.metric == "accuracy")

cross_val_plot<- ggplot(data = fit, aes(x = neighbors, y = mean)) +
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") + 
scale_x_continuous(breaks = seq(0, 14, by = 1)) +  
       scale_y_continuous(limits = c(0.4, 1.0)) +
      theme(text = element_text(size = 12))

cross_val_plot

[31mx[39m [31mFold1: preprocessor 1/1, model 1/1 (predictions):
  [1m[33mError[31m in `mutate()`:[22m
  [1m[22m[36mℹ[31m In argument: `.row = orig_rows`.
  [1mCaused by error:[22m
  [1m[22m[33m![31m `.row` must be size 900 or 1, not 916.[39m

[31mx[39m [31mFold2: preprocessor 1/1, model 1/1 (predictions):
  [1m[33mError[31m in `mutate()`:[22m
  [1m[22m[36mℹ[31m In argument: `.row = orig_rows`.
  [1mCaused by error:[22m
  [1m[22m[33m![31m `.row` must be size 908 or 1, not 916.[39m

[31mx[39m [31mFold3: preprocessor 1/1, model 1/1 (predictions):
  [1m[33mError[31m in `mutate()`:[22m
  [1m[22m[36mℹ[31m In argument: `.row = orig_rows`.
  [1mCaused by error:[22m
  [1m[22m[33m![31m `.row` must be size 906 or 1, not 914.[39m

[31mx[39m [31mFold4: preprocessor 1/1, model 1/1 (predictions):
  [1m[33mError[31m in `mutate()`:[22m
  [1m[22m[36mℹ[31m In argument: `.row = orig_rows`.
  [1mCaused by error:[22m
  [1m[22m[33m![31m `

ERROR: [1m[33mError[39m in `estimate_tune_results()`:[22m
[33m![39m All of the models failed. See the .notes column.


In [12]:
mnist_spec<- nearest_neighbor(weight_func = "rectangular", neighbors = ...)|>
          set_engine("kknn") |>
          set_mode("classification")

mnist_fit <- workflow() |>
          add_recipe(recipe) |>
          add_model(mnist_spec) |>
          fit(data = training_data)

final_prediction <- predict(knn_fit, testing_data)
final_prediction

ERROR: Error in set_engine(nearest_neighbor(weight_func = "rectangular", neighbors = ...), : '...' used in an incorrect context


## References

Dataset: https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results

Column Explanation: https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html
