## Classifying Exoplanets: Exploring NASA's Kepler Space Observatory Dataset

## Introduction

One of the most fascinating subjects in astronomical research is finding exoplanets, planets that orbit stars beyond our solar system. The Kepler Space Observatory, a NASA space telescope for finding exoplanets, has analyzed thousands of planets, especially ones that are roughly Earth-sized and within habitable zones. From 2009 to 2018, Kepler revolutionized our understanding of extrasolar systems by cross-checking previous observations of exoplanets and labeling them as confirmed planets, candidates, or false positives.
Our primary question is: *Can we accurately classify celestial bodies as exoplanets based on their observed characteristics using the Kepler exoplanet dataset?*
Our project will analyze the NASA Kepler exoplanet dataset. This dataset contains details about celestial objects, including their radius, transit, stellar luminosity, and other essential attributes. By analyzing this dataset, we hope to develop a predictive model that discerns exoplanets from other extrasolar entities.

In [2]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [3]:
## Reading the data 
exoplanet <- read_csv("https://raw.githubusercontent.com/QuwackJ/dsci-100-group-37/main/Data/cumulative.csv")


## Selecting for our predictors
exoplanet_selected <- exoplanet |>
                        select(koi_disposition, koi_period, koi_depth, koi_duration, koi_impact)


head(exoplanet_selected)

## Splitting into training and testing data
exoplanet_split <- initial_split(exoplanet_selected, prop = 0.75, strata = koi_disposition)
training_data <- training(exoplanet_split)   
testing_data <- testing(exoplanet_split)

[1mRows: [22m[34m9564[39m [1mColumns: [22m[34m50[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (5): kepoi_name, kepler_name, koi_disposition, koi_pdisposition, koi_tc...
[32mdbl[39m (43): rowid, kepid, koi_score, koi_fpflag_nt, koi_fpflag_ss, koi_fpflag_...
[33mlgl[39m  (2): koi_teq_err1, koi_teq_err2

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


koi_disposition,koi_period,koi_depth,koi_duration,koi_impact
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
CONFIRMED,9.488036,615.8,2.9575,0.146
CONFIRMED,54.418383,874.8,4.507,0.586
FALSE POSITIVE,19.89914,10829.0,1.7822,0.969
FALSE POSITIVE,1.736952,8079.2,2.40641,1.276
CONFIRMED,2.525592,603.3,1.6545,0.701
CONFIRMED,11.094321,1517.5,4.5945,0.538


## References

Dataset: https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results

Column Explanation: https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html
