# Classification of Stellar Objects

### Introduction

In astronomy, the classification of stars, galaxies, and quasars is fundamental towards the understanding of our own galaxy. Galaxies are large systems of stars (Greshko, 2021), whereas quasars are active galactic nuclei powered by supermassive black holes found at the center of massive galaxies (Bañados et al. 2016). The three can be hard to differentiate solely through observation as they all radiate different wavelengths, hence, astronomers use other spectral characteristics for object identification.

This project aims to answer the following predictive question: Is it possible to use Sloan Digital Sky Survey (SDSS) measurements to predict whether a future stellar body of an unknown type is a star, quasar or galaxy?

The dataset used in our project contains 100,000 observations of space, each of which are classified as a star, galaxy, or quasar, based on their spectral characteristics. Every observation was taken by the SDSS and was given a unique object identifier.

Dataset of interest: 2017 Stellar Classification (SDSS17) https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17

The original dataset contained 17 variables, of which 7 we are interested in. The variables of interest we will use in our dataset include:

- **u**: the ultraviolet filter in the photometric system
- **g**: the green filter in the photometric system
- **r**: the red filter in the photometric system
- **i**: the near Infrared filter in the photometric system
- **z**: the infrared filter in the photometric system
- **class**: the object class (galaxy, star, or quasar object)
- **redshift**: the redshift value based on the increase in wavelength

The class variable is the categorical variable to be predicted. Photometrics measure intensities of different wavelengths and are used by astronomers to study the structure and composition of celestial objects (Grier & Rivkin, 2019), whereas redshift is used in distance, velocity, and other calculations. Hence, we chose these variables as potential predictors.


### Methods

The variables we will use to classify the type of a new SDSS observation include all the photometric measurements (originally labeled u, g, r, i, and z) and the redshift. Previous literature and classification models have demonstrated that many of these factors correlate with stellar body types and may contribute to classifying an astronomical event of type galaxy, quasar or star (Finlay-Freundlich, 1954; Wierzbiński et al. 2021; Simet et al. 2021). 

In our analysis, we will use K-nearest neighbors. Prior to the preprocessing of our training data, we will standardize (i.e. scale and center) our training data to prevent the scales of predictors from unevenly impacting our model. We will also visualize our data to see how the new observation fits with the current dataset and to gauge the accuracy of our prediction.

We start by loading the packages necessary for our analysis.

In [3]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

We will use the read_csv function to read our dataset of interest and assign it to an object called 'star_data'. Then we will verify whether any columns are missing information by counting how many cells have 'NA' in each column.

In [6]:
star_data <- read_csv("https://raw.githubusercontent.com/Margokap/DSCI100-group-03/main/star_classification.csv")

map_df(star_data, ~sum(is.na(.)))

[1mRows: [22m[34m100000[39m [1mColumns: [22m[34m18[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): class
[32mdbl[39m (17): obj_ID, alpha, delta, u, g, r, i, z, run_ID, rerun_ID, cam_col, fi...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


No cells were missing information. Since we are only interested in the photometric variables, the redshift, and the class, we kept only these columns and assigned it to an object called 'star_data_tidy'.

In [7]:
star_data_tidy <- select(star_data, u, g, r, i, z, redshift, class)

star_data_tidy

u,g,r,i,z,redshift,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
23.87882,22.27530,20.39501,19.16573,18.79371,0.6347936,GALAXY
24.77759,22.83188,22.58444,21.16812,21.61427,0.7791360,GALAXY
25.26307,22.66389,20.60976,19.34857,18.94827,0.6441945,GALAXY
⋮,⋮,⋮,⋮,⋮,⋮,⋮
21.16916,19.26997,18.20428,17.69034,17.35221,0.1433656,GALAXY
25.35039,21.63757,19.91386,19.07254,18.62482,0.4550396,GALAXY
22.62171,21.79745,20.60115,20.00959,19.28075,0.5429442,GALAXY


We then changed the column titles to be more comprehensive, which concluded the tidying of the data set.

In [8]:
names(star_data_tidy) <- c("UV_filter",
                            "Green_filter",
                            "Red_filter",
                            "Near_Infrared_filter",
                            "Infrared_filter",
                            "Redshift",
                            "Stellar_object")
star_data_tidy

UV_filter,Green_filter,Red_filter,Near_Infrared_filter,Infrared_filter,Redshift,Stellar_object
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
23.87882,22.27530,20.39501,19.16573,18.79371,0.6347936,GALAXY
24.77759,22.83188,22.58444,21.16812,21.61427,0.7791360,GALAXY
25.26307,22.66389,20.60976,19.34857,18.94827,0.6441945,GALAXY
⋮,⋮,⋮,⋮,⋮,⋮,⋮
21.16916,19.26997,18.20428,17.69034,17.35221,0.1433656,GALAXY
25.35039,21.63757,19.91386,19.07254,18.62482,0.4550396,GALAXY
22.62171,21.79745,20.60115,20.00959,19.28075,0.5429442,GALAXY


Results:

Discussion:

Impact of our work:



References: