# Predicting the Nature of Space Objects
Samuel Robbins

### Abstract
The goal of this project is to use classification models to predict the nature of celestial bodies in the sky. Using k-Nearest Neighbor, Logistic Regression, Decision Trees, Random Forest, and Naive Bayes I leveraged location data and spectral characteristics to create robust classificaiton models with high predictive performance for this multi-class problem. 

### Design
The data for the Sloan Digital Sky Survey is found on [Kaggle](https://www.kaggle.com/muhakabartay/sloan-digital-sky-survey-dr16), and more detailed spectra data can be found at the SDSS [data repository](https://www.sdss.org/dr16). Correctly classifying the nature of celestial objects is important for expanding our collective understanding of the universe. Researchers have used data from the Sloan Digital Sky Survey to create the largest, most detailed 3D map of the universe so far and filled a gap of 11 billion years in its expansion history, so classifying these objects correctly is important for refining our greater understanding of the universe.  

### Data
The dataset cointains 100,000 observations of space objects with 17 features for each observation. Nine of the features are metadata for image calibration and identification, and eight are object features that can be used for classification. Feature highlights include ascension and declination - marking the location in the sky for each object, redshift, and spectral band response. The target for each observation is its class - Galaxy, Star, or Quasar.

### Algorithms
#### Feature Selection/Engineering
1. Principal Component Analysis for the spectral band data (ugriz) with 2 and 3 components was performed to see if grouping improved model performance. It did not meaningfully improve performance so components were not used for final classification. 

#### Models
Data were broken up into a 60/20/20 train/validate/test. Models were trained on the training set and evaluated based on the performance of the validation set to choose a final model. The test set was run only once after final model selection. Accuracy was chosen as the primary classification metric.

Final 5-fold CV scores for Decision Tree:
- Overall Accuracy: 0.99035
- Galaxy:

> Precision - 0.987

> Recall - 0.994

> F1 - 0.991
- Star:

> Precision - 0.997

> Recall - 0.999

> F1 - 0.998
- Qso:

> Precision - 0.981

> Recall - 0.941

> F1 - 0.961

### Tools
- NumPy and Pandas for data manipulation and cleaning
- Matplotlib, Seaborn, and Plotly for data visualization
- Scikit-learn for classification modeling

### Communication

In addition to the slides and visuals presented, all code and documentation will be available on my personal GitHub and (eventual) personl website. 

![best_decision_tree.png](attachment:best_decision_tree.png)