# Question 1: How can a traditional - non neural network approach be used to classify celestial objects (Stars, Galaxies, and Quasars)

## Introduction

## Overview

In this notebook I will explain how a decision tree can be used to classify different objects within a dataset. The dataset being used is the SDSS DR14 obtained from Kaggle: https://www.kaggle.com/datasets/lucidlenn/sloan-digital-sky-survey/data.

In this dataset, there are 18 columns that are used to describe the Stars, Galaxies, and Quasars observed by the SDSS telescope. However we don't necessarily need all 18 of these parameters. Columns: "objid", "run", "rerun", "camcol", "field", "specobjid", "plate", "mjd", and "fiberid", are all values related to the telescope and thus are not relevant. Columns: "ra", "dec", "u", "g", "r", "i", "z", and "redshift" are all values related to the objects themselves. The "class" column contains which group the objects fall into, this will be used to in testing to see how accuracte the model is.

## How does a Decision Tree work?

Decision trees are flow charts that are often used for classification or regression tasks. It works by splitting data into subsets based on the answer to a question. For example in this dataset, you could split the data based on the question "is the brightness above a certain value?" This would then split the data into two subsets, one with brightnesses greater than the chosen value, and one with brightnesses less than the chosen value.

This is repeated with different questions and eventually forms a tree structure with each splitting point forming a "node". It would end up looking something like this:

            Is brightness > 50?
                /         \
              Yes         No
             /             \
        Is size > 30?   Is size > 10?
          /     \         /     \
       Galaxy  Star    Quasar   Star

This method is useful for classification problems as it can be very quick to find an answer, depending on the dataset and the depth of your tree.

## Imports

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

## Initialising the dataset

In [49]:
# Import the dataset as from a csv file to a Pandas dataframe
path = '/Users/ryanu/Documents/Uni/ACT/SDSS-DR14-Classification/SDSS Data.csv'
data = pd.read_csv(path)

This dataset contains 18 columns, however most of them are not needed. I have chosen to only use the five filter bands; u, g, r, i, and z as these are the physical parameters that are observed by the telescope. The rest are values assigned after the observations. I mention the redshift column later on in the redshift as this drastically changes the accuracy of the model.

In [47]:
parameters = data[["u", "g", "r", "i", "z"]]
classification_examples = data["class"]

## Preparing the data for training and testing

The data needs to be split into two parts. The training set will be used to train the model on how to classify the objects. The testing data is then used to determine how good the model is at classifying objects.

In [48]:
# Split the data into training and testing sets
training_param, testing_param, train_classification, testing_classification = train_test_split(parameters, classification_examples, test_size=0.2, random_state=42)

# Train the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(training_param, train_classification)

# Make predictions on the test set
classification_predict = clf.predict(testing_param)

# Evaluate the model
print("Accuracy:", accuracy_score(testing_classification, classification_predict))
print("Classification Report:\n", classification_report(testing_classification, classification_predict))


# Plot the distribution of the classes with actual and correctly detected counts
# Actual counts in the test set
actual_counts = testing_classification.value_counts().sort_index()
correct_counts = pd.Series(testing_classification[testing_classification == classification_predict]).value_counts().sort_index()

Accuracy: 0.901
Classification Report:
               precision    recall  f1-score   support

      GALAXY       0.91      0.92      0.91       996
         QSO       0.87      0.85      0.86       190
        STAR       0.89      0.89      0.89       814

    accuracy                           0.90      2000
   macro avg       0.89      0.89      0.89      2000
weighted avg       0.90      0.90      0.90      2000

