# Question 1: How can a traditional - non neural network approach be used to classify celestial objects (Stars, Galaxies, and Quasars)

## Imports

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

## Initialising the dataset

I am using the SDSS DR14 dataset obtained from Kaggle: https://www.kaggle.com/datasets/lucidlenn/sloan-digital-sky-survey/data.

In [19]:
# Import the dataset as from a csv file to a Pandas dataframe
path = '/Users/ryanu/Documents/Uni/ACT/SDSS-DR14-Classification/SDSS Data.csv'
data = pd.read_csv(path)

This dataset contains 18 columns, however most of them are not needed. I have chosen to only use the five filter bands; u, g, r, i, and z; and redshift.

In [20]:
parameters = data[["ra", "dec", "u", "g", "r", "i", "z", "redshift"]]
classification_examples = data["class"]

## Preparing the data for training and testing

The data needs to be split into two parts. The training set will be used to train the model on how to classify the objects. The testing data is then used to determine how good the model is at classifying objects.

In [22]:
# Split the data into training and testing sets
training_param, testing_param, train_classification, testing_classification = train_test_split(parameters, classification_examples, test_size=0.2, random_state=42)

# Train the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(training_param, train_classification)

# Make predictions on the test set
classification_predict = clf.predict(testing_param)

# Evaluate the model
print("Accuracy:", accuracy_score(testing_classification, classification_predict))
print("Classification Report:\n", classification_report(testing_classification, classification_predict))


# Plot the distribution of the classes with actual and correctly detected counts
# Actual counts in the test set
actual_counts = testing_classification.value_counts().sort_index()
correct_counts = pd.Series(testing_classification[testing_classification == classification_predict]).value_counts().sort_index()

Accuracy: 0.986
Classification Report:
               precision    recall  f1-score   support

      GALAXY       0.98      0.99      0.99       996
         QSO       0.95      0.93      0.94       190
        STAR       1.00      1.00      1.00       814

    accuracy                           0.99      2000
   macro avg       0.98      0.97      0.97      2000
weighted avg       0.99      0.99      0.99      2000

