# Q1: Decision Tree Classifier for Galaxy Zoo Dataset

### Introduction

Machine learning can be used to classify celestial objects based on their properties. In this tutorial, we explore how to use a **traditional machine learning approach** to classify objects from the Galaxy Zoo dataset into three categories: **Galaxy**, **Star**, and **Quasar**. We will use a **Decision Tree Classifier**, a simple yet effective method for classification problems.

This guide is designed for **beginners in Machine Learning** who are familiar with Python but new to ML concepts. We will walk through the steps necessary to replicate our analysis

## Why Use a Decision Tree Classifier?

Decision Trees are a popular choice for traditional machine learning tasks due to several advantages:

- **Interpretability**: The structure of a decision tree is easy to understand and interpret, even for non-technical stakeholders.

- **Handles Mixed Data Types**: Decision Trees can handle both numerical and categorical features without requiring extensive preprocessing.

- **Non-Linear Relationships**: They can capture non-linear relationships between features and the target variable.

### Limitations of Decision Trees

Despite their advantages, Decision Trees have some limitations:

- **Tendency to Overfit**: They can become overly complex and memorize the training data, resulting in poor generalization to unseen data.

- **High Variance**: Small changes in the training data can lead to entirely different trees being generated.

- **Less Robust**: They may perform worse compared to ensemble methods like Random Forest or Gradient Boosting.

### Getting Started with Jupyter Notebook
- A **Jupyter Notebook** is an interactive environment where you can write and run code in small chunks called "cells."
- Types of cells:
  1. **Code cells**: For writing Python code.
  2. **Markdown cells**: For headings, explanations, and instructions (like this one).
- To run a cell:
  1. Click on the cell.
  2. Press `Shift + Enter`.

---

To complete this tutorial, you will need:
1. Python installed on your computer.
2. The Galaxy Zoo dataset saved in the same folder as this notebook.

Let’s get started!

## Step 1: Load and Explore the Dataset

### What is the Galaxy Zoo Dataset?

The **Galaxy Zoo dataset** contains astronomical object classifications based on images taken from telescopes. Key features include:

- **ra (Right Ascension) & dec (Declination)**: Coordinates of objects in the sky.

- **u, g, r, i, z filters**: Measurements of light intensity at different wavelengths.

- **redshift**: A measure of how much the object’s light has been stretched due to the expansion of the universe.

- **class**: The category of the object (Galaxy, Star, Quasar).

### How to Load the Data

To start, load the dataset using the `pandas` library. The first few rows of the dataset can be displayed to understand its structure.

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv("galaxy_zoo.csv")
print(data.head())

To ensure data quality, check for **missing values** and understand the distribution of object classes.

In [None]:
# Check for missing values
print(data.isnull().sum()) 
# Distribution of target classes
print(data['class'].value_counts()) 

### Visualising Target Class Distribution

We can plot the distribution of object classes to understand the dataset’s balance. here's an example:

![Target Class Distribution](class.png)

## Step 2: Data Preprocessing

### Why is Preprocessing Important?

Raw data often contains **irrelevant features, missing values, and inconsistencies**. Preprocessing ensures that our model learns from meaningful data.

### Key Preprocessing Steps

1. **Drop Unnecessary Columns** – Some columns do not contribute to classification and should be removed.

2. **Encode Categorical Data** – Convert the `class` column into numerical values for machine learning.

3. **Normalise the Redshift Feature** – Standardising numerical values helps models learn efficiently.

In [None]:
# Drop unnecessary columns
data = data.drop(columns=['objid', 'specobjid', 'fiberid', 'plate', 'mjd'])

# Encode 'class' column
data['class_encoded'] = data['class'].astype('category').cat.codes

# Normalise redshift
data['redshift_normalised'] = (data['redshift'] - data['redshift'].mean()) / data['redshift'].std()

print(data.head())

### Visualising Redshift Distribution

We can visualise the redshift feature to understand its range and distribution. Here is an example:

![Redshift Distribution](redshift.png)

## Step 3: Model Selection and Training

### Why Use a Decision Tree Classifier?

Decision Trees are **easy to interpret** and **work well with structured tabular data**. They split data into smaller subgroups using feature thresholds, creating a tree-like model of decisions.

### Steps to Train the Model
1. **Split the data**: Separate the dataset into training and testing subsets.

2. **Train the model**: Fit a Decision Tree Classifier to the training data.

3. **Tune hyperparameters**: Adjust tree depth to balance accuracy and overfitting.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split the data into training and testing sets
X = data.drop(columns=['class', 'class_encoded'])
y = data['class_encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Classifier
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)

## Step 4: Model Evaluation

### How Do We Measure Performance?

To assess model performance, use:

1. **Accuracy Score**: Measures how many predictions were correct.

2. **Confusion Matrix**: Shows the breakdown of correct vs. incorrect classifications.

3. **Classification Report**: Provides precision, recall, and F1-score for each class.


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

### Visualising the Confusion Matrix

A heatmap of the confusion matrix provides a clear visual representation.

![Confusion Matrix](confusion.png)

### Understanding Metrics

- **Precision**: The proportion of true positive predictions among all positive predictions made by the model.

- **Recall**: The proportion of true positives identified out of all actual positives.

- **F1-Score**: The harmonic mean of precision and recall, balancing both metrics.

These metrics are particularly important in imbalanced datasets where accuracy alone might be misleading.

## Step 5: Hyperparameter Tuning

### Testing Different Tree Depths

Experiment with different values of `max_depth` to observe its effect on model accuracy.

A deeper tree increases accuracy on training data but risks overfitting, making it less generalisable to unseen data.

In [None]:
# Test varying tree depths
for depth in range(1, 11):
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"Tree Depth: {depth}, Accuracy: {accuracy:.4f}")

## Step 6: Conclusion and Next Steps

### Key Takeaways

**Traditional ML models like Decision Trees** can effectively classify astronomical objects.

**Feature preprocessing** (dropping irrelevant columns, encoding, and normalisation) is crucial for improving accuracy.

**Model evaluation metrics** help determine areas for improvement.

### Summary

This Jupyter Notebook guides you through a **step-by-step approach** to classifying celestial objects using a **traditional ML model**. By following these steps, a beginner can replicate the process and build a foundation in machine learning for astronomy.

For more advanced techniques, check out **neural network approaches** (covered in Q2). 🚀
