# Q2: Neural Network Classifier for Galaxy Zoo Dataset

## Introduction

In this tutorial, we will classify objects in the Galaxy Zoo dataset into Galaxy, Star, and Quasar categories. This is a critical step in astrophysics research, where distinguishing between celestial objects helps us understand the structure of the universe. Classifying celestial objects helps astronomers study galaxy evolution, understand the large-scale structure of the universe, and identify rare phenomena like quasars. By leveraging machine learning, we can process large datasets efficiently and accurately. This tutorial will guide beginners in building a Neural Network for classification, explain preprocessing techniques, and compare its performance with a traditional Decision Tree classifier.

## Why Use a Neural Network Classifier?

The Galaxy Zoo dataset includes complex features such as spectral band magnitudes (`u`, `g`, `r`, `i`, `z`) and redshift, which exhibit non-linear relationships. Neural Networks excel in handling such complexity, making them ideal for this classification task.

### Advantages of Neural Networks:

- **Ability to Model Complex Patterns**: Captures non-linear relationships and interactions between features.

- **Adaptability**: Can handle large and complex datasets.

- **Improved Accuracy**: Performs well in multi-class classification tasks like Galaxy Zoo.

For instance, Neural Networks can differentiate between stars and quasars by learning the subtle variations in spectral bands and redshift, which might be challenging for simpler models like Decision Trees.

### Limitations of Neural Networks:

- **Computationally Expensive**: Requires more time and resources compared to simpler models like Decision Trees.

- **Needs Larger Datasets**: Performs poorly with small or sparse data.

## Step 1: Preparing the Dataset

Before building a Neural Network, we need to prepare the dataset to ensure it works well with the model. Proper preparation improves the model's accuracy and reliability.

- **Normalise the Features**: Features like `ra`, `dec`, `u`, `g`, `r`, `i`, `z`, and `redshift` vary significantly in their scales. For instance, `redshift` might have values in the range of thousands, while other features are in smaller ranges. To ensure that all features contribute equally to the learning process, we normalise them using `StandardScaler`, which scales the data to have a mean of 0 and a standard deviation of 1.

- **One-Hot Encode the Target Labels**: The `class` column in the dataset contains categorical labels: `Galaxy`, `Star`, and `Quasar`. Neural Networks require numerical data, and one-hot encoding transforms each class into a unique vector. For example:

   - Galaxy → [1, 0, 0]
   - Star → [0, 1, 0]
   - Quasar → [0, 0, 1]
   
This format ensures the Neural Network interprets the labels correctly when combined with the Softmax activation in the output layer, which outputs probabilities for each class.

- **Splitting the Dataset**: To evaluate the Neural Network's performance, we split the dataset into two subsets:

    - **Training Set**: 80% of the data, used to train the model.\n
    - **Testing Set**: 20% of the data, used to assess how well the model generalizes to unseen data.\n We use the train_test_split function with random_state=42 to ensure reproducibility of the split, which is critical when experimenting or sharing results.

In [None]:
# Step 1: Encode the target labels
# The 'class' column contains categorical labels ('Galaxy', 'Star', 'Quasar').
# We convert these into numeric codes (0, 1, 2) for easier processing.
data['class_encoded'] = data['class'].astype('category').cat.codes

# Step 2: Normalise the redshift feature
# Normalisation adjusts the 'redshift' values to have a mean of 0 and a standard deviation of 1.
# This prevents the 'redshift' feature, which has a larger range, from dominating the learning process.
data['redshift_normalized'] = (data['redshift'] - data['redshift'].mean()) / data['redshift'].std()

# Select features and target
X = data[['ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'redshift_normalized']]
y = data['class_encoded']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Standardise the features
# Standardisation scales all features to have zero mean and unit variance.
# This ensures all features contribute equally to the model's learning process.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit the scaler on training data and transform it
X_test = scaler.transform(X_test) # Use the same scaler to transform test data

# Step 4: One-hot encode the target labels
# Convert the numeric target labels (0, 1, 2) into one-hot vectors:
# 0 → [1, 0, 0], 1 → [0, 1, 0], 2 → [0, 0, 1]
# This format is required for multi-class classification in Neural Networks.
y_train = to_categorical(y_train, num_classes=3)
y_test = to_categorical(y_test, num_classes=3)

### Visualising the Class Distribution

Visualizing Class Distribution: Understanding the balance of the dataset is crucial for building effective machine learning models. The bar chart below shows the number of samples for each class (`Galaxy`, `Star`, `Quasar`):

- Galaxy: 5000 samples

- Star: 4000 samples

- Quasar: 1000 samples 

The dataset is imbalanced, with the majority class (`Galaxy`) having five times more samples than the minority class (`Quasar`). Imbalanced datasets can lead to biased predictions. For instance, if one class dominates, the model may overfit to it, neglecting the minority classes. Visualizing this ensures we can address any imbalance during preprocessing.

![Class Distribution](class2.png)

## Step 2: Building the Neural Network

### Define the Architecture

- **Input Layer**: This layer takes the prepared features (`ra`, `dec`, `u`, `g`, `r`, `i`, `z`, `redshift`), with each feature represented by a neuron. This ensures all data is passed into the model for learning.
     
- **Hidden Layers**: Include one or two hidden layers, each containing 32 or 64 neurons. These layers employ the ReLU (Rectified Linear Unit) activation function, which introduces non-linearity into the model, enabling it to learn intricate patterns in the data. ReLU is computationally efficient and helps the model learn non-linear relationships effectively.
     
- **Output Layer**: This layer determines the object class (`Galaxy`, `Star`, or `Quasar`). It comprises three neurons (one for each class) and uses the Softmax activation function. Softmax ensures that the output is a probability distribution, with values summing up to 1 across all classes, it also converts raw model outputs into probabilities, making them interpretable.

- This architecture is designed to balance complexity and computational efficiency, making it suitable for this dataset's size and multi-class nature.

- ReLU introduces non-linearity, allowing the model to learn complex patterns, while Softmax converts raw outputs into probabilities.

- This architecture balances model complexity and computational efficiency, making it suitable for this dataset.

### Compile the Model

- **Optimiser**: The Adam optimiser dynamically adjusts the learning rate during training, ensuring efficient convergence and improved performance.
     
- **Loss Function**: The categorical cross-entropy loss is ideal for multi-class classification problems, as it measures the distance between the predicted probability distribution and the true labels.
     
- **Metrics**: This is tracked as the primary evaluation metric during training, offering a clear measure of model performance on the dataset.

- The choice of activation functions, optimiser, and loss function directly impacts the model’s ability to learn.

### Train the Model

- **Training Process**: The Neural Network is trained on the dataset over a fixed number of epochs (e.g., 20). Each epoch represents a single pass through the entire training set, enabling the model to adjust its weights iteratively.
   
- **Batch Size**: Training is performed in batches of data (e.g., 32 samples per batch). This approach conserves memory and speeds up training by processing smaller subsets of data in parallel.
   
- **Validation Data**: During training, validation data is used to monitor the model's performance on unseen data, ensuring it generalises effectively without overfitting to the training dataset.

In [None]:
# Step 1: Define the model
# This model includes:
# - Input layer with 64 neurons and ReLU activation.
# - Hidden layer with 32 neurons and ReLU activation.
# - Output layer with 3 neurons (for Galaxy, Star, and Quasar classes) and Softmax activation.
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(3, activation='softmax')
])

# Step 2: Compile the model
# - Adam optimiser adjusts the learning rate during training.
# - Categorical cross-entropy measures the model's loss for multi-class classification.
# - Accuracy is used as the evaluation metric.
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Step 3: Train the model
# - Trains the model on the training dataset for 20 epochs.
# - Validation data is used to evaluate the model's performance on unseen data after each epoch.
# - A batch size of 32 is used to process the data in smaller subsets.
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=32, verbose=1)

## Step 3: Evaluating the Neural Network

### How Do We Measure Performance?

- **Evaluate Test Accuracy**: Test accuracy quantifies the percentage of correct predictions on the unseen test dataset. A higher accuracy implies the model has successfully generalised the patterns learned from the training data to new data, ensuring robust performance.

- **Analyse the Confusion Matrix**: The confusion matrix offers a granular breakdown of the model's predictions, highlighting the number of correct classifications and misclassifications. For example, it reveals when the model mistakenly predicts a `Galaxy` as a `Star`, providing insights into areas where the model may need improvement.

- **Examine Metrics**:
    - **Precision**: The proportion of positive predictions that are correct, indicating the reliability of the model's positive predictions.
    - **Recall**: The percentage of true positives captured by the model, showing how effectively it identifies a class.
    - **F1-Score**: The harmonic mean of precision and recall, offering a balanced measure that is particularly useful for imbalanced datasets, where one class has significantly more samples than others.

- **Compare Training and Validation Performance**: Assess whether the model achieves similar accuracy on both training and validation datasets. Significant disparities suggest overfitting, where the model memorises the training data instead of learning generalisable patterns, leading to poor performance on new data.


In [None]:
# Step 1: Evaluate the model
# This step calculates the loss and accuracy on the test dataset.
# Accuracy reflects the percentage of correct predictions.
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {test_accuracy:.2f}")

# Step 2: Generate predictions
# The model predicts probabilities for each class.
# Use np.argmax to convert probabilities into class predictions.
y_pred = model.predict(X_test) # Predicted probabilities
y_pred_classes = np.argmax(y_pred, axis=1) # Predicted class labels
y_test_classes = np.argmax(y_test, axis=1) # True class labels

# Step 3: Generate the Confusion Matrix
# The confusion matrix visualises the performance of the classifier.
# Each row represents the true class, and each column represents the predicted class.
conf_matrix = confusion_matrix(y_test_classes, y_pred_classes)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix (Neural Network)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Step 4: Generate the Classification Report
# This report summarises precision, recall, and F1-score for each class.
# It provides a detailed breakdown of the model's performance on individual classes.
print("Classification Report:")
print(classification_report(y_test_classes, y_pred_classes))

## Step 4: Comparing the Neural Network with the Decision Tree

### **Metrics to Compare**

To evaluate both models comprehensively, the following metrics are used:

1. **Accuracy**:
   - Measures the percentage of correct predictions on the test dataset.
   - A higher accuracy indicates better performance.
   - Accuracy provides an overall measure of the model’s performance by comparing correct predictions against the total number of predictions.

2. **Precision**:
   - The percentage of predicted positives that are correct.
   - Precision helps assess how reliable the model's predictions are.
   - Precision is particularly critical when identifying rare events, such as quasars in astronomical datasets, where minimising false positives is essential.

3. **Recall**:
   - The percentage of actual positives that are correctly predicted.
   - High recall means the model correctly identifies most of the actual instances.
   - Recall is crucial for identifying all instances of a specific class, such as ensuring all quasars are detected.

4. **F1-Score**:
   - A balance between precision and recall, useful when classes are imbalanced.
   - F1-Score offers a harmonic mean of precision and recall, making it ideal for imbalanced datasets like Galaxy Zoo.

5. **Confusion Matrix**:
   - Provides a detailed breakdown of correct and incorrect predictions for each class.
   - Useful for identifying specific areas where the models make errors.
   - These also complements the numerical metrics by offering a detailed breakdown of errors.

6. **Overfitting**:
   - Check if the model performs significantly better on the training dataset than on the test dataset.
   - Overfitting indicates the model has memorised the training data instead of learning general patterns.
   - Overfitting can cause the model to perform well on training data but fail to generalise, making it unreliable for real-world applications.

---

### **Results Summary**

#### Table 1: Performance Metrics Comparison Between the Decision Tree and Neural Network Models.

| Metric                 | Decision Tree | Neural Network |
|------------------------|---------------|----------------|
| **Accuracy**           | 0.95          | 0.98           |
| **Precision (Galaxy)** | 0.93          | 0.97           |
| **Recall (Galaxy)**    | 0.92          | 0.98           |
| **Precision (Star)**   | 0.91          | 0.96           |
| **Recall (Star)**      | 0.89          | 0.95           |
| **Precision (Quasar)** | 0.94          | 0.99           |
| **Recall (Quasar)**    | 0.93          | 0.99           |


#### Figure 1: Confusion Matrix for the Decision Tree Model

![Confusion Matrix Q1](confusion.png)

#### Figure 2: Confusion Matrix for the Neural Network Model.

![Confusion Matrix Q2](confusion2.png)

### **Observations**

- **Accuracy**: With an accuracy of 0.98, the Neural Network outperformed the Decision Tree (0.95), demonstrating its superior ability to generalise patterns from the training data to the test set.

- **Precision and Recall**: Across all object types, particularly for the Quasar class, the Neural Network exhibited improved precision and recall. This suggests it effectively models the non-linear relationships between features in the dataset.

- **Confusion Matrices**: The Neural Network confusion matrix shows fewer misclassifications, highlighting its enhanced performance in distinguishing between classes.

### **Why Does the Neural Network Perform Better?**

- **Captures Complex Patterns**: Neural Networks can model non-linear relationships and interactions between features, which are challenging for Decision Trees.

- **Improved Preprocessing**: Feature normalisation and one-hot encoding optimise the Neural Network’s performance. The preprocessing steps, such as feature normalisation and one-hot encoding, standardise input data and allow the Neural Network to learn effectively without being skewed by features with large ranges like `redshift`.

- **Flexible Architecture**: Neural Networks are highly flexible, making them capable of adapting to a wide variety of complex datasets like Galaxy Zoo. The hidden layers in the Neural Network enable it to learn intricate patterns in the data.

The Decision Tree relies on hierarchical splits, which may oversimplify complex patterns, leading to reduced performance.

#### **Advantages of Each Model**

- **Decision Tree**:
  - Simple to interpret and implement.
  - Requires less preprocessing and computational power.
  - Performs well on smaller datasets or datasets with clear patterns.

- **Neural Network**:
  - More accurate for complex datasets like Galaxy Zoo.
  - Handles non-linear relationships effectively.
  - Performs better with larger datasets.
  - Neural Networks excel in datasets like Galaxy Zoo, where feature interactions such as the relationship between redshift and spectral bands (u, g, r, etc.) are highly non-linear.


## Step 5: Conclusion

### **Key Findings**

1. **Neural Network Performance**:
   - The Neural Network achieved a significantly higher accuracy (0.98) than the Decision Tree (0.95), highlighting its ability to generalise better across the dataset.
   - The Neural Network consistently outperformed the Decision Tree for all object types (`Galaxy`, `Star`, and `Quasar`), with notable improvements in the more complex `Quasar` category.

2. **Decision Tree Performance**:
   - Simple, fast, and easy to interpret.
   - Achieved reasonable accuracy but struggled with complex object types like Quasars.
   - Although the Decision Tree performed reasonably well for simpler classifications (`Galaxy` and `Star`), it struggled with the more complex relationships needed to classify `Quasars` effectively.

3. **Strengths of Neural Networks**:
   - Neural Networks excel at identifying intricate patterns in data, such as the non-linear relationship between spectral bands and redshift..
   - The Galaxy Zoo dataset, with its diverse features (`ra`, `dec`, `spectral bands`, and `redshift`), is well-suited for the advanced learning capabilities of Neural Networks.

4. **Strengths of Decision Trees**:
   - Decision Trees are faster to train and provide straightforward, interpretable results, which are particularly beneficial for exploratory data analysis.
   - Unlike Neural Networks, Decision Trees can handle raw data without extensive preprocessing, making them suitable for smaller or less complex datasets.

---

### **When to Use Each Model**

1. **Use Decision Trees**:
   - Best suited for smaller datasets or tasks where explainability is critical, such as identifying outliers or preliminary analysis of a simple dataset.
   - Recommended when computational efficiency and training speed are necessary.
   - Decision Trees are suitable for smaller datasets or when rapid prototyping is required. For instance, they can be used to classify a subset of stars with fewer features.

2. **Use Neural Networks**:
   - Ideal for larger and more complex datasets, such as Galaxy Zoo, where feature interactions are intricate.
   - Preferred when achieving the highest accuracy is essential, even if training requires more computational resources.
   - Neural Networks are ideal for applications like classifying high-dimensional datasets or processing large-scale astronomical surveys.


### Conclusion

This tutorial demonstrated how Neural Networks outperform Decision Trees for classifying objects in the Galaxy Zoo dataset, achieving higher accuracy and better handling of non-linear relationships. However, Decision Trees remain a valuable alternative for simpler datasets or when interpretability is essential.

#### **Final Recommendation**

For the Galaxy Zoo dataset, the **Neural Network Classifier** is the superior option, offering higher accuracy and greater capability in managing the dataset’s complexity. Nonetheless, the **Decision Tree Classifier** is a practical alternative for simpler datasets or when interpretability and computational efficiency are priorities.
