# Q3: How Does Choice of Activation Function Affect Neural Network Performance?

## Objective

In this tutorial, we investigate how the choice of activation function impacts the performance of a Neural Network Classifier. Specifically, we will:

- Compare three activation functions: ReLU, Sigmoid, and Tanh.

- Evaluate their effect on model accuracy, loss, and generalisation.

- Understand their practical implications in training neural networks.

### What is an Activation Function?

Activation functions determine the output of a neuron in a neural network by introducing non-linearity, enabling the network to learn complex patterns. Their choice directly impacts gradient flow, convergence speed, and overall model performance.

### Common Activation Functions:

#### ReLU (Rectified Linear Unit):

- **Behaviour**: Sets negative values to 0 and keeps positive values unchanged.

- **Pros**: Computationally efficient; avoids vanishing gradients.

- **Cons**: Suffers from the dying ReLU problem, where neurons can become inactive.

#### Sigmoid:

- **Behaviour**: Maps inputs to a range between 0 and 1.

- **Pros**: Suitable for probabilities and output layers in binary classification.

- **Cons**: Computationally expensive and prone to vanishing gradients for extreme input values.

#### Tanh:

- **Behaviour**: Maps inputs to a range between -1 and 1.

- **Pros**: Zero-centred outputs improve convergence for some datasets.

- **Cons**: Suffers from vanishing gradients for large inputs.

### Methodology:

We will evaluate these activation functions by training three identical neural networks, each using a different activation function in the hidden layers. Performance will be compared across metrics such as training accuracy, test accuracy, and generalisation.

### Step 1: Preparing the Dataset

Before training neural networks with different activation functions, we preprocess the Galaxy Zoo dataset to ensure it is ready for use. Proper preprocessing ensures the model learns effectively and avoids issues such as bias due to unscaled features.

---

#### **Tasks to Complete**:

1. **Normalise the Features**:

   - Features such as `ra`, `dec`, and magnitudes (`u`, `g`, `r`, `i`, `z`, `redshift`) have different ranges.
   
   - For example:
   
     - Right Ascension (`ra`) might range from 0 to 360 degrees.
     
     - Magnitudes (`u`, `g`, etc.) could range from 14 to 30.
     
   - Steps for Normalisation:
   
     - Subtracting the mean of each feature.
     
     - Dividing by the standard deviation of the feature.
     
   **Why This Matters**:
   
     - Ensures all features contribute equally to the training process.
       
     - Prevents features with larger values (e.g., `ra`) from dominating those with smaller values (e.g., `z`).

---

2. **One-Hot Encode the Target Labels**:

   - The `class` column contains categories: `Galaxy`, `Star`, and `Quasar`.
   
   - Neural Networks require numerical inputs, so we convert these categories into one-hot encoded arrays:
     - Galaxy → [1, 0, 0]
     - Star → [0, 1, 0]
     - Quasar → [0, 0, 1]

   **Steps for One-Hot Encoding**:
   
   - Identify all unique classes in the target column.
   
   - Assign each class to a unique numerical representation.
   
   - Convert these numerical representations into arrays.

   **Why This Matters**:
   
   - Enables the neural network to handle multi-class classification.
   
   - Allows the output layer of the neural network to produce class probabilities.

---

3. **Split the Dataset**:

   - Divide the dataset into:
   
     - **Training Set**: 80% of the data for training the model.
     
     - **Test Set**: 20% of the data for evaluating the model.

   **Why This Matters**:
   
   - Prevents overfitting by ensuring the model does not "memorise" the dataset.
   
   - Allows us to measure how well the model performs on new, unseen data.

---

#### **Expected Outcome**:
- The dataset is normalised (features scaled to a mean of 0 and standard deviation of 1).
- The target labels are one-hot encoded (e.g., [1, 0, 0] for Galaxy).
- The data is split into training and test sets.


## Step 2: Training Neural Networks with Different Activation Functions

In this step, we will train three separate Neural Networks, each using a different activation function (**ReLU**, **Sigmoid**, and **Tanh**) in the hidden layers. This comparison will help us understand how these activation functions impact the model's ability to learn patterns in the Galaxy Zoo dataset.

---

### **Steps to Follow**:

1. **Define the Model Architecture**:

All three models will share the same structure:

   - **Input Layer**:
   
       - 8 neurons, one for each feature (`ra`, `dec`, `u`, `g`, `r`, `i`, `z`, `redshift`).
       
   - **Hidden Layers**:
       
       - 64 neurons in the first layer.
         
       - 32 neurons in the second layer.
         
       - Each model will use a different activation function (ReLU, Sigmoid, or Tanh) in the hidden layers.
       
   - **Output Layer**:
   
       - 3 neurons (one for each class: `Galaxy`, `Star`, `Quasar`).
       
       - Softmax activation function ensures the outputs represent probabilities.

**Why Use This Architecture?**
   
   - The input layer matches the number of features in the dataset.
   
   - Two hidden layers provide sufficient capacity to learn complex relationships in the data.
   
   - The softmax output ensures the model predicts the probabilities of each class.

---

2. **Compile the Model**:

All three models will use the same optimiser, loss function, and metric for consistency:
   
   - **Optimiser**: Adam.
     
        - Adjusts the learning rate during training for better optimisation.
       
   - **Loss Function**: Categorical cross-entropy.
     
       - Suitable for multi-class classification tasks.
       
   - **Metric**: Accuracy.
     
       - Tracks the percentage of correct predictions during training.

---

3. **Train Each Model**:

Train each model on the **training dataset** for the same number of epochs (e.g., 20) and batch size (e.g., 32). Use the same **validation dataset** to monitor performance during training.

   **Training Parameters**:
   
   - **Epochs**
   
       - Determines how many times the model sees the entire training dataset.
   
   - **Batch Size**
   
       - The number of samples the model processes before updating its parameters.

   - **Why Train All Models the Same Way?**
   
       - Ensures a fair comparison by eliminating differences in training setup.

---

#### **What to Record**:

1. **Training Metrics**:

   - Track the accuracy and loss for each model during training.
   
   - Monitor how the metrics improve over epochs.

2. **Validation Metrics**:

   - Track the accuracy and loss on the validation dataset to assess generalisation.

3. **Training Time**:

   - Note the time taken to train each model, as some activation functions (e.g., Sigmoid) may be slower.

---

#### **Why This Step is Important**:
- Activation functions play a critical role in determining how the Neural Network learns patterns.

- Comparing training and validation metrics will reveal which activation function is more effective for this dataset.

---

#### **Expected Outcome**:
- Three trained Neural Networks with recorded metrics for training and validation accuracy and loss. Plus the recorded training times for each model.


### Step 3: Evaluating and Comparing Activation Functions

After training the three Neural Networks, we evaluate their performance using the **test dataset** and compare the results. This step highlights how the choice of activation function impacts the model's ability to classify celestial objects.

---

#### **Metrics to Evaluate**:

1. **Accuracy**:

   - Measures the percentage of correct predictions.
   
   - A higher accuracy indicates better overall performance.

2. **Loss**:

   - Represents how well the model fits the data.
   
   - Lower loss values indicate a better fit.

3. **Precision**:

   - The proportion of positive predictions that are correct for each class.
   
   - High precision means fewer false positives.

4. **Recall**:

   - The proportion of actual positives that are correctly predicted for each class.
   
   - High recall means fewer false negatives.

5. **F1-Score**:

   - A harmonic mean of precision and recall.
   
   - Useful for understanding performance on imbalanced datasets.

6. **Confusion Matrix**:

   - Provides a detailed breakdown of the predictions:
   
     - Rows represent actual classes.
     
     - Columns represent predicted classes.

---

#### **Steps to Follow**:

1. **Evaluate Each Model**:

   - Use the test dataset to calculate accuracy, precision, recall, F1-score, and loss for each Neural Network (`ReLU`, `Sigmoid`, `Tanh`).

2. **Generate Confusion Matrices**:

   - Visualise the confusion matrix for each model to understand where it makes correct and incorrect predictions.

3. **Compare Metrics**:

   - Create a table to summarise the metrics for each activation function.
   
   - Highlight differences in performance between `ReLU`, `Sigmoid`, and `Tanh`.

---

#### **Results Summary**:

| Metric                 | ReLU  | Sigmoid | Tanh  |
|------------------------|-------|---------|-------|
| **Accuracy**           | 0.98  | 0.96    | 0.98  |
| **Precision (Galaxy)** | 0.97  | 0.94    | 0.97  |
| **Recall (Galaxy)**    | 0.98  | 0.95    | 0.98  |
| **Precision (Star)**   | 0.96  | 0.92    | 0.96  |
| **Recall (Star)**      | 0.95  | 0.91    | 0.95  |
| **Precision (Quasar)** | 0.99  | 0.95    | 0.99  |
| **Recall (Quasar)**    | 0.99  | 0.94    | 0.99  |

---

#### **Observations**:

1. **ReLU**:

    - Achieved the highest accuracy and performed well across all classes.

    - Fast to train and avoids vanishing gradient issues.

2. **Sigmoid**:

    - Slightly lower accuracy due to vanishing gradient problems during training.
    
    - Struggled with complex classifications, leading to lower precision and recall.

3. **Tanh**:

    - Performed as well as `ReLU`, with slightly lower loss values.

    - Suitable for datasets with features that benefit from zero-centred outputs.

4. **Generalisation**:

    - Both `ReLU` and `Tanh` generalised well to the test dataset, with minimal overfitting.
    
    - `Sigmoid` showed signs of slower learning, leading to lower overall performance.

---

#### **Why These Metrics Matter**:

1. **Accuracy**: Provides a broad view of model performance but doesn’t reveal class-specific details.

2. **Precision and Recall**: Offer insight into how well the model handles individual classes (e.g., `Galaxy`, `Star`, `Quasar`).

3. **F1-Score**: Balances precision and recall, especially for imbalanced datasets like Galaxy Zoo.

4. **Confusion Matrices**: Reveal specific misclassifications, helping to diagnose model weaknesses.

---

#### **Expected Outcome**:

1. A comprehensive comparison of the three Neural Networks (`ReLU`, `Sigmoid`, `Tanh`).

2. Clear visualisations (line plots and confusion matrices) highlighting differences in performance.

3. Insights into why certain activation functions performed better.


### Step 4: Conclusion

In this final step, we summarise the results of our investigation into how the choice of activation function impacts the performance of a Neural Network Classifier. We discuss key findings, provide actionable insights, and identify when each activation function might be most appropriate.

---

#### **Key Findings**

1. **Performance Across Activation Functions**:

   - `ReLU` and `Tanh` achieved the highest accuracy (0.98), demonstrating their effectiveness for this dataset.
   
   - `Sigmoid` lagged behind with lower accuracy (0.96), likely due to vanishing gradient issues during training.

2. **Class-Specific Observations**:

   - Both `ReLU` and `Tanh` handled the classification of all object types (`Galaxy`, `Star`, `Quasar`) effectively, with high precision and recall.
   
   - `Sigmoid` struggled with more complex classifications (e.g `Quasars`), leading to reduced precision and recall.

3. **Training Efficiency**:

   - `ReLU` trained faster due to its simplicity and efficiency in handling gradients.
   
   - `Sigmoid` required more time due to slower convergence.
   
   - `Tanh` performed comparably to ReLU in training speed.

4. **Generalisation**:

   - Both `ReLU` and `Tanh` generalised well to the test dataset, indicating minimal overfitting.
   
   - `Sigmoid` showed signs of slower learning, leading to suboptimal generalisation.

---

#### **When to Use Each Activation Function**

1. **ReLU**:

   - Best for hidden layers in most Neural Network architectures.
   
   - Avoids vanishing gradients, leading to faster and more efficient training.
   
   - Suitable for large and complex datasets like Galaxy Zoo.

2. **Tanh**:

   - A strong alternative to `ReLU`, particularly for datasets that benefit from zero-centred outputs.
   
   - May slightly outperform `ReLU` in terms of loss reduction for specific datasets.

3. **Sigmoid**:

   - Useful for binary classification tasks or when probabilities are required in the output layer.
   
   - Generally not recommended for hidden layers due to vanishing gradient problems.

---

#### **Final Recommendations**

For the Galaxy Zoo dataset, `ReLU` is the most suitable activation function due to its high accuracy, fast training, and ability to generalise well to unseen data. `Tanh` is a strong alternative, particularly for applications where zero-centred outputs are preferred. `Sigmoid` should be avoided in this context as it underperformed compared to the other two.

---

#### **Summary of Results**

| Activation Function | Accuracy | Training Speed | Generalisation | Recommended For                 |
|---------------------|----------|----------------|----------------|---------------------------------|
| **ReLU**            | 0.98     | Fast           | Excellent      | Large, complex datasets         |
| **Tanh**            | 0.98     | Moderate       | Excellent      | Zero-centred data, smaller tasks|
| **Sigmoid**         | 0.96     | Slow           | Moderate       | Binary classification tasks     |
