# Question 2: 

## Introduction

## Imports

In [35]:
import torch
import numpy as np
import pandas as pd
from torch import nn, optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

## Initialising the Dataset

Like in the Q1 notebook, we need to load the data in a useable form (i.e. a pandas dataframe).

In [3]:
# change the path to where the csv file is stored on your pc
path = '/Users/ryanu/Documents/Uni/ACT/SDSS-DR14-Classification/SDSS Data.csv'
data = pd.read_csv(path)
#data

I am going to start off using the same features as in Q1.

In [4]:
features = data[['u', 'g', 'r', 'i', 'z']]
labels = data['class']
#features

## Data Preprocessing

Like in Q1 with the decision tree, we need to split the data between training and testing. However, now we're going also split out a validation set. Where the training set is used to train the model, and the testing set is used to measure the models performance, the validation set will be used to tune hyperparameters (e.g., learning rate, architecture) and monitor overfitting during training.

In [5]:
# Split the data into training, verification, and testing sets
    # train_test_split() splits the data into training and testing sets
    # test_size=0.2 specifies that 20% of the data should be used for testing
    # random_state=42 is a random seed used to shuffle the data
    # The data is split into training and validation sets in a 80:20 ratio
    # The training set is then split into training and validation sets in a 80:20 ratio
    # The final data is split into training, validation, and testing sets in a 64:16:20 ratio
features_train_val, features_test, label_train_val, label_test = train_test_split(features, labels, test_size=0.2, random_state=42)
features_train, features_val, label_train, label_val = train_test_split(features_train_val, label_train_val, test_size=0.2, random_state=42)

In order to train a neural network we need to get the data into the right format to work with. The first step is to normalise all the data. Neural networks often perform better with normalised data because they are sensitive to the scale of the input features Standardization ensures that features with larger ranges don’t dominate, and makes the model converge faster.

In [6]:
# Initiliase the StandardScaler() function
    # It's important to initialise the StandardScaler() function, then use it for all the data sets to ensure that the same scaling is applied to all the data sets
    # The StandardScaler() function scales the data so that it has a mean of 0 and a standard deviation of 1
scaler = StandardScaler()

# Fit the StandardScaler() function to the training data
    # The fit_transform() function fits the StandardScaler() function to the training data and then scales the training data
    # The transform() function scales the validation and testing data using the same scaling as the training data
    # This ensures that the validation and testing data are scaled in the same way as the training data
features_train_normalised = scaler.fit_transform(features_train)
features_val_normalised = scaler.transform(features_val)
features_test_normalised = scaler.transform(features_test) 

We then need to convert the label names (Star, Galaxy, QSO) into numbers as neural networks expect numerical inputs and outputs.

In [7]:
# Encode the labels using the LabelEncoder() function
    # Again, it's important to initialise the LabelEncoder() function, then use it for all the data sets to ensure that the same encoding is applied to all the data sets
    # The LabelEncoder() function encodes the labels, in alphabetical order, as integers starting from 0 (e.g. Galaxy is 0, QSO is 1, Star is 2)
    # This is necessary because the labels need to be integers for the model to be able to use them
label_encoder = LabelEncoder()

# Fit the LabelEncoder() function to the training labels
    # The fit_transform() function fits the LabelEncoder() function to the training labels and then encodes the training labels
    # The transform() function encodes the validation and testing labels using the same encoding as the training labels
    # This ensures that the validation and testing labels are encoded in the same way as the training labels
label_train_encoded = label_encoder.fit_transform(label_train)
label_val_encoded = label_encoder.transform(label_val)
label_test_encoded = label_encoder.transform(label_test)

We then want to store the datasets as a PyTorch tensor, which are similar to NumPy arrays but have some unique features that make them more suitable for machine learning tasks.

- Multi-Dimensional Arrays:
    - Tensors can have any number of dimensions, making them versatile for representing various types of data, such as scalars (0D), vectors (1D), matrices (2D), and higher-dimensional arrays
- GPU Acceleration:
    - PyTorch tensors can be moved to and operated on using GPUs, which significantly speeds up computations, especially for large-scale machine learning models
- Automatic Differentiation:
    - PyTorch tensors support automatic differentiation, which is essential for training neural networks. This feature is provided by PyTorch's autograd module, which automatically computes gradients for tensor operations
- Interoperability with NumPy:
    - PyTorch tensors can be easily converted to and from NumPy arrays, allowing seamless integration with existing NumPy-based code

In [8]:
# Convert features and labels into PyTorch tensors
    # torch.tensor() creates a tensor from a NumPy array
    # dtype=torch.float32 and dtype=torch.long specify the data type of the tensor
features_train_tensor = torch.tensor(features_train_normalised, dtype=torch.float32)
features_val_tensor = torch.tensor(features_val_normalised, dtype=torch.float32)
features_test_tensor = torch.tensor(features_test_normalised, dtype=torch.float32)
label_train_tensor = torch.tensor(label_train_encoded, dtype=torch.long)
label_val_tensor = torch.tensor(label_val_encoded, dtype=torch.long)
label_test_tensor = torch.tensor(label_test_encoded, dtype=torch.long)

## Defining the Neural Network

Now that we have the data ready, we can define the actual neural network (NN). The SimpleNN class defines a Feedforward Neural Network (FNN), which is the simplest type of NN. It's called feedforward because the data flows in one direction; from the input layer, through any hidden layers, into the output layer. There are no loops or cycles in the network.

The first thing the SimpleNN class does it inherit from PyTorch's nn.Module class, which is the base class for all neural network models in PyTorch. This inheritance provides the necessary structure and methods to define and train a neural network.

It then sets up the layers required by the NN.

- The Input Layer:
    - The input layer is the first layer of the NN and has the same number of neurons as the number of input features
    - It isn't explicitly defined in the SimpleNN class because it's just the input data
- First Fully Connected Layer:
    - The first fully connected layer is the first hidden layer of the NN
    - It is defined by the nn.Linear() class, which creates a fully connected layer
    - The number of input features is specified by the input_size parameter
    - The number of neurons in the hidden layer is specified by the hidden_size parameter
- Rectified Linear Unit (ReLU) Activation Function:
    - The ReLU activation function is applied to the output of the first fully connected layer
    - It is defined by the nn.ReLU() class and is used to introduce non-linearity into the NN
- Second Fully Connected Layer:
    - The second fully connected layer is the output layer of the NN
    - It is defined by the nn.Linear() class
    - The number of neurons in the output layer is specified by the output_size parameter
- Log Softmax Activation Function:
    - The Log Softmax activation function is applied to the output of the second fully connected layer
    - It is defined by the nn.LogSoftmax() class and is used to convert the output into log probabilities

Next it defines the forward() method, which specifies how the data flows through the NN. The forward() method takes the input data as a tensor, then passes it to the first connected layer. The linear output is then passed through the ReLU activation function to help the NN learn complex relationship in the data. The tensor is then passed into the second connected layer, and the softmax activation function where log probabilities are calculated. 

In [34]:
class SDSSClassifier(nn.Module):
    '''
    This class defines the neural network model for the classification task. The neural network model consists of three fully connected 
        layers with ReLU activation functions and a softmax activation function at the output layer. The neural network model is defined in
        the __init__() function and the forward pass is defined in the forward() function.
    '''
    def __init__(self, input_size, hidden_size, num_classes):
        '''
        This function initialises the SDSSClassifier class

        :param input_size: The number of input features, e.g. 5 for [u, g, r, i, z]
        :param hidden_size: The number of neurons in the hidden layer
        :param num_classes: The number of output classes, e.g. 3 for [Galaxy, QSO, and Star]
        '''
        # The super() function is used to call the __init__() function of the parent class (nn.Module)
        super(SDSSClassifier, self).__init__()

        # Define the layers of the neural network
            # nn.Linear() defines a fully connected layer
                # The first argument is the number of input neurons
                # The second argument is the number of output neurons
        self.fc1 = nn.Linear(input_size, hidden_size) # First fully connected layer
        self.relu = nn.ReLU() # ReLU activation function
        self.fc2 = nn.Linear(hidden_size, hidden_size) # Second fully connected layer
        self.fc3 = nn.Linear(hidden_size, num_classes) # Third fully connected layer
        self.softmax = nn.Softmax(dim=1) # Softmax activation function

    def forward(self, input_features):
        '''
        This function defines the forward pass of the neural network model and is called when the neural network model is run. The forward
            pass is the process of inputting the input features into the neural network and obtaining an output.

        :param input_features: The input features
        :return: The output of the neural network (the class probabilities)
        '''
        input_features = self.fc1(input_features) # Pass the input features through the first fully connected layer
        input_features = self.relu(input_features) # Pass the output of the first fully connected layer through the ReLU activation function
        input_features = self.fc2(input_features) # Pass the output of the ReLU activation function through the second fully connected layer
        input_features = self.relu(input_features) # Pass the output of the second fully connected layer through the ReLU activation function
        input_features = self.fc3(input_features) # Pass the output of the ReLU activation function through the third fully connected layer
        return self.softmax(input_features) # Apply the softmax activation function to the output of the third fully connected layer and return the result

Now that the NN has been defined, we can start to initialise it. The input, hidden, and output sizes are defined and then fed into the SimpleNN class to create the model.

In [23]:
# Define the neural network model, loss function, and optimiser
# Input size is the number of input features (e.g., 5 for u, g, r, i, z) we will use
input_size = features_train_tensor.shape[1]

# Hidden size is the number of neurons in the hidden layer
hidden_size = 64  # You can change this value to see how it affects the performance of the model

# Output size is the number of classes (e.g., 3 for star, galaxy, quasar)
    # The number of classes is the number of unique labels in the training data
    # The np.unique() function returns the unique elements in an array, in this case the unique labels in the training data
output_size = len(np.unique(label_train_tensor))

# Create an instance of the SimpleNN class
model = SimpleNN(input_size, hidden_size, output_size)

Next we define the loss function and optimiser.

- Loss function:
    - Used to calculate the error between the predicted output of the neural network and the actual labels
    - A hyperparameter that needs to be tuned to achieve the best performance of the model

- Optmiser:
    - Used to update the weights of the neural network based on the error calculated by the loss function
    - There are many different optimisers available in PyTorch, such as Adam, SGD, RMSprop, etc. We use Adam in this example because it is a popular choice for many tasks, and commonly used for NNs
    - The Adam optimiser is an adaptive learning rate optimiser that adjusts the learning rate during training, which can help the model converge faster and achieve better performance

The loss function and optimiser are defined outside the neural network model class because they are not part of the neural network architecture but are instead used to train the neural network. This also allows them to be easily changed or modified without affecting the neural network architecture.

- Learning rate:
    - The learning rate controls how much the weights of the neural network are updated during training
    - A higher learning rate means the weights are updated more and a smaller learning rate means the weights are updated less, during the training.
    - If the learning rate is too high, the model may converge too quickly or diverge, as it overshoots the minimum of the loss function
    - If the learning rate is too low, the training process can get stuck in a local minimum and/or take a very long time to complete

In [24]:
# Define the loss function and optimiser
# nn.CrossEntropyLoss() is the loss function used for classification tasks with multiple classes
# optim.Adam() is the optimiser used to update the weights of the neural network

# Define the loss function
    # nn.CrossEntropyLoss() is the loss function used for classification tasks with multiple classes
    # The CrossEntropyLoss() function combines the softmax activation function and the negative log likelihood loss function
criterion = nn.CrossEntropyLoss()

# Define the optimiser
    # optim.Adam() is the optimiser used to update the weights of the neural network
    # The Adam optimiser is an extension of the stochastic gradient descent optimiser
    # The Adam optimiser adapts the learning rate for each parameter during training
    # The learning rate is specified by the lr argument
    # The model.parameters() function specifies the parameters that need to be updated by the optimiser, which in this case are the weights of the neural network
optimizer = optim.Adam(model.parameters(), lr=0.001)