# Question 2: 

## Introduction

## Imports

In [43]:
import torch
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

## Initialising the Dataset

Like in the Q1 notebook, we need to load the data in a useable form (i.e. a pandas dataframe).

In [36]:
# change the path to where the csv file is stored on your pc
path = '/Users/ryanu/Documents/Uni/ACT/SDSS-DR14-Classification/SDSS Data.csv'
data = pd.read_csv(path)
#data

I am going to start off using the same features as in Q1.

In [35]:
features = data[['u', 'g', 'r', 'i', 'z']]
labels = data['class']
#features

## Data Preprocessing

Like in Q1 with the decision tree, we need to split the data between training and testing. However, now we're going also split out a validation set. Where the training set is used to train the model, and the testing set is used to measure the models performance, the validation set will be used to tune hyperparameters (e.g., learning rate, architecture) and monitor overfitting during training.

In [57]:
# Split the data into training, verification, and testing sets
    # train_test_split() splits the data into training and testing sets
    # test_size=0.2 specifies that 20% of the data should be used for testing
    # random_state=42 is a random seed used to shuffle the data
    # The data is split into training and validation sets in a 80:20 ratio
    # The training set is then split into training and validation sets in a 80:20 ratio
    # The final data is split into training, validation, and testing sets in a 64:16:20 ratio
features_train_val, features_test, label_train_val, label_test = train_test_split(features, labels, test_size=0.2, random_state=42)
features_train, features_val, label_train, label_val = train_test_split(features_train_val, label_train_val, test_size=0.2, random_state=42)

In order to train a neural network we need to get the data into the right format to work with. The first step is to normalise all the data. Neural networks often perform better with normalised data because they are sensitive to the scale of the input features Standardization ensures that features with larger ranges don’t dominate, and makes the model converge faster.

In [61]:
# Initiliase the StandardScaler() function
    # It's important to initialise the StandardScaler() function, then use it for all the data sets to ensure that the same scaling is applied to all the data sets
    # The StandardScaler() function scales the data so that it has a mean of 0 and a standard deviation of 1
scaler = StandardScaler()

# Fit the StandardScaler() function to the training data
    # The fit_transform() function fits the StandardScaler() function to the training data and then scales the training data
    # The transform() function scales the validation and testing data using the same scaling as the training data
    # This ensures that the validation and testing data are scaled in the same way as the training data
features_train_normalised = scaler.fit_transform(features_train)
features_val_normalised = scaler.transform(features_val)
features_test_normalised = scaler.transform(features_test) 

We then need to convert the label names (Star, Galaxy, QSO) into numbers as neural networks expect numerical inputs and outputs.

In [62]:
# Encode the labels using the LabelEncoder() function
    # Again, it's important to initialise the LabelEncoder() function, then use it for all the data sets to ensure that the same encoding is applied to all the data sets
    # The LabelEncoder() function encodes the labels, in alphabetical order, as integers starting from 0 (e.g. Galaxy is 0, QSO is 1, Star is 2)
    # This is necessary because the labels need to be integers for the model to be able to use them
label_encoder = LabelEncoder()

# Fit the LabelEncoder() function to the training labels
    # The fit_transform() function fits the LabelEncoder() function to the training labels and then encodes the training labels
    # The transform() function encodes the validation and testing labels using the same encoding as the training labels
    # This ensures that the validation and testing labels are encoded in the same way as the training labels
label_train_encoded = label_encoder.fit_transform(label_train)
label_val_encoded = label_encoder.transform(label_val)
label_test_encoded = label_encoder.transform(label_test)

We then want to store the datasets as a PyTorch tensor, which are similar to NumPy arrays but have some unique features that make them more suitable for machine learning tasks.

- Multi-Dimensional Arrays:
    - Tensors can have any number of dimensions, making them versatile for representing various types of data, such as scalars (0D), vectors (1D), matrices (2D), and higher-dimensional arrays
- GPU Acceleration:
    - PyTorch tensors can be moved to and operated on using GPUs, which significantly speeds up computations, especially for large-scale machine learning models
- Automatic Differentiation:
    - PyTorch tensors support automatic differentiation, which is essential for training neural networks. This feature is provided by PyTorch's autograd module, which automatically computes gradients for tensor operations
- Interoperability with NumPy:
    - PyTorch tensors can be easily converted to and from NumPy arrays, allowing seamless integration with existing NumPy-based code

In [63]:
# Convert features and labels into PyTorch tensors
    # torch.tensor() creates a tensor from a NumPy array
    # dtype=torch.float32 and dtype=torch.long specify the data type of the tensor
features_train_tensor = torch.tensor(features_train_normalised, dtype=torch.float32)
features_val_tensor = torch.tensor(features_val_normalised, dtype=torch.float32)
features_test_tensor = torch.tensor(features_test_normalised, dtype=torch.float32)
label_train_tensor = torch.tensor(label_train_encoded, dtype=torch.long)
label_val_tensor = torch.tensor(label_val_encoded, dtype=torch.long)
label_test_tensor = torch.tensor(label_test_encoded, dtype=torch.long)