# 🔥 Introduction to coding Neural Networks with PyTorch 




We need a lot more imported libraries in this notebook, as we are going to be 
- reading in data from a file 
- and also implimenting some higher level machine learning techniques 

In [None]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

### 🤞 Download the data
Lets first try and download the planetary data that we need from where they are stored on google drive. Notice that the file is in a weird format (`.hdf5`) and you probably can't open it normally just by clicking on it. 

In [None]:
! gdown https://drive.google.com/uc?id=1X1dKqy5UR2jEp0j4RYcHBWjtYEK3o6-v -O data.hdf5

# 🌌 What is an HDF5 File?

HDF5 (Hierarchical Data Format v5) is a powerful file format designed to store and organize large amounts of data efficiently. For astrophysics, this is especially useful because it can handle **complex, multi-dimensional datasets** (like simulations or observations 🌠) while maintaining high performance in reading and writing data.

### Key Features:
- **Hierarchical Structure**: Think of it as a file-system-within-a-file 🗂️. You can organize data into groups and datasets for easy access.
- **Efficient Storage**: Supports compression and chunking, ideal for managing massive datasets from telescopes or simulations 🛰️.
- **Metadata Support**: Attach descriptive information to your data for better context and usability 🏷️.

HDF5 files are widely used in astrophysics for handling datasets like cosmic microwave background maps, galaxy catalogs, particle simulations, or in this case, exoplanet observations. 🚀✨

If we open the file in python using the xarray package, then we can see that it contains 22 seperate folders, each with data in. The file gives us some information about this data at the top as well, most importantly the dimensions. We are specifically interested in the wavelength, as this will be where the raw spectra data are stored. Samples will tell us how many examples of spectra that we have, which should be around 70,000 🪐


In [None]:
# open hdf5
path = "data.hdf5"

ds = xr.open_dataset(path)
ds

### 👀 Let's visualise some of this data
The below code will extract the first (`i=0`) planet from the data and plot the spectra.

In [None]:
i = 0

# plt.plot(ds['wavelength'],ds['contributions'].sel(sample=i,species=['H2O']).values.T, label='H2O')

plt.plot(ds['wavelength'],ds['spectrum'].sel(sample=i).values, "k-", label='observed spectrum')

plt.errorbar(ds['wavelength'],ds['spectrum'].sel(sample=i).values,xerr=ds['bin_width']/2, yerr=ds['noise'].sel(sample=i).values, fmt='none', color='black', )


plt.xlabel('wavelength')
plt.ylabel('transit depth')

axs = plt.gca()
axs.set_xscale('log')
axs.set_xticks([0.6, 1,1.5, 2, 2.5,3,4, 5, 6,8,])
axs.get_xaxis().set_major_formatter(plt.ScalarFormatter())

plt.title(f'sample {i}')
plt.legend()


# print the species present in the sample
s_ = 'log_H2O'
print("#"*70)
print(f" Aundance of {s_} in planetary atmosphere for planet {i} is {ds[s_].sel(sample=i).values:.4f}")
print("#"*70)

## 🚀 Convert HDF5 Files to NumPy Arrays

While HDF5 files are excellent for **storing and organizing large datasets**, working with them directly in Python can be less efficient for certain tasks, especially in **machine learning** workflows with frameworks like PyTorch. Here's why converting HDF5 data to **NumPy arrays** is a great idea:

### ⚡ Speed and Performance
- NumPy arrays are highly optimized for numerical computations and are faster to manipulate in Python 🏎️.
- Machine learning ibraries like PyTorch which we will be using natively support NumPy arrays, allowing for seamless conversion to tensors and faster data loading 🔄.

### 📚 Simplicity
- NumPy arrays are easier to slice, index, and manipulate compared to hierarchical structures in HDF5 files 🧩.
- Simplifies data preprocessing and transformation pipelines for machine learning.
- We can also get rid of a lot of the unnessecary data in the HDF5 file that we are not interested in this time around


In [None]:
label_names = ['log_H2O']
labels = ds[label_names]
labels_np = labels.to_array().values.T

labels_np.shape

In [None]:
spectra_np = ds['spectrum'].values
spectra_np.shape

# 🧠 Creating the Neural Network Model 

This is where we will define the architecture of the model that we are going to train

### 🚀 Model Structure

`self.fc1 = nn.Linear(a, b)` creates a layer with `a` input features and `b` output features.

So, `self.fc1 = nn.Linear(52, 512)` creates a layer with 52 input features and 512 output features.

The number of input features for the first layer _must match the size of the data_ and similarly the size of the output of the last layer _must match the number of things you are trying to predict_, which in this case is only 1

### 🧘‍♀️ Flexibility 
You can easily add more layers or modify the architecture to improve the model performance for specific use cases.


In [None]:
class model_A(nn.Module):
    def __init__(self):
        super(model_A, self).__init__()

        self.fc1 = nn.Linear(52, 512)
        self.fc2 = nn.Linear(512, 128)
        self.fc3 = nn.Linear(128, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

## ✍️ Student Task: Try different models
Have a go training model A, B and C, and see which one you think has the better performance. If you are feeling up to it, create a model of your own and try training that as well. See if you can beat the loss performance of the models I have given you here. 

In [None]:
class model_B(nn.Module):
    def __init__(self):
        super(model_B, self).__init__()

        self.fc1 = nn.Linear(52, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, 32)
        self.fc4 = nn.Linear(32, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

In [None]:
class model_C(nn.Module):
    def __init__(self):
        super(model_C, self).__init__()

        self.fc1 = nn.Linear(52, 1024)
        self.fc2 = nn.Linear(1024, 1024)
        self.fc3 = nn.Linear(1024, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [None]:
class model_D(nn.Module):
    def __init__(self):
        super(model_D, self).__init__()

        self.fc1 = nn.Linear( # Your code here...


    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu( # Your code here...
            
        return x

### 🔀 Change the model used here!

Modify the line below to change which model is being used to train

In [None]:
model = model_A() # Modify this to select the model of your choice! 

This is where we define the loss function, which is evaluating how good your model is at predicting the amount of water in the planet atmosphere. 

In [None]:
# create a loss function
criterion = nn.MSELoss()

## 📈 Understanding The Optimiser and _Learning Rate_

The **learning rate** (often denoted as `lr`) is a critical hyperparameter in machine learning algorithms. It determines the amount of change that is made to the model weights each epoch to get closer to the solution.

### How It Works
- During training, the model updates its parameters (weights) to reduce the error by following the loss function.
- The learning rate controls how much the parameters are adjusted in each update 🛠️.

### Choosing the Right Learning Rate
- **Too High**: The model may overshoot the optimal solution, causing the loss to oscillate or even diverge 🚀❌.
- **Too Low**: The training process will be very slow and might get stuck in local minima, which is where the model can't keep taking smaller steps to get to a better solution, but needs to try something radically different to keep exploring.
- **Just Right**: A properly tuned learning rate helps the model converge efficiently onto the solution 🎯.

The learning rate is a key factor in determining the speed and success of your training process. A good choice can make the difference between a well-trained model and one that fails to converge! 🏆✨


here we assign the optimiser, which controls the learning rate.

### ✍️ Student Task: Learning rate exploration
Modify the learning rate to see the effect that this has on model training. 

In [None]:
lr = 0.0005

optimizer = optim.Adam(model.parameters(), lr=lr)

# 🗂️ Understanding Batch Size in Machine Learning

**Batch size** is an important hyperparameter in machine learning that defines the number of **training examples** processed together before the model updates its parameters.

### How It Works
- During training, data is divided into smaller subsets called **batches**.
- Each batch is used to calculate the **loss** and perform a single **gradient update**.

### Types of Batch Sizes
1. **Small Batch (Mini-batch)**: 
   - Size: Usually between 32 and 512.
   - **Pros**: Faster updates, better generalization 🏃.
   - **Cons**: More noise in gradients 🚧.
2. **Large Batch**: 
   - Size: Can be thousands of examples.
   - **Pros**: More stable gradients, efficient training on larger machines 💻.
   - **Cons**: Higher memory usage, and can potentially cause overfitting, where the model has been optimised for only a certain subset of the true data, meaning it will likely perform badly on examples that are not present within the training dataset 📈.
3. **Full Batch**: 
   - All data is processed at once.
   - Rarely used due to memory and speed limitations 🐢.

### Choosing the Right Batch Size
- **Small Batch Sizes**: Often better for noisy datasets and limited hardware resources 🎯.
- **Larger Batch Sizes**: Useful for smooth convergence when sufficient computational power is available 💪.

### Impact on Training
- Smaller batches add **stochasticity** to training, which is like adding noise and randomness. It helps models generalize better, because within each batch the model is training to a small subset of the data, which is likely not representative of the full distribution. This means that the model needs to learn to cope better with data which is outside of the subset of data which it was trained on.
- Larger batches make training more deterministic, which can lead to faster convergence but may reduce generalization.

The choice of batch size is a balance between **training speed, memory constraints, and model performance**. It’s often best to experiment with different sizes to find the optimal value for your task! 🏆✨


here we assign the data loader, which controls the batch size.

### ✍️ Student Task: Batch size exploration
Modify the batch size to see the effect that this has on model training. 

In [None]:
batch_size = 1000

train_dataset = TensorDataset(torch.Tensor(spectra_np), torch.Tensor(labels_np))
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

## 📈 Training Begins!

Here we will start to train the model over a small number of epochs. Notice how the loss decreases (hopefully!) per epoch. This means the model is learning

In [None]:
batch_losses = []
epoch_losses = []


# train the model
model.train()
for epoch in range(20):

    for i, data in enumerate(train_loader):

        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        batch_losses.append(loss.item())


    epoch_losses.append(np.mean(batch_losses))
    batch_losses = []

    print(f'epoch {epoch}, loss {epoch_losses[-1]}')

## 👀 Visualising the training process

Lets visualise this like we were doing in class. by creating some functions which plot the data. 

In [None]:
def plot_loss(losses, epoch):

    losses = np.array(losses)

    plt.figure()
    plt.plot(losses, "k-")

    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.title(f'training loss\nepoch: {epoch}')

    plt.yscale('log')

    plt.ylim(losses.mean() - 3 * losses.std(), 4)

    plt.savefig('training_loss.png')
    plt.close()

In [None]:
def plot_predictions(predictions, labels, epoch):
    # make sure plot is square
    plt.figure(figsize=(5, 5))
    plt.plot(labels, predictions, "k.", label='predictions')
    plt.plot(labels, labels, "r--", label='Ground Truth')
    plt.xlabel('true log H2O value')
    plt.ylabel('predicted H2O value')
    plt.title(f'predictions\nepoch: {epoch}')

    plt.xlim(labels.min(), labels.max())
    plt.ylim(labels.min(), labels.max())

    plt.savefig('predictions.png')
    plt.close()

## 🚴‍♀️ Training for real

Lets run a lot more epochs of training, with this new visualisation.

_Hint: If the loss looks like it is still decreasing after the 200 epochs I have used here, then up this number, with the caviat of course that more epochs of training will take longer to run_

The model should save 2 image files, which are updated as the training goes on, which you can see in the left hand bar under the folder icon 📁. One will show the batch prediction against the truth, and the other will show the loss history 🕰️.

In [None]:
batch_losses = []
epoch_losses = []


# train the model
model.train()
for epoch in range(200):

    for i, data in enumerate(train_loader):

        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        batch_losses.append(loss.item())


    epoch_losses.append(np.mean(batch_losses))
    batch_losses = []


    if epoch % 5 == 0:
        plot_loss(epoch_losses, epoch)
        plot_predictions(outputs.detach().numpy(), labels.detach().numpy(), epoch)

    if epoch % 20 == 0:
        print(f'epoch {epoch}, loss {epoch_losses[-1]}')

# 🚨 If you are working through this **stop here** 
## We will cover this next session in class 

In [None]:
# split train into train and test

tts = [0.8, 0.2]
seed = 42
train_spectra, test_spectra, train_labels, test_labels, train_noise, test_noise = train_test_split(spectra_np, labels_np, noise_np, test_size=tts[1], random_state=seed)

# print shapes in table
print(f"""
train_spectra: {train_spectra.shape}
train_labels: {train_labels.shape}
train_noise: {train_noise.shape}

test_spectra: {test_spectra.shape}
test_labels: {test_labels.shape}
test_noise: {test_noise.shape}
""")

In [None]:
train_dataset = TensorDataset(torch.Tensor(train_spectra), torch.Tensor(train_labels))
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

In [None]:
def plot_loss(losses, test_losses, epoch):

    losses = np.array(losses)
    test_losses = np.array(test_losses)

    plt.figure()
    plt.plot(losses, "k-", label='train loss')
    plt.plot(test_losses, "g-", label='test loss')

    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.title(f'training loss\nepoch: {epoch}')

    plt.yscale('log')
    plt.legend()

    plt.ylim(test_losses.mean() - 3 * test_losses.std(), 4)

    plt.savefig('training_loss.png')
    plt.close()

In [None]:
batch_losses = []
epoch_train_losses = []
epoch_test_losses = []

# train the model
model.train()
for epoch in range(1000):
    for i, data in enumerate(train_loader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        batch_losses.append(loss.item())

    with torch.no_grad():
        test_outputs = model(torch.Tensor(test_spectra))
        test_loss = criterion(test_outputs, torch.Tensor(test_labels))

    epoch_test_losses.append(test_loss.item())
    epoch_train_losses.append(np.mean(batch_losses))
    batch_losses = []

    if epoch % 5 == 0:
        plot_loss(epoch_train_losses,epoch_test_losses, epoch)
        plot_predictions(outputs.detach().numpy(), labels.detach().numpy(), epoch)

    if epoch % 20 == 0:
        print(f'epoch {epoch}, train loss {epoch_losses[-1]}, test loss {test_loss.item()}')
