# Homework: PyTorch Multi-Layer Perceptron

This is a set of homework questions based on the `SimpleDNN.ipynb` tutorial, which will ask you to perform some experimentation and draw some insights about neural networks. We will use the same datasets as before, so make sure you still have those! It may also be beneficial to refer to that notebook if you're ever confused about the physics task at hand.

As you work, feel free to insert cells into this notebook as needed. Please keep the overall structure (order of the problems, etc) consistent, but don't feel overly bound by the scaffolding laid out!

To get started, we'll import the same tools we've used before.

In [None]:
import os
from itertools import cycle
import numpy as np
import vector
import torch
from matplotlib import pyplot as plt

To start, you should load in the datasets (and build `DataLoader`s) and re-create the model architecture you built from the tutorial. While I have no doubt you'll reference that code, try to make sure you're actually typing out at least a significant portion of it here. I've found that copy/pasting is often the enemy of knowledge retention!

In [None]:
# Your code here - load datasets and create dataloaders

In [None]:
# Your code here - define model class as subclass of torch.nn.Module

### Problem 1

Now that you have a model class defined and some datasets, you should implement training! Here, we're going to do things a little differently than in the tutorial: you'll be wrapping all of the training process up neatly into a function! This will allow you to run multiple trainings easily within the notebook, which you'll be doing in later problems. Your function should accept as parameters:
- an instance of your model class (the model to train)
- an optimizer (instance of a class in the [`torch.optim`](https://docs.pytorch.org/docs/stable/optim.html) module)
- a loss function (can be an instance of any subclass of `torch.nn.Module`, though will most likely be one of the losses that PyTorch implements)
- a number of epochs to train
- anything else you feel you'll need (or just want to include)

It should then return the trained model.

A couple of pointers for the design of this function:
- Try to make it as self-contained as possible. This will likely involve adding additional parameters beyond the four mentioned above (what else did we need to know in order to set up our training?), but will pay off in the long run, as you can easily run trainings with a variety of configurations.
- This function will likely spend more than a minute running when called (it takes time to train)! During that time, it might be nice to get some feedback that things are running as expected. Feel free to add `print` statements (or use other tools) to monitor your training progress!

A simple signature has been included for you. When you finish, train a model! It might be a good idea to save them with `torch.jit.script(my_model).save(file_name)`. You can then load all your models and compare them very easily, with `my_model = torch.jit.load(file_name)`.

In [None]:
def train_my_model(model, optimizer, loss_fn, epochs=5):
    
    # Your code here - train the model!
    
    return model

In [None]:
# Your code here - call your function and train!

### Problem 2: Model Size

Now that we can create and train a variety of models easily, let's investigate a major aspect (and sometimes pain point) in machine learning: how does the performance of our model vary as its size changes? Train at least 4 different models (for at least 5 epochs each) and examine their performance on the validation (or testing) dataset. Which one performed the best? Write down your observations in markdown cells as you go, and try to compare performance directly (perhaps even in a plot?) among the different trainings you ran.

In [None]:
# Training 1

In [None]:
# Evaluation 1

In [None]:
# Training 2

In [None]:
# Evaluation 2

In [None]:
# Training 3

In [None]:
# Evaluation 3

In [None]:
# Training 4

In [None]:
# Evaluation 4

In [None]:
# Comparison and summary

### Problem 3: Our Input Data

To understand our problem a little better, let's look a bit more closely at what we're giving the model.

#### Part 3a: Taking a Look

Plot histograms of several of the observables that serve as inputs to our model. You should at least plot the $p_x$, $p_z$, and $E$ of the _first jet_ and _lepton_, and the $p_x$ and $p_y$ of the missing transverse energy. What do you see? Do all the variables lie in about the same range, or are their domains different? What about their distributions? Summarize your findings.

In [None]:
# Your code here - make some plots!

In [None]:
# Summary - what did you find?

#### Part 3b: Standardization

Hopefully you saw that the distributions of the various quantities we pass to the model can be quite different from each other! Neural networks can sometimes [train more easily](https://doi.org/10.1016/0305-0483(96)00010-2) (and thus reach better performance) when all of the features have roughly the same range of values. So, we should try _standardizing_ our input features, so that they all lie on a similar range. The most straightforward way to do this is by taking each feature, subtracting its mean, and dividing by its standard deviation:

$$ x_{i,j} \leftarrow \frac{x_{i,j} - \langle x_i \rangle}{\sigma_i} $$

where $i$ indexes over the different features (so $\langle x_i \rangle$ is the mean of the $i$th feature) and $j$ indexes over the samples in our dataset. This transformation will map any normal (Gaussian) distribution to the _standard normal distribution_ (Gaussian with $\mu=0$ and $\sigma=1$). On other distributions, it will preserve the shape, but shift it to be centered around zero and squeeze the values into a (usually) smaller range. Is this desirable for us? Try it and see!

The Scikit-Learn library provides a variety of methods for data standardization in their [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) module. The method I described above is implemented in the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) class. You can use these if you wish, but be aware that you'll need to convert the PyTorch tensors from the datasets to Numpy arrays (you can use [`torch.Tensor.numpy()`](https://docs.pytorch.org/docs/stable/generated/torch.Tensor.numpy.html) for this), pass them through the scalers, then convert them back to tensors (using [`torch.as_tensor()`](https://docs.pytorch.org/docs/stable/generated/torch.as_tensor.html#torch.as_tensor) or similar).

Regardless of how you accomplish it, try to standardize the input features and train a model with them. If you expanded the parameter list in your training function to include the dataloaders, you'll be able to just create new dataloaders with your transformed inputs and pass them in!

In [None]:
# from sklearn import preprocessing # Optional

In [None]:
# Your code here - standardization

In [None]:
# Your code here - training

In [None]:
# Your code here - evaluation and conclusions

#### Part 3c: Standardizing Outputs

Just like input features, _output features_ can also be standardized. This is simple in training, but complicates evaluation a bit - in order to interpret the predictions your model makes, you'll need to _invert the transformation_. Here's where the Scikit-Learn classes can come in handy - they implement an `inverse_transform()` method that can do this inversion easily. Try training with standardized outputs, both with and without standardizing inputs. Which combination performs the best? Why might this be the case?

Hint: if you're having trouble getting your tensor(s) of model predictions into Numpy arrays to invert the scaling, you may need to use something like `output_tensor.detach().to("cpu").numpy()`. This detaches your tensor from PyTorch's computational graph, moves the data to CPU memory if it's on the GPU, and _only then_ converts it to Numpy. A slightly less safe alternative is `output_tensor.numpy(force=True)`, which does these things and a couple of others to ensure Numpy can work with the data in the tensor.

Note: if you already standardized your output features along with your inputs before, that's totally fine, no need to change that! Just train with only inputs (and only outputs) standardized now, and comment on the results.

In [None]:
# Your code here - standardize output features and make new dataloaders

In [None]:
# Your code here - train

In [None]:
# Evaluation and conclusions

### Problem 4: Other Training Settings

As a final problem, experiment with other _hyperparameters_ (settings for our training process). Run several trainings, varying at least two of these things:
- Change the _batch size_ to be larger or smaller (create new dataloaders using the same datasets). How big do batches need to be?
- Change the learning rate (when you create the optimizer). Try several values and try to find the best one! Does this optimal value depend on the batch size? Why might this be?
- Use a different loss function than you have so far. Try [MSE](https://docs.pytorch.org/docs/stable/generated/torch.nn.MSELoss.html), [L1 Loss (MAE)](https://docs.pytorch.org/docs/stable/generated/torch.nn.L1Loss), and [Huber loss](https://docs.pytorch.org/docs/2.8/generated/torch.nn.HuberLoss.html), all of which are common in regression problems like this.

Compare the results of your different trainings. Were any of the optimal settings surprising to you? Why (or why not)?

In [None]:
# Your code here - many trainings!

You're done, yay!

A couple of final notes on that last problem:
1. I just mentioned a few of the standard loss functions implemented in PyTorch, but you can implement _custom_ loss functions as well! They're subclasses of `torch.nn.Module`, just like the models themselves. This, along with model architecture, is one of the biggest "knobs" that researchers play with to improve model training (check out [multi-objective optimization](https://indico.cern.ch/event/1299889/contributions/5465272/attachments/2773733/4833542/MDMM%20lecture.pdf) for some cool stuff in this arena).
2. This last problem was likely your first stumbling steps towards the practice of _hyperparameter optimization_. This is a big deal in machine learning, and doing it systematically is often a large effort (and large computational cost) for ML-based research projects. One package that exists to ease the effort is [Optuna](https://optuna.readthedocs.io/en/stable/), a very flexible framework for general optimization problems. I've used it a lot in the past, and it greatly simplifies the process of finding good hyperparameters amidst a huge search space.