# PyTorch Workflow

This notebook closely follows the material available at learnpytorch.io [[1]](https://www.learnpytorch.io/) with occasional refactoring and extension for consistency of style and to make connections with other parts of the package. It is also more extensive on examples and does less revisit of lower level concepts once discussed.

### A Generic Workflow

At the core of most machine learning models, the below steps are followed.

- Take relevant historical data and map it to tensors.
- Build or select a model/algorithm that is suitable for the problem to be solved.
    - Decide on an loss function and an optimizer.
    - Build a training loop.
- Fit the model to the data and make a prediction.
- Evaluate the model and improve iteratively.
- Preserve the state of the training.

In [None]:
import matplotlib.pyplot as plt  # for inline visualization
import torch
from torch import nn  # contains the building blocks for neural networks

### Example: Linear Regression

To demonstrate the above workflow, we will generate a predictor $X$ and a target variable $Y$ with known linear (affine to be more precise) relationship

$$
Y = a X + b
$$

then use a portion of this data to train our model to estimate these parameters and consequently use it for prediction of the rest of the data.

**Note.** In machine learning data can be just about anything. One of the immediate challenges to overcome is to find a proper representation of the input in terms of tensors.

In [None]:
true_weight, true_bias = 0.7, 0.3
predictor = torch.arange(0, 1, 0.02)
print(predictor)

Since the relationship is fixed, we readily generate the target variable from the predictor and the true weight and bias parameters.

In [None]:
target = true_weight * predictor + true_bias
print(target)

When it comes to partition of the data, the following three-way split shall be considered: training set, validation set, and testing set. Each has its own purpose and rule of thumbs for size.

- Training set is used by the model as  means of learning. It can attribute to 60-80% of the total data.
- Validation set is used to fine tune the model. It is not always used and can make up 10-20% of the total data.
- Testing set is the batch on which the model is evaluated. It consists of the rest of the data.

Here, we ignore validation and use a generic Pareto split of 80% and 20%.

In [None]:
train_cutoff = int(0.8 * predictor.size(dim=0))
train_predictor, train_target = predictor[:train_cutoff], target[:train_cutoff]
test_predictor, test_target = predictor[train_cutoff:], target[train_cutoff:]

We can visualize our data using a scatter plot, a fairly standard pick in the context of linear regression.

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(train_predictor, train_target, s=2, c="red", label="Training data")
plt.scatter(test_predictor, test_target, s=2, c="blue", label="Testing data")
plt.legend(prop={"size": 12})
plt.show()

In order to build a machine learning model, we will rely on the `Module` template provided by `torch.nn`.
Due to inheritance we do not really need to care about low-level implementation.
The main actors are the weight and bias parameters that will be represented as scalar tensors.
However, it is important that we would like to use the associated gradients with respect to the yet to be defined loss function; thus, we require gradients during instantiation.
The type used for these objects is the `Parameter` class which, when used in conjuction with, gets added to the parameters of the `Module`.
Other than that `Parameter`s can be thought of as any other tensor.

The bare minimum to produce a solution to the machine learning problem is to be able to predict using the model.
The `forward` function defines how the prediction should be made with the parameters of the model.
In this case, it is the linear regression formula.

In [None]:
class LinearRegressionModel(nn.Module):
    
    def __init__(self):
        super().__init__()
        self._weight = nn.Parameter(
            torch.randn(1, dtype=torch.float),  # randomized initial state
            requires_grad=True,
        )
        self._bias = nn.Parameter(
            torch.randn(1, dtype=torch.float),  # randomized initial state
            requires_grad=True,
        )
    
    @property
    def weight(self):
        return self._weight
    
    @property
    def bias(self):
        return self._bias
    
    def forward(self, predictor):
        return self.weight * predictor + self.bias

Parameters and the state of the model can readily be accessed with inherited methods from `Module`.

In [None]:
sample_model = LinearRegressionModel()
for parameter in sample_model.parameters():
    print(parameter)
print()
print(sample_model.state_dict())

Assume we would like to make a prediction using the initial parameters. PyTorch is equipped with a context manager called `inference_mode` that ensures the code run within its context does not use the gradients and stops view tracking to improve performance of execution. In this particular example, it has no real benefits, however, it does once the number of parameters and evaluation steps blow up.

In [None]:
with torch.inference_mode():
    result = sample_model(test_predictor)
print(result)

Visualizing once again, the initial pick of parameters simply resulted in a random guess.

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(train_predictor, train_target, s=2, c="red", label="Training data")
plt.scatter(test_predictor, test_target, s=2, c="blue", label="Testing data")
plt.scatter(test_predictor, result, s=2, c="green", label="Model-predicted data")
plt.legend(prop={"size": 12})
plt.show()

The model parameters can be improved iteratively by expressing the goodness of the prediction in terms of some loss function.
Then we would pick an appropriate optimizer mechanism that determines how the parameters should be updated in order to decrease the loss.
An example of a loss function would be mean absolute error while an example of an optimizer would be stochastic gradient descent.
For a brief overview of each, see the Wikipedia articles [[2]](https://en.wikipedia.org/wiki/Mean_absolute_error) and [[3]](https://en.wikipedia.org/wiki/Gradient_descent).

In [None]:
loss_function = nn.L1Loss()  # Mean absolute error
optimizer = torch.optim.SGD(
    params=sample_model.parameters(),
    lr=0.01
)  # Stochastic gradient descent. `lr` is a so-called hyperparameter and subject to change

The training loop in which we leverage the loss function and the optimization step is structured as follows:
- Forward pass performs forward calculations of the model across the training data.
- Loss calculation evaluates the predictions against the true target.
- Backpropagation computes the gradient of the loss function with respect for every model parameter to be updated, accumulating them using the chain rule. Note that each step requires gradients to be set to zero so that training steps do not interfere with each other beyond the parameters.
- Update the parameters.

In [None]:
torch.manual_seed(42)

epoch_count = 100
train_losses, test_losses = [], []

for epoch in range(epoch_count):
    sample_model.train()  # sets the model in training state
    prediction = sample_model(train_predictor)
    loss = loss_function(prediction, train_target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    sample_model.eval()  # sets the model in evaluation state
    
    with torch.inference_mode():
        test_prediction = sample_model(test_predictor)
        test_loss = loss_function(test_prediction, test_target)
    
    train_losses.append(loss.detach().numpy())
    test_losses.append(test_loss.detach().numpy())

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(list(range(1, epoch_count+1)), train_losses, label="Training MAE loss")
plt.plot(list(range(1, epoch_count+1)), test_losses, label="Testing MAE loss")
plt.legend({"size": 14})
plt.show()

Notice that we got "fairly" close to the actual parameters.

In [None]:
print(sample_model.state_dict())

Let assume that we are satisfied with the model's performance. Now we can make predictions, but need to remember that we should set the model state into evaluation and use the inference mode for performance.

In [None]:
new_predictor = torch.arange(1.02, 1.20, 0.02)
sample_model.eval()
with torch.inference_mode():
    new_prediction = sample_model(new_predictor)
print(new_prediction)

Finally, we may preserve and reuse the model. The standard approach is to pickle and write the model's state dictionary into a file, then load it when needed.

In [None]:
from pathlib import Path

model_path = Path("models")
model_path.mkdir(parents=True, exist_ok=True)
torch.save(
    obj=sample_model.state_dict(),
    f=model_path / "linear_regressor.pth"
)

We can demonstrate the correct storage by reloading the state into a new instance of the model and repeat the prediction.

In [None]:
trained_model = LinearRegressionModel()
trained_model.load_state_dict(torch.load(f=model_path / "linear_regressor.pth"))
trained_model.eval()
with torch.inference_mode():
    new_prediction = trained_model(new_predictor)
print(new_prediction)

### References

[1] Learn PyTorch for Deep Learning: Zero to Mastery book, accessed online on 2023.04.02 at https://www.learnpytorch.io/

[2] Mean absolute error, Wikipedia article, accessed online on 2023.04.02 at https://en.wikipedia.org/wiki/Mean_absolute_error

[3] Gradient descent, Wikipedia article, accessed online on 2023.04.02 at https://en.wikipedia.org/wiki/Gradient_descent