This notebook is adapted from the official PyTorch tutorial on [Optimization](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html).

# Optimizing Model Parameters

Now that we have prepared some data and built a model, it's time to train our model by optimizing its parameters on our data. 

Training a model is an iterative process; in each iteration the model makes a guess about the output, calculates
the error in its guess (*loss*), collects the derivatives of the error with respect to its parameters (as we saw in
the last tutorial), and **optimizes** these parameters using gradient descent. 


## Prerequisite Code
We first load data and define our model as in previous tutorials. 

In [2]:
import torch

torch.set_printoptions(sci_mode=False, linewidth=300)

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
from sklearn.preprocessing import StandardScaler

# Ensure each feature is roughly on the same scale
std_scaler = StandardScaler().fit(X_train)
X_train = std_scaler.transform(X_train)
X_test = std_scaler.transform(X_test)

In [6]:
from torch.utils.data import Dataset, DataLoader, TensorDataset

X_train = torch.as_tensor(X_train, dtype=torch.float)
y_train = torch.as_tensor(y_train, dtype=torch.float)
X_test = torch.as_tensor(X_test, dtype=torch.float)
y_test = torch.as_tensor(y_test, dtype=torch.float)

train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)

In [8]:
import torch.nn as nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__() 

        self.linear_relu_stack = nn.Sequential(
            nn.Linear(in_features=8, out_features=16),
            nn.ReLU(),
            nn.Linear(in_features=16, out_features=16),
            nn.ReLU(),
            nn.Linear(in_features=16, out_features=1),
        )

    def forward(self, x):
        return self.linear_relu_stack(x)

In [9]:
model = NeuralNetwork()
model

NeuralNetwork(
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=8, out_features=16, bias=True)
    (1): ReLU()
    (2): Linear(in_features=16, out_features=16, bias=True)
    (3): ReLU()
    (4): Linear(in_features=16, out_features=1, bias=True)
  )
)

## Optimization Loop

We can then train and optimize our model with an optimization loop. Each
iteration of the optimization loop is called an **epoch**.

Each epoch consists of two main parts:
 - **The Train Loop** - iterate over the training dataset and try to converge to optimal parameters.
 - **The Validation/Test Loop** - iterate over the test dataset to check if model performance is improving.

Let's familiarize ourselves with some of the concepts used in the training loop.


### Loss Function

When presented with some training data, our untrained network is likely not to give the correct
answer. **Loss function** measures the degree of dissimilarity of obtained result to the target value,
and it is the loss function that we want to minimize during training. To calculate the loss we make a
prediction using the inputs of our given data sample and compare it against the true data label value.

Common loss functions include [nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) (Mean Square Error) for regression tasks and
[nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) for classification tasks. A more complete list of loss functions can be found [here](https://pytorch.org/docs/stable/nn.html#loss-functions).


In [10]:
loss_fn = nn.MSELoss()

### Optimizer

Optimization is the process of adjusting model parameters to reduce model error in each training step. **Optimization algorithms** define how this process is performed. In this tutorial, we use the vanilla Stochastic Gradient Descent (SGD).

Optimization-related tools are provided under the [torch.optim](https://pytorch.org/docs/stable/optim.html) package, including a variety of [optimizers](https://pytorch.org/docs/stable/optim.html#algorithms). Here we use the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD) optimizer. 

Many factors can influence the choice of your optimizer, such as the nature of the problem you are trying to solve or the model architecture you designed. The [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam) optimizer is usually the go-to one for initial exploration, but it may not be the best choice for your problem at hand. In fact, some papers such as [this](https://proceedings.neurips.cc/paper/2017/file/81b3833e2504647f9d794f7d7b9bf341-Paper.pdf) have suggested that the models trained with Adam (and the like) do not generalize well as compared to that trained with SGD.



We initialize an optimizer by registering the model's parameters that need to be trained and passing any required (hyper)parameters such as the learning rate.

In [12]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

Inside the training loop, optimization happens in three steps:
 * Call ``optimizer.zero_grad()`` to reset the gradients of model parameters. Remember that PyTorch **accumulates gradients** by default, so we need to explicitly zero the gradients at each iteration.
 * Backpropagate the prediction loss with a call to ``loss.backward()``. PyTorch deposits the gradients of the loss w.r.t. each parameter.
 * Once we have our gradients, we call ``optimizer.step()`` to adjust the parameters using the gradients collected in the backward pass.




## Full Implementation
We define ``train_loop`` that loops over our optimization code, and ``test_loop`` that
evaluates the model's performance against our test data.



In [13]:
def train_loop(dataloader, model, loss_fn, optimizer, device):
    
    total = len(dataloader.dataset) # total number of examples
    num_data_seen = 0 # number of examples seen so far

    model.train() # Set your model to "train" mode

    for batch_idx, (X, y) in enumerate(dataloader):
        # Send inputs and labels to device
        X, y = X.to(device), y.to(device)

        # Compute prediction and loss
        pred = model(X).reshape(-1) # Reshape the pred of shape (N, 1) to (N,)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad() # You must zero grad before .backward!
        loss.backward()
        optimizer.step() # Take an optimization step

        num_data_seen += X.size(0)
        # Print out stats every 20 batches
        if batch_idx % 20 == 0:
            loss = loss.item() # What is .item?
            print(f"loss: {loss:>7f}  [{num_data_seen:>5d}/{total:>5d}]")

In [15]:
def test_loop(dataloader, model, loss_fn, device): # no more optimizer
    total = len(dataloader.dataset)
    test_loss = 0

    model.eval() # Set your model to "eval" mode

    with torch.no_grad():
        for X, y in dataloader:
            # Send inputs and labels to device
            X, y = X.to(device), y.to(device)

            pred = model(X).reshape(-1) # Reshape the pred of shape (N, 1) to (N,)
            loss = loss_fn(pred, y)

            # no more optimization steps

            test_loss += loss.item() * X.size(0) # because "loss" was averaged over the batch

    test_loss /= total
    print(f"Avg test loss: {test_loss:>8f} \n")

We initialize the loss function and optimizer, and pass it to ``train_loop`` and ``test_loop``.
Feel free to increase the number of epochs to track the model's improving performance.



In [19]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

model = NeuralNetwork().to(device)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)

num_epochs = 10
for t in range(num_epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer, device)
    test_loop(test_dataloader, model, loss_fn, device)
print("Done!")

Epoch 1
-------------------------------
loss: 7.232432  [   64/16512]
loss: 5.759336  [ 1344/16512]
loss: 4.239867  [ 2624/16512]
loss: 5.791527  [ 3904/16512]
loss: 3.029928  [ 5184/16512]
loss: 3.515863  [ 6464/16512]
loss: 3.220753  [ 7744/16512]
loss: 2.155186  [ 9024/16512]
loss: 3.036534  [10304/16512]
loss: 3.284680  [11584/16512]
loss: 2.408881  [12864/16512]
loss: 2.077053  [14144/16512]
loss: 1.759101  [15424/16512]
Avg test loss: 1.585000 

Epoch 2
-------------------------------
loss: 1.970925  [   64/16512]
loss: 1.734903  [ 1344/16512]
loss: 1.432903  [ 2624/16512]
loss: 0.856717  [ 3904/16512]
loss: 1.405313  [ 5184/16512]
loss: 1.282586  [ 6464/16512]
loss: 0.992226  [ 7744/16512]
loss: 0.755988  [ 9024/16512]
loss: 1.100949  [10304/16512]
loss: 1.080812  [11584/16512]
loss: 1.005567  [12864/16512]
loss: 0.979074  [14144/16512]
loss: 0.758651  [15424/16512]
Avg test loss: 1.004695 

Epoch 3
-------------------------------
loss: 0.904746  [   64/16512]
loss: 0.736265  [ 