# Introduction to PyTorch
PyTorch is an open source deep learning library developed primarily by Facebook's AI Research lab. The two main flavors of PyTorch are:

1. Tensor computing (like NumPy) with GPU support. This means we can treat a PyTorch tensor and a Numpy array as mostly identical.
1. Deep neural networks built on a tape-based automatic differentiation system. This means the library has its own way of computing the gradients of complex functions, which we don't have to worry about.

This primer will introduce you to some common API calls and functionalities of PyTorch. There are many excellent tutorials online, including one from the [official PyTorch page](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html), if you like to know more about the subject.

## 1. PyTorch basics
### 1.1. Creating a tensor
The primary data structure used in PyTorch is a *tensor*, which stores data in a multi-dimensional matrix format, similar to a Numpy array (note that in mathmematics, *tensor* and *matrix* are two distinct concepts, but here we can treat them as identical). Here are the data types supported by PyTorch tensor:

| Tensor Type  |  Data Type   |
|--------------|--------------|
| FloatTensor  | 32-bit float |
| DoubleTensor | 64-bit float |
| HalfTensor   | 16-bit float |
| IntTensor    | 16-bit int   |
| LongTensor   | 32-bit int   |

In [1]:
import torch
import torchvision
import numpy as np

You can pass any array-like object such as Python list or Numpy array to `torch.tensor()` or `torch.as_tensor()` to create a tensor project. The difference between these is that `torch.tensor` always copies the data, while `torch.as_tensor` always tries to avoid copying the data, especially when the input is already a Numpy array, i.e., changing the output of `torch.as_tensor` may change the original data source as well.

In [2]:
W = torch.tensor([[1, 2], [3, 4]])
a = np.array([[1, 2], [3, 4]])
V = torch.as_tensor(a)
print(type(W), W, W.dtype)
print(type(a), a, a.dtype)
print(type(V), V, V.dtype)

<class 'torch.Tensor'> tensor([[1, 2],
        [3, 4]]) torch.int64
<class 'numpy.ndarray'> [[1 2]
 [3 4]] int64
<class 'torch.Tensor'> tensor([[1, 2],
        [3, 4]]) torch.int64


One important parameter to keep in mind is the `device` parameter, which specifies the type of device (CPU or GPU) that should hold the tensor object (or other objects such as a machine learning model). This allows you to control which operations are performed on CPU and which are on GPU.

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# create a tensor on a specified device
v = torch.tensor([[1, 2], [3, 4]], device = device)

# move the tensor to CPU, also change datatype to double
v.to("cpu", torch.double)

cpu


tensor([[1., 2.],
        [3., 4.]], dtype=torch.float64)

Note that the output of `print(device)` will be `cpu` by default, but if you have a GPU set up, the output will `cuda` instead. The eaiest way to set up a GPU is uploading this notebook to Colab, then on the top toolbar select Runtime --> Change runtime type, and set "Hardware accelerator" to GPU (yes, Colab provides free GPU access!)

Conversion between Numpy array and Torch tensor can be done conveniently with `.numpy()` and `.from_numpy()`:

In [4]:
torch_to_numpy = W.numpy()
numpy_to_torch = torch.from_numpy(a)
print(type(torch_to_numpy), torch_to_numpy)
print(type(numpy_to_torch), numpy_to_torch)

<class 'numpy.ndarray'> [[1 2]
 [3 4]]
<class 'torch.Tensor'> tensor([[1, 2],
        [3, 4]])


Note that `torch.from_numpy` functions similarly to `torch.as_tensor`, in that the input Numpy array and output tensor share the same memory, and modifying one will change the other as well.

### 1.2. Operating on tensor
PyTorch tensor also supports many matrix-based operations. In general, for large array sizes (e.g., $40000^2$ array elements), Numpy provides faster array initialization but slower element access and array operations than PyTorch (see [here](https://medium.com/python-pandemonium/how-pytorch-will-replace-numpy-2df48427f56d) for an example comparison).

In [5]:
a = np.array([[1., 2.], [3., 4.]])
b = np.array([[5., 6.], [7., 8.]])

V = torch.from_numpy(a)
M = torch.from_numpy(b)

# addition
print(a+b, '\n')
print(V+M)

# subtraction
print(b-a, '\n')
print(M-V)

# multiplication
print(a*b, '\n')
print(V*M)

# division
print(a/b, '\n')
print(V/M)

# matrix addition
print(np.add(a,b), '\n')
print(torch.add(V,M))

# matrix subtraction
print(np.subtract(a,b), '\n')
print(torch.sub(V,M))

# matrix multiplication
print(np.dot(a,b), '\n')
print(torch.mm(V,M))

# matrix division
print(np.divide(a,b), '\n')
print(torch.div(V,M))

# matrix transpose
print(a, '\n')
print(np.transpose(a), '\n')
print(torch.t(V))

[[ 6.  8.]
 [10. 12.]] 

tensor([[ 6.,  8.],
        [10., 12.]], dtype=torch.float64)
[[4. 4.]
 [4. 4.]] 

tensor([[4., 4.],
        [4., 4.]], dtype=torch.float64)
[[ 5. 12.]
 [21. 32.]] 

tensor([[ 5., 12.],
        [21., 32.]], dtype=torch.float64)
[[0.2        0.33333333]
 [0.42857143 0.5       ]] 

tensor([[0.2000, 0.3333],
        [0.4286, 0.5000]], dtype=torch.float64)
[[ 6.  8.]
 [10. 12.]] 

tensor([[ 6.,  8.],
        [10., 12.]], dtype=torch.float64)
[[-4. -4.]
 [-4. -4.]] 

tensor([[-4., -4.],
        [-4., -4.]], dtype=torch.float64)
[[19. 22.]
 [43. 50.]] 

tensor([[19., 22.],
        [43., 50.]], dtype=torch.float64)
[[0.2        0.33333333]
 [0.42857143 0.5       ]] 

tensor([[0.2000, 0.3333],
        [0.4286, 0.5000]], dtype=torch.float64)
[[1. 2.]
 [3. 4.]] 

[[1. 3.]
 [2. 4.]] 

tensor([[1., 3.],
        [2., 4.]], dtype=torch.float64)


For a comprehensive mapping between Numpy and PyTorch matrix operations, refer to [this tutorial](https://github.com/wkentaro/pytorch-for-numpy-users).

## 2. Machine learning workflow in PyTorch
Just like sklearn's flow of (1) constructor, (2) `.fit`, and (3) `.transform` or `.predict`, PyTorch has its own workflow for a machine learning task:

1. Define the model. The module [torch.nn](https://pytorch.org/docs/stable/nn.html) provides many classes for the different types of layer in a neural network; however, each layer can also act as an independent model (e.g., a `Linear` layer can replicate a linear regression model).

1. Define the loss function and optimizer. The available loss functions, such as `MSELoss` and `CrossEntropyLoss`, are listed in the `torch.nn` modules. For the optimizer, while we have only seen gradient descent so far, there are many other strategies supported by the [torch.optim](https://pytorch.org/docs/stable/optim.html) module. A more recent improvement to stochastic gradient descent is the [Adam (Adaptive Moment Optimization)](https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/) algorithm, which we will use in Project 5.

1. Train the model for `n_epoch` epochs. In each iteration, the training code would look roughly as follows:

```python
def train(model, X, Y):
    # forward pass, compute loss value
    Y_hat = model(X)
    loss = criterion(Y_hat, Y)

    # reset gradient to prepare for backward pass
    optimizer.zero_grad()

    # backward pass
    loss.backward()
    optimzier.step()
```

We should also note the difference between an epoch and an iteration. Because CNN typically uses stochastic gradient descent (or its variations), you will frequently see the following code structure:

```python
for epoch in range(n_epochs):
    for X, Y in minibatches:
        train(model, X, Y)
```

In this case, each call to `train(model, X, Y)` is one iteration, as it goes through one minibatch, and an epoch is counted as one pass through all the minibatches.

Let's go through some examples of the above process.

### 2.1. Linear regression example

In [6]:
import torch.nn as nn
import torch.optim

# fix random seed for consistent results
torch.manual_seed(0)

# input data
X = torch.tensor([[0.9, 1.9], [1.95, 1.8], [1.85, 0.45], [1.3, 1.55], [1.9, 1.25]])
Y = torch.tensor([-9.5, -6.8,  2.3, -5.9, -2.6]).reshape(-1, 1)

# model specification
alpha, n_epochs = 1e-1, 100
model = nn.Linear(X.shape[1], 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = alpha)

# training epochs
for epoch in range(n_epochs):
    Y_hat = model(X)
    loss = criterion(Y_hat, Y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{n_epochs}], Loss: {round(loss.item(), 4)}')

print()
# print model parameters after training
for name, param in model.named_parameters():
    print(name, param)
    
# print predictions after training
model(X)

Epoch [10/100], Loss: 4.828
Epoch [20/100], Loss: 1.3768
Epoch [30/100], Loss: 0.5044
Epoch [40/100], Loss: 0.2775
Epoch [50/100], Loss: 0.2126
Epoch [60/100], Loss: 0.1889
Epoch [70/100], Loss: 0.176
Epoch [80/100], Loss: 0.1662
Epoch [90/100], Loss: 0.1578
Epoch [100/100], Loss: 0.1501

weight Parameter containing:
tensor([[ 2.7708, -6.3034]], requires_grad=True)
bias Parameter containing:
tensor([-0.1549], requires_grad=True)


tensor([[-9.6376],
        [-6.0979],
        [ 2.1346],
        [-6.3231],
        [-2.7696]], grad_fn=<AddmmBackward>)

Note that for `MSELoss`, even though each output is only a scalar value, Pytorch still expects the output to be a 2D matrix, hence the `.reshape` call we did for `Y`, which converts it to shape `(n,1)`. This may not be the case for other loss functions, e.g., `CrossEntropyLoss` expects a 1D output vector, so make sure to check the documentation carefully.

### 2.3. Convolutional neural network

We first define a convolutional network model with the following structure:

1. Input layer
1. Convolutional layer with $K = 5, S = 1, P = 2$
1. ReLU layer
1. Maxpool layer with $F = 2, S = 2$
1. Convolutional layer with $K = 5, S = 1, P = 2$
1. ReLU layer
1. Maxpool layer with $F = 2, S = 2$.
1. Fully connected layer

In [7]:
class MyCNN(nn.Module):
    def __init__(self):
        super(MyCNN, self).__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        # the output of self.layers is a matrix of shape (7, 7, 32)
        # so when flattened, it becomes a vector of shape 7 * 7 * 32
        self.fc = nn.Linear(7*7*32, n_classes)

    def forward(self, x):
        out = self.layers(x)
        out = out.reshape(out.size(0), -1)
        return self.fc(out)

As we can see, to create a CNN model, we create a class `MyCNN` that inherits from `nn.Module`, where we only need to implement two functions:

1. Constructor `__init__`. This calls the parent's constructor and defines instance variables that represent the network layers. 

1. Feedforward procedure `forward`. This defines how an input tensor should be processed through each layer in the network. One trick to make this function concise is to put the layers in `__init__` into an `nn.Sequential` object, as we showed above, which allows us to get the output from all of the member layers with just one line of code in `forward`, namely `out = self.layers(x)`.

You may notice that we did not put the final `nn.Linear` layer into `self.layers`. The reason is that the output from the second `nn.MaxPool2d` needs to be flattened to a vector, before it is input to `nn.Linear`. We perform this flattening step inside `forward`, in the line `out = out.reshape(out.size(0), -1)`.

Alternatively, with a recent update, PyTorch now supports an `nn.Flatten` layer as well, so we could move the flattening step to `__init__`: 

In [8]:
class MyCNN2(nn.Module):
    def __init__(self):
        super(MyCNN2, self).__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(7*7*32, n_classes)
        )

    def forward(self, x):
        return self.layers(x)

We see that all of the data processing steps can now be included in `self.layers`, so the `forward` function only needs to return the output of `self.layers`. In general, the takeaway message is: if an operation is supported by `torch.nn`, put it in an `nn.Sequential` object in `__init__`; otherwise, perform it in `forward`. This setting allows for great flexibility in how we want to define our network.

Now let's run our network. For this example we will use the MNIST (Modified National Institute of Standards and Technology) dataset. Here each data point is a $28 \times 28$ integer matrix that represents a black-and-white image of a handwritten digit.

![mnist](https://www.researchgate.net/profile/Steven_Young11/publication/306056875/figure/fig1/AS:393921575309346@1470929630835/Example-images-from-the-MNIST-dataset.png)

Our task is to predict a digit based on its image; in other words, this is a 10-class classification problem. MNIST is one of the standard datasets included in the `torchvision` package, so we can load it directly. We also use the `DataLoader` utility from PyTorch to get the minibatches used for training and testing:

In [9]:
import torchvision.datasets as datasets
import torchvision.transforms as transforms
mnist_trainset = datasets.MNIST(root="./", train=True, download=True, transform=transforms.ToTensor())
mnist_testset = datasets.MNIST(root="./", train=False, download=True, transform=transforms.ToTensor())
trainloader = torch.utils.data.DataLoader(mnist_trainset, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(mnist_testset, batch_size=1000,shuffle=False)
n_classes = 10

Now we create an instance of `MyCNN` and train the model. 

In [13]:
model = MyCNN()
alpha, n_epochs = 1e-2, 3
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = alpha)

# Train the model
for epoch in range(n_epochs):
    for images, labels in trainloader:
        outputs = model(images)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Epoch [{epoch+1}/{n_epochs}], Loss: {round(loss.item(), 4)}')

Epoch [1/3], Loss: 0.0192
Epoch [2/3], Loss: 0.027
Epoch [3/3], Loss: 0.0096


We now evaluate our trained model on the test data. We first call `torch.no_grad()` to disable gradient computation. This is useful when we are only doing inference and do not update the model. 

In [14]:
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in testloader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    print(f'Test Accuracy of the model on the 10000 test images: {100 * correct / total}')

Test Accuracy of the model on the 10000 test images: 97.41


We see an accuracy of around 97% on the test data even after only 3 training epochs!

### 2.4. Saving and loading a model
To save a model, we first retrieve its state and parameter values by `model.state_dict()`, then call `torch.save` and specify the file to save to. To load a saved model, we first create an instance of the same model class and "fill in" the parameter values.

In [12]:
torch.save(model.state_dict(), "saved_model_params.pth")
loaded_model = MyCNN()
loaded_model.load_state_dict(torch.load("saved_model_params.pth"))

<All keys matched successfully>

PyTorch also allows for model saving *during* the training phase, so that if any interruption happens while the model is being trained, we can load the latest checkpoint instead of having to re-train from the beginning. See the [official documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for more details.