# Pytorch Tutorial
*by Hong Xu*

This tutorial aims to teach the skills necessary to write and understand *PyTorch* code for deep learning purposes. We will touch on the basics of *PyTorch*, the essential tools, and some additional resources. You do not need a GPU to run this tutorial, although you will if you wish to train any non-trivial network, so I will include the commands as an option to try. Most of the code here comes from [jcjohnson's pytorch examples](https://github.com/jcjohnson/pytorch-examples).

First you will need to install *PyTorch* and *PyTorchvision*. If you don't have these, you can either
* Install Anaconda, and run `conda install pytorch torchvision cudatoolkit=10.1 -c pytorch` or
* Run
    * `pip3 install torch torchvision` (linux/mac)
    * `pip3 install torch===1.4.0 torchvision===0.5.0 –f https://download.pytorch.org/whl/torch_stable.html` (windows)

Now we check whether they were installed correctly

In [1]:
import torch
import torchvision
import numpy as np

## PyTorch Basics

In this section, we will introduce three levels of abstraction; the concepts of tensors, variables, layers, and modules.

### Tensors

*PyTorch* is a library primarily focused on deep learning, where the most convenient way to represent data and parameters is through tensors. **Tensors are multi-dimensional arrays**.

<img src="./figures/tensor.png" width="350"  />

In *PyTorch*, tensors have similar utility to *numpy* arrays, but they can easily run on a GPU if specified. They have properties like `.shape` and can be applied element-wise operations easily. The following code (from [jcjohnson](https://github.com/jcjohnson/pytorch-examples) pytorch examples) demonstrates the learning of a 2-layer network on some random input using only *PyTorch* tensors.

Recall how a neural network is trained from the following image, and try to match each step of the process to the code bellow:

![structure](./figures/optimizer_explain.png)

In [2]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU <----------------------------------------

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
  # of shape (); we can get its value as a Python number with loss.item().
  loss = (y_pred - y).pow(2).sum()
  if(t%50 == 0):
      print(t, loss.item())

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

0 32993284.0
50 10132.3388671875
100 209.24058532714844
150 7.940583229064941
200 0.3835572302341461
250 0.0209675095975399
300 0.001452811062335968
350 0.00022822186292614788
400 7.119649671949446e-05
450 3.338568058097735e-05


Running on GPU is as simple as commenting out the fourth line above. Do not overlook the power of GPU operations on tensors. For instance, the following 3000x3000 matrix multiplication takes 350 ms on the CPU, and only 0.1 ms on the GPU.

```python
# Task : compute matrix multiplication C = AB
d = 3000

# using numpy: takes 350 ms
A = np.random.rand(d,d).astype(np.float32)
B = np.random.rand(d,d).astype(np.float32)
C = A.dot(B)

# using torch with GPU
A = torch.rand(d, d).cuda()
B = torch.rand(d, d).cuda()
C = torch.mm(A, B)
```

Technically, you could write any network code on *numpy*, but it would be painfully slow and excrutiating.

### Variable

**A variable represents a node in a computational graph; it stores data and gradients**. This structure accounts for the fact that in gradient descent, the most popular optimization framework, we need to store both the data and the gradient for every structure. 

* `x.data` is a Tensor
* `x.grad` is a Variable of gradients (same shape as x.data)
* `x.grad.data` is the Tensor of gradients

![tensor](./figures/variable.png)

**Autograd**: We can use the `torch.autograd` package to automatically calculate the gradients. This will be particularly useful when building intricate networks. In the code below, you will see how the varibles imported from autograd work with the `loss.backward()` to automatically compute the gradients of any variable on a computational graph.


In [3]:
import torch
from torch.autograd import Variable

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = Variable(torch.randn(N, D_in, device=device), requires_grad=False)
y = Variable(torch.randn(N, D_out, device=device), requires_grad=False)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = Variable(torch.randn(D_in, H, device=device), requires_grad=True)
w2 = Variable(torch.randn(H, D_out, device=device), requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors. Since w1 and
  # w2 have requires_grad=True, operations involving these Tensors will cause
  # PyTorch to build a computational graph, allowing automatic computation of
  # gradients. Since we are no longer implementing the backward pass by hand we
  # don't need to keep references to intermediate values.
  y_pred = x.mm(w1).clamp(min=0).mm(w2)
  
  loss = (y_pred - y).pow(2).sum()
  if(t%50 == 0):
      print(t, loss.item())

  # Use autograd to compute the backward pass. This call will compute the
  # gradient of loss with respect to all Tensors with requires_grad=True.
  # After this call w1.grad and w2.grad will be Tensors holding the gradient
  # of the loss with respect to w1 and w2 respectively.
  loss.backward()

  # Update weights using gradient descent. For this step we just want to mutate
  # the values of w1 and w2 in-place; we don't want to build up a computational
  # graph for the update steps, so we use the torch.no_grad() context manager
  # to prevent PyTorch from building a computational graph for the updates
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()

0 23048620.0
50 12457.9248046875
100 386.017822265625
150 19.587377548217773
200 1.2226712703704834
250 0.08571489155292511
300 0.00668316800147295
350 0.0007893430883996189
400 0.0001850952103268355
450 6.941909668967128e-05


### Layers

**Layers are abstractions of frequently used computational graphs**, particularly in deep learning applications. Using layers from `torch.nn` allows us to quickly and efficiently build deep learning architechtures that can automatically compute gradients and optimize. Layers are often the smallest units we have to deal with extensively for any network architecture. `torch.nn` also includes a plethora of loss functions, which can be seen as layers with a scalar output.

Some examples of `torch.nn` layers:
* ConvXd (X = 1, 2, 3) 
* ConvTransposeXd (X = 1, 2, 3)
* MaxPoolXd (X = 1, 2, 3)
* Dropout
* Linear
* Normalization
* MSELoss
* L1Loss
* BCELoss

We can now write the previous code in a condensed way using only `torch.nn` layers:

In [4]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
  # Forward pass: compute predicted y by passing x to the model. Module objects
  # override the __call__ operator so you can call them like functions. When
  # doing so you pass a Tensor of input data to the Module and it produces
  # a Tensor of output data.
  y_pred = model(x)

  loss = loss_fn(y_pred, y)
  if(t%50 == 0):
      print(t, loss.item())
  
  # Zero the gradients before running the backward pass.
  model.zero_grad()

  # Backward pass: compute gradient of the loss with respect to all the learnable
  # parameters of the model. Internally, the parameters of each Module are stored
  # in Tensors with requires_grad=True, so this call will compute gradients for
  # all learnable parameters in the model.
  loss.backward()

  # Update the weights using gradient descent. Each parameter is a Tensor, so
  # we can access its data and gradients like we did before.
  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad

0 678.9769897460938
50 27.842239379882812
100 1.5715466737747192
150 0.17806026339530945
200 0.03175480291247368
250 0.0070302411913871765
300 0.0017063487321138382
350 0.00042926709284074605
400 0.00010955774632748216
450 2.814173603837844e-05


### Network Modules

**Network modules are user level abstractions of network substructures**. A module can be an entire network or a network can be comprised of a large amount of modules. For instance, you can have a module for a ResnetBlock, a commonly used structure in a lot of image processing networks. Modules allow you to build potentially massive networks in a manageable way.

Modules are wrapped in a class in PyTorch. A module class has two important methods: the `__init__` method where all the layers or other modules are initialized and the `forward` method in charge of defining the computational graph.

Simple example (from Chongruo Wu’s tutorial):

![net_example](./figures/net_example.png)

```python
class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1,10, kernel_size=5) # Convolutional layer with 5x5 kernel size, 
                                                    #1 input channel and 10 output channels.
        self.conv2 = nn.Conv2d(10,20, kernel_size=5)
        self.mp = nn.MaxPool2d(2)
        self.fc = nn.Linear(320, 10)
        
    def forward(self,x):
        in_size = x.size(0)
        x = F.relu(self.mp(self.conv1(x)))
        x = F.relu(self.mp(self.conv2(x)))
        x = x.view(in_size, -1) # Flattens the tensor
        x = self.fc(x)
        return F.log_softmax(x)
```

Now we can tidy up our code from before by writing a TwoLayerNet module class.

In [5]:
import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    """
    In the forward function we accept a Tensor of input data and we must return
    a Tensor of output data. We can use Modules defined in the constructor as
    well as arbitrary (differentiable) operations on Tensors.
    """
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function.The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  if(t%50 == 0):
      print(t, loss.item())

  # Zero the gradients before running the backward pass.
  model.zero_grad()

  # Backward pass: compute gradient of the loss with respect to all the learnable
  # parameters of the model. Internally, the parameters of each Module are stored
  # in Tensors with requires_grad=True, so this call will compute gradients for
  # all learnable parameters in the model.
  loss.backward()

  # Update the weights using gradient descent. Each parameter is a Tensor, so
  # we can access its data and gradients like we did before.
  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad

0 671.4662475585938
50 31.281875610351562
100 1.793733835220337
150 0.15654367208480835
200 0.017252231016755104
250 0.0022066549863666296
300 0.0003083283663727343
350 4.570548844640143e-05
400 7.063119028316578e-06
450 1.1287644383628503e-06


## PyTorch: Optimizer

In this section, we will discuss the optimizer, which will allow us to perform different learning paradigms by making minimal modifications. In particular, we have been using gradient descent exclussively up to this point, but we might want to use more sophisticated optimizer like AdaGrad, RMSPro, Adam, among others.

The `torch.optimizer` function allows us to do just that. Just simply create an optimizer

```python
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
```

Notice that this is an Adam optimizer with learning rate `learning_rate`. And the gradient descent step can simply be replaced with

```python
optimizer.step()
```

The code below will run our usual network with an Adam optimizer.

In [6]:
import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  if(t%50 == 0):
      print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

0 679.9312133789062
50 34.326072692871094
100 2.420177459716797
150 0.2701015770435333
200 0.037167467176914215
250 0.005999159999191761
300 0.0011976438108831644
350 0.00027674061129800975
400 6.978487363085151e-05
450 1.852341665653512e-05


## Pytorch: Scheduler

The scheduler allows us to manipulate the learning rate parameter of the optimizer between epochs. This is important because a decreasing learning rate, for instance, can perform better than a constant one.

The `torch.optim.lr_scheduler` class provides the following scheduler options:
* `StepLR`: LR is multiplied by gamma every step_size epochs
* `MultiStepLR`: LR is multiplied by gamma once the number of epoch reaches milestones.
* `ExponentialLR`
* `CosineAnnealingLR`
* `ReduceLROnPlateau`

We initialize a scheduler as such

```python
scheduler = StepLR(optimizer, step_size=step_size, gamma=gamma)
```

This scheduler multiplies the learning rate by a factor or `gamma` every `step_size` epochs. Then after each epoch, we step the scheduler once using

```python
scheduler.step()
```

Notice that the scheduler step wraps around the optimizer's step, so there is no need for `optimizer.step()` anymore.

Now we can add a scheduler to our code above. Notice that in this toy example, it does not make much of a difference.

In [8]:
import torch
from torch.optim.lr_scheduler import StepLR

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = TwoLayerNet(D_in, H, D_out)

loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
# This scheduler multiplies the learning rate by a factor or 0.999999 every epoch.
scheduler = StepLR(optimizer, step_size=1, gamma=0.999999)
for t in range(500):
  y_pred = model(x)

  loss = loss_fn(y_pred, y)
  if(t%50 == 0):
      print(t, loss.item())

  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  scheduler.step()

0 660.5859985351562
50 37.57862854003906
100 3.069124698638916
150 0.3878892660140991
200 0.06261113286018372
250 0.01202652882784605
300 0.002636996563524008
350 0.000635945878457278
400 0.00016495128511451185
450 4.518511559581384e-05


## Data Loader

A PyTorch data loader is a tool that allows us to feed data into our network. It is useful because it allows for

* Map-style and iterable-style datasets
* Customizing data loading order
* **Automatic batching**
* Single- and multi- process data loading
* automatic memory pinning.

Not only can you load popular datasets quickly and cleanly through a data loader from libraries and easily use data loaders written for other datasets, but you can also build your own custom data loaders. 

The `torch.utils.data.DataLoader` functionality allows you to build a custom data loader class by specifying three methods

* `__init__(self)`: Downloads, reads, loads or creates data.
* `__getitem__(self, index)`: Returns one item on the index.
* `__len__(self)`: Returns the length of the data.

A sample loader looks like this, assuming that `Dataset` contains x and y data

```python
class toy_loader(dataset):
    
    def __init__(self):
        N, D_in, H, D_out = 64, 1000, 100, 10
        self.x = torch.randn(N, D_in)
        self.y = torch.randn(N, D_out)
        self.len = x.shape[0]
    
    def __getitem__(self, index):
        return self.x[index], self.y[index]
    
    def __len__(self):
        return self.len
```

Now we can create a `DataLoader` variable as such

```python
dataset = toy_loader()
train_loader = DataLoader(dataset=dataset,
                         batch_size=32,
                         shuffle=True)
```

Now we finally add batch training to our code through this toy data loader.

In [9]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable
from torch.optim.lr_scheduler import StepLR

#This toy loader initializes our random dataset as before.
class toy_loader(Dataset):
    
    def __init__(self):
        N, D_in, H, D_out = 64, 1000, 100, 10
        self.x = torch.randn(N, D_in)
        self.y = torch.randn(N, D_out)
        self.len = N
    
    def __getitem__(self, index):
        return self.x[index], self.y[index]
    
    def __len__(self):
        return self.len

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

#The dataset is created and loaded to the DataLoader
#The data loader provides batches of size 16
dataset = toy_loader()
train_loader = DataLoader(dataset=dataset,
                         batch_size=32,
                         shuffle=True)

model = TwoLayerNet(D_in, H, D_out)

loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
scheduler = StepLR(optimizer, step_size=1, gamma=0.999999)
for t in range(500):
    
    # We add an extra loop that handles each batch
    for i, data in enumerate(train_loader, 0):
        
        # Get the inputs
        x, y = data
        
        # Wrap them in Variable
        x, y = Variable(x), Variable(y)
        
        y_pred = model(x)

        loss = loss_fn(y_pred, y)
        if(t%50 == 0):
          print(t, loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    scheduler.step()

0 363.23663330078125
0 346.4319763183594
50 23.969371795654297
50 17.1717472076416
100 2.5396196842193604
100 0.9869752526283264
150 0.3369961380958557
150 0.1542620062828064
200 0.050437189638614655
200 0.028866257518529892
250 0.009905891492962837
250 0.004183848388493061
300 0.0019150826847180724
300 0.0007571085006929934
350 0.0004007300012744963
350 0.0001295009715249762
400 5.0271471991436556e-05
400 5.818992940476164e-05
450 1.1823357453977223e-05
450 1.1101073141617235e-05


## PyTorch: MNIST step-by-step example

This section focuses on analyzing a step-by-step example of running a basic classifier on the well known MNIST dataset. This will also let you see how a sanitary PyTorch project can be constructed.

MNIST is a data set of handwritten digits 0-9 which have been labeled with their correct number representation. This is a very well know image dataset used widely used in academia as a simple proof-of-concept task. The goal here is to be able to classify the handwritten digits correctly.

The network that we are going to create is similar to the one shown in [this tutorial](https://towardsdatascience.com/mnist-handwritten-digits-classification-using-a-convolutional-neural-network-cnn-af5fafbc35e9). The only difference is that our network has a dropout layer after each of the two max-pooling layers. The code from this section comes from [here](https://github.com/pytorch/examples/blob/master/mnist/main.py).

![MNIST](figures/MNIST.png)

The code starts with imports and defines the network structure right after

In [None]:
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

# Ignore this
class ignore_arguments:
    batch_size = 64
    test_batch_size = 1000 
    epochs = 14
    lr = 1.0
    gamma = 0.7
    no_cuda = False
    seed = 1
    log_interval = 10
    save_model = False

def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')

    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    
    args = ignore_arguments()
    # args = parser.parse_args() #Uncomment this for argument parsing
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    
    torch.manual_seed(args.seed)

    device = torch.device("cuda" if use_cuda else "cpu")

    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.batch_size, shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.test_batch_size, shuffle=True, **kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(args, model, device, test_loader)
        scheduler.step()

    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
    main()

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data\MNIST\raw\train-images-idx3-ubyte.gz


100.1%

Extracting ../data\MNIST\raw\train-images-idx3-ubyte.gz to ../data\MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data\MNIST\raw\train-labels-idx1-ubyte.gz


113.5%

Extracting ../data\MNIST\raw\train-labels-idx1-ubyte.gz to ../data\MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data\MNIST\raw\t10k-images-idx3-ubyte.gz


100.4%

Extracting ../data\MNIST\raw\t10k-images-idx3-ubyte.gz to ../data\MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data\MNIST\raw\t10k-labels-idx1-ubyte.gz


180.4%

Extracting ../data\MNIST\raw\t10k-labels-idx1-ubyte.gz to ../data\MNIST\raw
Processing...
Done!

Test set: Average loss: 0.0525, Accuracy: 9819/10000 (98%)




Test set: Average loss: 0.0399, Accuracy: 9864/10000 (99%)


Test set: Average loss: 0.0335, Accuracy: 9885/10000 (99%)




Test set: Average loss: 0.0330, Accuracy: 9890/10000 (99%)


Test set: Average loss: 0.0309, Accuracy: 9892/10000 (99%)

