# FloydHub Introduction to Deep Learning: PyTorch

<img style="align: center;height: 300px; widht: 300px;" src="https://github.com/sominwadhwa/sominwadhwa.github.io/blob/master/assets/intro_to_pytorch_series/PyTorch.jpg?raw=true">

## Introduction

[PyTorch](http://pytorch.org/) is one among the numerous [Deep Learning frameworks](https://www.kdnuggets.com/2017/02/python-deep-learning-frameworks-overview.html) which allows us to build powerful Deep Learning models by harnessing GPU compute. PyTorch is extensively used for rapid prototyping in research and small scale projects. The objective of this article is to give you a hands on experience with PyTorch & some basic mathematical lingo associated with Deep Learning. We also introduce the classic problem of [Handwritten Digit Recognition](http://yann.lecun.com/exdb/mnist/).

**Table of Contents**:

- [PyTorch Introduction](#pytorch-introduction)
- [Tensors](#tensor)
- [Variables & Autograd](#variables-and-autograd)
- [Logistic Regression](#logistic-regression)
- [Summary](#summary)

### PyTorch Introduction

PyTorch is a Python based scientific computing package targeted at two sets of audiences:

- A Deep Learning research platform that provides maximum flexibility and speed through Dynamic Compute graphs & Imperative Programming control flow.
- A replacement for NumPy to harness GPU compute capability.

Here's a list of modules we'll need in order to run this tutorial:

1. [torch.autograd](http://pytorch.org/docs/master/autograd.html) Provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions.
2. [torch.nn](http://pytorch.org/docs/master/nn.html) Package provides an easy and modular way to build and train simple or complex neural networks.
3. [NumPy](http://www.numpy.org/) is the fundamental package for scientific computing with Python.

#### Running Code Cell

If you want to run Cell code using shortcut, type **`shift + enter`**.

In [129]:
# Import the package we need to run the tutorial
import torch
import numpy as np
from torch.autograd import Variable

# Is CUDA available on this instance?
cuda = torch.cuda.is_available()

# Seed for reproducibility
# This make sure you get the same results returned in this notebook
torch.manual_seed(1)
if cuda:
    torch.cuda.manual_seed(1)

### Tensors

In any deep learning pipeline, one obvious inevitable thing that we encounter, is mathematical data. Be it an images stored in the form of `[height x width]` matrices, a piece of text stored in the form a vector or some spooky operation taking place between those two. PyTorch provides us with objects known as Tensors that store all this data under one roof.

*Formally*, a [PyTorch Tensor](http://pytorch.org/docs/master/tensors.html) is conceptually identical to a NumPy's `ndarray`, and PyTorch provides many functions for operating on these Tensors. Like standard `ndarrays`, PyTorch Tensors do not know anything about deep learning or computational graphs or gradients; they are a generic tool for scientific computing. We can use n-dimensional Tensors to our requirement, for instance - we can have multidimensional (2D) tensor storing an image, or a single variable storing text.

The following snippets demonstrate Tensors & a few of their operations:

In [4]:
# Load Tensor on GPU if CUDA available
dtype = torch.cuda.FloatTensor if cuda else torch.FloatTensor

# Construct a 5x3 matrix, uninitialized:
print("torch.Tensor(5, 3):")
x = torch.Tensor(5, 3).type(dtype)
print(x)

torch.Tensor(5, 3):

1.00000e-16 *
  8.8183  0.0000  8.8183
  0.0000  0.0000  0.0000
  0.0000  0.0000  0.0000
  0.0000  0.0000  0.0000
  0.0000  0.0000  0.0000
[torch.FloatTensor of size 5x3]



In [5]:
# Construct a randomly initialized matrix
print("torch.rand(5, 3):")
x = torch.rand(5, 3).type(dtype)
print(x)

torch.rand(5, 3):

 0.1863  0.3879  0.3456
 0.6697  0.3968  0.9355
 0.5388  0.8463  0.4192
 0.3133  0.6852  0.5245
 0.2045  0.4435  0.8781
[torch.FloatTensor of size 5x3]



In [6]:
# Get its size
print("Last Tensor Size:")
print(x.size())

Last Tensor Size:
torch.Size([5, 3])


In [7]:
# There are multiple syntaxes for operations. Let’s see addition as an example
# Addition: syntax 1
y = torch.rand(5, 3)
print("Syntax 1: x + y =")
print(x + y)

Syntax 1: x + y =

 0.4158  0.4153  0.8800
 1.3402  1.3107  1.3528
 0.9960  1.4050  0.8499
 0.4537  1.6243  0.7226
 0.9828  1.2442  1.5941
[torch.FloatTensor of size 5x3]



In [8]:
# Addition: syntax 2
print("Syntax 2: torch.add(x, y) =")
print(torch.add(x, y))

Syntax 2: torch.add(x, y) =

 0.4158  0.4153  0.8800
 1.3402  1.3107  1.3528
 0.9960  1.4050  0.8499
 0.4537  1.6243  0.7226
 0.9828  1.2442  1.5941
[torch.FloatTensor of size 5x3]



In [9]:
# Addition: giving an output tensor
result = torch.Tensor(5, 3)
torch.add(x, y, out=result)
print("Syntax 3: torch.add(x, y, out=result) =")
print(result)

Syntax 3: torch.add(x, y, out=result) =

 0.4158  0.4153  0.8800
 1.3402  1.3107  1.3528
 0.9960  1.4050  0.8499
 0.4537  1.6243  0.7226
 0.9828  1.2442  1.5941
[torch.FloatTensor of size 5x3]



In [10]:
# Addition: in-place
# adds x to y
print("In-place Addition: y.add_(x) =")
y.add_(x)
print(y)

In-place Addition: y.add_(x) =

 0.4158  0.4153  0.8800
 1.3402  1.3107  1.3528
 0.9960  1.4050  0.8499
 0.4537  1.6243  0.7226
 0.9828  1.2442  1.5941
[torch.FloatTensor of size 5x3]



In [11]:
# You can use standard numpy-like indexing with all bells and whistles!
print ("Indexing x[:, 1] - Second column(index starts from zero) of every rows:")
print(x[:, 1])

Indexing x[:, 1] - Second column(index starts from zero) of every rows:

 0.3879
 0.3968
 0.8463
 0.6852
 0.4435
[torch.FloatTensor of size 5]



Unlike NumPy `ndarrays`, PyTorch Tensors can utilize GPUs to accelerate their numeric computations & PyTorch makes it ridiculously easy to switch from GPU to CPU & vice versa.

*Note: It is interesting to know that PyTorch can serve as a full fledged replacement for NumPy, as Tensors & ndarrays can be used interchangeably.*

In [12]:
# Generate a sample matrix of 3 rows and 4 columns 
# from a Normal Distribution with Mean 0 and Var 1 
numpy_tensor = np.random.randn(3, 4)
print ("Numpy tensor: ", numpy_tensor, "\n")

Numpy tensor:  [[-1.11230897 -0.04769943 -0.8689276   0.11765421]
 [-2.26887639  0.37423547  0.64090709 -1.03655458]
 [-0.3089939   0.95988481 -0.67968412  0.30984516]] 



In [13]:
# Convert numpy array to pytorch array
pytorch_tensor = torch.Tensor(numpy_tensor)
print ("Numpy to PyTorch tensor: ", pytorch_tensor, "\n")
# Or another way
pytorch_tensor = torch.from_numpy(numpy_tensor)
# Convert torch tensor to numpy representation
print ("PyTorch to Numpy tensor: ", pytorch_tensor.numpy(), "\n")

Numpy to PyTorch tensor:  
-1.1123 -0.0477 -0.8689  0.1177
-2.2689  0.3742  0.6409 -1.0366
-0.3090  0.9599 -0.6797  0.3098
[torch.FloatTensor of size 3x4]
 

PyTorch to Numpy tensor:  [[-1.11230897 -0.04769943 -0.8689276   0.11765421]
 [-2.26887639  0.37423547  0.64090709 -1.03655458]
 [-0.3089939   0.95988481 -0.67968412  0.30984516]] 



*Note: you can run the GPU to CPU example only if you are running a FloydHub GPU instance*

In [14]:
# If cuda is available, run GPU-to-CPU and vice versa example
if cuda:
    # If we want to use tensor on GPU provide another type
    dtype = torch.cuda.FloatTensor
    gpu_tensor = torch.randn(5, 10).type(dtype)
    # Or just call `cuda()` method
    gpu_tensor = pytorch_tensor.cuda()
    print ("PyTorch cuda gpu_tensor ", gpu_tensor, "\n")
    # Call back to the CPU
    cpu_tensor = gpu_tensor.cpu()
    print ("PyTorch cuda tensor to cpu_tensor, gpu_tensor.cpu() ", cpu_tensor, "\n")

In [15]:
# Define pytorch tensors
x = torch.randn(10, 20)
y = torch.ones(20, 5)
# `@` mean matrix multiplication from python3.5, PEP-0465
res = x @ y # Same as torch.matmul(x, y)

# Get the shape
res.shape  # torch.Size([10, 5])

torch.Size([10, 5])

#### Tensor Exercise

Create a square Matrix as this: [[1, 2], [3, 4]], multiply it for this *column* vector [5, 6] and add at the result the following column vector [7, 8]. The result should be a column vector with these values: [24, 47].

`result = matrix * column1 + column2`

*Note: There is more than one way to achieve the same result.*

In [21]:
# Create a 2-D Tensor with these values [[1, 2], [3, 4]]
matrix = # CODE HERE

# Create a column Vector with these values [5, 6]
column1 = # CODE HERE

# Create a column Vector with these values [7, 8]
column2 = # CODE HERE

# Matrix Mult
mutl = # CODE HERE

# Addition
res = # CODE HERE

# Show the result
print (res)


 24
 47
[torch.FloatTensor of size 2]



### Variables and AutoGrad

<p align="center">
  <img src="http://pytorch.org/tutorials/_images/Variable.png"/>
</p>

>Credits: PyTorch Variable Docs

Variables are **wrappers** over Tensors that allow them to be differentiated & modified. Let me demonstrate how: Take the example in the following snippet, where we apply a string of operations over a 'Variable' `x`, to predict `y`.

In [16]:
### Var and Authograd example on simple operations

# Create a Variable
x = Variable(torch.ones(2, 2), requires_grad=True)
print("x", x)

# Make an op
y = x + 2
print("x + 2 = y,", y, "\n")

# y was created as a result of an operation, so it has a grad_fn.
print("y was created as a result of an operation, so we have ", y.grad_fn, "\n")

# More op
z = y * y * 3
out = z.mean()

print("y * y * * 3 = z", z, "\n", "mean(z), ", out)

# Let’s backprop now 
out.backward()
print("After backprop, x", x.grad)

x Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

x + 2 = y, Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]
 

y was created as a result of an operation, so we have  <torch.autograd.function.AddConstantBackward object at 0x7f4a140a2e58> 

y * y * * 3 = z Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]
 
 mean(z),  Variable containing:
 27
[torch.FloatTensor of size 1]

After backprop, x Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]



In [30]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 10, 10, 10, 1

# Create random input and output data
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

### A typical forward computation
# Matrix Mult
h = x.mm(w1)
# Custom ReLU
h_relu = h.clamp(min=0)
# Matrix Mult
y_pred = h_relu.mm(w2)

# Compute the loss with Mean Squared Error
loss = ((y_pred - y).pow(2).sum())/N

And now, we wish to compute the derivative of this function with respect to the loss. Using the traditional symbolic differentiation, we would achieve that in a way like this,

In [31]:
# Manual Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)

>This process mentioned up is a part of Backpropagation in a simple single layer Neural Network, don't worry about it even if you don't understand much of it, we'll cover it in the next article.

Now imagine, if there were tens of different types of mathematical operations before computing `loss` in the first snippet (because there will be in what's about to come!). How could you possibly code the gradient computation for something like that? Thankfully, `torch.autograd` exists. It works on the principle of Automatic Differentiation, which is inherently based on the **chain rule**. To perform the gradient computation in the above example using `autograd`, all we have to do is,

In [37]:
# Create random Tensors to hold input and outputs, and wrap them in Variables.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

# Create random Tensors for weights, and wrap them in Variables.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

#Make sure x, w1 & w2 are Variables
y_pred = x.mm(w1).clamp(min=0).mm(w2)

# Compute the loss with Mean Squared Error
loss = ((y_pred - y).pow(2).sum())/N

# Auto differentation
loss.backward()

# Get Grads with <variable_name>.grad.data

Every variable instance has two attributes: `.data` that contains the initial tensor itself and `.grad` that contains gradients for the corresponding tensor. Here are some more snippets on using Autograd & Variables:

In [38]:
# Create a Variable
x = Variable(torch.ones(2, 2), requires_grad=True)
# Do some operation
y = x + 2
z = y * y * 3
out = z.mean()
# Let’s compute the gradient now
out.backward()
print(x.grad)

Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]



*Note*: When we wrap our Tensors with Variables, the arithmetic operations still remain the same, but Variables also remember their history of computation. Thus, `z` is not only a regular `2 x 2` Tensor but expression, involving `y` & `x`. This is what helps us define a **computational graph**. Nodes in this graph are Tensors & edges will be Functions operated on these nodes. **Backpropagating** through this graph then allows you to easily compute gradients.

#### Variables and AutoGrad Exercise

In the previous snippet of code we have used auto differentiation to compute the gradient with respect to the loss. Now we will use the computed gradients to update the parameters simulating an Optimization workflow with [SGD](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).

*Note: You should see the loss fastly converging to zero, if is not the case, you have to fix your code.*

In [254]:
# Seed for reproducibility
# This make sure you get the same results returned in this snippet
torch.manual_seed(1)

# Create random Tensors to hold input and outputs, and wrap them in Variables.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

# Create random Tensors for weights, and wrap them in Variables.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

prev_loss = 10000000
max_error = 3
for t in range(500):
    #Make sure x, w1 & w2 are Variables
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute the loss with Mean Squared Error
    loss = ((y_pred - y).pow(2).sum())/N
    
    print (loss.data[0])
    if loss.data[0] > prev_loss:
        max_error -= 1
        if max_error <= 0:
            print ("SGD Error - Plase fix your Code your code!")
            break
    else:
        # Update max error allowed on loss fluctuation
        max_error = 3
    
    # Update loss to check for correct implementation
    prev_loss = loss.data[0]
    
    
    # Zero the grad after the first step
    # At the beginning they are None
    if t > 1:
        # Manually zero the gradients before running the backward pass
        # CODE HERE - w1 grad must be zero-ed
        # CODE HERE - w2 grad must be zero-ed
       
    # Print Loss per time step
    print (loss.data[0])

    # Auto differentation
    loss.backward()

    # Update weights using gradient descent according the below formula
    # (SGD) w = w - learning_rate * w_grad_wrt_loss
    learning_rate = 1e-4
    # CODE HERE - w1 SGD step update -> w1 = w1 - learning_rate * w1_grad_wrt_loss
    # CODE HERE - w2 SGD step update -> w2 = w2 - learning_rate * w2_grad_wrt_loss

471615.84375
471615.84375
1265843.5
1265843.5
744764.25
744764.25
1196102.0
1196102.0
852086.5625
852086.5625
117779.265625
117779.265625
9248.3505859375
9248.3505859375
7311.44189453125
7311.44189453125
6044.49267578125
6044.49267578125
5061.51123046875
5061.51123046875
4283.59814453125
4283.59814453125
3659.133056640625
3659.133056640625
3151.992919921875
3151.992919921875
2734.068603515625
2734.068603515625
2386.427734375
2386.427734375
2094.23193359375
2094.23193359375
1846.6749267578125
1846.6749267578125
1635.353759765625
1635.353759765625
1453.8814697265625
1453.8814697265625
1297.270263671875
1297.270263671875
1161.3912353515625
1161.3912353515625
1042.9434814453125
1042.9434814453125
939.2108764648438
939.2108764648438
848.0303955078125
848.0303955078125
767.5533447265625
767.5533447265625
696.3588256835938
696.3588256835938
633.1547241210938
633.1547241210938
576.8418579101562
576.8418579101562
526.5444946289062
526.5444946289062
481.5548400878906
481.5548400878906
441.189575

0.16093353927135468
0.16093353927135468
0.1568954586982727
0.1568954586982727
0.1529628038406372
0.1529628038406372
0.14912675321102142
0.14912675321102142
0.14539843797683716
0.14539843797683716
0.14176201820373535
0.14176201820373535
0.13821689784526825
0.13821689784526825
0.1347638964653015
0.1347638964653015
0.1313978135585785
0.1313978135585785
0.12811818718910217
0.12811818718910217
0.12492497265338898
0.12492497265338898
0.12180949002504349
0.12180949002504349
0.11877629160881042
0.11877629160881042
0.115820974111557
0.115820974111557
0.11293844878673553
0.11293844878673553
0.11013401299715042
0.11013401299715042
0.10739827901124954
0.10739827901124954
0.10473249852657318
0.10473249852657318
0.10213278234004974
0.10213278234004974
0.09959892928600311
0.09959892928600311
0.09712858498096466
0.09712858498096466
0.09472343325614929
0.09472343325614929
0.09237639605998993
0.09237639605998993
0.09009081870317459
0.09009081870317459
0.08786336332559586
0.08786336332559586
0.0856895595

0.00039412069600075483
0.00039412069600075483
0.0003848111373372376
0.0003848111373372376
0.00037574523594230413
0.00037574523594230413
0.00036696248571388423
0.00036696248571388423
0.00035830913111567497
0.00035830913111567497
0.00034998220507986844
0.00034998220507986844
0.00034164093085564673
0.00034164093085564673
0.0003337223897688091
0.0003337223897688091
0.00032589404145255685
0.00032589404145255685
0.0003182206128258258
0.0003182206128258258
0.000310739764245227
0.000310739764245227
0.00030338275246322155
0.00030338275246322155
0.00029633622034452856
0.00029633622034452856
0.00028938104514963925
0.00028938104514963925
0.00028261810075491667
0.00028261810075491667
0.0002759683411568403
0.0002759683411568403
0.0002695330767892301
0.0002695330767892301


## Optimization

<p align="center">
    <img src="https://alykhantejani.github.io/images/gradient_descent_line_graph.gif"/>
</p>

So until now, we've seen Tensors that hold the data, Variables wrap around Tensors to let them perform complex math operations & finally `autograd` to compute gradients. So why do these Variables need to retain a history of computation?

The reason we wish to retain a computational graph of these variables is so we can differentiate & update them to optimize mathematical equations. This may not make much sense now, but hang on for a while. We'll get there. Say we have two Variables `y_` & `y`. `y_` is what our model predicts & `y` is what it **should** predict (remember supervised learning?).

But how do we teach a machine that it's not doing a very good job of predicting `y` & needs to do better? You see, the basis of learning, be it biological beings like us or artificial machines, has always been 'repetition' of a  particular task i.e. a **learning algorithm**. To achieve this, we optimize!

In [214]:
# Seed for reproducibility
# This make sure you get the same results returned in this snippet
torch.manual_seed(1)

# Data and Label Variable
x = Variable(torch.FloatTensor([5.0]), requires_grad=False)
y = Variable(torch.FloatTensor([3.0]), requires_grad=False)

# weight and bias
w = Variable(torch.Tensor([1]), requires_grad=True)
b = Variable(torch.randn([1]), requires_grad=True)

optimizer = torch.optim.SGD([w], lr=0.001)
for i in range(100):
    y_ = (x * w) + b
    error = (y_ - y).abs() # Minimizes absolute difference
    optimizer.zero_grad() # Zero the gradients before running the backward pass.
    error.backward()      # Computes derivatives automatically
    optimizer.step()      # Decreases loss: Updates y_ to become 'more' close to y
    
    
print ("Target y ", y)  # Evaluates to 3.0

print ("Learned function, y_ ", 
       y_,  # Evaluates to ~ 3.0 -- optimization successful!
       "\nOptimization successful!")

Target y  Variable containing:
 3
[torch.FloatTensor of size 1]

Learned function, y_  Variable containing:
 3.0032
[torch.FloatTensor of size 1]
 
Optimization successful!


The above snippet creates an optimizer called Stochastic Gradient Descent, passing it a list of parameters to optimize & a [learning rate](https://medium.com/@balamuralim.1993/importance-of-learning-rate-in-machine-learning-920a323fcbfb). We try to minimize the difference between `y_` & `y`, slowly. And after 100 steps, they become equal.

We'll even use advanced optimizers like Adagrad & Adam when we get to Neural Nets. They're usually slower & more explanatory but are less likely to **overshoot** & thus, are used a lot. `torch.optim` module contains a number of these optimizers.

<p align="center">
    <img src="https://2.bp.blogspot.com/-eW63YjSyuwY/V1QP3b9ZSmI/AAAAAAAAFeY/VcLfkmRvGaQbRjKhetlKjIl59kgkGV6hQCKgB/s1600/opt1.gif"/>
</p>

### Quick Trivia: Why are we doing this in PyTorch? Why not TensorFlow?

You may skip this section & will still do fine, but it's interesting to know how exactly TensorFlow & PyTorch differ and how PyTorch is gaining so much popularity.

With PyTorch & Tensorflow, being the two most comprehensive & popular frameworks, it didn't take much time to boil down our options to these two. Even though TensorFlow is more popular, we chose to go ahead with PyTorch for two primary reasons.

**1. Graph Creation**: Creating & running graphs is where the two frameworks differ the most. Graphs in PyTorch are created dynamically, i.e at runtime. Whereas TensorFlow compiles the graph first, then executes it repeatedly. As a simple example, consider this:

```python
for _ in range(T):
    h = torch.matmul(W, h) + b
```

Since the above operation takes place under a standard Python loop, `T` can be changed with each iteration of this code. TensorFlow on the other hand uses its [control flow operations](https://www.tensorflow.org/api_guides/python/control_flow_ops#Control_Flow_Operations) making it a bit too tedious to compute a graph dynamically. Furthermore, this makes debugging much easier. You'll see some more virtues of dynamic compute graphs in the upcoming articles.

In TensorFlow, we define the [computate graph](https://www.tensorflow.org/programmers_guide/graphs) once and then execute the same graph over and over again, like a loop. In PyTorch, each forward pass defines a new computational graph.

<p align="center">
  <img src="https://www.tensorflow.org/images/tensors_flowing.gif"/>
</p>

>Credit: [TF Graph docs](https://www.tensorflow.org/programmers_guide/graphs)

>Static graphs are nice because you can optimize the graph up front; framework might decide to fuse some graph
operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many
machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be amortized as the same graph is rerun over and over. However, for some models we may wish to perform
different computations differently for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as tf.scan for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative flow control to perform computation that differs for each input.

**2. Data Loaders**: With its well designed APIs, sampler & data loader, parallelizing data-flow operations is incredibly simple. TensorFlow provides us with some of its data loading tools (readers, queues, etc) but PyTorch is clearly miles ahead.

*So why is TensorFlow so popular then?* While we may feel that learning about DL makes PyTorch a better candidate than TF, it may also be noted that there are certain fronts where TensorFlow does extremely well. Primarily in **Deployment**, **Device Management** & **Serialization**.

#### Optimization Exercise

Fix the Optimization workflow for the next code snippet in the following way:
- Define learning rate variable to 1e-4,
- Define the Adam Optimizer passing 2 parameters: learning rate and model parameters,
- Inside the loop, after computing the loss: zeros the grads, compute gradients and update the weights accordingly.

In [180]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Load variables on GPU
if cuda:
    x, y = x.cuda(), y.cuda()

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        )
    
loss_fn = torch.nn.MSELoss(size_average=False)

# Load model and loss_fn on GPU
if cuda:
    model.cuda()
    loss_fn.cuda()

#### DEFINE OPTIMIZER ####
# Define learning rate
learning_rate = # CODE HERE
# Init Adam with parameters(model params and learning rate)
optimizer = # CODE HERE

for t in range(500):
  # Forward pass: compute predicted y by passing x to the model.
  y_pred = model(x)

  # Compute and print loss.
  loss = loss_fn(y_pred, y)
  print(t, loss.data[0])
  
  #### BACKPROP ####
  # Zero the grads
  # CODE HERE

  # Backward pass
  # CODE HERE

  # Update the weight with Adam
  # CODE HERE

0 632.82177734375
1 616.0342407226562
2 599.7182006835938
3 583.8980102539062
4 568.5601196289062
5 553.7428588867188
6 539.48974609375
7 525.6441650390625
8 512.2056274414062
9 499.0415954589844
10 486.1917419433594
11 473.7044982910156
12 461.57232666015625
13 449.8550109863281
14 438.465576171875
15 427.3471984863281
16 416.5296936035156
17 406.01910400390625
18 395.8365783691406
19 385.93768310546875
20 376.36749267578125
21 367.14117431640625
22 358.1838073730469
23 349.5149230957031
24 341.06573486328125
25 332.82940673828125
26 324.78094482421875
27 316.93963623046875
28 309.2872619628906
29 301.8126220703125
30 294.5040588378906
31 287.3655090332031
32 280.39263916015625
33 273.5672912597656
34 266.89208984375
35 260.3569030761719
36 253.98287963867188
37 247.80311584472656
38 241.78712463378906
39 235.8996124267578
40 230.1467742919922
41 224.52374267578125
42 219.02557373046875
43 213.64776611328125
44 208.37257385253906
45 203.21932983398438
46 198.16754150390625
47 193.2254

369 9.696146662463434e-06
370 9.032905836647842e-06
371 8.413609066337813e-06
372 7.83595532993786e-06
373 7.2980446930159815e-06
374 6.793623470002785e-06
375 6.32607134321006e-06
376 5.888831310585374e-06
377 5.480385425471468e-06
378 5.09972733198083e-06
379 4.7456114771193825e-06
380 4.414420800458174e-06
381 4.106736923858989e-06
382 3.820125130005181e-06
383 3.5527482395991683e-06
384 3.3038259061868303e-06
385 3.071978881052928e-06
386 2.8547617603180697e-06
387 2.6542174964561127e-06
388 2.4662101623107446e-06
389 2.2923966298549203e-06
390 2.129453378074686e-06
391 1.9782974050031044e-06
392 1.8377932065050118e-06
393 1.706218313302088e-06
394 1.5849684587010415e-06
395 1.4718700640514726e-06
396 1.3664537164004287e-06
397 1.2686374475379125e-06
398 1.1771192021114985e-06
399 1.0928218898698105e-06
400 1.0136735681953724e-06
401 9.404033107784926e-07
402 8.723727091819455e-07
403 8.088707659226202e-07
404 7.506167207793624e-07
405 6.958796348044416e-07
406 6.450922569456452e-0

## Next Up: Handwritten Digit Classification

So that's all for now. For the next article in this series, we are introducing a classical problem in Computer Vision: Handwritten Digit Recognition. Until now we've seen how to use Tensors (n-dimensional arrays) in PyTorch & compute their gradients with Autograd. The handwritten digit recognition is an example of a **classification** problem; given an image of a digit we can to classify it as either 0, 1, 2, 3...9. Each digit to be classified is known as a class. We will (try) to build a classifier with only whatever you've learned until now & then finally introduce you to the Artificial Neural Networks.

<p align="center">
  <img src="https://github.com/sominwadhwa/sominwadhwa.github.io/blob/master/assets/intro_to_pytorch_series/mnist_logreg.jpeg?raw=true"/>
</p>

Task: we'll be given a greyscale image (28 x 28) of some handwritten digit. We'll process this image to get a 28 x 28 matrix of real valued numbers, called **features** of this image. Our objective would be to **map a relationship between these features & the probability of a particular outcome**. Before moving on to the next article, if you are not familiar with this kind of a task, or wish to seek a quick intro to Logistic Regression, give [this article](https://medium.com/data-science-group-iitr/logistic-regression-simplified-9b4efe801389) a quick 5 minute read & you're good to go.

### Dataset

For this task we will use the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. We've already uploaded the entire [dataset on FloydHub](https://www.floydhub.com/redeipirati/datasets/mnist) & you can access the same via the `input` path.

To learn how datasets are managed on FloydHub, you can checkout the [dataset documentation](https://docs.floydhub.com/guides/create_and_upload_dataset/) or checkout this quick [tutorial](https://blog.floydhub.com/getting-started-with-deep-learning-on-floydhub/).

## Summary

PyTorch provides an amazing framework with an awesome community that can support us in our DL journey. We introduced PyTorch & in the next article you'll some more traditional use cases of PyTorch; We'll be implementing a full scale `Classification` exercise on PyTorch using Logistic Regression, look for some improvements through a single layer Neural Network as well as create some more 'strange' networks to give you a good idea how Dynamic Compute graphs make PyTorch so powerful.

*Note:* You should know that the PyTorch's [documentation](http://pytorch.org/docs/master/) and [tutorials](http://pytorch.org/tutorials/) are stored separately. And sometimes they may not converge due to the rapid speed of development and version changes. So feel free to investigate the [source code](https://github.com/pytorch/pytorch), if you feel so. [PyTorch Forums](https://discuss.pytorch.org/) are another great place to get your doubts cleared up. If you do however have any doubts/queries regarding our examples or in general, do let us know on the, we'll be happy to help.

We hope you enjoyed this Introduction to PyTorch. If you'd like to share your feedback (cheers, bug fix, typo and/or improvements), please leave us a comment on our super active [forum](https://forum.floydhub.com/) or tweet us [@FloydHub_](https://twitter.com/FloydHub_).

## Resources

**Big thanks** to:
 - [Illarion Khlestov](https://medium.com/@illarionkhlestov) for the code snippets & images.
 - [PyTorch](http://pytorch.org/tutorials/) for the docs, code snippets, images and the amazing framework.
 - [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) for the pytorch examples and snippets of code.

Link References:
 - Pytorch [docs](http://pytorch.org/docs/master/) and [tutorial](http://pytorch.org/tutorials/)
 - [jcjohnson pytorch examples](https://github.com/jcjohnson/pytorch-examples)
 - [PyTorch tutorial distilled by Illarion Khlestov](https://medium.com/towards-data-science/pytorch-tutorial-distilled-95ce8781a89c)
