# LEARNING PYTORCH WITH EXAMPLES
This tutorial introdueces the fundamental concepts of [Py-Torch](https://github.com/pytorch/pytorch) through self-contained examples.

At its core, PyTorch provides two main features:
- An n -dimensional Tensor, similar to numpy but ca run on GPUs
- Automatic differentiation for building and training neural networks

We will use a fully-connected ReLU network as our running example. The network will have a single hidden layer, and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output.

## Tensors
### Warm-up: numpy
Before introducing PyTroch, we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic gramework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [12]:
a = np.random.randn(3, 4)
a

array([[ 0.50307088, -0.15185053, -0.34081556, -0.30341894],
       [ 0.54715363,  0.33065336,  0.31636158, -0.69556258],
       [-0.86570228,  0.19769972,  1.10967712,  0.48293151]])

In [1]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension/
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

(0, 28039873.966130495)
(1, 20472276.079896238)
(2, 16512349.60131437)
(3, 13586239.235681716)
(4, 10938049.854629451)
(5, 8470256.300047835)
(6, 6319554.249940729)
(7, 4593580.153910894)
(8, 3307110.435857486)
(9, 2392634.040389738)
(10, 1761199.4217992234)
(11, 1326971.1552512283)
(12, 1026653.9182247353)
(13, 815183.5272028478)
(14, 662441.0549544694)
(15, 548962.3566950667)
(16, 462358.38123568974)
(17, 394424.4327615809)
(18, 339878.5554095361)
(19, 295217.94464816886)
(20, 258042.41865338816)
(21, 226735.26874670837)
(22, 200091.2773070959)
(23, 177241.97131199512)
(24, 157605.74365220417)
(25, 140540.88703333965)
(26, 125669.53303626839)
(27, 112638.42396724306)
(28, 101162.08790136671)
(29, 91036.8585282597)
(30, 82071.03383399942)
(31, 74113.72695723925)
(32, 67034.92822275579)
(33, 60725.87531058881)
(34, 55091.57766087088)
(35, 50047.17976619336)
(36, 45525.93840649478)
(37, 41477.18382123007)
(38, 37837.17774789438)
(39, 34555.98910978215)
(40, 31593.059126752247)
(41, 2891

## PyTorch: Tensor
Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the `Tensor`. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic tool for scientific computing.

Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [None]:
import torch


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda: 0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # forward pass: compute predicted y
    h = x.mm(w1)