<a href="https://colab.research.google.com/github/AvishekRoy16/DeepLearning/blob/master/6-Pytorch/Pytorch-Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Outline
* PyTorch
* What are tensors
* Initialising, slicing, reshaping tensors
* Numpy and PyTorch interfacing
* GPU support for PyTorch + Enabling GPUs on Google Colab
* Speed comparisons, Numpy -- PyTorch -- PyTorch on GPU
* Autodiff concepts and application
* Writing a basic learning loop using autograd
* Exercises

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt

Tensor is a kind of datastructure just like vector and matrics (list and dataframed/2D Lists).  
They have a higher order and also many tensors have relation between them.

## Initialise Tensors

In [2]:
# Makes the tensors of the specified dimentions and fill them with ones
x = torch.ones(3,2)
print(x)

# Makes the tensors of the specified dimentions and fill them with zeros
x = torch.zeros(3, 2)
print(x)

# Makes the tensors of the specified dimentions and fill them with random numbers
x = torch.rand(3, 2)
print(x)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])
tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])
tensor([[0.2029, 0.5901],
        [0.5739, 0.3683],
        [0.0375, 0.4813]])


In [3]:
# Will create space for the dimentions spesified but will not initialise value in it
x = torch.empty(3, 2)
print(x)

# if we want to give something the same shape as another tensor we can do that
y = torch.zeros_like(x)
print(y)

tensor([[1.5691e-07, 3.0887e-41],
        [1.4487e-07, 3.0887e-41],
        [8.9683e-44, 0.0000e+00]])
tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])


In [4]:
# Create a linearspace start, end, steps - start and end are included
x = torch.linspace(0, 1, steps=5)
print(x)

tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])


In [5]:
# Manually defining the tensors
x = torch.tensor([[1, 2], 
                 [3, 4], 
                 [5, 6]])
print(x)

tensor([[1, 2],
        [3, 4],
        [5, 6]])


## Slicing tensors

In [6]:
# Dimentions of the tensors are fiven insde a list
print(x.size())

# slicing[rows: column]
# Take all rows and print the column of id 1
print(x[:, 1]) 
# Take the 0th roaw and print all the column in that
print(x[0, :])

# All the rules for slicing in list apply to tensors as well

torch.Size([3, 2])
tensor([2, 4, 6])
tensor([1, 2])


In [7]:
# We are accessing a particular element from the x rows we are accessing the element in the 
# first row and first column, The data type of the element still remains tensor
y = x[1, 1]
print(y)
# To change the data type of the element while accessing it from tensor to it's actual datatype.
print(y.item())

tensor(4)
4


## Reshaping tensors

Dimentions play a very important role in machine learning and we have to keep track of what we are multiplying with what when we are trying to do matrix multiplications and other operation that erquire the dimentions of the tensors to be correct

In [8]:
# To view the tensor in another dimentions we can use views - views(row, column)
print(x)
y = x.view(2, 3)
print(y)

tensor([[1, 2],
        [3, 4],
        [5, 6]])
tensor([[1, 2, 3],
        [4, 5, 6]])


In [9]:
# We can reshape it when we know pnly one of the dimentions and 
# it will pick and appropriate number to put in the second dimention, to do that we have to fill -1 in
# the dimention we do not know the number
y = x.view(6,-1) 
print(y)

tensor([[1],
        [2],
        [3],
        [4],
        [5],
        [6]])


## Simple Tensor Operations

In [10]:
# Simple operations in Tesors
x = torch.ones([3, 2])
y = torch.ones([3, 2])
z = x + y
print(z)
z = x - y
print(z)
z = x * y
print(z)

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])
tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])
tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])


In [11]:
# z is being updated by adding x to y. Here y reamins the same and is not updated
z = y.add(x)
print(z)
print(y)

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])
tensor([[1., 1.],
        [1., 1.],
        [1., 1.]])


In [12]:
# Addition in place
# We are taking y and then adding x to it and updating y in the process
z = y.add_(x)
print(z)
print(y)

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])
tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])


## Numpy <> PyTorch

In [13]:
# Interfacing Numpy and Pytorch

# Converted tensor into numpy
x_np = x.numpy()
print(type(x), type(x_np))
print(x_np)

<class 'torch.Tensor'> <class 'numpy.ndarray'>
[[1. 1.]
 [1. 1.]
 [1. 1.]]


In [14]:
# Converting a numpy array into a tenors
a = np.random.randn(5)
print(a)
a_pt = torch.from_numpy(a)
print(type(a), type(a_pt))
print(a_pt)
# This is less of copying and more a bridge between the two as if we make changes into numpy,
# it will be reflected in tensor

[ 0.62234009  0.86093278 -1.85977867 -0.77231315 -0.83015987]
<class 'numpy.ndarray'> <class 'torch.Tensor'>
tensor([ 0.6223,  0.8609, -1.8598, -0.7723, -0.8302], dtype=torch.float64)


In [15]:
np.add(a, 1, out=a)
print(a)
print(a_pt) 

[ 1.62234009  1.86093278 -0.85977867  0.22768685  0.16984013]
tensor([ 1.6223,  1.8609, -0.8598,  0.2277,  0.1698], dtype=torch.float64)


In [16]:
%%time
# Checking the time taken to loop and add random numbers using numpy arrays
for i in range(100):
  a = np.random.randn(100,100) # (100,100) is the matrix size
  b = np.random.randn(100,100)
  c = np.matmul(a, b)

CPU times: user 369 ms, sys: 6.8 ms, total: 375 ms
Wall time: 112 ms


In [17]:
%%time
# Checking the time taken to loop and add random numbers using tensors
for i in range(100):
  a = torch.randn([100, 100])
  b = torch.randn([100, 100])
  c = torch.matmul(a, b)

# Note we are still not using the GPU.

CPU times: user 58.4 ms, sys: 0 ns, total: 58.4 ms
Wall time: 26.6 ms


In [18]:
%%time
for i in range(10):
  a = np.random.randn(10000,10000)
  b = np.random.randn(10000,10000)
  c = a + b

CPU times: user 55.7 s, sys: 14.1 s, total: 1min 9s
Wall time: 1min 11s


In [19]:
%%time
for i in range(10):
  a = torch.randn([10000, 10000])
  b = torch.randn([10000, 10000])
  c = a + b

CPU times: user 15.8 s, sys: 4.51 s, total: 20.3 s
Wall time: 16.7 s


## CUDA support

In [20]:
print(torch.cuda.device_count())

1


In [21]:
print(torch.cuda.device(0))
print(torch.cuda.get_device_name(0))

<torch.cuda.device object at 0x7f1b5e605280>
GeForce MX150


In [22]:
cuda0 = torch.device('cuda:0')

In [23]:
a = torch.ones(3, 2, device=cuda0)
b = torch.ones(3, 2, device=cuda0)
c = a + b
print(c)

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]], device='cuda:0')


In [24]:
print(a)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]], device='cuda:0')


In [25]:
%%time
# Time comparison between numpy, cpu and gpu performance for matrix addition
for i in range(10):
  a = np.random.randn(10000,10000)
  b = np.random.randn(10000,10000)
  np.add(b, a)

CPU times: user 1min 5s, sys: 17.6 s, total: 1min 23s
Wall time: 1min 24s


In [26]:
%%time
for i in range(10):
  a_cpu = torch.randn([10000, 10000])
  b_cpu = torch.randn([10000, 10000])
  b_cpu.add_(a_cpu)

CPU times: user 24.1 s, sys: 3.97 s, total: 28.1 s
Wall time: 23.4 s


In [27]:
%%time
for i in range(10):
  a = torch.randn([10000, 10000], device=cuda0)
  b = torch.randn([10000, 10000], device=cuda0)
  b.add_(a)

CPU times: user 9.67 ms, sys: 59.4 ms, total: 69.1 ms
Wall time: 80.9 ms


In [28]:
%%time
# Time comparison between numpy, cpu and gpu performance for matrix multiplication
for i in range(10):
  a = np.random.randn(10000,10000)
  b = np.random.randn(10000,10000)
  np.matmul(b, a)

CPU times: user 13min 7s, sys: 19.1 s, total: 13min 26s
Wall time: 4min 4s


In [29]:
%%time
for i in range(10):
  a_cpu = torch.randn([10000, 10000])
  b_cpu = torch.randn([10000, 10000])
  torch.matmul(a_cpu, b_cpu)

CPU times: user 6min 41s, sys: 4.5 s, total: 6min 45s
Wall time: 1min 53s


In [30]:
%%time
for i in range(10):
  a = torch.randn([10000, 10000], device=cuda0)
  b = torch.randn([10000, 10000], device=cuda0)
  torch.matmul(a, b)

CPU times: user 543 ms, sys: 232 ms, total: 775 ms
Wall time: 1.24 s


## Autodiff

This feature lets us calculate the gradient automatically

In [31]:
# requires grad = True is spefied when the tensor is used in autodiff
# so we might later differenciate therse wrt x
x = torch.ones([3,2], requires_grad = True)
print(x)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]], requires_grad=True)


Tensor on a data-structure level is about storing these multidimentional matrices but further at a structural level it also relates different tensors with each other.  
The ability to model different high dimentional matrices is what makes tensors what they are

In [32]:
# when we do this y automatically understands that y is a function of x which itself requires gradients
y = x + 5
print(y)

tensor([[6., 6.],
        [6., 6.],
        [6., 6.]], grad_fn=<AddBackward0>)


In [33]:
# so we are stackin another fuction on top of y
z = y*y + 1
print(z)

tensor([[37., 37.],
        [37., 37.],
        [37., 37.]], grad_fn=<AddBackward0>)


In [34]:
# torch.sum simply adds all the numbers in the tensors
t = torch.sum(z)
print(t)
# We can think of this as a forward pass we have been doing
# The book-keeping is being kept bu pyTorch and it tells that the last operation being done on the tensor was sum operation

tensor(222., grad_fn=<SumBackward0>)


In [35]:
# At this point we are ready to do a backward pass
t.backward()
# nothing is shown in the output, pyTorch is internally doing some computations 

In [36]:
# x.grad is the derivative of t wrt x
# We had taken the backward starting from t so that becomes the fucntion that we want to diffrenciate
# ans we want do diffrenciate it against x upon which we are calling the grad
print(x.grad)

tensor([[12., 12.],
        [12., 12.],
        [12., 12.]])


Logic for why the derivative of t wrt x was 12:  
$t = \sum_i z_i,    z_i = y_i^2 + 1,    y_i = x_i + 5$

$\frac{\partial t}{\partial x_i} = \frac{\partial z_i}{\partial x_i} = \frac{\partial z_i}{\partial y_i} \frac{\partial y_i}{\partial x_i} = 2y_i \times 1$


At x = 1, y = 6, $\frac{\partial t}{\partial x_i} = 12$

So here we can see that we are getting the partial derivative of t wrt all x's.  
We can now write any cascading set on functions on a given set of inputs,  
we can call trig functions, tanh, log, even things like standard deviation mean and so on, 
then compute derivative wrt inputs

This is numerically being computed at the poin we have initialised our values. So in this particular example, x is initialised at 1's.

In [37]:
# Another example, x and y values replacing z by r where r is taking the sigmoid of y
x = torch.ones([3, 2], requires_grad=True)
y = x + 5
r = 1/(1 + torch.exp(-y))
print(r)
s = torch.sum(r)
s.backward()
print(x.grad)

tensor([[0.9975, 0.9975],
        [0.9975, 0.9975],
        [0.9975, 0.9975]], grad_fn=<MulBackward0>)
tensor([[0.0025, 0.0025],
        [0.0025, 0.0025],
        [0.0025, 0.0025]])


We were earlier writing the forward pass and backward pass ourselves and were implementing our knowlege of the derivative of sigmoids, tanh etc. now we are letting pyTorch do it for us automatically. So it's quite a powerful thing in that sense!

In [38]:
# We can do the above diffrensiation in this manne too, just that here instad of taking r into s and summing it to get one value
# we define 'a' with 1's and the same shape of x.
# So basicallly we are avoiding calling the sum fuction
x = torch.ones([3, 2], requires_grad=True)
y = x + 5
r = 1/(1 + torch.exp(-y))
a = torch.ones([3, 2])
r.backward(a)
print(x.grad)

tensor([[0.0025, 0.0025],
        [0.0025, 0.0025],
        [0.0025, 0.0025]])


r.backward is computing the derivative of r wrt x, but it multiplies point wise with derivative the value of 'a' which we  have taken as an argument in r.backward   
so we are doing $\frac{\partial{s}}{\partial{r}}$ and multiplying poin wise with 'a'.  

This feature is there so that we are able to cascade our chain rule through multiple fucntions  

$\frac{\partial{s}}{\partial{x}} = \frac{\partial{s}}{\partial{r}} \cdot \frac{\partial{r}}{\partial{x}}$

For the above code $a$ represents $\frac{\partial{s}}{\partial{r}}$ and then $x.grad$ gives directly $\frac{\partial{s}}{\partial{x}}$



id we want to calculate $\frac{\partial{s}}{\partial{x}}$ then it is given by chain rule $\frac{\partial{s}}{\partial{r}} \cdot \frac{\partial{r}}{\partial{x}}$  
So we want to move from s to, so we first move from s to r and then form r to x.  
r to x is given by r.backward, but if we have already computed s to r and stored that in a then point wise multiplying a with $\frac{\partial{r}}{\partial{x}}$ will directly give us $\frac{\partial{s}}{\partial{x}}$  
In this case we dont have a s we want to concider so, When a is a submission we will use torch.ones

## Autodiff example that looks like what we have been doing

#### Forward Pass

In [39]:
# Think of these as a training dataset, with 20 items 
# for each of them we are computing y = 3*x - w we can think of this as the gorund truth model - real output
# Input data
x = torch.randn([20, 1], requires_grad=True)
# Model
y = 3*x - 2

In [40]:
# We don't know the correct values of w and b so we are setting them up to be we 1 each, 
# the real values we know 3 and -2 from y = 3x - 2
w = torch.tensor([1.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

# This is the predicted value
y_hat = w*x + b

# We are using the usual MSE - Mean Squared Error, not taking the mean here
loss = torch.sum((y_hat - y)**2)

In [41]:
print(loss)
# The loss is a positve number, shows that the model is not equal, 
# i.e is the estimated values of w and b are not the same as the ground truth model

tensor(290.7623, grad_fn=<SumBackward0>)


#### Backprpogatoin

In [42]:
loss.backward()

In [43]:
# Derivative of loss wrt w and b respectively
print(w.grad, b.grad)

tensor([-107.0340]) tensor([122.4855])


Observations: We know that the model that we are using is a simple linear model. So the correct value of w should be 3, and we estimated the value to be 1. So the w whould increase, which is clear as the derivative of loss wrt to w is negetive.
This means if we move towards increasing w the loss will decrease.  

On the other hand for b we have chosen to 1 as the value and the actual value is -2, so we would like to decrease it. So if we take a derivative of loss wrt to b, as we increase b the loss should increase

So if the gradient is a positive number then, then the selected initialsation value should decrease and vice versa.

## Do it in a loop

In [44]:
# Taking Learning rate as a constant
learning_rate = 0.01

# Initiating w and b to 1 in and enabling the autodiff functionality
w = torch.tensor([1.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)
# Printing the initial values of w and b
print(w.item(), b.item())

# FORWARD PROPAGATION - also known as computation graph(where different varibles are interactinga and computing new variables)
# Now we are going through the learning loop so we can think of 10 here as the number of epochs
for i in range(10):
  # Input values
  x = torch.randn([20, 1])
  # Ground truth model
  y = 3*x - 2
  
  # Predicted Value
  y_hat = w*x + b
  # Loss Function
  loss = torch.sum((y_hat - y)**2)
  
  
  # BACKWARD PROPAGATION
  loss.backward()
  
  # Standard Gradient Decent Algo(Update Rule)
  # we have started backward propagation but to tell pyTorch that we have stated backward 
  # propagation so that it does not continue it's book keeping and build relations b/w the tensors 
  # so we use with torch.no_grad
  # so the w and b are jsut being thought as variable updates
  with torch.no_grad():
    w -= learning_rate * w.grad
    b -= learning_rate * b.grad
    
    # We also need to set the gradients to zero so that they are completely fresh
    w.grad.zero_()
    b.grad.zero_() 

  print(w.item(), b.item())
  # as we know the write values of w and b are 3 and -2 and the output shows that we are coming close to the actual values

1.0 1.0
1.8591370582580566 -0.16979622840881348
2.4375343322753906 -0.7341733574867249
2.5145201683044434 -1.2211039066314697
2.670606851577759 -1.5503445863723755
2.733276844024658 -1.711594581604004
2.800959587097168 -1.8045190572738647
2.809910774230957 -1.8447062969207764
2.880446434020996 -1.884988784790039
2.911482334136963 -1.9189097881317139
2.927065849304199 -1.9481284618377686


## Do it for a large problem

In [45]:
%%time
# Using pyTorch but not GPU
learning_rate = 0.001
N = 10000000
epochs = 200

# Here w is a vector of N values - so we get the same benifit of vectorisation we had earlier seen in numpy
# w is randomised b/w 0 and 1
w = torch.rand([N], requires_grad=True)
b = torch.ones([1], requires_grad=True)

# print(torch.mean(w).item(), b.item())

for i in range(epochs):
  
  # x contains random numbers but there are N features
  x = torch.randn([N])
  y = torch.dot(3*torch.ones([N]), x) - 2
  
  # The model has a dot product b/w w and x
  y_hat = torch.dot(w, x) + b
  loss = torch.sum((y_hat - y)**2)
  
  loss.backward()
  
  with torch.no_grad():
    w -= learning_rate * w.grad
    b -= learning_rate * b.grad
    
    w.grad.zero_()
    b.grad.zero_()

  # print(torch.mean(w).item(), b.item())
  
  
  # The correct value for w is 3 and for b it us -2 form the equation -  y = torch.dot(3*torch.ones([N]), x) - 2
  # We can see from the output that there are ossilations but the model did come close to original value of w which is 3
  # and b needs more number of epochs to being learned by this model

CPU times: user 1min 22s, sys: 22.8 s, total: 1min 45s
Wall time: 38 s


In [46]:
%%time
# Using pyTorch & GPU
learning_rate = 0.001
N = 10000000
epochs = 200

# Every tensor that we create should be in the gpu

# instanciating the waribles with the GPU
w = torch.rand([N], requires_grad=True, device=cuda0)
b = torch.ones([1], requires_grad=True, device=cuda0)

# print(torch.mean(w).item(), b.item())

for i in range(epochs):
  
  x = torch.randn([N], device=cuda0)
  # here we cans see that the ones also has to be declared in the cuda device
  y = torch.dot(3*torch.ones([N], device=cuda0), x) - 2
  
  y_hat = torch.dot(w, x) + b
  loss = torch.sum((y_hat - y)**2)
  
  loss.backward()
  
  with torch.no_grad():
    w -= learning_rate * w.grad
    b -= learning_rate * b.grad
    
    w.grad.zero_()
    b.grad.zero_()

  #print(torch.mean(w).item(), b.item())

CPU times: user 2.02 s, sys: 908 ms, total: 2.93 s
Wall time: 2.95 s
