<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/DL_Fundamentals_DEEP_LEARNING_WITH_PYTORCH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Deep Learning Building Blocks: Affine maps, non-linearities and objectives

[link](https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html)

Deep learning consists of composing linearities with non-linearities in clever ways. The introduction of non-linearities allows for powerful models. In this section, we will play with these core components, make up an objective function, and see how the model is trained.

One of the core workhorses of deep learning is the affine map, which is a function f(x) where

f(x)=Ax+b
for a matrix A and vectors x,b. The parameters to be learned here are A and b. Often, b is refered to as the bias term.

PyTorch and most other deep learning frameworks do things a little differently than traditional linear algebra. It maps the rows of the input instead of the columns. That is, the i’th row of the output below is the mapping of the i’th row of the input under A, plus the bias term. Look at the example below.



In [2]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


<torch._C.Generator at 0x7fec9afd1630>

In [6]:
lin = nn.Linear(5, 3)  # maps from R^5 to R^3, parameters A, b

# basically lin is nothing but a matrix of wts which will be updated by grad descent

print (lin.weight.shape)

# data is 2x5.  A maps from 5 to 3... can we map "data" under A?

data = torch.randn(2, 5)

# op: 2x5 5x3 -> 2x3

print(lin(data))

print (lin(data).shape)

torch.Size([3, 5])
tensor([[-0.0120,  0.3745, -0.3695],
        [ 0.0722,  0.7715, -0.4374]], grad_fn=<AddmmBackward>)
torch.Size([2, 3])


In [7]:
# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = torch.randn(2, 2)
print(data)
print(F.relu(data))

tensor([[ 0.2912, -0.8317],
        [-0.5525,  0.6355]])
tensor([[0.2912, 0.0000],
        [0.0000, 0.6355]])


The function Softmax(x) is also just a non-linearity, but it is special in that it usually is the last operation done in a network. This is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no constraints). Then the i’th component of Softmax(x) is

exp(xi)∑jexp(xj)

It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.

You could also think of it as just applying an element-wise exponentiation operator to the input to make everything non-negative and then dividing by the normalization constant.

In [10]:
# Softmax is also in torch.nn.functional
data = torch.randn(5)
print(data)

print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, dim=0))  # theres also log_softmax

tensor([-2.5667, -1.4303,  0.5009,  0.5438, -0.4057])
tensor([0.0176, 0.0549, 0.3789, 0.3955, 0.1530])
tensor(1.)
tensor([-4.0381, -2.9017, -0.9705, -0.9276, -1.8771])


### Creating Network Components in PyTorch

