<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/DL_Fundamentals_DEEP_LEARNING_WITH_PYTORCH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Deep Learning Building Blocks: Affine maps, non-linearities and objectives

[link](https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html)

Deep learning consists of composing linearities with non-linearities in clever ways. The introduction of non-linearities allows for powerful models. In this section, we will play with these core components, make up an objective function, and see how the model is trained.

One of the core workhorses of deep learning is the affine map, which is a function f(x) where

f(x)=Ax+b
for a matrix A and vectors x,b. The parameters to be learned here are A and b. Often, b is refered to as the bias term.

PyTorch and most other deep learning frameworks do things a little differently than traditional linear algebra. It maps the rows of the input instead of the columns. That is, the i’th row of the output below is the mapping of the i’th row of the input under A, plus the bias term. Look at the example below.



In [2]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


<torch._C.Generator at 0x7fec9afd1630>

In [6]:
lin = nn.Linear(5, 3)  # maps from R^5 to R^3, parameters A, b

# basically lin is nothing but a matrix of wts which will be updated by grad descent

print (lin.weight.shape)

# data is 2x5.  A maps from 5 to 3... can we map "data" under A?

data = torch.randn(2, 5)

# op: 2x5 5x3 -> 2x3

print(lin(data))

print (lin(data).shape)

torch.Size([3, 5])
tensor([[-0.0120,  0.3745, -0.3695],
        [ 0.0722,  0.7715, -0.4374]], grad_fn=<AddmmBackward>)
torch.Size([2, 3])


In [7]:
# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = torch.randn(2, 2)
print(data)
print(F.relu(data))

tensor([[ 0.2912, -0.8317],
        [-0.5525,  0.6355]])
tensor([[0.2912, 0.0000],
        [0.0000, 0.6355]])


The function Softmax(x) is also just a non-linearity, but it is special in that it usually is the last operation done in a network. This is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no constraints). Then the i’th component of Softmax(x) is

exp(xi)∑jexp(xj)

It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.

You could also think of it as just applying an element-wise exponentiation operator to the input to make everything non-negative and then dividing by the normalization constant.

In [10]:
# Softmax is also in torch.nn.functional
data = torch.randn(5)
print(data)

print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, dim=0))  # theres also log_softmax

tensor([-2.5667, -1.4303,  0.5009,  0.5438, -0.4057])
tensor([0.0176, 0.0549, 0.3789, 0.3955, 0.1530])
tensor(1.)
tensor([-4.0381, -2.9017, -0.9705, -0.9276, -1.8771])


### Creating Network Components in PyTorch

Before we move on to our focus on NLP, lets do an annotated example of building a network in PyTorch using only affine maps and non-linearities. We will also see how to compute a loss function, using PyTorch’s built in negative log likelihood, and update parameters by backpropagation.

All network components should inherit from nn.Module and override the forward() method. That is about it, as far as the boilerplate is concerned. Inheriting from nn.Module provides functionality to your component. For example, it makes it keep track of its trainable parameters, you can swap it between CPU and GPU with the .to(device) method, where device can be a CPU device torch.device("cpu") or CUDA device torch.device("cuda:0").

Let’s write an annotated example of a network that takes in a sparse bag-of-words representation and outputs a probability distribution over two labels: “English” and “Spanish”. This model is just logistic regression.

### Example: Logistic Regression Bag-of-Words classifier


Our model will map a sparse BoW representation to log probabilities over labels. We assign each word in the vocab an index. For example, say our entire vocab is two words “hello” and “world”, with indices 0 and 1 respectively. The BoW vector for the sentence “hello hello hello hello” is

[4,0]
For “hello world world hello”, it is

[2,2]
etc. In general, it is

[Count(hello),Count(world)]
Denote this BOW vector as x. The output of our network is:

logSoftmax(Ax+b)
That is, we pass the input through an affine map and then do log softmax.

In [12]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

print (data[0])

(['me', 'gusta', 'comer', 'en', 'la', 'cafeteria'], 'SPANISH')


In [17]:
# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector


word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


In [35]:
VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

# print (VOCAB_SIZE)

class BoWClassifier(nn.Module):  # inheriting from nn.Module!
  
  def __init__(self, num_labels, vocab_size):
    # calls the init function of nn.Module.  Dont get confused by syntax,
    # just always do it in an nn.Module
    super(BoWClassifier, self).__init__()

    # Define the parameters that you will need.  In this case, we need A and b,
    # the parameters of the affine mapping.
    # Torch defines nn.Linear(), which provides the affine map.
    # Make sure you understand why the input dimension is vocab_size
    # and the output is num_labels!

    self.linear = nn.Linear(vocab_size, num_labels)

    # linear is matrix of num_labelsxvocab_size
    # ip is matrx of 1xvocab_size
    # op will be tensor of num_labels
    
    # NOTE! The non-linearity log softmax does not have parameters! So we don't need
    # to worry about that here
    
  def forward(self, bow_vec):
    # Pass the input through the linear layer,
    # then pass that through log_softmax.
    # Many non-linearities and other functions are in torch.nn.functional
    return F.log_softmax(self.linear(bow_vec), dim=1)
    

    
def make_bow_vector(sentence, word_to_ix):
  vec = torch.zeros(len(word_to_ix))
  for word in sentence:
    vec[word_to_ix[word]] += 1
  
  # the size of op vec will be [1, vocab_size]
  return vec.view(1, -1)

def make_target(label, label_to_idx):
  return torch.LongTensor([label_to_idx[label]])


print (make_bow_vector(data[0][0], word_to_ix).shape)



torch.Size([1, 26])


In [36]:

model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
  # first param will be matrix A
  # second will be the biases
  # Ax + b
  # A: num_labelsxvocab_size x: 1xvocab_size b: num_labels
  print(param.shape)
  
# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()

# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
  sample = data[0]
  bow_vector = make_bow_vector(sample[0], word_to_ix)
  log_probs = model(bow_vector)
  print(log_probs)


torch.Size([2, 26])
torch.Size([2])
tensor([[-0.6784, -0.7081]])


So F.softmax returns a prob dist. F.log_softmax returns a log of this

Which of the above values corresponds to the log probability of ENGLISH, and which to SPANISH? We never defined it, but we need to if we want to train the thing.



In [0]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}


So lets train! To do this, we pass instances through to get log probabilities, compute a loss function, compute the gradient of the loss function, and then update the parameters with a gradient step. Loss functions are provided by Torch in the nn package. nn.NLLLoss() is the negative log likelihood loss we want. It also defines optimization functions in torch.optim. Here, we will just use SGD.

**Note that the input to NLLLoss is a vector of log probabilities, and a target label. It doesn’t compute the log probabilities for us. This is why the last layer of our network is log softmax. The loss function nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log softmax for you.**

In [42]:
# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)



tensor([[-0.5154, -0.9095]])
tensor([[-0.8217, -0.5792]])


In [44]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.

for instance, label in data:
  # Step 1. Remember that PyTorch accumulates gradients.
  # We need to clear them out before each instance
  # model.zero_grad()
  # Step 2. Make our BOW vector and also we must wrap the target in a
  # Tensor as an integer. For example, if the target is SPANISH, then
  # we wrap the integer 0. The loss function then knows that the 0th
  # element of the log probabilities is the log probability
  # corresponding to SPANISH
  bow_vec = make_bow_vector(instance, word_to_ix)
  target = make_target(label, label_to_ix)

  # Step 3. Run our forward pass.
  log_probs = model(bow_vec)

  # Step 4. Compute the loss, gradients, and update the parameters by
  # calling optimizer.step()
  print (log_probs, target)
  loss = loss_function(log_probs, target)
  print (loss)





tensor([[-0.1002, -2.3500]], grad_fn=<LogSoftmaxBackward>) tensor([0])
tensor(0.1002, grad_fn=<NllLossBackward>)
tensor([[-1.3585, -0.2971]], grad_fn=<LogSoftmaxBackward>) tensor([1])
tensor(0.2971, grad_fn=<NllLossBackward>)
tensor([[-0.3231, -1.2869]], grad_fn=<LogSoftmaxBackward>) tensor([0])
tensor(0.3231, grad_fn=<NllLossBackward>)
tensor([[-1.1336, -0.3884]], grad_fn=<LogSoftmaxBackward>) tensor([1])
tensor(0.3884, grad_fn=<NllLossBackward>)


What can we decipher from the above op?

Note we are getting the log of prob values and log(1) = 0

When target is 0 and if we get [-0.1002, -2.3500] as op , op should have been [0, x] (remember we are getting log(prob) values), So loss here is 0-(--0.1002) = 0.1002



In [0]:
for epoch in range(100):
  for instance, label in data:
    # Step 1. Remember that PyTorch accumulates gradients.
    # We need to clear them out before each instance
    model.zero_grad()

    # Step 2. Make our BOW vector and also we must wrap the target in a
    # Tensor as an integer. For example, if the target is SPANISH, then
    # we wrap the integer 0. The loss function then knows that the 0th
    # element of the log probabilities is the log probability
    # corresponding to SPANISH
    bow_vec = make_bow_vector(instance, word_to_ix)
    target = make_target(label, label_to_ix)

    # Step 3. Run our forward pass.
    log_probs = model(bow_vec)

    # Step 4. Compute the loss, gradients, and update the parameters by
    # calling optimizer.step()
    loss = loss_function(log_probs, target)
    loss.backward()
    optimizer.step()

In [47]:
with torch.no_grad():
  for instance, label in test_data:
    print (instance, label)
    bow_vec = make_bow_vector(instance, word_to_ix)
    log_probs = model(bow_vec)
    print(log_probs)

['Yo', 'creo', 'que', 'si'] SPANISH
tensor([[-0.1424, -2.0193]])
['it', 'is', 'lost', 'on', 'me'] ENGLISH
tensor([[-2.6018, -0.0770]])


In [48]:
# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([ 0.2901, -0.3370], grad_fn=<SelectBackward>)
