# Deep Learning for Natural Language processing with pytorch

In [3]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x19c179ac138>

# 1. Introduction to tensor operation

In [4]:
V_data = [1., 2., 3.]
V = torch.Tensor(V_data)

print(V)


 1
 2
 3
[torch.FloatTensor of size 3]



### Reshaping (view())

In [5]:
x = torch.randn(2,3,4)
print(x)

print(x.view(2, 12))
print(x.view(2, -1))


(0 ,.,.) = 
 -2.9718  1.7070 -0.4305 -2.2820
  0.5237  0.0004 -1.2039  3.5283
  0.4434  0.5848  0.8407  0.5510

(1 ,.,.) = 
  0.3863  0.9124 -0.8410  1.2282
 -1.8661  1.4146 -1.8781 -0.4674
 -0.7576  0.4215 -0.4827 -1.1198
[torch.FloatTensor of size 2x3x4]



Columns 0 to 9 
-2.9718  1.7070 -0.4305 -2.2820  0.5237  0.0004 -1.2039  3.5283  0.4434  0.5848
 0.3863  0.9124 -0.8410  1.2282 -1.8661  1.4146 -1.8781 -0.4674 -0.7576  0.4215

Columns 10 to 11 
 0.8407  0.5510
-0.4827 -1.1198
[torch.FloatTensor of size 2x12]



Columns 0 to 9 
-2.9718  1.7070 -0.4305 -2.2820  0.5237  0.0004 -1.2039  3.5283  0.4434  0.5848
 0.3863  0.9124 -0.8410  1.2282 -1.8661  1.4146 -1.8781 -0.4674 -0.7576  0.4215

Columns 10 to 11 
 0.8407  0.5510
-0.4827 -1.1198
[torch.FloatTensor of size 2x12]



# 2. Computation Graph and Automatic differentiation
The concept of a computation graph is essential to efficient deep learning programming, because it allows you to not have to write the back propagation gradients yourself. A computation graph is simply a specification of how your data is combined to give you the output. Since the graph totally specifies what parameters were involved with which operations, it contains enough information to compute derivatives. This probably sounds vague, so lets see what is going on using the fundamental class of Pytorch: autograd.Variable.

First, think from a programmers perspective. What is stored in the torch.Tensor objects we were creating above? Obviously the data and the shape, and maybe a few other things. But when we added two tensors together, we got an output tensor. All this output tensor knows is its data and shape. It has no idea that it was the sum of two other tensors (it could have been read in from a file, it could be the result of some other operation, etc.)

The Variable class keeps track of how it was created. Lets see it in action.

In [6]:
# Variables wrap tensor objects
x = autograd.Variable( torch.Tensor([1., 2., 3]), requires_grad=True )
# You can access the data with the .data attribute
print(x.data)

# You can also do all the same operations you did with tensors with Variables.
y = autograd.Variable( torch.Tensor([4., 5., 6]), requires_grad=True )
z = x + y
print(z.data)

# BUT z knows something extra.
print(z.grad_fn)


 1
 2
 3
[torch.FloatTensor of size 3]


 5
 7
 9
[torch.FloatTensor of size 3]

<torch.autograd.function.AddBackward object at 0x0000019C1804AAF0>


So Variables know what created them. z knows that it wasn't read in from a file, it wasn't the result of a multiplication or exponential or whatever. And if you keep following z.grad_fn, you will find yourself at x and y.

But how does that help us compute a gradient?

In [8]:
# Lets sum up all the entries in z
s = z.sum()
print(s)
print(s.grad_fn)

Variable containing:
 21
[torch.FloatTensor of size 1]

<torch.autograd.function.SumBackward object at 0x0000019C1804ABE8>


So now, what is the derivative of this sum with respect to the first component of x? In math, we want
∂s∂x0
∂s∂x0
Well, s knows that it was created as a sum of the tensor z. z knows that it was the sum x + y. So

And so s contains enough information to determine that the derivative we want is 1!

Lets have Pytorch compute the gradient, and see that we were right: (note if you run this block multiple times, the gradient will increment. That is because Pytorch accumulates the gradient into the .grad property, since for many models this is very convenient.)

In [10]:
print(x.grad)

None


In [11]:
s.backward()
print(x.grad)

Variable containing:
 1
 1
 1
[torch.FloatTensor of size 3]



# 3. Deep Learning Building Blocks: Affine maps, non-linearities and objectives

## Afine Maps

In [12]:
lin = nn.Linear(5, 3) # maps from R^5 to R^3, params A, b
data = autograd.Variable(torch.randn(2, 5))
print(lin(data))

Variable containing:
-0.1750  0.1918  0.6385
 0.1311 -0.1329  1.3910
[torch.FloatTensor of size 2x3]



## Non linearity
First, note the following fact, which will explain why we need non-linearities in the first place. Suppose we have two affine maps $f(x)=Ax+b$ and $g(x)=Cx+d$. What is $f(g(x))$?

<center>$f(g(x))=A(Cx+d)+b=ACx+(Ad+b)$</center>

ACis a matrix and $Ad+b$ is a vector, so we see that composing affine maps gives you an affine map.

From this, you can see that if you wanted your neural network to be long chains of affine compositions, that this adds no new power to your model than just doing a single affine map.

If we introduce non-linearities in between the affine layers, this is no longer the case, and we can build much more powerful models.

There are a few core non-linearities.$tanh⁡(x),σ(x),ReLU(x)$ are the most common. You are probably wondering: "why these functions? I can think of plenty of other non-linearities." The reason for this is that they have gradients that are easy to compute, and computing gradients is essential for learning. For example
<center>$dσ/dx=σ(x)(1−σ(x))$</center>

A quick note: although you may have learned some neural networks in your intro to AI class where σ(x)σ(x) was the default non-linearity, typically people shy away from it in practice. This is because the gradient vanishes very quickly as the absolute value of the argument grows. Small gradients means it is hard to learn. Most people default to tanh or ReLU.

In [15]:
# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = autograd.Variable( torch.randn(2, 2) )
print(data)
print(F.relu(data))

Variable containing:
-1.3128  0.7099
 0.9944 -0.2694
[torch.FloatTensor of size 2x2]

Variable containing:
 0.0000  0.7099
 0.9944  0.0000
[torch.FloatTensor of size 2x2]



## Objective Functions
The objective function is the function that your network is being trained to minimize (in which case it is often called a loss function or cost function). This proceeds by first choosing a training instance, running it through your neural network, and then computing the loss of the output. The parameters of the model are then updated by taking the derivative of the loss function. Intuitively, if your model is completely confident in its answer, and its answer is wrong, your loss will be high. If it is very confident in its answer, and its answer is correct, the loss will be low.

The idea behind minimizing the loss function on your training examples is that your network will hopefully generalize well and have small loss on unseen examples in your dev set, test set, or in production. An example loss function is the negative log likelihood loss, which is a very common objective for multi-class classification. For supervised multi-class classification, this means training the network to minimize the negative log probability of the correct output (or equivalently, maximize the log probability of the correct output).

# 4. Optimization and Training
So what we can compute a loss function for an instance? What do we do with that? We saw earlier that autograd.Variable's know how to compute gradients with respect to the things that were used to compute it. Well, since our loss is an autograd.Variable, we can compute gradients with respect to all of the parameters used to compute it! Then we can perform standard gradient updates. Let θ be our parameters, L(θ) the loss function, and ηη a positive learning rate. Then:

<center>$θ(t+1)=θ(t)−η∇θL(θ)$</center>

There are a huge collection of algorithms and active research in attempting to do something more than just this vanilla gradient update. Many attempt to vary the learning rate based on what is happening at train time. You don't need to worry about what specifically these algorithms are doing unless you are really interested. Torch provies many in the torch.optim package, and they are all completely transparent. Using the simplest gradient update is the same as the more complicated algorithms. Trying different update algorithms and different parameters for the update algorithms (like different initial learning rates) is important in optimizing your network's performance. Often, just replacing vanilla SGD with an optimizer like Adam or RMSProp will boost performance noticably.