# Using PyTorch for Natural Language Processing

## Basic Info

### 1. What is PyTorch?

PyTorch is a free, open-source deep learning framework, based on the Torch library. It is used for applications such as natural language processing (NLP) and Computer Vision (CV). It is primarily developed by Facebook's artificial-intelligence research group.

PyTorch provides two high-level features:

1. Tensor computing (like NumPy) with strong acceleration via graphics processing units (GPU)
2. Deep neural networks built on a tape-based autodiff system

### 2. Pros of PyTorch over Tensorflow

The main advantage of PyTorch over Tensorflow is that it uses dynamic computation graphs while Tensorflow (or Keras, or Theano) uses static graphs. In Tensorflow, computation graphs must be defined all at once and for all. Whereas in PyTorch graphs are inherited from participating tensors.

A tensorflow program cannot be stopped using control flow at any point for debugging. In PyTorch, standard (if-else, loops) can be used.

### 3. Installation

The best resource to install PyTorch is by following the instruction provided at its official website: https://pytorch.org/. However, a brief steps to follow are explained below. To install it directly using pip library manager for Python 3.6 and CUDA 9.0, run the following command(s) in the command line:

#### For Ubuntu Linux

```bash
pip3 install torch torchvision
```
#### For Windows
``` bash
pip3 install https://download.pytorch.org/whl/cu90/torch-1.1.0-cp36-cp36m-win_amd64.whl ```

``` bash
pip3 install https://download.pytorch.org/whl/cu90/torchvision-0.3.0-cp36-cp36m-win_amd64.whl ```

### 4. Verfication

To ensure that PyTorch was installed correctly, we can run a sample code that randomly initializes a tensor:
``` python
from __future__ import print_function
import torch
x = torch.rand(5, 3)
print(x) ```

Additionally, to check if the GPU driver and CUDA are enabled and accessible by PyTorch, run the following code:
``` python
import torch
print(torch.cuda.is_available())
```
This should return ``` True ```.

## First Example: Logistic Regression Bag-of-Words classifier

This program reads an input text and outputs a probability of being an English or Spanish text. We first prepare the training and test data:

In [52]:
# Code sample is brought from https://pytorch.org/tutorials/

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

# word_to_ix is a dictionary of the words in the vocab and its corresponding unique integer index,
# which will be used in the Bag of words vector'''

word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
            
print(word_to_ix)

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


Then , we define the classifier class which inherits from torch.nn.Module and torch.nn.Linear:

In [53]:
# Defining the BoW classifier Class
class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping. Where A is the weight matrix and b is the bias vector
        # Torch defines nn.Linear(), which provides the affine map.
        # This is equal to y = Ax + b
        self.linear = nn.Linear(vocab_size, num_labels)

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)

# Defining Some Functions
def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])

Now, we create an instance of the classifier with a specific vocabulary size and two labels (English and Spanish). The model knows its parameters. The first output below is A, the second is b. This is inherited from the torch.nn.Linear module

In [54]:
VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

for param in model.parameters():
    print(param,'\n')
    
print('--------------\n')

# To run the model, pass in a BoW vector
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Print the matrix column corresponding to "creo"
E_label_before = round(next(model.parameters())[:, word_to_ix["creo"]][1].item(), 4)
S_label_before = round(next(model.parameters())[:, word_to_ix["creo"]][0].item(), 4)
print('Before training:', next(model.parameters())[:, word_to_ix["creo"]])

Parameter containing:
tensor([[ 0.1011, -0.0866, -0.0380,  0.0921, -0.1846,  0.1176, -0.0403,  0.0998,
          0.0273, -0.0240,  0.0544,  0.0097,  0.0716, -0.0764, -0.0143, -0.0177,
          0.0284, -0.0008,  0.1714,  0.0610, -0.0730, -0.1184, -0.0329, -0.0846,
         -0.0628,  0.0094],
        [ 0.1169,  0.1066, -0.1917,  0.1216,  0.0548,  0.1860,  0.1294, -0.1787,
         -0.1865, -0.0946,  0.1722, -0.0327,  0.0839, -0.0911,  0.1924, -0.0830,
          0.1471,  0.0023, -0.1033,  0.1008, -0.1041,  0.0577, -0.0566, -0.0215,
         -0.1885, -0.0935]], requires_grad=True) 

Parameter containing:
tensor([ 0.1064, -0.0477], requires_grad=True) 

--------------

tensor([[-0.6250, -0.7662]])
tensor([[-0.5870, -0.8119]])
Before training: tensor([0.0544, 0.1722], grad_fn=<SelectBackward>)


So lets train! To do this, we forward-pass instances through to get log probabilities, compute a loss function. Then, we compute the gradient of the loss function, update the parameters in the backpropagation step. Loss functions are provided in the torch.nn module. Optimization functions are available in torch.optim. Here, we will just use the stochastic gradient descent (SGD) with a learning rate $\eta=0.1$.

Note that the input to NLLLoss is a vector of log probabilities, and a target label. It doesn’t compute the log probabilities for us. This is why the last layer of our network is log softmax. The loss function nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log softmax for you.

In [55]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.

for epoch in range(100):
    for instance, label in data:
        
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run the feed forward and compute the loss.
        log_probs = model(bow_vec)
        loss = loss_function(log_probs, target)        

        # Step 4. Compute the gradients, and update the parameters by
        # calling optimizer.step()
        loss.backward()
        optimizer.step()

Let's test our model using the test data:

In [56]:
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Index corresponding to Spanish goes up, English goes down!
E_label_after = round(next(model.parameters())[:, word_to_ix["creo"]][1].item(),4)
S_label_after = round(next(model.parameters())[:, word_to_ix["creo"]][0].item(),4)
print('After training:', next(model.parameters())[:, word_to_ix["creo"]])

tensor([[-0.1210, -2.1721]])
tensor([[-2.7767, -0.0643]])
After training: tensor([ 0.5004, -0.2738], grad_fn=<SelectBackward>)


As seen from the output, the classification output for "Spanish" has been increased from {{S_label_before}} to {{S_label_after}}. Whereas, the classification output for "English" has been dereased from {{E_label_before}} to {{E_label_after}}