## Deep Learning NLP - Getting Started



### Load the Data
Instead of reloading the data, we restore the ones we have stored in the previous notebook. I have stored them already and they are available for download.

Note that this tuturial was written on PyTorch version `1.0.1`. If you use the newest version of PyTorch to run the code here you may run into some issues due to compatability.

In [None]:
## takes roughly about 3 minutes
!pip install torch==1.0.1

Collecting torch==1.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/f7/92/1ae072a56665e36e81046d5fb8a2f39c7728c25c21df1777486c49b179ae/torch-1.0.1-cp36-cp36m-manylinux1_x86_64.whl (560.0MB)
[K     |████████████████████████████████| 560.1MB 29kB/s 
[31mERROR: torchvision 0.5.0 has requirement torch==1.4.0, but you'll have torch 1.0.1 which is incompatible.[0m
[?25hInstalling collected packages: torch
  Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
Successfully installed torch-1.0.1


In [None]:
import torch
import pickle
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import time

# helper functions
def convert_to_pickle(item, directory):
    pickle.dump(item, open(directory,"wb"))


def load_from_pickle(directory):
    return pickle.load(open(directory,"rb"))

print(torch.__version__)

1.0.1


Let's first download our datasets.

In [None]:
!wget https://www.dropbox.com/s/qcyl34jvdc9siw6/test_dataset
!wget https://www.dropbox.com/s/ldk80nwz5va1wvz/train_dataset
!wget https://www.dropbox.com/s/t4cah3zc9bz6jnv/val_dataset

--2020-04-14 14:29:53--  https://www.dropbox.com/s/qcyl34jvdc9siw6/test_dataset
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.1, 2620:100:6021:1::a27d:4101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/qcyl34jvdc9siw6/test_dataset [following]
--2020-04-14 14:29:53--  https://www.dropbox.com/s/raw/qcyl34jvdc9siw6/test_dataset
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc6ea348adcd9027152bd80db9b3.dl.dropboxusercontent.com/cd/0/inline/A10HSAoFyIRoi8lRBThJ8BcZRKcY6DevzS6mGONqKZQDoystpjC9uTIsgJsh_QyiGH20UQprjdGONnHNcKx0OiKX-1BkJElQp-XWDnUnHVw8HDwqk67VTtZrhtCHuVsseMw/file# [following]
--2020-04-14 14:29:53--  https://uc6ea348adcd9027152bd80db9b3.dl.dropboxusercontent.com/cd/0/inline/A10HSAoFyIRoi8lRBThJ8BcZRKcY6DevzS6mGONqKZQDoystpjC9uTIsgJsh_QyiGH20UQprjdGONnHNcKx0OiKX-1BkJElQp-XWDnUnHVw8HDw

## Load the Data

We did the data loading in the previous notebook, we just carry the code over here. 

In [None]:
## You need to declare the class again to properly load the data
class MyData(Dataset):
    def __init__(self, X, y):
        self.data = X
        self.target = y
        self.length = [ np.sum(1 - np.equal(x, 0)) for x in X]
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        x_len = self.length[index]
        return x, y, x_len
    
    def __len__(self):
        return len(self.data)

## store the datasets in these variables
train_dataset = load_from_pickle("train_dataset")
test_dataset = load_from_pickle("test_dataset")
val_dataset = load_from_pickle("val_dataset")

In [None]:
train_dataset.batch_size

64

### Implementing Model

After the data has been preprocessed, transformed and prepared it is now time to build the model or the so-called computation graph that will be used to train our classification model. We are going to use a gated recurrent neural network (GRU), which is considered a more efficient version of a basic RNN. The figure below shows a high-level overview of the model details. 

The model aims to learn representations and project those to the classification task via a fully connected layer followed by a softmax operation that produces values that sum to 1 and can be interpreted as probabilities. Essentially, the outputs of the RNN are mapped to a probability distribution over the predecited output classes. 

![alt txt](https://github.com/omarsar/nlp_pytorch_tensorflow_notebooks/blob/master/img/gru-model.png?raw=true)

## The model

In [None]:
class EmoGRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_units, batch_sz, output_size):
        super(EmoGRU, self).__init__()
        self.batch_sz = batch_sz
        self.hidden_units = hidden_units
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        self.output_size = output_size
        
        ## layers
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.dropout = nn.Dropout(p=0.5) # avoid overfitting
        self.gru = nn.GRU(self.embedding_dim, self.hidden_units)
        self.fc = nn.Linear(self.hidden_units, self.output_size)
    
    def initialize_hidden_state(self, device):
        return torch.zeros((1, self.batch_sz, self.hidden_units)).to(device)
    
    def forward(self, x, lens, device):
        x = self.embedding(x)
        self.hidden = self.initialize_hidden_state(device)
        output, self.hidden = self.gru(x, self.hidden) # max_len X batch_size X hidden_units
        out = output[-1, :, :] 
        out = self.dropout(out)
        out = self.fc(out)
        return out, self.hidden

### Model sanity testing

In [None]:
# parameters
TRAIN_BUFFER_SIZE = 40000 # len(input_tensor_train)
VAL_BUFFER_SIZE = 5000 # len(input_tensor_val)
TEST_BUFFER_SIZE = 5000 # len(input_tensor_test)
BATCH_SIZE = 64
TRAIN_N_BATCH = TRAIN_BUFFER_SIZE // BATCH_SIZE
VAL_N_BATCH = VAL_BUFFER_SIZE // BATCH_SIZE
TEST_N_BATCH = TEST_BUFFER_SIZE // BATCH_SIZE

embedding_dim = 256
units = 1024
vocab_inp_size = 27291 # len(inputs.word2idx)
target_size = 6 # num_emotions

In [None]:
## put batches of same size closer to each other; generally helps the model
## read more here: https://towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e
def sort_batch(X, y, lengths):
    "sort the batch by length"
    
    lengths, indx = lengths.sort(dim=0, descending=True)
    X = X[indx]
    y = y[indx]
    return X.transpose(0,1), y, lengths # transpose (batch x seq) => (seq x batch)

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = EmoGRU(vocab_inp_size, embedding_dim, units, BATCH_SIZE, target_size)
model.to(device)

## obtain one sample from the data iterator
it = iter(train_dataset)
x, y, x_len = next(it)

## sort the batch first to be able to use with pad_packed_sequence
xs, ys, lens = sort_batch(x, y, x_len)

print("Input size: ", xs.size())

output, _ = model(xs.to(device), lens, device)
print(output.size())

Input size:  torch.Size([69, 64])
torch.Size([64, 6])


### Setup Training
Now that we have tested the model, it is time to train it. We will define out optimization algorithm, learnin rate, and other necessary information to train the model.

In [None]:
## Enabling cuda
use_cuda = True if torch.cuda.is_available() else False
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = EmoGRU(vocab_inp_size, embedding_dim, units, BATCH_SIZE, target_size)
model.to(device)

## loss criterion and optimizer for training
criterion = nn.CrossEntropyLoss() # the same as log_softmax + NLLLoss
optimizer = torch.optim.Adam(model.parameters())

def loss_function(y, prediction):
    """ CrossEntropyLoss expects outputs and class indices as target """
    ## convert from one-hot encoding to class indices
    target = torch.max(y, 1)[1]
    loss = criterion(prediction, target) 
    return loss   
    
def accuracy(target, logit):
    ''' Obtain accuracy for training round '''
    target = torch.max(target, 1)[1] # convert from one-hot encoding to class indices
    corrects = (torch.max(logit, 1)[1].data == target).sum()
    accuracy = 100.0 * corrects / len(logit)
    return accuracy

### Training Model

Now we finally train the model.

In [None]:
## takes ~3 minutes

EPOCHS = 3

for epoch in range(EPOCHS):
    start = time.time()
    
    ### Initialize hidden state
    # TODO: do initialization here.
    total_loss = 0
    train_accuracy, val_accuracy = 0, 0
    
    ### Training
    for (batch, (inp, targ, lens)) in enumerate(train_dataset):
        loss = 0
        predictions, _ = model(inp.permute(1 ,0).to(device), lens, device) # TODO:don't need _   
              
        loss += loss_function(targ.to(device), predictions)
        batch_loss = (loss / int(targ.shape[1]))        
        total_loss += batch_loss
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        batch_accuracy = accuracy(targ.to(device), predictions)
        train_accuracy += batch_accuracy
        
        if batch % 100 == 0:
            print('Epoch {} Batch {} Val. Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.cpu().detach().numpy()))
            
    ### Validating
    for (batch, (inp, targ, lens)) in enumerate(val_dataset):        
        predictions,_ = model(inp.permute(1, 0).to(device), lens, device)        
        batch_accuracy = accuracy(targ.to(device), predictions)
        val_accuracy += batch_accuracy
    
    print('Epoch {} Loss {:.4f} -- Train Acc. {:.4f} -- Val Acc. {:.4f}'.format(epoch + 1, 
                                                             total_loss / TRAIN_N_BATCH, 
                                                             train_accuracy / TRAIN_N_BATCH,
                                                             val_accuracy / VAL_N_BATCH))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Val. Loss 0.2982
Epoch 1 Batch 100 Val. Loss 0.2278
Epoch 1 Batch 200 Val. Loss 0.2175
Epoch 1 Batch 300 Val. Loss 0.1060
Epoch 1 Batch 400 Val. Loss 0.0351
Epoch 1 Batch 500 Val. Loss 0.0328
Epoch 1 Batch 600 Val. Loss 0.0330
Epoch 1 Loss 0.1436 -- Train Acc. 66.0000 -- Val Acc. 90.0000
Time taken for 1 epoch 30.53244161605835 sec

Epoch 2 Batch 0 Val. Loss 0.0348
Epoch 2 Batch 100 Val. Loss 0.0174
Epoch 2 Batch 200 Val. Loss 0.0275
Epoch 2 Batch 300 Val. Loss 0.0281
Epoch 2 Batch 400 Val. Loss 0.0449
Epoch 2 Batch 500 Val. Loss 0.0149
Epoch 2 Batch 600 Val. Loss 0.0271
Epoch 2 Loss 0.0272 -- Train Acc. 92.0000 -- Val Acc. 91.0000
Time taken for 1 epoch 30.768707752227783 sec

Epoch 3 Batch 0 Val. Loss 0.0091
Epoch 3 Batch 100 Val. Loss 0.0153
Epoch 3 Batch 200 Val. Loss 0.0308
Epoch 3 Batch 300 Val. Loss 0.0387
Epoch 3 Batch 400 Val. Loss 0.0369
Epoch 3 Batch 500 Val. Loss 0.0239
Epoch 3 Batch 600 Val. Loss 0.0358
Epoch 3 Loss 0.0202 -- Train Acc. 93.0000 -- Val Acc. 

### Stopping the Model

How do we know when to stop the model. We can use a technique called `early stopping`, not covered here, but widely used in deep learning, to control the convergence of models.

### Store the Model


In [None]:
torch.save(model, "/gdrive/My Drive/pycon2019/emogru")

  "type " + obj.__name__ + ". It won't be checked "


---

### Exercise - Implementing your deep learning model
Implement a model similar to the one above. Try to use an LSTM instead of an GRU. Go into the pytorch documentation and research quick ways to improve the model, like adding a `Dropout` [layer](https://pytorch.org/docs/stable/_modules/torch/nn/modules/dropout.html). Also, add additional layers (i.e., make it deeper) to improve the model potential.

---



In [None]:
### YOUR CODE HERE

### YOUR CODE HERE

### References
- [Emotion Recognition with PyTorch](https://github.com/omarsar/emotion_recognition_pytorch/blob/master/Deep_Learning_Emotion_Recognition_PyTorch.ipynb)

- [Serialization Semantics by PyTorch](https://pytorch.org/docs/master/notes/serialization.html#recommend-saving-models)

- [Word embeddings in PyTorch](https://github.com/omarsar/phd_2017/blob/master/pytorch_word_embeddings.ipynb)
