## Torchtext tutorial


Made by [Andrea Sottana](http://github.com/AndreaSottana) and adapted from the following [tutorial](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/)

This notebook shows how you can use your own custom dataset starting from the raw form (for example a .csv file) with the `torch` library, to easily do some cleaning and preprocessing (such as tokenization), build a vocabulary, and ultimately create an `Iterator` or `BucketIterator` object that you can feed it into a neural network built using the `torch` library.

In [56]:
from torchtext.data import Field, TabularDataset, Iterator, BucketIterator
from torchtext import datasets
import torch
import torch.nn.functional as F
import pandas as pd
import nltk
import os
from tqdm import tqdm
import logging

logging.basicConfig(level=20)  # shows info level loggings (we'll use this to print the loss during training)

A data `Field` specifies how you want a certain field to be processed, for example whether you want to tokenize it, lower case the text etc. In our case we have one text field, and one label field for the sentiment analysis, which we represent with an integer, where 0 is negative, 1 is neutral and 2 is positive.

In [5]:
TEXT = Field(sequential=True, tokenize='spacy', lower=True)
LABEL = Field(sequential=False, use_vocab=False)

In [6]:
# remember to source the .envrc file in the terminal before launching this notebook to 
# ensure it can use the environment variables correctly.

folder = os.path.join(os.getenv('DATA_DIR'), 'movie_review_dataset')
train_dataset = pd.read_csv(os.path.join(folder, 'train_dataset.csv'))
valid_dataset = pd.read_csv(os.path.join(folder, 'valid_dataset.csv'))
test_dataset = pd.read_csv(os.path.join(folder, 'test_dataset.csv'))
valid_dataset

Unnamed: 0,text,labels,numerical_labels
0,It 's a lovely film with lovely performances b...,positive,2
1,"No one goes unindicted here , which is probabl...",neutral,1
2,And if you 're not nearly moved to tears by a ...,positive,2
3,"A warm , funny , engaging film .",positive,2
4,Uses sharp humor and insight into human nature...,positive,2
...,...,...,...
1096,it seems to me the film is about the art of ri...,negative,0
1097,It 's just disappointingly superficial -- a mo...,negative,0
1098,The title not only describes its main characte...,negative,0
1099,Sometimes it feels as if it might have been ma...,neutral,1


This tells the field what data to work on. Note that the fields we pass in must be in the same order as the columns in our csv file. For the columns we don't use, we pass in a tuple where the field element is None.

In [7]:
fields = [("text", TEXT), ("labels", None), ("numerical_labels", LABEL)]
fields

[('text', <torchtext.data.field.Field at 0x10e958668>),
 ('labels', None),
 ('numerical_labels', <torchtext.data.field.Field at 0x138793390>)]

The `TabularDataset` is one of the built-in Datasets in torchtext that handle common data formats; this one is good for .csv files. Other datasets, such as `LanguageModelingDataset` or `TranslationDataset`, are available.

In [8]:
train, valid, test = TabularDataset.splits(
    path = folder,
    train = 'train_dataset.csv',
    validation = 'valid_dataset.csv',
    test = 'test_dataset.csv',
    format = 'csv',
    skip_header = True,
    fields = fields
)

In [9]:
print(train[0], '\n')

# This is an Example object. The Example object bundles the attributes of a single data 
# point together. We also see that the text has already been tokenized for us, but has 
# not yet been converted to integers. 

print(train[0].__dict__, '\n')
print(train[0].__dict__.keys(), '\n')
print(train[0].text)

<torchtext.data.example.Example object at 0x139efcb00> 

{'text': ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '`', '`', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', '-', 'claud', 'van', 'damme', 'or', 'steven', 'segal', '.'], 'numerical_labels': '2'} 

dict_keys(['text', 'numerical_labels']) 

['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '`', '`', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', '-', 'claud', 'van', 'damme', 'or', 'steven', 'segal', '.']


The next step is constructing the mapping from words to ids. The vocabulary is normally built on the training set only

In [10]:
TEXT.build_vocab(train)

Torchtext has its own class called Vocab for handling the vocabulary. The Vocab class holds a mapping from word to id in its `stoi` attribute and a reverse mapping in its `itos` attribute. Words that are not included in the vocabulary will be converted into `<unk>`.

In [11]:
print(TEXT.vocab.itos[12])
print(TEXT.vocab.stoi['it'])

it
12


Now we need to create an Iterator to pass the data to our model.  

In torchvision and PyTorch, the processing and batching of data is handled by DataLoaders. For some reason, torchtext has renamed the objects that do the exact same thing to Iterators. The basic functionality is the same, but Iterators, as we will see, have some convenient functionality that is unique to NLP.  

We also need to tell the BucketIterator what attribute you want to bucket the data on. In our case, we want to bucket based on the lengths of the comment_text field, so we pass that in as a keyword argument. 

In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(64, 64, 64),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True
)

An alternative would be to use a standard iterator for the test set, given we do not need to shuffle the data since we'll be outputting the predictions at the end of training.  
The code would be:

In [12]:
train_iter, valid_iter = BucketIterator.splits(
    (train, valid),
    batch_sizes=(64, 64),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True
)

test_iter = Iterator(
    test, batch_size=64, device=device, sort=False, sort_within_batch=False
)

Now we need to convert the batch to a tuple in the form (x, y) where x is the independent variable (the input to the model) and y is the dependent variable (the supervision data).

In [14]:
class BatchWrapper:
    def __init__(self, iterator, feature, label: list):
        self.iterator = iterator
        self.feature = feature
        self.label = label

    def __iter__(self):
        for batch in self.iterator:
            x = getattr(batch, self.feature)  # we assume only one input in this wrapper
            if self.label is not None:  # we will concatenate y into a single tensor
                y = torch.cat(
                    [getattr(batch, lab).unsqueeze(1) for lab in self.label], dim=1
                    # this is just in case we have multiple labels
                ).float()
            else:
                y = torch.zeros(1)
            yield (x, y)

    def __len__(self):
        # This is so the iteration will occur over the indices from zero to len-1.
        return len(self.iterator)

In [15]:
next(iter(train_iter))


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 38x64]
	[.numerical_labels]:[torch.LongTensor of size 64]

The example above works if we have multiple labels. If we only have one label, we can simplify the code as follows.

In [16]:
# works only for single label

class BatchWrapperSingleLabel:
    def __init__(self, iterator, feature, label: str):
        self.iterator = iterator
        self.feature = feature
        self.label = label

    def __iter__(self):
        for batch in self.iterator:
            x = getattr(batch, self.feature)  # we assume only one input in this wrapper
            if self.label is not None: 
                y = getattr(batch, self.label).unsqueeze(1).float()
            else:
                y = torch.zeros(1)
            yield (x, y)

    def __len__(self):
        # This is so the iteration will occur over the indices from zero to len-1.
        return len(self.iterator)

Below is some code to play with instances of the `BatchWrapper` class to see what it's like.

In [17]:
train_dataloader = BatchWrapper(train_iter, "text", ["numerical_labels"])
valid_dataloader = BatchWrapper(valid_iter, "text", ["numerical_labels"])
test_dataloader = BatchWrapper(test_iter, "text", None)  # no labels given for testing

it = next(train_dataloader.__iter__())
it[0].shape, it[1].shape

(torch.Size([45, 64]), torch.Size([64, 1]))

In [18]:
list(train_iter)[0].numerical_labels.shape

torch.Size([64])

In [19]:
vars(list(test_iter)[0])

{'batch_size': 64,
 'dataset': <torchtext.data.dataset.TabularDataset at 0x139efc978>,
 'fields': dict_keys(['text', 'labels', 'numerical_labels']),
 'input_fields': ['text', 'numerical_labels'],
 'target_fields': [],
 'text': tensor([[ 437,   49,  666,   12,   72,    5,   16,  143,   12,    0,   47,   67,
            74, 4502,    5, 4171,    5,   76, 3640, 1831, 4011, 1376,   10, 1214,
             5,    5,   22,   46,   12, 1020,   62,   21,   49,   54,    5,   46,
            12, 2367, 2450,    0,   12,  250,   12,   10,   46, 3437,    0, 1034,
          1034, 1034,  140, 2960,   64,   24,   64,  571,   61,   61,   64,    0,
          1912, 3930,    0, 2361],
         [  92,   72,  221,   11,  230,   88,   46,  694,   11,   21,   46,    0,
           226,    6, 1230,    6, 1058,   11,    4,    6,    4,    6,   26, 1965,
            88, 2858, 1391,  970,   11,   10,    0,   31,    5,   11,  152,  946,
            11,   10,   12,    8,   11,   28,   10,   21, 1873,    4,  269,  437,
 

In [20]:
iterator = train_dataloader.__iter__()
print(len(list(iterator)))
print(type(list(iterator)))
print(iterator)

134
<class 'list'>
<generator object BatchWrapper.__iter__ at 0x1274adba0>


In [21]:
# Need to re-define them as we've already used the generator above

train_dataloader = BatchWrapper(train_iter, "text", ["numerical_labels"])
valid_dataloader = BatchWrapper(valid_iter, "text", ["numerical_labels"])
test_dataloader = BatchWrapper(test_iter, "text", None)  # no labels given for testing
iterator = train_dataloader.__iter__()

In [22]:
next_ = next(iterator)
print(len(next_))
print(type(next_))
print(next_)

2
<class 'tuple'>
(tensor([[  35,    3, 2433,  ...,   14,    5,   67],
        [ 179, 3274,  879,  ..., 2205,  292,    7],
        [  36,  664, 2807,  ...,    4,   75,   21],
        ...,
        [  23,  265,   11,  ..., 7959,  438, 7130],
        [ 349,   71,  980,  ...,  555, 1201, 7663],
        [   2,    2,    2,  ...,    2,    2,    2]]), tensor([[2.],
        [1.],
        [2.],
        [2.],
        [1.],
        [2.],
        [2.],
        [2.],
        [2.],
        [2.],
        [2.],
        [2.],
        [0.],
        [0.],
        [2.],
        [0.],
        [2.],
        [2.],
        [0.],
        [0.],
        [0.],
        [2.],
        [0.],
        [1.],
        [0.],
        [2.],
        [1.],
        [1.],
        [2.],
        [2.],
        [0.],
        [2.],
        [0.],
        [0.],
        [2.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [2.],
        [0.],
        [0.],
        [2.],
        [2.],
        [1.],
        [2

In [23]:
print(next_[0].shape)

torch.Size([28, 64])


In [24]:
print(next(iterator)[0].shape)  # lenght of sentence, batch size
print(next(iterator)[0].shape)
print(next(iterator)[0].shape)

torch.Size([13, 64])
torch.Size([35, 64])
torch.Size([5, 64])


Now we are creating the instances of the `BatchWrapperSingleLabel` class, which we'll then use to feed into our neural network.

In [25]:
train_dataloader = BatchWrapperSingleLabel(train_iter, "text", "numerical_labels")
valid_dataloader = BatchWrapperSingleLabel(valid_iter, "text", "numerical_labels")
test_dataloader = BatchWrapperSingleLabel(test_iter, "text", None)  # no labels given for testing

Here is, finally, our neural network.

In [26]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

class SimpleLSTMBaseline(nn.Module):
    def __init__(self, vocab_length, hidden_dim, emb_dim=300, num_lstm_layers=1, num_linear_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_length, emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=num_lstm_layers)
        self.linear_layers = nn.ModuleList()
        for _ in range(num_linear_layers):  # - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
            # self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 3)  # 3 possible outcomes: negative, neutral, positive

    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
            
        preds = self.predictor(feature)  # outside for loop
        return preds


model = SimpleLSTMBaseline(vocab_length=len(TEXT.vocab), hidden_dim=500, emb_dim=100, num_linear_layers=3)

In [27]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.CrossEntropyLoss()
epochs = 2

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm(train_dataloader): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(y, preds)
        loss.backward()
        opt.step()

        running_loss += loss.data * x.size(0)

    epoch_loss = running_loss / len(train)

    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dataloader:
        preds = model(x)
        loss = loss_func(preds, y)
        # val_loss += loss.data[0] * x.size(0)
        val_loss += loss.data * x.size(0)

    val_loss /= len(valid)
    logger.info('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

  0%|          | 0/134 [00:00<?, ?it/s]


RuntimeError: 1D target tensor expected, multi-target not supported

In [32]:
list(train_dataloader)[0][1].shape

torch.Size([64, 1])

In [34]:
model(list(train_dataloader)[0][0]).shape

torch.Size([64, 3])