## Torchtext tutorial


Made by [Andrea Sottana](http://github.com/AndreaSottana) and adapted from the following [tutorial](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/)

This notebook shows how you can use your own custom dataset starting from the raw form (for example a .csv file) with the `torch` library, to easily do some cleaning and preprocessing (such as tokenization), build a vocabulary, and ultimately create an `Iterator` or `BucketIterator` object that you can feed it into a neural network built using the `torch` library.

In [1]:
from torchtext.data import Field, TabularDataset, Iterator, BucketIterator
from torchtext import datasets
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import pandas as pd
import nltk
import os
from tqdm import tqdm
import logging


logging.basicConfig(level=20)  # shows info level loggings (we'll use this to print the loss during training)

## Dataset 1: multi-class classification, single label

A data `Field` specifies how you want a certain field to be processed, for example whether you want to tokenize it, lower case the text etc. In our case we have one text field, and one label field for the sentiment analysis, which we represent with an integer, where 0 is negative, 1 is neutral and 2 is positive.

In [2]:
TEXT = Field(sequential=True, tokenize='spacy', lower=True)
LABEL = Field(sequential=False, use_vocab=False)

In [3]:
# remember to source the .envrc file in the terminal before launching this notebook to 
# ensure it can use the environment variables correctly.

folder = os.path.join(os.getenv('DATA_DIR'), 'movie_review_dataset')
train_dataset = pd.read_csv(os.path.join(folder, 'train_dataset.csv'))
valid_dataset = pd.read_csv(os.path.join(folder, 'valid_dataset.csv'))
test_dataset = pd.read_csv(os.path.join(folder, 'test_dataset.csv'))
train_dataset

Unnamed: 0,text,labels,numerical_labels
0,The Rock is destined to be the 21st Century 's...,positive,2
1,The gorgeously elaborate continuation of `` Th...,positive,2
2,Singer\/composer Bryan Adams contributes a sle...,positive,2
3,You 'd think by now America would have had eno...,neutral,1
4,Yet the act is still charming here .,positive,2
...,...,...,...
8539,A real snooze .,negative,0
8540,No surprises .,negative,0
8541,We 've seen the hippie-turned-yuppie plot befo...,positive,2
8542,Her fans walked out muttering words like `` ho...,negative,0


This tells the field what data to work on. Note that the fields we pass in must be in the same order as the columns in our csv file. For the columns we don't use, we pass in a tuple where the field element is None.

In [4]:
fields = [("text", TEXT), ("labels", None), ("numerical_labels", LABEL)]
fields

[('text', <torchtext.data.field.Field at 0x13614ad68>),
 ('labels', None),
 ('numerical_labels', <torchtext.data.field.Field at 0x137ce70b8>)]

The `TabularDataset` is one of the built-in Datasets in torchtext that handle common data formats; this one is good for .csv files. Other datasets, such as `LanguageModelingDataset` or `TranslationDataset`, are available.

In [5]:
train, valid, test = TabularDataset.splits(
    path = folder,
    train = 'train_dataset.csv',
    validation = 'valid_dataset.csv',
    test = 'test_dataset.csv',
    format = 'csv',
    skip_header = True,
    fields = fields
)

In [6]:
print(train[0], '\n')

# This is an Example object. The Example object bundles the attributes of a single data 
# point together. We also see that the text has already been tokenized for us, but has 
# not yet been converted to integers. 

print(train[0].__dict__, '\n')
print(train[0].__dict__.keys(), '\n')
print(train[0].text)

<torchtext.data.example.Example object at 0x139ffd5c0> 

{'text': ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '`', '`', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', '-', 'claud', 'van', 'damme', 'or', 'steven', 'segal', '.'], 'numerical_labels': '2'} 

dict_keys(['text', 'numerical_labels']) 

['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '`', '`', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', '-', 'claud', 'van', 'damme', 'or', 'steven', 'segal', '.']


The next step is constructing the mapping from words to ids. The vocabulary is normally built on the training set only

In [7]:
TEXT.build_vocab(train)

Torchtext has its own class called Vocab for handling the vocabulary. The Vocab class holds a mapping from word to id in its `stoi` attribute and a reverse mapping in its `itos` attribute. Words that are not included in the vocabulary will be converted into `<unk>`.

In [8]:
print(TEXT.vocab.itos[12])
print(TEXT.vocab.stoi['it'])

it
12


Now we need to create an Iterator to pass the data to our model.  

In torchvision and PyTorch, the processing and batching of data is handled by DataLoaders. For some reason, torchtext has renamed the objects that do the exact same thing to Iterators. The basic functionality is the same, but Iterators, as we will see, have some convenient functionality that is unique to NLP.  

We also need to tell the BucketIterator what attribute you want to bucket the data on. In our case, we want to bucket based on the lengths of the comment_text field, so we pass that in as a keyword argument. 

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(64, 64, 64),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True
)

An alternative would be to use a standard iterator for the test set, given we do not need to shuffle the data since we'll be outputting the predictions at the end of training.  
The code would be:

In [10]:
train_iter, valid_iter = BucketIterator.splits(
    (train, valid),
    batch_sizes=(64, 64),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True
)

test_iter = Iterator(
    test, batch_size=64, device=device, sort=False, sort_within_batch=False
)

Let's see what a `BucketIterator` looks like.

In [11]:
print(type(train_iter))
next(iter(train_iter))

<class 'torchtext.data.iterator.BucketIterator'>



[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 26x64]
	[.numerical_labels]:[torch.LongTensor of size 64]

In [12]:
train_dataloader = list(train_iter)
valid_dataloader = list(valid_iter)
test_dataloader = list(test_iter)

In [13]:
it = train_dataloader[0]
it.text.shape, it.numerical_labels.shape

(torch.Size([8, 64]), torch.Size([64]))

In [14]:
vars(list(test_iter)[0])

{'batch_size': 64,
 'dataset': <torchtext.data.dataset.TabularDataset at 0x139ffd710>,
 'fields': dict_keys(['text', 'labels', 'numerical_labels']),
 'input_fields': ['text', 'numerical_labels'],
 'target_fields': [],
 'text': tensor([[ 106,    5,    3,  ...,   16,  138,  436],
         [1790, 4857,  147,  ...,   54, 4456,    5],
         [  50,    4,   10,  ...,   34, 4359, 2691],
         ...,
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1]]),
 'numerical_labels': tensor([0, 0, 0, 0, 1, 1, 0, 0, 2, 2, 2, 0, 2, 1, 0, 0, 0, 2, 2, 0, 0, 1, 2, 0,
         2, 2, 0, 0, 0, 1, 0, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 0, 1, 0, 0, 0, 2, 0,
         1, 1, 1, 0, 0, 0, 0, 2, 2, 0, 0, 0, 2, 1, 0, 0])}

In [15]:
iterator = train_dataloader
print(len(list(iterator)))
print(type(list(iterator)))
print(iterator)

134
<class 'list'>
[
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 8x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 22x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 12x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 18x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 13x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 23x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 16x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size

In [16]:
train_dataloader = list(train_iter)
next_ = train_dataloader[0]
print(len(next_))
print(type(next_))
print(next_.text.shape, next_.numerical_labels.shape)
print(next_.text, next_.numerical_labels)

64
<class 'torchtext.data.batch.Batch'>
torch.Size([10, 64]) torch.Size([64])
tensor([[   12,    54,  3448,    12,  3975,     3,     5,   883,    16,    55,
            72,    32,     5,   763,   542,  3114,     3,     3,    12,    22,
             3,  7671,    12,     5,    56,    55,     6,  6724,    23,    24,
            15,    54,    16,     5,    12,  2096,  2339,  7945,    76,  1278,
             3,   701,    54,    12,   683,    22,    46,    29,     5,     8,
         12173,    14,    12,   151,     6,     5,   390,    30,    50,  3259,
            36, 12647,    31,    25],
        [ 9821,    11,  4078,    11,    11,   740,  2316,     4,    12,    11,
           105,   409,   471,    11,     6,   700,   368,    17,    11,  4950,
           121,  7341,    11,  2791,  9825,    11,  2311,  3334,   738,    62,
          1213,    98,    12, 10711,   330,     4,    74,  5101,    35,  7395,
            20,     4,    98,    11,    41,  4947,    33,  2287,  7971,  8985,
           236,

Here is, finally, our neural network.

In [17]:
class LSTMultiClass(nn.Module):
    def __init__(self, vocab_length, hidden_dim, emb_dim=300, num_lstm_layers=1, num_linear_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_length, emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=num_lstm_layers)
        self.linear_layers = nn.ModuleList()
        self.linear_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_linear_layers)]
        )
        self.sigm = nn.Sigmoid()
        self.tanh = nn.Tanh()
        self.predictor = nn.Linear(hidden_dim, 3)  # 3 possible outcomes: negative, neutral, positive

    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))  # hidden state, cell state (we don't need cell state now)
        feature = hdn[-1, :, :]  # taking the last output from the LSTM (we're dealing with many-to-one problem)
        for layer in self.linear_layers:
            feature = layer(feature)
            feature = self.sigm(feature)
         
        preds = self.predictor(feature)
        preds = self.tanh(preds)
        return preds


model = LSTMultiClass(vocab_length=len(TEXT.vocab), hidden_dim=500, emb_dim=100, num_linear_layers=3)

In [18]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.CrossEntropyLoss()
epochs = 3

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    model.train() # turn on training mode
    for batch in tqdm(train_dataloader):
        x = batch.text  # independent variable (the input to the model)
        y = batch.numerical_labels.unsqueeze(1).float()  # dependent variable (the supervision data)
        opt.zero_grad()
        predictions = model(x)
        loss = loss_func(predictions, y[:,0].long()) 
        loss.backward()
        opt.step()

        running_loss += loss.data * x.size(0)
    epoch_loss = running_loss / len(train)

    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    with torch.no_grad():
        for batch in valid_dataloader:
            x = batch.text
            y = batch.numerical_labels.unsqueeze(1).float()
            predictions = model(x)
            loss = loss_func(predictions, y[:,0].long())  
            val_loss += loss.data * x.size(0)

    val_loss /= len(valid)
    logger.info('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

100%|██████████| 134/134 [00:43<00:00,  3.08it/s]
INFO:__main__:Epoch: 1, Training Loss: 0.4888, Validation Loss: 0.5559
100%|██████████| 134/134 [00:48<00:00,  2.77it/s]
INFO:__main__:Epoch: 2, Training Loss: 0.4892, Validation Loss: 0.5559
100%|██████████| 134/134 [00:42<00:00,  3.13it/s]
INFO:__main__:Epoch: 3, Training Loss: 0.4892, Validation Loss: 0.5559


## Dataset 2: binary classification, multiple labels

In [20]:
folder = os.path.join(os.getenv('DATA_DIR'), 'toxic_comments_dataset')
train_dataset = pd.read_csv(os.path.join(folder, 'train_dataset.csv'))
valid_dataset = pd.read_csv(os.path.join(folder, 'valid_dataset.csv'))
test_dataset = pd.read_csv(os.path.join(folder, 'test_dataset.csv'))
train_dataset

Unnamed: 0,comment_text,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,Explanation\nWhy the edits made under my usern...,0000997932d777bf,0,0,0,0,0,0
1,D'aww! He matches this background colour I'm s...,000103f0d9cfb60f,0,0,0,0,0,0
2,"Hey man, I'm really not trying to edit war. It...",000113f07ec002fd,0,0,0,0,0,0
3,"""\nMore\nI can't make any real suggestions on ...",0001b41b1c6bb37e,0,0,0,0,0,0
4,"You, sir, are my hero. Any chance you remember...",0001d958c54c6e35,0,0,0,0,0,0
5,"""\n\nCongratulations from me as well, use the ...",00025465d4725e87,0,0,0,0,0,0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,0002bcb3da6cb337,1,1,1,0,1,0
7,Your vandalism to the Matt Shirvington article...,00031b1e95af7921,0,0,0,0,0,0
8,Sorry if the word 'nonsense' was offensive to ...,00037261f536c51d,0,0,0,0,0,0
9,alignment on this subject and which are contra...,00040093b2687caa,0,0,0,0,0,0


In [21]:
TEXT = Field(sequential=True, tokenize='spacy', lower=True)
LABEL = Field(sequential=False, use_vocab=False, dtype=torch.long)

In [22]:
fields = [
    ("comment_text", TEXT),
    ("id", None),  
    ("toxic", LABEL), 
    ("severe_toxic", LABEL),
    ("obscene", LABEL),
    ("threat", LABEL),
    ("insult", LABEL),
    ("identity_hate", LABEL)
]

fields

[('comment_text', <torchtext.data.field.Field at 0x13f90b668>),
 ('id', None),
 ('toxic', <torchtext.data.field.Field at 0x13f91dd30>),
 ('severe_toxic', <torchtext.data.field.Field at 0x13f91dd30>),
 ('obscene', <torchtext.data.field.Field at 0x13f91dd30>),
 ('threat', <torchtext.data.field.Field at 0x13f91dd30>),
 ('insult', <torchtext.data.field.Field at 0x13f91dd30>),
 ('identity_hate', <torchtext.data.field.Field at 0x13f91dd30>)]

In [23]:
train, valid, test = TabularDataset.splits(
    path = folder,
    train = 'train_dataset.csv',
    validation = 'valid_dataset.csv',
    test = 'test_dataset.csv',
    format = 'csv',
    skip_header = True,
    fields = fields
)

In [24]:
TEXT.build_vocab(train, min_freq=2)

In [25]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(7, 7, 7),
    device=device,
    sort_key=lambda x: len(x.comment_text),
    sort_within_batch=True
)

In [26]:
vars(list(train_iter)[0])

{'batch_size': 4,
 'dataset': <torchtext.data.dataset.TabularDataset at 0x13f6522b0>,
 'fields': dict_keys(['comment_text', 'id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']),
 'input_fields': ['comment_text',
  'toxic',
  'severe_toxic',
  'obscene',
  'threat',
  'insult',
  'identity_hate'],
 'target_fields': [],
 'comment_text': tensor([[  5,   5,  88,   5],
         [ 34,  15,   7,  15],
         [ 36, 209,   0,  93],
         ...,
         [108,   1,   1,   1],
         [ 39,   1,   1,   1],
         [  5,   1,   1,   1]]),
 'toxic': tensor([0, 0, 0, 0]),
 'severe_toxic': tensor([0, 0, 0, 0]),
 'obscene': tensor([0, 0, 0, 0]),
 'threat': tensor([0, 0, 0, 0]),
 'insult': tensor([0, 0, 0, 0]),
 'identity_hate': tensor([0, 0, 0, 0])}

In [27]:
train_dataloader = list(train_iter)
valid_dataloader = list(valid_iter)
#test_dataloader = list(test_iter)

In [28]:
class LSTMBinary(nn.Module):
    def __init__(self, vocab_length, hidden_dim, emb_dim=300, num_lstm_layers=1, num_linear_layers=1):
        super().__init__()
        self.num_lstm_layers = num_lstm_layers
        self.num_linear_layers = num_linear_layers
        self.embedding = nn.Embedding(vocab_length, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, num_layers = num_lstm_layers)
        self.linear_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_linear_layers)]
        )
        self.output = nn.Linear(hidden_dim, 6)  # 6 different binary labels
        self.sigm = nn.Sigmoid()
            
    def forward(self, seq):
        preds = self.embedding(seq)
        preds, _ = self.lstm(preds)
        preds = preds[-1, :, :]
        for layer in self.linear_layers:
            preds = self.sigm(layer(preds))
        preds = self.output(preds)
        
        return preds
    
model = LSTMBinary(vocab_length=len(TEXT.vocab), hidden_dim=500, emb_dim=100, num_linear_layers=3)

In [29]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()
epochs = 3
labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    
    model.train()
    
    for batch in tqdm(train_dataloader):
        x = batch.comment_text
        y = torch.cat([getattr(batch, label).unsqueeze(1) for label in labels], dim=1).float()
        opt.zero_grad()
        predictions = model(x)
        loss = loss_func(predictions, y)
        loss.backward()
        opt.step()     
        running_loss += loss.data * x.size(0)
    epoch_loss = running_loss / len(train)
    
    valid_loss = 0.0
    model.eval()
    with torch.no_grad():
        for batch in valid_dataloader:
            x = batch.comment_text
            y = torch.cat([getattr(batch, label).unsqueeze(1) for label in labels], dim=1).float()
            predictions = model(x)
            loss = loss_func(predictions, y)
            valid_loss += loss.data * x.size(0)

        valid_loss /= len(valid)
    logger.info('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, valid_loss))

100%|██████████| 4/4 [00:05<00:00,  1.46s/it]
INFO:__main__:Epoch: 1, Training Loss: 1.1390, Validation Loss: 5.9476
100%|██████████| 4/4 [00:05<00:00,  1.28s/it]
INFO:__main__:Epoch: 2, Training Loss: 1.4952, Validation Loss: 5.5875
100%|██████████| 4/4 [00:05<00:00,  1.30s/it]
INFO:__main__:Epoch: 3, Training Loss: 1.9336, Validation Loss: 4.8728


In [38]:
l = list(train_iter)
l

[
 [torchtext.data.batch.Batch of size 7]
 	[.comment_text]:[torch.LongTensor of size 133x7]
 	[.toxic]:[torch.LongTensor of size 7]
 	[.severe_toxic]:[torch.LongTensor of size 7]
 	[.obscene]:[torch.LongTensor of size 7]
 	[.threat]:[torch.LongTensor of size 7]
 	[.insult]:[torch.LongTensor of size 7]
 	[.identity_hate]:[torch.LongTensor of size 7], 
 [torchtext.data.batch.Batch of size 7]
 	[.comment_text]:[torch.LongTensor of size 48x7]
 	[.toxic]:[torch.LongTensor of size 7]
 	[.severe_toxic]:[torch.LongTensor of size 7]
 	[.obscene]:[torch.LongTensor of size 7]
 	[.threat]:[torch.LongTensor of size 7]
 	[.insult]:[torch.LongTensor of size 7]
 	[.identity_hate]:[torch.LongTensor of size 7], 
 [torchtext.data.batch.Batch of size 4]
 	[.comment_text]:[torch.LongTensor of size 587x4]
 	[.toxic]:[torch.LongTensor of size 4]
 	[.severe_toxic]:[torch.LongTensor of size 4]
 	[.obscene]:[torch.LongTensor of size 4]
 	[.threat]:[torch.LongTensor of size 4]
 	[.insult]:[torch.LongTensor of s

In [39]:
l[2].comment_text.shape

torch.Size([587, 4])

In [40]:
TEXT.vocab.itos[1]

'<pad>'

In [41]:
for i in range(len(l[0].comment_text[1])):
    print("\n NEW SENTENCE \n")
    print(' '.join([TEXT.vocab.itos[x] for x in l[0].comment_text[:,i]]))


 NEW SENTENCE 

" 
 more 
 i <unk> n't make any <unk> <unk> on improvement - i <unk> if the <unk> <unk> should be later on , or a <unk> of " " <unk> of <unk> " "   <unk> think the references may need <unk> so that they are <unk> in the <unk> <unk> format <unk> <unk> format <unk> . i can do that later on , if no - one <unk> does first - if you have any <unk> for formatting <unk> on references or want to do it yourself please <unk> me know . 

 there <unk> to be a <unk> on articles for review so i <unk> there may be a <unk> <unk> a <unk> <unk> up . it 's listed in the relevant <unk> <unk> wikipedia : <unk>   "

 NEW SENTENCE 

" 

  <unk> are not always <unk> ! 

 under <unk> it is <unk> that " " a <unk> always has <unk> <unk> <unk> . " " this <unk> is simply not <unk> ! <unk> to <unk> <unk> , " " the rather <unk> <unk> <unk> are by <unk> the most <unk> <unk> . " " <unk> someone really need to take a look at his <unk> and <unk> <unk> <unk> of it <unk> i <unk> see a <unk> <unk> of <unk> 