## Torchtext tutorial


Made by [Andrea Sottana](http://github.com/AndreaSottana) and adapted from the following [tutorial](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/)

This notebook shows how you can use your own custom dataset starting from the raw form (for example a .csv file) with the `torch` library, to easily do some cleaning and preprocessing (such as tokenization), build a vocabulary, and ultimately create an `Iterator` or `BucketIterator` object that you can feed it into a neural network built using the `torch` library.

In [1]:
from torchtext.data import Field, TabularDataset, Iterator, BucketIterator
from torchtext import datasets
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import pandas as pd
import nltk
import os
from tqdm import tqdm
import logging


logging.basicConfig(level=20)  # shows info level loggings (we'll use this to print the loss during training)

## Dataset 1: multi-class classification, single label

A data `Field` specifies how you want a certain field to be processed, for example whether you want to tokenize it, lower case the text etc. In our case we have one text field, and one label field for the sentiment analysis, which we represent with an integer, where 0 is negative, 1 is neutral and 2 is positive.

In [2]:
TEXT = Field(sequential=True, tokenize='spacy', lower=True)
LABEL = Field(sequential=False, use_vocab=False)

In [22]:
# remember to source the .envrc file in the terminal before launching this notebook to 
# ensure it can use the environment variables correctly.

folder = os.path.join(os.getenv('DATA_DIR'), 'movie_review_dataset')
train_dataset = pd.read_csv(os.path.join(folder, 'train_dataset.csv'))
valid_dataset = pd.read_csv(os.path.join(folder, 'valid_dataset.csv'))
test_dataset = pd.read_csv(os.path.join(folder, 'test_dataset.csv'))
train_dataset

Unnamed: 0,text,labels,numerical_labels
0,The Rock is destined to be the 21st Century 's...,positive,2
1,The gorgeously elaborate continuation of `` Th...,positive,2
2,Singer\/composer Bryan Adams contributes a sle...,positive,2
3,You 'd think by now America would have had eno...,neutral,1
4,Yet the act is still charming here .,positive,2
...,...,...,...
8539,A real snooze .,negative,0
8540,No surprises .,negative,0
8541,We 've seen the hippie-turned-yuppie plot befo...,positive,2
8542,Her fans walked out muttering words like `` ho...,negative,0


This tells the field what data to work on. Note that the fields we pass in must be in the same order as the columns in our csv file. For the columns we don't use, we pass in a tuple where the field element is None.

In [23]:
fields = [("text", TEXT), ("labels", None), ("numerical_labels", LABEL)]
fields

[('text', <torchtext.data.field.Field at 0x136a4be10>),
 ('labels', None),
 ('numerical_labels', <torchtext.data.field.Field at 0x1383d40b8>)]

The `TabularDataset` is one of the built-in Datasets in torchtext that handle common data formats; this one is good for .csv files. Other datasets, such as `LanguageModelingDataset` or `TranslationDataset`, are available.

In [24]:
train, valid, test = TabularDataset.splits(
    path = folder,
    train = 'train_dataset.csv',
    validation = 'valid_dataset.csv',
    test = 'test_dataset.csv',
    format = 'csv',
    skip_header = True,
    fields = fields
)

In [25]:
print(train[0], '\n')

# This is an Example object. The Example object bundles the attributes of a single data 
# point together. We also see that the text has already been tokenized for us, but has 
# not yet been converted to integers. 

print(train[0].__dict__, '\n')
print(train[0].__dict__.keys(), '\n')
print(train[0].text)

<torchtext.data.example.Example object at 0x1401d1320> 

{'text': ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '`', '`', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', '-', 'claud', 'van', 'damme', 'or', 'steven', 'segal', '.'], 'numerical_labels': '2'} 

dict_keys(['text', 'numerical_labels']) 

['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '`', '`', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', '-', 'claud', 'van', 'damme', 'or', 'steven', 'segal', '.']


The next step is constructing the mapping from words to ids. The vocabulary is normally built on the training set only

In [26]:
TEXT.build_vocab(train)

Torchtext has its own class called Vocab for handling the vocabulary. The Vocab class holds a mapping from word to id in its `stoi` attribute and a reverse mapping in its `itos` attribute. Words that are not included in the vocabulary will be converted into `<unk>`.

In [27]:
print(TEXT.vocab.itos[12])
print(TEXT.vocab.stoi['it'])

it
12


Now we need to create an Iterator to pass the data to our model.  

In torchvision and PyTorch, the processing and batching of data is handled by DataLoaders. For some reason, torchtext has renamed the objects that do the exact same thing to Iterators. The basic functionality is the same, but Iterators, as we will see, have some convenient functionality that is unique to NLP.  

We also need to tell the BucketIterator what attribute you want to bucket the data on. In our case, we want to bucket based on the lengths of the comment_text field, so we pass that in as a keyword argument. 

In [28]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(64, 64, 64),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True
)

An alternative would be to use a standard iterator for the test set, given we do not need to shuffle the data since we'll be outputting the predictions at the end of training.  
The code would be:

In [29]:
train_iter, valid_iter = BucketIterator.splits(
    (train, valid),
    batch_sizes=(64, 64),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True
)

test_iter = Iterator(
    test, batch_size=64, device=device, sort=False, sort_within_batch=False
)

Let's see what a `BucketIterator` looks like.

In [30]:
print(type(train_iter))
next(iter(train_iter))

<class 'torchtext.data.iterator.BucketIterator'>



[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 31x64]
	[.numerical_labels]:[torch.LongTensor of size 64]

In [31]:
train_dataloader = list(train_iter)
valid_dataloader = list(valid_iter)
test_dataloader = list(test_iter)

In [32]:
it = train_dataloader[0]
it.text.shape, it.numerical_labels.shape

(torch.Size([23, 64]), torch.Size([64]))

In [33]:
vars(list(test_iter)[0])

{'batch_size': 64,
 'dataset': <torchtext.data.dataset.TabularDataset at 0x1401d11d0>,
 'fields': dict_keys(['text', 'labels', 'numerical_labels']),
 'input_fields': ['text', 'numerical_labels'],
 'target_fields': [],
 'text': tensor([[  80,    3,   15,  ...,   22,   13,    3],
         [3645, 2184,   38,  ..., 1083,    0,  120],
         [   0,   10,   23,  ...,  839,    0,  976],
         ...,
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1]]),
 'numerical_labels': tensor([0, 0, 0, 0, 0, 2, 1, 0, 1, 2, 1, 1, 2, 2, 0, 0, 1, 2, 1, 0, 2, 0, 2, 0,
         0, 1, 0, 0, 0, 2, 2, 2, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 2, 2, 0,
         0, 2, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 0])}

In [34]:
iterator = train_dataloader
print(len(list(iterator)))
print(type(list(iterator)))
print(iterator)

134
<class 'list'>
[
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 23x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 31x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 30x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 38x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 9x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 20x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 21x64]
	[.numerical_labels]:[torch.LongTensor of size 64], 
[torchtext.data.batch.Batch of size

In [35]:
train_dataloader = list(train_iter)
next_ = train_dataloader[0]
print(len(next_))
print(type(next_))
print(next_.text.shape, next_.numerical_labels.shape)
print(next_.text, next_.numerical_labels)

64
<class 'torchtext.data.batch.Batch'>
torch.Size([20, 64]) torch.Size([64])
tensor([[   12,    21,   235,  ...,     3,    12, 10634],
        [   11,  2835,    22,  ...,    20,  2899,    14],
        [    5,  5580,   776,  ...,   325,  5804,    62],
        ...,
        [  292,   804,  7276,  ...,   587,    14,    40],
        [   75,   387,   545,  ...,  1318,    17,    24],
        [    2,     2,     2,  ...,     2,     2,     2]]) tensor([2, 1, 0, 2, 0, 2, 2, 1, 2, 1, 1, 2, 0, 0, 1, 0, 0, 2, 2, 1, 0, 0, 0, 1,
        2, 0, 2, 2, 0, 2, 0, 1, 0, 0, 2, 0, 0, 0, 2, 2, 1, 0, 2, 2, 0, 0, 2, 2,
        2, 0, 2, 0, 1, 2, 0, 0, 2, 2, 2, 2, 0, 1, 2, 0])


Here is, finally, our neural network.

In [42]:
class LSTMultiClass(nn.Module):
    def __init__(self, vocab_length, hidden_dim, emb_dim=300, num_lstm_layers=1, num_linear_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_length, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, num_layers=num_lstm_layers)
        self.linear_layers = nn.ModuleList()
        self.linear_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_linear_layers)]
        )
        self.tanh = nn.Tanh()
        self.predictor = nn.Linear(hidden_dim, 3)  # 3 possible outcomes: negative, neutral, positive

    def forward(self, seq):
        emb = self.embedding(seq)
        all_hidden, (h_n, c_n) = self.lstm(emb)
        feature = h_n.squeeze() # taking the last output from the LSTM (we're dealing with many-to-one problem)
        for layer in self.linear_layers:
            feature = layer(feature)
            feature = self.tanh(feature)
         
        preds = self.predictor(feature)
        return preds


model = LSTMultiClass(vocab_length=len(TEXT.vocab), hidden_dim=500, emb_dim=100, num_linear_layers=3)

In [43]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.CrossEntropyLoss()
epochs = 3

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    model.train() # turn on training mode
    for batch in tqdm(train_dataloader):
        x = batch.text  # independent variable (the input to the model)
        y = batch.numerical_labels.long()  # dependent variable (the supervision data)
        opt.zero_grad()
        predictions = model(x)
        loss = loss_func(predictions, y)
        loss.backward()
        opt.step()

        running_loss += loss.data * x.size(0)
    epoch_loss = running_loss / len(train)

    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    with torch.no_grad():
        for batch in valid_dataloader:
            x = batch.text
            y = batch.numerical_labels.long()
            predictions = model(x)
            loss = loss_func(predictions, y)
            val_loss += loss.data * x.size(0)

    val_loss /= len(valid)
    logger.info('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

100%|██████████| 134/134 [00:43<00:00,  3.11it/s]
INFO:__main__:Epoch: 1, Training Loss: 0.4095, Validation Loss: 0.4699
100%|██████████| 134/134 [00:42<00:00,  3.12it/s]
INFO:__main__:Epoch: 2, Training Loss: 0.3681, Validation Loss: 0.4746
 20%|██        | 27/134 [00:07<00:29,  3.62it/s]


KeyboardInterrupt: 

## Dataset 2: binary classification, multiple labels

In [61]:
folder = os.path.join(os.getenv('DATA_DIR'), 'toxic_comments_dataset')
train_dataset = pd.read_csv(os.path.join(folder, 'train_dataset.csv'))
valid_dataset = pd.read_csv(os.path.join(folder, 'valid_dataset.csv'))
test_dataset = pd.read_csv(os.path.join(folder, 'test_dataset.csv'))
train_dataset

Unnamed: 0,comment_text,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,Explanation\nWhy the edits made under my usern...,0000997932d777bf,0,0,0,0,0,0
1,D'aww! He matches this background colour I'm s...,000103f0d9cfb60f,0,0,0,0,0,0
2,"Hey man, I'm really not trying to edit war. It...",000113f07ec002fd,0,0,0,0,0,0
3,"""\nMore\nI can't make any real suggestions on ...",0001b41b1c6bb37e,0,0,0,0,0,0
4,"You, sir, are my hero. Any chance you remember...",0001d958c54c6e35,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
4995,"""\n\nHello Marcruhwedell, and Welcome to Wikip...",0d39d2eaa40a6241,0,0,0,0,0,0
4996,"...that's why I did ....cheers, (talk · cont...",0d3b57f3b2db3f03,0,0,0,0,0,0
4997,"No, it's not a delayed reaction\n\nI just happ...",0d3bb1f12d90a6a2,1,0,0,0,1,0
4998,"""\n\nA slight difference with you\nI have to d...",0d3be13abf1bec52,0,0,0,0,0,0


In [62]:
TEXT = Field(sequential=True, tokenize='spacy', lower=True)
LABEL = Field(sequential=False, use_vocab=False, dtype=torch.long)

In [63]:
fields = [
    ("comment_text", TEXT),
    ("id", None),  
    ("toxic", LABEL), 
    ("severe_toxic", LABEL),
    ("obscene", LABEL),
    ("threat", LABEL),
    ("insult", LABEL),
    ("identity_hate", LABEL)
]

fields

[('comment_text', <torchtext.data.field.Field at 0x155856940>),
 ('id', None),
 ('toxic', <torchtext.data.field.Field at 0x140c964a8>),
 ('severe_toxic', <torchtext.data.field.Field at 0x140c964a8>),
 ('obscene', <torchtext.data.field.Field at 0x140c964a8>),
 ('threat', <torchtext.data.field.Field at 0x140c964a8>),
 ('insult', <torchtext.data.field.Field at 0x140c964a8>),
 ('identity_hate', <torchtext.data.field.Field at 0x140c964a8>)]

In [64]:
train, valid, test = TabularDataset.splits(
    path = folder,
    train = 'train_dataset.csv',
    validation = 'valid_dataset.csv',
    test = 'test_dataset.csv',
    format = 'csv',
    skip_header = True,
    fields = fields
)

In [65]:
TEXT.build_vocab(train, min_freq=2)

In [68]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(64, 64, 64),
    device=device,
    sort_key=lambda x: len(x.comment_text),
    sort_within_batch=True
)

In [69]:
vars(list(train_iter)[0])

{'batch_size': 64,
 'dataset': <torchtext.data.dataset.TabularDataset at 0x13faa97f0>,
 'fields': dict_keys(['comment_text', 'id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']),
 'input_fields': ['comment_text',
  'toxic',
  'severe_toxic',
  'obscene',
  'threat',
  'insult',
  'identity_hate'],
 'target_fields': [],
 'comment_text': tensor([[ 2051,  9599,     5,  ...,  4508,  1514,  7263],
         [10474,     0,    25,  ...,     4,   506, 11587],
         [    0,    25,   118,  ...,     7,    25,    95],
         ...,
         [   19,    29,    16,  ...,  3461,  1630,     0],
         [ 2861,     0,    56,  ...,     2,    23,  6299],
         [    2,     2,     5,  ...,     1,     1,     1]]),
 'toxic': tensor([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]),
 'severe_toxic': tensor([0, 0, 0, 0,

In [70]:
train_dataloader = list(train_iter)
valid_dataloader = list(valid_iter)
test_dataloader = list(test_iter)

In [75]:
class LSTMBinary(nn.Module):
    def __init__(self, vocab_length, hidden_dim, emb_dim=300, num_lstm_layers=1, num_linear_layers=1):
        super().__init__()
        self.num_lstm_layers = num_lstm_layers
        self.num_linear_layers = num_linear_layers
        self.embedding = nn.Embedding(vocab_length, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, num_layers = num_lstm_layers)
        self.linear_layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_linear_layers)]
        )
        self.output = nn.Linear(hidden_dim, 6)  # 6 different binary labels
        self.tanh = nn.Tanh()
            
    def forward(self, seq):
        emb = self.embedding(seq)
        all_hidden, (h_n, c_n) = self.lstm(emb)
        preds = h_n.squeeze()
        for layer in self.linear_layers:
            preds = self.tanh(layer(preds))
        preds = self.output(preds)
        
        return preds
    
model = LSTMBinary(vocab_length=len(TEXT.vocab), hidden_dim=500, emb_dim=100, num_linear_layers=3)

In [76]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()
epochs = 3
labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    
    model.train()
    
    for batch in tqdm(train_dataloader):
        x = batch.comment_text
        y = torch.cat([getattr(batch, label).unsqueeze(1) for label in labels], dim=1).float()
        opt.zero_grad()
        predictions = model(x)
        loss = loss_func(predictions, y)
        loss.backward()
        opt.step()     
        running_loss += loss.data * x.size(0)
    epoch_loss = running_loss / len(train)
    
    valid_loss = 0.0
    model.eval()
    with torch.no_grad():
        for batch in valid_dataloader:
            x = batch.comment_text
            y = torch.cat([getattr(batch, label).unsqueeze(1) for label in labels], dim=1).float()
            predictions = model(x)
            loss = loss_func(predictions, y)
            valid_loss += loss.data * x.size(0)

        valid_loss /= len(valid)
    logger.info('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, valid_loss))

 42%|████▏     | 33/79 [03:25<04:45,  6.21s/it]


KeyboardInterrupt: 

In [77]:
l = list(train_iter)
l

[
 [torchtext.data.batch.Batch of size 64]
 	[.comment_text]:[torch.LongTensor of size 105x64]
 	[.toxic]:[torch.LongTensor of size 64]
 	[.severe_toxic]:[torch.LongTensor of size 64]
 	[.obscene]:[torch.LongTensor of size 64]
 	[.threat]:[torch.LongTensor of size 64]
 	[.insult]:[torch.LongTensor of size 64]
 	[.identity_hate]:[torch.LongTensor of size 64], 
 [torchtext.data.batch.Batch of size 64]
 	[.comment_text]:[torch.LongTensor of size 60x64]
 	[.toxic]:[torch.LongTensor of size 64]
 	[.severe_toxic]:[torch.LongTensor of size 64]
 	[.obscene]:[torch.LongTensor of size 64]
 	[.threat]:[torch.LongTensor of size 64]
 	[.insult]:[torch.LongTensor of size 64]
 	[.identity_hate]:[torch.LongTensor of size 64], 
 [torchtext.data.batch.Batch of size 64]
 	[.comment_text]:[torch.LongTensor of size 239x64]
 	[.toxic]:[torch.LongTensor of size 64]
 	[.severe_toxic]:[torch.LongTensor of size 64]
 	[.obscene]:[torch.LongTensor of size 64]
 	[.threat]:[torch.LongTensor of size 64]
 	[.insult]:

In [78]:
l[2].comment_text.shape

torch.Size([239, 64])

In [79]:
TEXT.vocab.itos[1]

'<pad>'

In [80]:
for i in range(len(l[0].comment_text[1])):
    print("\n NEW SENTENCE \n")
    print(' '.join([TEXT.vocab.itos[x] for x in l[0].comment_text[:,i]]))


 NEW SENTENCE 

april 2006 ( utc ) 

  wikipedia : colours 

 much thanks for fixing / improving the table code .  
 no reason , maybe add the info where you found the values , is that in <unk> ?   
 all the styles are <unk> . highly - repetitively . if you know your way around mediawiki , i 've been asking everywhere for a css guru to clean up all the <unk> on the main page wikicode . ( like we do nt all have enough to do already , eh ? ;) so many projects , so few <unk> ... <unk> , 11

 NEW SENTENCE 

screwjob 

    hey i noticed your comments on the montreal screwjob discussion page . i decided since nobody except someone with no account objected to what you said . i would atleast change the page a little to make it fair . i just wanted to tell you because i thought you would like to know . i only changed a few words at the top of the first paragraph and added something to the second one . if your not bothered that s fine but i thought since you were fighting for and nothing was ev