**NB!** The current (30/3/2018) release of PyTorch Torchtext has some bugs preventing execution of this notebook. The updated version of Torchtext used in this notebook can be retrieved via:
```
pip install --upgrade git+https://github.com/pytorch/text
```

In [41]:
import pandas as pd
import pathlib
import torch

In [2]:
path  = pathlib.Path('../../data/')
comp  = pathlib.Path('competitions/jigsaw-toxic-comment-classification-challenge/')
TRAIN_DATA_FILE = pathlib.Path('train.csv')
TEST_DATA_FILE  = pathlib.Path('test.csv')

### ~~Fill Missing Values~~

(not needed in updated torchtext)

In [3]:
# train = pd.read_csv(path/comp/TRAIN_DATA_FILE)
# test  = pd.read_csv(path/comp/TEST_DATA_FILE)

# train["comment_text"] = train["comment_text"].fillna("_na_").values
# test["comment_text"]  = test["comment_text"].fillna("_na_").values

# # update paths to copies on disk
# TRAIN_DATA_FILE = pathlib.Path('nafill_' + str(TRAIN_DATA_FILE))
# TEST_DATA_FILE  = pathlib.Path('nafill_' + str(TEST_DATA_FILE))

# train.to_csv(path/comp/TRAIN_DATA_FILE)
# test.to_csv(path/comp/TEST_DATA_FILE)

### Declare Fields

In [3]:
from torchtext.data import Field

tokenize = lambda x: x.split()

TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)

### Construct Dataset

In [4]:
from torchtext.data import TabularDataset
    
tv_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("comment_text", TEXT), ("toxic", LABEL),
                 ("severe_toxic", LABEL), ("threat", LABEL),
                 ("obscene", LABEL), ("insult", LABEL),
                 ("identity_hate", LABEL)]
trn, vld = TabularDataset.splits(
               path="../../data/competitions/jigsaw-toxic-comment-classification-challenge/", # the root directory where the data lies
               train='train.csv', validation='train.csv',
               format='csv',
               skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
               fields=tv_datafields)
 
tst_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                  ("comment_text", TEXT)]
tst = TabularDataset(
           path="../../data/competitions/jigsaw-toxic-comment-classification-challenge/test.csv", # the file path
           format='csv',
           skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
           fields=tst_datafields)

In [5]:
trn[0]

<torchtext.data.example.Example at 0x7fec3494e518>

The `Example` object bundles the attributes of a single datapoint together. At this point the text has been tokenized.

In [6]:
trn[0].__dict__.keys()

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

In [7]:
trn[1].__dict__.keys()

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

In [8]:
trn[0].comment_text[:3]

['explanation', 'why', 'the']

Torchtext handles mapping words to integers, but it has to be told the full range of words it should handle. In our case, we probably want to build the vocabulary on the training set only, so we run the following code:

In [9]:
TEXT.build_vocab(trn)

In [12]:
TEXT.vocab.freqs.most_common(10)

[('the', 490031),
 ('to', 294069),
 ('of', 222834),
 ('and', 218120),
 ('a', 211778),
 ('i', 196695),
 ('you', 187782),
 ('is', 170753),
 ('that', 146478),
 ('in', 140540)]

This makes Torchtext go through all the elements in the training set, check the contents corresponding to the `TEXT` field, and register the words in its vocabulary. Torchtext has its own class called `Vocab` for handling the vocabulary. The `Vocab` class holds a mapping frm word to id in its `stoi` attribute and a reverse mapping in its `itos` attribute. In addition to this, it cn automatically build an embedding matrix for you using various pretrained embeddings like word2vec. The `Vocab` class can also take options like `max_size` and `min_freq` that dictate how many words are in the vocabulary or how many times a word has to appear to be registered in the vocabulary. Words that aren't included in the vocabulary will be converted into `<unk>`, the "unknown" token.

Now that we have our data formatted and read into memeory, we turn to the next step: creating an Iterator to pass hte data to our model:

### Constructing the Iterator

'Iterator' is the Torchtext DataLoader with some extra NLP-specific functionalty.

In [10]:
from torchtext.data import Iterator, BucketIterator

device = 0

train_iter, val_iter = BucketIterator.splits(
    (trn, vld), # we pass in the datasets we want the iterator to draw data from
    batch_sizes=(64,64),
    device=device, # if you want to use GPU; else: -1
    sort_key=lambda x: len(x.comment_text), # BucketIterator needs to be told what function it should use to group the data
    sort_within_batch=False,
    repeat=False # False bc we want to wrap this Iterator layer
)

test_iter = Iterator(tst, batch_size=64, device=device, sort=False, 
                     sort_within_batch=False, repeat=False)

`BucketIterator` automatically shuffles & buckets input sequences into seqs of similar length. Allows for efficient padding. You *must* tell `BucketIterator` what attribute you want to bucket the data on.

Here we want to bucket based on lengths of `comment_text` field. For test data, we don't want to shuffle since we'll be outputting predictions at end fo training -- that's why we use a standard `Iterator`.

Output of `BucketIterator`:

In [13]:
batch = next(train_iter.__iter__()); batch


[torchtext.data.batch.Batch of size 64]
	[.comment_text]:[torch.cuda.LongTensor of size 65x64 (GPU 0)]
	[.toxic]:[torch.cuda.LongTensor of size 64 (GPU 0)]
	[.severe_toxic]:[torch.cuda.LongTensor of size 64 (GPU 0)]
	[.threat]:[torch.cuda.LongTensor of size 64 (GPU 0)]
	[.obscene]:[torch.cuda.LongTensor of size 64 (GPU 0)]
	[.insult]:[torch.cuda.LongTensor of size 64 (GPU 0)]
	[.identity_hate]:[torch.cuda.LongTensor of size 64 (GPU 0)]

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name:

In [14]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'train', 'fields', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

### Wrapping the Iterator

The `Iterator` returns a custom datatype: `torch.data.Batch` `Batch` has similar API to `Example`: w/ a batch of data from each field as attributes. We can use a simple wrapper to ease use. $\longrightarrow$ convert `Batch` to tuple: (x, y):

In [59]:
class BatchWrapper:
    def __init__(self, dl, x, y):
        self.dl, self.x, self.y = dl, x, y
    
    def __iter__(self):
        x = getattr(batch, self.x) # assuming only 1 input in this wrapper
        
        if self.y is not None: # we'll concat y into a single tensor
            y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y], dim=1).float()
        else:
            y = torch.zeros((1))
            
        yield (x, y)
            
    def __len__(self):
        return len(self.dl)    

In [60]:
train_dl = BatchWrapper(train_iter, "comment_text", list(trn.fields.keys())[2:])

valid_dl = BatchWrapper(train_iter, "comment_text", list(trn.fields.keys())[2:])

test_dl  = BatchWrapper(test_iter, "comment_text", None)

In [61]:
print(trn.fields.keys())
print(list(trn.fields.keys())[2:])

dict_keys(['id', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
['toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate']


All we're doing here is converting the `Batch` object into a tuple of inputs and outputs:

In [62]:
next(train_dl.__iter__())

(Variable containing:
  1.1000e+02  1.7900e+02  1.8000e+01  ...   1.8000e+01  1.8000e+01  1.8000e+01
  5.6600e+02  7.4700e+02  1.5048e+05  ...   1.9000e+01  4.1830e+03  7.0000e+00
  1.1000e+02  7.9000e+01  7.5000e+01  ...   4.5290e+03  1.6000e+01  2.0000e+01
                 ...                   ⋱                   ...                
  1.1121e+05  8.2436e+04  1.8000e+01  ...   1.3300e+02  3.4210e+03  4.0000e+00
  1.0000e+00  1.0000e+00  1.0000e+00  ...   4.0600e+02  2.2240e+05  1.7248e+05
  1.0000e+00  1.0000e+00  1.0000e+00  ...   1.8000e+01  1.8000e+01  1.8000e+01
 [torch.cuda.LongTensor of size 65x64 (GPU 0)], Variable containing:
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     0     0     0     0     0     0
     1     0     1     

Now we're ready to start training a model!

### Training a Text Classifier -- LSTM

We'll use a simple LSTM as a baseline example

In [63]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

In [66]:
class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300, 
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__() # don't forget to call this
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for i in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

In [67]:
emb_sz = 300
nh     = 500
nl     = 3
model  = SimpleBiLSTMBaseline(nh, emb_dim=emb_sz)

In [68]:
model.cuda()

SimpleBiLSTMBaseline(
  (embedding): Embedding(470342, 300)
  (encoder): LSTM(300, 500, dropout=0.1)
  (linear_layers): ModuleList(
  )
  (predictor): Linear(in_features=500, out_features=6, bias=True)
)

### Training Loop

We can iterate using our wrapped `Iterator`, and the data will automatically be passed to us after being moved to the GPU and numericalized appropriately.

In [69]:
import tqdm

opt = torch.optim.Adam(model.parameters(), lr=1e-2)
loss_fn = torch.nn.BCEWithLogitsLoss()

epochs = 2

In [71]:
%%time
for epoch in range(1, epochs+1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper we can intuitively iterate over our data
        opt.zero_grad()
        preds = model(x)
        loss = loss_fn(preds, y)
        loss.backward()
        opt.step()
        
        running_loss += loss.data[0] * x.size(0)
        
    epoch_loss = running_loss / len(trn)
    
    # calculate the vlaidation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_fn(preds, y)
        val_loss += loss.data[0] * x.size(0)
        
    val_loss /= len(vld)
    print(f'Epoch: {epoch}, Training Loss: {epoch_loss:.4f}, Validation Loss: {val_loss:.4f}')

  0%|          | 1/2494 [00:00<06:45,  6.15it/s]
  0%|          | 1/2494 [00:00<06:06,  6.81it/s]

Epoch: 1, Training Loss: 0.0001, Validation Loss: 0.0000
Epoch: 2, Training Loss: 0.0000, Validation Loss: 0.0000
CPU times: user 232 ms, sys: 124 ms, total: 356 ms
Wall time: 352 ms





In [75]:
class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds


In [76]:
em_sz = 100
nh = 500
nl = 3
model = SimpleBiLSTMBaseline(nh, emb_dim=em_sz); model.cuda()

SimpleBiLSTMBaseline(
  (embedding): Embedding(470342, 100)
  (encoder): LSTM(100, 500, dropout=0.1)
  (linear_layers): ModuleList(
  )
  (predictor): Linear(in_features=500, out_features=6, bias=True)
)

In [77]:
%%time


opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()

epochs = 2

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        
        running_loss += loss.data[0] * x.size(0)
        
    epoch_loss = running_loss / len(trn)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.data[0] * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

  0%|          | 1/2494 [00:00<03:37, 11.46it/s]
  0%|          | 1/2494 [00:00<03:33, 11.68it/s]


Epoch: 1, Training Loss: 0.0003, Validation Loss: 0.0001
Epoch: 2, Training Loss: 0.0001, Validation Loss: 0.0001
CPU times: user 156 ms, sys: 60 ms, total: 216 ms
Wall time: 212 ms
