# Deep Learning for NLP

In the earlier section, we used classical machine learning techniques to build our text classifiers. In this chapter, we will replace those with deep learning techniques: Recurrent Neural Networks. 

In particular, we will use a relatively simple BiDirectional LSTM model. If this is new to you, keep reading - if not, please feel free to skip ahead! 

I begin by touching upon the overhyped terms as 'deep' in Deep Learning and 'neural' in Deep Neural Networks. I do a quick detour to why I use PyTorch and compare it to Tensorflow and Keras - the other popular Deep Learning frameworks.   

I will build the simplest possible architecture for demonstration here. I assume a general familiarity with RNNs and don’t introduce the same again.  

In this section, we answer the following questions: 
- What is Deep Learning? How does it differ from what we have seen? 
- What are the key ideas in any deep learning model? 
- Why PyTorch?  
- How to tokenize text and setting up dataloaders with `torchtext`?
- What recurrent networks are, and how to use them for text classification? 

# What is Deep Learning? 

Deep learning is a subset of machine learning: a new take on learning from data that puts an emphasis on learning successive layers of increasingly meaningful representations. But what does 'deep' in Deep Learning mean? 

> The deep in deep learning isn’t a reference to any kind of deeper understanding achieved by the approach; rather, it stands for this idea of successive layers of representations. - F. Chollet, Lead Developer of Keras

The _depth_ of the model is indicative of how many layers of such representations did we use. F Chollet suggested _layered representations learning_, _hierarchical representations learning_ as better names for these. Another name could have been _differentiable programming_. This term coined by Yann LeCun, reasons from that the common thing between our 'deep learning methods' are not more layers. Instead, that all these models learn via some form of differential calculus—most often stochastic gradient descent.

## Differences with 'Modern' Machine Learning Methods
The modern machine learning methods which we saw shot to mainstream mostly in the 1990s or after that. The binding factor among them was that they all use one layer of representations. For instance, the Decision Trees just create one set of rules and apply them. Even if you add ensemble approaches, the 'ensembling' is often shallow and only combines several ML models directly.

Here is a better worded interpretation of these differences: 

> Modern deep learning often involves tens or even hundreds of successive layers of representations—and they’re all learned automatically from exposure to training data. Meanwhile, other approaches to machine learning tend to focus on learning only one or two layers of representations of the data; hence, they’re sometimes called shallow learning. - F Chollet

# Understanding Deep Learning

In a loosely worded manner, machine learning is about mapping inputs (such as image, or "movie review") to targets (such as the label cat or “positive”). The model does this by looking at (or training from) several pairs of input and targets. 

Deep Neural Networks do this input-to-target mapping using a long sequence of simple data transformations (layers). This sequence length is referred to as depth of the network. The entire sequence from input-to-target is referred to as _model_ which learns about the data. These data transformations are learned by repeated obvservation of examples.  let’s look at how this learning happens, concretely.

As a necessary caveat here, the _neural_ in Neural Networks has nothing to do with human brain except serving as a bad metaphor. There are several articles written by lazy journalists who did not bother asking any actual Deep Learning engineer or researcher about this term. You can safely ignore _all of them_ 

## Puzzle Pieces

We are looking at a particular sub-class of challenges where we want to learn an input-to-target mapping. This subclass is generally referred to as supervised machine learning. The word _supervised_ denoting that we have target(s) for each input. Unsupervised machine learning includes challenges like trying to cluster text, where we do not have a target. 

In order to do any supervised machine learning, we need the following in-place: 

 1. Input Data : Anything ranging from past stock performance to your vacation pictures 
 1. Target - Examples of the expected output:
 1. A way to measure whether the algorithm is doing a good job — This is necessary in order to determine the distance between the algorithm’s current output and its expected output. 
 
The above components are universal to any supervised approach, machine learning or deep learning. Deep Learning in particular has it's own starcast of puzzle pieces: 

1. Model Itself  
1. Loss Function
1. Optimizer

Since these actors are new to the scene, let's take a minute in understanding what they do: 

### Model

Each model is comprised of several layers. Each layer is a data transformation. The transformation is captured using a bunch of numbers - called layer weights. This is not complete truth though, most layers often have an operation mathematical associated with it e.g. convolution or affine transform. A more precise perspective would be to say that a layer is **parameterized** by it's weights. Hence, we use the terms _layer parameters_ and _layer weights_ interchangeably. 

The state of all the layer weights together makes the model state captured in model weights. A model can have anywhere between a few thousand to few million parameters. 

Let's try to understand the notion of model **learning** in this context:

Learning means finding values for the weights of all layers in a network, such that the network will correctly map example inputs to their associated targets. Note that this value set is for _all layers_ at one go. This nuance is important because changing weights of one layer can change the behaviour and predictions made by the entire model. 

### Loss Function
One of the pieces to setup a Machine learning task is to assess how a model is doing. The simplest answer would be to measure a notional accuracy of the model. Accuracy has few flaw thoughs:
- Accuracy is a proxy metric tied to validation data and not training data
- Accuracy measures how correct we are. During training, we want to measure how far are model predictions from target. 

These differences mean we need a different function to meet our criteria above. This is fulfilled by the _loss function_ in context of Deep Learning. This is sometimes referred to as an _objective function_ as well.

> The loss function takes the predictions of the network and the true target (what you wanted the network to output) and computes a distance score, capturing how well the network has done on this specific example. 
> - From Deep Learning in Python by F Chollet
 
This distance measurement is called loss score or simply loss.

### Optimizer

This loss is automatically used as a feedback signal to adjust the way the algorithm works. This adjustment step is what we call learning.

This automatic ajustment in model weights is peculiar to deep learning. Each adjustment or _update_ of weights is made in a direction that will lower the loss score for the current training pair (input, target). 

This adjustment is the job of the optimizer, which implements what’s called the Backpropagation algorithm: the central algorithm in deep learning. 

Optimizers and loss functions are common to all deep learning methods - even the cases where we don't have a (input, target) pair. All optimizers are based on differential calculus such as Stochastic Gradient Descent (SGD), Adam and so on. Hence, the term differentiable programming is a more precise name for Deep Learning in my mind. 

## Putting it Together: Training Loop

We now have a shared vocabulary. You have a notional understanding of what terms like layers, model weights, loss function, optimizer mean. But how do they work together? How do we train them on arbitrary data? We can train them to give us the ability to recognise cat picture to fraud reviews on Amazon. 

Here is the rough outline of steps that happen inside a training loop: 

- Initialize: 
    - The network/model weights are assigned random values, usually in (-1, 1) or (0, 1) 
    - Model is very far from the target. This is because it is simply executing a series of random transformations. 
    - Loss is very high 
- With every example the network processes:
    - Weights are adjusted a little in the correct direction
    - Loss score decreases
    
This is the training loop, which is repeated several times. Each pass over the entire training set is often referred to an _epoch_. Each training set suited for deep learning should typically have thousands of examples. The models are often trained for tens of epochs, or alternatively millions of iterations.

In a training setup (model, optimizer, loop), the above loop updates weight values that minimize the loss function. A  trained network is the one with least possible loss score on the entire training and valid data. 

It’s a simple mechanism that, repeated often times, just works like magic. 

# Kaggle: Text Categorization Challenge

In this particular section, we are going to visit the familiar task of text classification. We are going to use a different datset though. We are going to be solving the [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

# Getting the Data

Note that you will need to do accept the terms and conditions of the competition and data usage in order to get this dataset.

**Direct Download**: You can get the train and test data from the [data tab on challenge website](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data).  

**Kaggle API**: You can use the official Kaggle API [(github link)](https://github.com/Kaggle/kaggle-api) to download the data

In case of direct download and Kaggle API both, you have to split your train data into smaller train and validation splits for this notebook. 

You can create train and valid splits of train data using `sklearn.model_selection.train_test_split` utility. Alternatively, you can download directly from the accompanying ...

**Github Repo**: I am uploading the exact splits that I am using to a repository associated with this codebase. 

## Exploring the Data

In [5]:
# !conda install -y pandas
# !conda install -y numpy

In [6]:
import pandas as pd
import numpy as np

In [7]:
train_df = pd.read_csv("data/train.csv")

In [8]:
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\r\nWhy the edits made under my use...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\r\nMore\r\nI can't make any real suggestions...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [9]:
val_df = pd.read_csv("data/valid.csv")

In [10]:
val_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,000eefc67a2c930f,Radial symmetry \r\n\r\nSeveral now extinct li...,0,0,0,0,0,0
1,000f35deef84dc4a,There's no need to apologize. A Wikipedia arti...,0,0,0,0,0,0
2,000ffab30195c5e1,"Yes, because the mother of the child in the ca...",0,0,0,0,0,0
3,0010307a3a50a353,"""\r\nOk. But it will take a bit of work but I ...",0,0,0,0,0,0
4,0010833a96e1f886,"""== A barnstar for you! ==\r\n\r\n The Real L...",0,0,0,0,0,0


## Multiple Target Dataset!

The interesting thing about this dataset is that each comment can have multiples labels. For instance, a comment can be insult and be toxic. Or be obscene and have identity_hate elements in it. 

Hence, we are leveling up here by trying to predict not one label (e.g. positive or negative) but multiple labels at one go. For each label, we'd predict a value between 0 and 1 to indicate how likely it is to belong to that category. 

This is not a probability value in the Bayesian meaning of the word, but represents the same intent. 

Tip, I'd recommend trying out the models which we have seen earlier with this dataset. 

And re-implementing this code for our favourite IMDb dataset. 

In [1]:
test_df = pd.read_csv("data/test.csv")

NameError: name 'pd' is not defined

In [12]:
test_df.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \r\n\r\n The title is fine as i...
2,00013b17ad220c46,""" \r\n\r\n == Sources == \r\n\r\n * Zawe Ashto..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


# Why PyTorch? 

PyTorch is a deep learning framework by Facebook, similar to Tensorflow by Google. 

Being backed by Google, thousands of dollars have been spent in Tensorflow's marketing, development and documentation. It also got to a stable 1.0 release almost a year ago, while PyTorch has only recently gotten to 0.4.1. This means, that it's usually easier to find a Tensorflow solutiton to your problem and you can copy paste code off Internet. 

On the other hand, PyTorch is programmer friendly. It is semantically similar to numpy+deep learning operations as one. This means I can use the Python debugging tools that I am already familiar with it. 

Pythonic: Tensorflow worked like a C program in the sense that the code was all written in one session, compiled and then executed. Thereby destroying it's Python flavour altogether. This has been solved by Tensorflow's Eager Execution feature release, which will soon be stable enough to use for most prototyping work. 

Trainig Loop Visualization: Till a while ago, Tensorflow had a good visualization tool called Tensorboard for understanding how your training and validation performance (and other characterstics) which was absent in PyTorch. For a long while now, tensorboardX makes Tensorboard easy to use with PyTorch.

In summary, I use PyTorch because it is easier to debug, more Pythonic and more programmer friendly. 

# PyTorch and torchtext

You can install the latest version of Pytorch ([website](https://pytorch.org/)) via conda or pip for your target machine. I am running this code on a Windows laptop with a GPU. 

I installed `torch` using `conda install pytorch cuda92 -c pytorch`. 

In [4]:
# !conda install -y pytorch cuda92 -c pytorch

For installing `torchtext`, I recommend using pip directly from their Github repository with the latest fixes instead of PyPi which is not frequently updated. Uncomment the install line when running this for the first time. 

In [5]:
# !pip install --upgrade git+https://github.com/pytorch/text

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext

If you are running this code on a machine with GPU, leave the `use_gpu` flag set to True, else set it to False. 

If you set `use_gpu=True` on a machine, we can check whether the GPU is accessible to PyTorch or not using the `torch.cuda.is_available()` utility. 

In [7]:
use_gpu = True
if use_gpu:
    assert torch.cuda.is_available(), 'You either do not have a GPU or is not accessible to PyTorch'

Let's see how many GPU devices are available to PyTorch on this machine

In [8]:
torch.cuda.device_count()

1

## Data Loaders with torchtext

Writing good data loaders is the most tedious part in most deep learning applications. This step often combines the preprocessing, text cleaning, and vectorization tasks which we have seen earlier. 

Additionally, it wraps our static data objects into iterators or generators. This is incredibly helpful in processing data sizes much larger than GPU memory - which is quite often the case. This is done by splitting the data such that you can make `batches` of **batchsize** samples such that it fits your GPU memory.

Batchsizes are often powers of 2, such as 32, 64, 512 and so on. This convention exists because it helps with vector operations on the instruction set level. Anecdotally, using a batchsize different from power of 2 has not helped or hurt my processing speed.

### Conventions and Style
The code, iterators and wrappers used below are from [Practical Torchtext](https://github.com/keitakurita/practical-torchtext/). It is a torchtext tutorial by Keita Kurita - one of the top 5 contributors to torchtext. 

The naming conventions and style are loosely inspired from the above work and fastai - a deep learning framework based on PyTorch itself.

**Let's begin**  by setting up the required variable placeholders in place: 

In [19]:
from torchtext.data import Field 

The Field class determines how the data is preprocessed and converted into a numeric format. The Field class is a fundamental torchtext data structure and worth looking into. Field class models common text processing and sets them up for numericalisation (or vectorisation). 

In [15]:
LABEL = Field(sequential=False, use_vocab=False)

All fields, by default, expect a sequence of words to come in, and they expect to build a mapping from the words to integers later on. This mapping is called the vocab, and is effectively one hote encoding of the the tokens.  

We saw that each label in our case is already an integer marked as 0 or 1. So, we will not one-hot this, we tell the Field class that is already one-hot encoded and non-sequential by setting, `use_vocab=False` and `sequential=False` respectively.

In [16]:
tokenize = lambda x: x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

Here as few things happening in very few lines of code, let's unpack it just a little bit: 

- `lower=True`: all input is converted to lowercase
- `sequential=True`: if False, no tokenization is applied
- tokenizer: we defined a custom tokenize function which simply splits the string on space. You should replace this with the spaCy tokenizer (set tokenize="spacy") and see if that changes the loss curve or final model performance. 

**More about `Field`: **

In addition to the keyword arguments mentioned above, the Field class also allows the user to specify special tokens (the unk_token for out-of-vocabulary _unknown_ words, the pad_token for padding, the eos_token for the end of a sentence, and an optional init_token for the start of the sentence). 

The preprocessing and postprocessing parameters accept any `torchtext.data.Pipeline`s. Preprocessing is applied after tokenizing but before numericalizing. Postprocessing is applied after numericalizing, but before converting them to a Tensor.

The docstrings for the Field class are relatively well written, so if you need some advanced preprocessing you should probe them for more information. 

In [16]:
from torchtext.data import TabularDataset

In [17]:
%%time
tv_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("comment_text", TEXT), ("toxic", LABEL),
                 ("severe_toxic", LABEL), ("threat", LABEL),
                 ("obscene", LABEL), ("insult", LABEL),
                 ("identity_hate", LABEL)]

trn, vld = TabularDataset.splits(
        path="data", # the root directory where the data lies
        train='train.csv', validation="valid.csv",
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tv_datafields)

Wall time: 2.99 ms


In [18]:
%%time
tst_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("comment_text", TEXT)
]

tst = TabularDataset(
        path="data/test.csv", # the file path
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tst_datafields)

Wall time: 998 µs


In [19]:
TEXT.build_vocab(trn)

In [20]:
TEXT.vocab.freqs.most_common(10)

[('the', 78),
 ('to', 41),
 ('you', 33),
 ('of', 30),
 ('and', 26),
 ('a', 26),
 ('is', 24),
 ('that', 22),
 ('i', 20),
 ('if', 19)]

In [21]:
trn[0]

<torchtext.data.example.Example at 0x18aa3f827f0>

In [22]:
trn[0].__dict__.keys()

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

In [23]:
trn[4].comment_text

['you,',
 'sir,',
 'are',
 'my',
 'hero.',
 'any',
 'chance',
 'you',
 'remember',
 'what',
 'page',
 "that's",
 'on?']

## Iterators!

In [24]:
from torchtext.data import Iterator, BucketIterator

In [25]:
train_iter, val_iter = BucketIterator.splits(
        (trn, vld), # we pass in the datasets we want the iterator to draw data from
        batch_sizes=(32, 32),
        device=0, # if you want to use the CPU, specify -1 or GPU ID here
        sort_key=lambda x: len(x.comment_text), # the BucketIterator needs to be told what function it should use to group the data.
        sort_within_batch=False,
        repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.


In [26]:
train_iter

<torchtext.data.iterator.BucketIterator at 0x18aa3f76748>

In [27]:
batch = next(train_iter.__iter__()); batch


[torchtext.data.batch.Batch of size 25]
	[.comment_text]:[torch.LongTensor of size 494x25]
	[.toxic]:[torch.LongTensor of size 25]
	[.severe_toxic]:[torch.LongTensor of size 25]
	[.threat]:[torch.LongTensor of size 25]
	[.obscene]:[torch.LongTensor of size 25]
	[.insult]:[torch.LongTensor of size 25]
	[.identity_hate]:[torch.LongTensor of size 25]

In [28]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

In [29]:
test_iter = Iterator(tst, batch_size=64, device=0, sort=False, sort_within_batch=False, repeat=False)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.


In [30]:
next(test_iter.__iter__())


[torchtext.data.batch.Batch of size 33]
	[.comment_text]:[torch.LongTensor of size 158x33]

In [49]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))
            if use_gpu:
                yield (x.cuda(), y.cuda())
            else:
                yield (x, y)
    
    def __len__(self):
        return len(self.dl)

In [50]:
train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)

In [51]:
next(train_dl.__iter__())

(tensor([[ 280,   15,  315,  ...,   63,  660,   66],
         [  18,  360,   12,  ...,    4,   11,   82],
         [  14,   45,    6,  ...,  664,    2,    2],
         ...,
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1]], device='cuda:0'),
 tensor([[ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 1.,  1.,  0.,  1.,  1.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,

## Training a Text Classifier

In [52]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

In [53]:
class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=2):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

# Bi-LSTM Classifiers
- What is a Bi-LSTM? 
- What is a RNN? 
- Implement LSTM-only classification example
- Implement Bi-LSTM classification example

In [54]:
em_sz = 100
nh = 500
nl = 3
model = SimpleBiLSTMBaseline(nh, emb_dim=em_sz)
print(model)

  "num_layers={}".format(dropout, num_layers))


SimpleBiLSTMBaseline(
  (embedding): Embedding(784, 100)
  (encoder): LSTM(100, 500, dropout=0.1)
  (linear_layers): ModuleList(
    (0): Linear(in_features=500, out_features=500, bias=True)
  )
  (predictor): Linear(in_features=500, out_features=6, bias=True)
)


In [55]:
if use_gpu:
    model = model.cuda()
from tqdm import tqdm

In [56]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss().cuda()

In [57]:
epochs = 4

In [59]:
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        
        running_loss += loss.item() * x.size(0)
        
    epoch_loss = running_loss / len(trn)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.item() * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.02it/s]


Epoch: 1, Training Loss: 3.4272, Validation Loss: 3.1624


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.12it/s]


Epoch: 2, Training Loss: 5.3027, Validation Loss: 3.9013


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.67it/s]


Epoch: 3, Training Loss: 4.9982, Validation Loss: 2.2753


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.90it/s]


Epoch: 4, Training Loss: 3.2272, Validation Loss: 5.4684


In [61]:
test_preds = []
for x, y in tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.cpu().numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
test_preds = np.hstack(test_preds)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 66.85it/s]


In [62]:
test_df = pd.read_csv("data/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    test_df[col] = test_preds[:, i]

In [63]:
test_df.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,0.518735,0.674489,6e-06,0.707423,0.381914,3e-06
1,0000247867823ef7,== From RfC == \r\n\r\n The title is fine as i...,0.518735,0.674489,6e-06,0.707423,0.381914,3e-06
2,00013b17ad220c46,""" \r\n\r\n == Sources == \r\n\r\n * Zawe Ashto...",0.518735,0.674489,6e-06,0.707423,0.381914,3e-06


# Bi-LSTM for Linguistic Tasks