# Deep Learning for NLP

# Understanding Deep Learning

# Kaggle: Text Categorization Challenge

# Getting the Data

**Direct Download**: You can get the train and test data from the [data tab on challenge website](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data).  

## Exploring the Data

In [1]:
!conda install -y pandas
!conda install -y numpy

Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Miniconda3\envs\nlp

  added / updated specs: 
    - pandas


The following packages will be UPDATED:

    pandas: 0.23.3-py36h830ac7b_0 --> 0.23.4-py36h830ac7b_0

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Miniconda3\envs\nlp

  added / updated specs: 
    - numpy


The following packages will be UPDATED:

    numpy:      1.15.1-py36ha559c80_0 --> 1.15.4-py36ha559c80_0
    numpy-base: 1.15.1-py36h8128ebf_0 --> 1.15.4-py36h8128ebf_0

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done


In [2]:
import pandas as pd
import numpy as np

In [3]:
train_df = pd.read_csv("data/train.csv")

FileNotFoundError: File b'data/train.csv' does not exist

In [None]:
train_df.head()

In [None]:
val_df = pd.read_csv("data/valid.csv")

In [None]:
val_df.head()

## Multiple Target Dataset!

In [None]:
test_df = pd.read_csv("data/test.csv")

In [None]:
test_df.head()

# Why PyTorch? 

# PyTorch and torchtext

In [None]:
!conda install -y pytorch cuda92 -c pytorch

In [None]:
!pip install --upgrade git+https://github.com/pytorch/text

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext

In [12]:
use_gpu = True
if use_gpu:
    assert torch.cuda.is_available(), 'You either do not have a GPU or is not accessible to PyTorch'

Let's see how many GPU devices are available to PyTorch on this machine

In [13]:
torch.cuda.device_count()

1

## Data Loaders with torchtext

### Conventions and Style

In [14]:
from torchtext.data import Field 

In [15]:
LABEL = Field(sequential=False, use_vocab=False)

In [16]:
tokenize = lambda x: x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

In [17]:
from torchtext.data import TabularDataset

In [18]:
tv_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("comment_text", TEXT), ("toxic", LABEL),
                 ("severe_toxic", LABEL), ("threat", LABEL),
                 ("obscene", LABEL), ("insult", LABEL),
                 ("identity_hate", LABEL)]

In [19]:
trn, vld = TabularDataset.splits(
        path="data", # the root directory where the data lies
        train='train.csv', validation="valid.csv",
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tv_datafields)

In [20]:
tst_datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
                 ("comment_text", TEXT)
                 ]

In [21]:
tst = TabularDataset(
        path="data/test.csv", # the file path
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tst_datafields)

## Exploring the Dataset Objects

In [22]:
trn, vld, tst

(<torchtext.data.dataset.TabularDataset at 0x176c45f30f0>,
 <torchtext.data.dataset.TabularDataset at 0x176c45f3908>,
 <torchtext.data.dataset.TabularDataset at 0x176c45f32b0>)

In [23]:
trn[0], vld[0], tst[0]

(<torchtext.data.example.Example at 0x176c45f3978>,
 <torchtext.data.example.Example at 0x176c45f3940>,
 <torchtext.data.example.Example at 0x176c45f37f0>)

In [24]:
trn[0].__dict__.keys()

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

In [25]:
trn[0].__dict__['comment_text'][:5]

['explanation', 'why', 'the', 'edits', 'made']

In [26]:
TEXT.build_vocab(trn)

In [27]:
TEXT.vocab

<torchtext.vocab.Vocab at 0x176c45f3400>

In [28]:
type(TEXT.vocab.freqs)

collections.Counter

In [29]:
TEXT.vocab.freqs.most_common(5)

[('the', 78), ('to', 41), ('you', 33), ('of', 30), ('and', 26)]

In [30]:
type(TEXT.vocab.itos), type(TEXT.vocab.stoi), len(TEXT.vocab.itos), len(TEXT.vocab.stoi.keys()), 

(list, collections.defaultdict, 784, 784)

In [31]:
TEXT.vocab.stoi['and'], TEXT.vocab.itos[7]

(7, 'and')

## Iterators!

In [32]:
from torchtext.data import Iterator, BucketIterator

## BucketIterator

In [33]:
train_iter, val_iter = BucketIterator.splits(
        (trn, vld), # we pass in the datasets we want the iterator to draw data from
        batch_sizes=(32, 32),
        sort_key=lambda x: len(x.comment_text), # the BucketIterator needs to be told what function it should use to group the data.
        sort_within_batch=False,
        repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

In [34]:
train_iter

<torchtext.data.iterator.BucketIterator at 0x176c45f30b8>

In [35]:
batch = next(train_iter.__iter__())
batch


[torchtext.data.batch.Batch of size 25]
	[.comment_text]:[torch.LongTensor of size 494x25]
	[.toxic]:[torch.LongTensor of size 25]
	[.severe_toxic]:[torch.LongTensor of size 25]
	[.threat]:[torch.LongTensor of size 25]
	[.obscene]:[torch.LongTensor of size 25]
	[.insult]:[torch.LongTensor of size 25]
	[.identity_hate]:[torch.LongTensor of size 25]

In [36]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

In [37]:
batch.__dict__['dataset'], trn, batch.__dict__['dataset']==trn

(<torchtext.data.dataset.TabularDataset at 0x176c45f30f0>,
 <torchtext.data.dataset.TabularDataset at 0x176c45f30f0>,
 True)

In [38]:
test_iter = Iterator(tst, batch_size=64, sort=False, sort_within_batch=False, repeat=False)

In [39]:
next(test_iter.__iter__())


[torchtext.data.batch.Batch of size 33]
	[.comment_text]:[torch.LongTensor of size 158x33]

## BatchWrapper

In [40]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))
            if use_gpu:
                yield (x.cuda(), y.cuda())
            else:
                yield (x, y)
    
    def __len__(self):
        return len(self.dl)

In [41]:
train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)

In [42]:
next(train_dl.__iter__())

(tensor([[ 453,   63,   15,  ...,  454,  660,  778],
         [ 523,    4,  601,  ...,   78,   11,  650],
         [  30,  664,  242,  ...,    8,    2,   22],
         ...,
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1]], device='cuda:0'),
 tensor([[ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 1.,  1.,  0.,  1.,  1.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,  0.],
         [ 0.,  0.,  0.,  0.,  0.,

## Training a Text Classifier

In [43]:
class SimpleLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=2):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=num_linear, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

### Initializing the Model

In [44]:
em_sz = 300
nh = 500
model = SimpleLSTMBaseline(nh, emb_dim=em_sz)
print(model)

SimpleLSTMBaseline(
  (embedding): Embedding(784, 300)
  (encoder): LSTM(300, 500, num_layers=2, dropout=0.1)
  (linear_layers): ModuleList(
    (0): Linear(in_features=500, out_features=500, bias=True)
  )
  (predictor): Linear(in_features=500, out_features=6, bias=True)
)


In [45]:
def model_size(model: torch.nn)->int:
    """
    Calculates the number of trainable parameters in any model
    
    Returns:
        params (int): the total count of all model weights
    """
    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
#     model_parameters = model.parameters()
    params = sum([np.prod(p.size()) for p in model_parameters])
    return params

print(f'{model_size(model)/10**6} million parameters')

4.096706 million parameters


In [46]:
if use_gpu:
    model = model.cuda()

**Putting together the pieces again**:

In [47]:
from torch import optim
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss().cuda()
epochs = 3

## Training Loop

In [48]:
from tqdm import tqdm
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()
        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        
        running_loss += loss.item() * x.size(0)
        
    epoch_loss = running_loss / len(trn)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.item() * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.34it/s]


Epoch: 1, Training Loss: 13.5037, Validation Loss: 4.6498


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.58it/s]


Epoch: 2, Training Loss: 7.8243, Validation Loss: 24.5401


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.35it/s]


Epoch: 3, Training Loss: 57.4577, Validation Loss: 4.0107


## Prediction Mode

In [49]:
test_preds = []
for x, y in tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.cpu().numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
test_preds = np.hstack(test_preds)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.64it/s]


### Convert predictions to a pandas dataframe

This helps us convert the predictions to a more interpretable format. Let's insert the predictions in the correct column and then we can preview few rows of the dataframe: 

In [50]:
test_df = pd.read_csv("data/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    test_df[col] = test_preds[:, i]

In [51]:
test_df.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,0.629146,0.116721,0.438606,0.156848,0.139696,0.388736
1,0000247867823ef7,== From RfC == \r\n\r\n The title is fine as i...,0.629146,0.116721,0.438606,0.156848,0.139696,0.388736
2,00013b17ad220c46,""" \r\n\r\n == Sources == \r\n\r\n * Zawe Ashto...",0.629146,0.116721,0.438606,0.156848,0.139696,0.388736
