<a href="https://colab.research.google.com/github/AnDDoanf/learn_NLP/blob/master/notebooks/RNN_for_Sequence_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN for Sequence Classification

What included in this notebook:

- Implementation of RNN model for text classification task
- Using pre-trained word embeddings to initialize weights for embedding layers

## Download the data

In [None]:
%%capture
!rm -f titles-en-train.labeled
!rm -f titles-en-test.labeled

!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-train.labeled
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/titles-en-test.labeled

## Load data

We will load data into a list of sentences with their labels.

In [None]:
def load_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            lb, text = line.split('\t')
            data.append((text,int(lb)))
            
    return data

In [None]:
train_data = load_data('./titles-en-train.labeled')
test_data = load_data('./titles-en-test.labeled')

train_docs, train_labels = zip(*train_data)
test_docs, test_labels = zip(*test_data)

## Steps in building RNN model for text classification

- Create Vocabulary, Vectorizer, Dataset
- Implement model class
- Training loop
- Evaluation on the test data

## Vocabulary, Vectorizer, Dataset

For each sentence, we need to transform tokens in the sentence into integer indexes that correspond to indexes of words in a vocabulary.

So we need:
- Create a vocab from training data
- Vectorize data into integer indexes
- Transform data into Data objects

### Vocablary class

In [None]:
from collections import defaultdict

class Vocabulary:
    def __init__(self, token_to_idx=None):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices
        """
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx

        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        
        self.pad_index = 0
        self.unk_index = 1

    def lookup_token(self, token):
        """Retrieve the index associated with the token 
          or the UNK index if token isn't present.
        
        Args:
            token (str): the token to look up 
        Returns:
            index (int): the index corresponding to the token
        Notes:
            `unk_index` needs to be >=0 (having been added into the Vocabulary) 
              for the UNK functionality 
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
    
    def lookup_index(self, index):
        """Return the token associated with the index
        
        Args: 
            index (int): the index to look up
        Returns:
            token (str): the token corresponding to the index
        Raises:
            KeyError: if the index is not in the Vocabulary
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]
    
    def add_token(self, token):
        """Update mapping dicts based on the token.

        Args:
            token (str): the item to add into the Vocabulary
        Returns:
            index (int): the integer corresponding to the token
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index

    @classmethod
    def build_vocab(cls, sentences):
        """Build vocabulary from a list of sentences

        Arguments:
        ----------
            sentences (list): list of sentences, each sentence is a string
        
        Return:
        ----------
            vocab (Vocabulary): a Vocabulary object
        """
        token_to_idx = {"<PAD>": 0, "<UNK>": 1}
        vocab = cls(token_to_idx)

        frequencies = defaultdict(int)

        for s in sentences:
            for word in s.split():
                vocab.add_token(word)
        return vocab

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self._token_to_idx)

Let's try to create a Vocabulary from the training data

In [None]:
vocab = Vocabulary.build_vocab(train_docs)
print(vocab)

<Vocabulary(size=27192)>


### Data Vectorizer function

In [None]:
import torch
import numpy as np

def vectorize(vocab, title):
    """
    Args:
        vocab (Vocabulary)
        title (str): the string of characters
        max_length (int): an argument for forcing the length of index vector
    """
    indices = [vocab.lookup_token(token) for token in title.split()]
    
    return torch.tensor(indices)

In [None]:
print(train_docs[0])

FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period .


In [None]:
print(vectorize(vocab, train_docs[0]))

tensor([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,  9, 16, 17, 18,
        19, 20, 21,  7, 20, 22, 23, 24])


### Vectorize training data/test data

In [None]:
train_data = [vectorize(vocab, t) for t in train_docs]
test_data = [vectorize(vocab, t) for t in test_docs]

In [None]:
print(train_data[0])

tensor([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,  9, 16, 17, 18,
        19, 20, 21,  7, 20, 22, 23, 24])


### Label Mapping

In [None]:
label2idx = {
    -1: 0, 1: 1
}
train_y = [label2idx[lb] for lb in train_labels]
test_y = [label2idx[lb] for lb in test_labels]

### Dataset class

In order to put data into DataLoader, we need to implement a custom Dataset class that inherite [Dataset class](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)

It is required to implement two functions `__len__` and `__getitem__`

In [None]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):

    def __init__(self, sequences, labels):
        self.sequences = sequences
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, index):
        x = self.sequences[index]
        y = self.labels[index]

        return x, y

Create train_dataset and test_dataset

In [None]:
train_dataset = TextDataset(train_data, train_y)
test_dataset = TextDataset(test_data, test_y)

In [None]:
print( train_dataset[0] )

(tensor([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,  9, 16, 17, 18,
        19, 20, 21,  7, 20, 22, 23, 24]), 1)


### Create DataLoader

We need to define function for processing batches generated by DataLoader

In [None]:
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    """Processing a batch generated by DataLoader

    Arguments:
    -----
        batch (torch.tensor): a tensor generated by DataLoader
    """
    (x, y) = zip(*batch)
    x_lens = torch.tensor([len(x) for x in x])
    y = torch.tensor(y, dtype=torch.float32)
    
    x_pad = pad_sequence(x, batch_first=True, padding_value=0)

    return x_pad, x_lens, y

## RNN Model

Our RNN model for text classification includes following layers:

- Embedding layer ([nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html))
- RNN Layer ([nn.RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html))
- Linear layer with softmax ([nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html))

In [None]:
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence

class TextClassifier(nn.Module):

    def __init__(self, vocab_size, embedding_size, rnn_hidden_size, num_classes, 
                 batch_first=True, padding_idx=0):
        
        super(TextClassifier, self).__init__()

        self.emb = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_size, 
                                padding_idx=padding_idx)
        self.rnn = nn.LSTM(input_size=embedding_size, hidden_size=rnn_hidden_size,
                          batch_first=batch_first)
        self.fc = nn.Linear(in_features=rnn_hidden_size, out_features=num_classes)


    def forward(self, x_in, x_lens):
        x_embed = self.emb(x_in)
        x_packed = pack_padded_sequence(x_embed, x_lens, batch_first=True, enforce_sorted=False)
        _, (hidden, _) = self.rnn(x_packed)
        
        logits = torch.sigmoid(self.fc(hidden))
        return logits

## Create an RNN model

In [None]:
vocab_size = len(vocab)   # 27192
embedding_size = 200
rnn_hidden_size = 256
num_classes = 1
batch_first = True

model = TextClassifier(vocab_size=vocab_size, 
                       embedding_size=embedding_size, 
                       rnn_hidden_size=rnn_hidden_size,
                       num_classes=num_classes, 
                       batch_first=batch_first)

In [None]:
print(model)

TextClassifier(
  (emb): Embedding(27192, 200, padding_idx=0)
  (rnn): LSTM(200, 256, batch_first=True)
  (fc): Linear(in_features=256, out_features=1, bias=True)
)


## Training Loop

In [None]:
from tqdm.notebook import trange, tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  

learning_rate = 1e-3
batch_size = 16
epochs = 50

criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
model.to(device)

def train():
    train_dataloader = DataLoader(
        train_dataset,
        collate_fn=collate_batch,
        batch_size=batch_size,
    )
    model.train()
    train_iterator = trange(int(epochs), desc="Epoch")

    for _ in train_iterator:
        for x_in, x_lens, y in train_dataloader:
            x_in = x_in.to(device)
            y = y.to(device)

            optimizer.zero_grad()
            pred = model(x_in, x_lens).squeeze()
            loss = criterion(pred, y)
            loss.backward()
            optimizer.step()

train()


Epoch:   0%|          | 0/50 [00:00<?, ?it/s]

## Evaluation

In [None]:
from sklearn import metrics

def evaluate():
    model.eval()
    test_dataloader = DataLoader(
        test_dataset,
        collate_fn=collate_batch,
        shuffle=False,
        batch_size=batch_size,
    )

    preds = []
    true_labels = []
    with torch.no_grad():
        for x_in, x_lens, y in tqdm(test_dataloader, desc="Evaluating"):
            x_in = x_in.to(device)
            y = y.to(device)

            logits = model(x_in, x_lens).squeeze()
            _preds = (logits>0.5).type(torch.long)
            preds += _preds.detach().cpu().numpy().tolist()
            true_labels += y.detach().cpu().numpy().tolist()

    print(metrics.classification_report(true_labels, preds))

evaluate()

Evaluating:   0%|          | 0/177 [00:00<?, ?it/s]

              precision    recall  f1-score   support

         0.0       0.91      0.97      0.94      1477
         1.0       0.96      0.90      0.93      1346

    accuracy                           0.93      2823
   macro avg       0.94      0.93      0.93      2823
weighted avg       0.93      0.93      0.93      2823



## Further Improvements

- Use pre-trained word embedding to initialize the word embedding matrix
- Use bidirectional RNN
- Add one more linear layer in the network
- Try different ways to initialize weights

## References

- [Pad pack sequences for Pytorch batch processing with DataLoader](https://suzyahyah.github.io/pytorch/2019/07/01/DataLoader-Pad-Pack-Sequence.html)