In [9]:
import collections

import spacy

import torch
from torch import nn
from torch.utils.data import DataLoader

from torchtext.datasets import IMDB
from torchtext.vocab import Vocab

This notebook is based on a [tutorial](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb) from [Ben Trevett](https://github.com/bentrevett) and uses the IMDB Movie Reviews dataset. This is a set of movie reviews that are labelled as either positive or negative. The goal of the problem is to correctly infer the sentiment of the review from the text. The main changes from the source tutorial are that we try and mirror the style of approach taken in the [AG News Classification notebook](https://github.com/DavidEdwards1/pytorch-text/blob/main/notebooks/AG%20News%20Classification.ipynb) when it comes to creating the text processing pipeline and collator etc.

## Text Processing

Our text processing pipeline is similar to that used in the [AG News Classification notebook](https://github.com/DavidEdwards1/pytorch-text/blob/main/notebooks/AG%20News%20Classification.ipynb) where we first tokenise the text and then create a vocabulary that will encode each token as an integer. The main "upgrade" here is that we use [SpaCy](https://spacy.io/) to tokenise the text (specifically their [`en_core_web_md`](https://spacy.io/models/en#en_core_web_md) model).A question worth investigating could be: does it actually make a difference over using `torchtext.data.utils.get_tokenizer('basic_english')` tokenizer?

In [2]:
nlp = spacy.load("en_core_web_md")
tokenizer = nlp.tokenizer

train_iter = IMDB(split='train')

counter = collections.Counter()
for (label, text) in train_iter:
    counter.update((t.text for t in tokenizer(text)))
    
vocab = Vocab(counter, min_freq=1)

In [3]:
[vocab[token.text] for token in tokenizer("Here is an example")]

[1026, 9, 43, 491]

In [6]:
use_cuda = torch.cuda.is_available()
device = torch.device("cpu") #torch.device("cuda" if use_cuda else "cpu")

class TextPipeline:
    def __init__(self, vocab, tokenizer):
        self.vocab = vocab
        self.tokenizer = tokenizer
        
    def __call__(self, text):
        return [self.vocab[token.text] for token in self.tokenizer(text)]

class LabelPipeline:
    def __call__(self, label):
        return 0 if label == "neg" else 1
    
class Collator:
    def __init__(self, text_pipeline, label_pipeline):
        self.text_pipeline = text_pipeline
        self.label_pipeline = label_pipeline
        
    def __call__(self, batch):
        """
        Prepare batch of data to be used as input to torch model.

        Returns
        -------
          labels: a torch.tensor of integer encoded labels. Has shape (batch_size)
          texts: a torch.tensor of integer encoded text sequences. Encoded using text_pipeline.
              Each example is concatenated together into a flat 1D tensor. The start of each
              example is recorded in offsets. Has shape (n_tokens_in_batch)
          offsets: a torch.tensor of the index of the start of each example.
              Has shape (batch_size)
        """
        labels, texts, offsets = [], [], [0]

        for (label, text) in batch:
            labels.append(
                self.label_pipeline(label)
            )
            processed_text = torch.tensor(
                self.text_pipeline(text),
                dtype=torch.int64
            )
            texts.append(processed_text)
            offsets.append(processed_text.size(0)) # length of processed text

        labels = torch.tensor(labels, dtype=torch.int64)
        offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) # starting index of each example
        texts = torch.cat(texts) # we can treat this differently as it is a list of tensors

        return labels.to(device), texts.to(device), offsets.to(device)

In [10]:
train_iter = IMDB(split='train')

collator = Collator(
    TextPipeline(vocab, tokenizer),
    LabelPipeline()
)

dataloader = DataLoader(
    train_iter,
    batch_size=8,
    shuffle=False,
    collate_fn=collator
)

# Predictive Model

Fairly simple RNN model. We pass the tokenized and numericalised text into an Embedding Layer. Then to a recursive layer. Essentially each vector from an input gets passed through the recursive layer and the hidden state builds up over the course of the text. The final hidden state then is supposed to encode something sensible about the text. We take the final hidden layer and pass it into a linear layer to get a prediction.

In [None]:
class RNN(nn.module):
    def __init__(self):
        super().__init__()
        
    def forward(self, text):
        pass