# Datasets and Data Loaders

In this notebook we show the data to be used for the language models that we'll be building later on. We will also give an overview of the tools that make working with data for the purposes of training deep learning models a lot easier.

## Imports

PyTorch's data utilities are located in `torch.utils.data` and we will also import some of our own tools from `modelling.data` (refer to the source code if you're interested in how they work).

In [1]:
from torch.utils.data import DataLoader

from modelling.data import (
    FilmReviewSequences,
    GPTTokenizer,
    IMDBTokenizer,
    get_data,
    make_sequence_datasets,
    pad_seq2seq_data,
)

## Raw Data

The data that we will use are the set of movie reviews from IMDB that are hosted at: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [2]:
data = get_data()
data.head(10)

Unnamed: 0,sentiment,review
0,0,Forget what I said about Emeril. Rachael Ray i...
1,0,Former private eye-turned-security guard ditch...
2,0,Mann photographs the Alberta Rocky Mountains i...
3,0,Simply put: the movie is boring. Cliché upon c...
4,1,"Now being a fan of sci fi, the trailer for thi..."
5,1,"In 'Hoot' Logan Lerman plays Roy Eberhardt, th..."
6,0,This is the worst film I have ever seen.I was ...
7,1,I think that Toy Soldiers is an excellent movi...
8,0,I think Micheal Ironsides acting career must b...
9,0,This was a disgrace to the game FarCry i had m...


These include sentiment scores for each review, but we will not make use of this for now.

## Tokenization

We will need to split sentences into words and and map words into numbers that our models can work with - i.e., we need to perform tokenization. We have developed a bespoke tokenizer class for this dataset - `IMDBtokenizer`.

In [3]:
reviews = data["review"].tolist()
review = reviews[0]

tokenizer = IMDBTokenizer(reviews)
tokenized_review = tokenizer(review)
tokenised_review_decoded = tokenizer.tokens2text(tokenized_review[:10])

print(f"ORIGINAL TEXT: {review[:47]} ...")
print(f"TOKENS FROM TEXT: {', '.join(str(t) for t in tokenized_review[:10])} ...")
print(f"TEXT FROM TOKENS: {tokenised_review_decoded} ...")

ORIGINAL TEXT: Forget what I said about Emeril. Rachael Ray is ...
TOKENS FROM TEXT: 822, 49, 11, 299, 43, 38969, 3, 10411, 1391, 8 ...
TEXT FROM TOKENS: forget what i said about emeril. rachael ray is ...


We have also provided an implementation of the tokenizer used in the GPT models that is based on [Byte Pair Encoding](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt). This approach to tokenisation can be thought of as a midway between character-level and word-level encoding.

In [4]:
gpt_tokenizer = GPTTokenizer(reviews)
tokenized_review = gpt_tokenizer(reviews[0])
tokenized_review[:10]
tokenised_review_decoded = gpt_tokenizer.tokens2text(tokenized_review[:10])

print(f"ORIGINAL TEXT: {review[:47]} ...")
print(f"TOKENS FROM TEXT: {', '.join(str(t) for t in tokenized_review[:10])} ...")
print(f"TEXT FROM TOKENS: {tokenised_review_decoded} ...")




ORIGINAL TEXT: Forget what I said about Emeril. Rachael Ray is ...
TOKENS FROM TEXT: 19574, 440, 209, 1242, 391, 23177, 223, 14, 21282, 3444 ...
TEXT FROM TOKENS: Forget what I said about Emeril. Rachael Ray ...


## PyTorch Datasets

PyTorch provides a simple framework for making it easier to assemble batches of data when training models with algorithms like [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).

The first part of this framework involves implementing the `Dataset` class interface that enables downstream objects to interact with data in a consistent way (via the pre-defined interface).

We have implemented a custom `Dataset` class called `FilmReviewSequence` that will enable indexed access to token sequences that can be used for training generative language models.

In [5]:
tokenized_reviews = [tokenizer(review) for review in reviews]
dataset = FilmReviewSequences(tokenized_reviews)
x, y = dataset[0]

print(f"x[:5]: {x[:5]}")
print(f"y[:5]: {y[:5]}")

x[:5]: tensor([822,  49,  11, 299,  43])
y[:5]: tensor([   49,    11,   299,    43, 38969])


Note that `y` here is the same sequence held in `x`, but shifted forwards by one position, as the task is to predict the next token(s) in the sequence, given an initial sequence of tokens (or 'prompt').

For convenience, `make_sequence_datasets` is provided that will yield datasets for training, validation and testing.

In [6]:
datasets = make_sequence_datasets()

print(f"size of training data = {len(datasets.train_data)} reviews")
print(f"size of validation data = {len(datasets.val_data)} reviews")
print(f"size of test data = {len(datasets.test_data)} reviews")

print(f"\nvocabulary size = {datasets.tokenizer.vocab_size} tokens")

size of training data = 42709 reviews
size of validation data = 2209 reviews
size of test data = 4959 reviews

vocabulary size = 74421 tokens


## PyTorch DataLoaders

The second component of PyTorch's data handling framework is the `DataLoader` class. This class takes a `Dataset` and yields batches of data to be used in each iteration of a model's training step.

In [7]:
data_loader = DataLoader(dataset, batch_size=10, collate_fn=pad_seq2seq_data)

Because each movie review has a different length and all data from a single batch has to have the same shape, we pad sequences to the same length using our `pad_seq2seq_data` function, called automatically by the data loader when yielding a batch.

In [8]:
data_batches = [batch for batch in data_loader]
x_batch, y_batch = data_batches[0]

We can easily verify that the batches have the expected properties.

In [9]:
x_batch.shape

torch.Size([10, 40])

In [10]:
y_batch.shape

torch.Size([10, 40])