In [1]:
import numpy as np
import torch

**Dataset**

Run the cell below to open and read the ebook of [_Pride and Prejudice_](https://www.gutenberg.org/files/1342/1342-h/1342-h.htm) to the variable `raw_text`. 

**Note**: Due to hardware constraints, we'll only use the full text from **Chapter 1** which we've indexed and saved to the variable `raw_text_ch1`.

In [2]:
with open('datasets/book.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

# index chapter 1
raw_text_ch1 = raw_text[1985:6468]
print(raw_text_ch1[:117])

It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.


Let's tokenize and preprocess the text of Chapter 1 into the following variables:

- `tokenized_text` : contains the tokenized text as character-based tokens
- `unique_character_tokens` : contains a list of unique tokens sorted by alphabetical order
- `c2ix` : the vocabulary as a dictionary mapping the tokens to their token IDs
- `ix2c` : the inverse vocabulary mapping the token IDs back to their tokens


In [3]:

tokenized_text = list(raw_text_ch1)
unique_character_tokens = sorted(list(set(tokenized_text)))
c2ix = {w2ix:ix for ix, w2ix in enumerate(unique_character_tokens)}
ix2c = {ix:w for w, ix in c2ix.items()}

# Get the vocabulary size
vocab_size = len(c2ix)

# show output - print the vocabulary size
print("Vocabulary size:", vocab_size)
ix2c

Vocabulary size: 51


{0: '\n',
 1: ' ',
 2: '!',
 3: '"',
 4: ',',
 5: '.',
 6: ';',
 7: '?',
 8: 'A',
 9: 'B',
 10: 'D',
 11: 'E',
 12: 'H',
 13: 'I',
 14: 'J',
 15: 'L',
 16: 'M',
 17: 'N',
 18: 'O',
 19: 'P',
 20: 'S',
 21: 'T',
 22: 'W',
 23: 'Y',
 24: '_',
 25: 'a',
 26: 'b',
 27: 'c',
 28: 'd',
 29: 'e',
 30: 'f',
 31: 'g',
 32: 'h',
 33: 'i',
 34: 'j',
 35: 'k',
 36: 'l',
 37: 'm',
 38: 'n',
 39: 'o',
 40: 'p',
 41: 'q',
 42: 'r',
 43: 's',
 44: 't',
 45: 'u',
 46: 'v',
 47: 'w',
 48: 'x',
 49: 'y',
 50: 'z'}

In [7]:
tokenized_id_text = [c2ix[word] for word in tokenized_text]
tokenized_id_text

[13,
 44,
 1,
 33,
 43,
 1,
 25,
 1,
 44,
 42,
 45,
 44,
 32,
 1,
 45,
 38,
 33,
 46,
 29,
 42,
 43,
 25,
 36,
 36,
 49,
 1,
 25,
 27,
 35,
 38,
 39,
 47,
 36,
 29,
 28,
 31,
 29,
 28,
 4,
 1,
 44,
 32,
 25,
 44,
 1,
 25,
 1,
 43,
 33,
 38,
 31,
 36,
 29,
 1,
 37,
 25,
 38,
 1,
 33,
 38,
 1,
 40,
 39,
 43,
 43,
 29,
 43,
 43,
 33,
 39,
 38,
 0,
 39,
 30,
 1,
 25,
 1,
 31,
 39,
 39,
 28,
 1,
 30,
 39,
 42,
 44,
 45,
 38,
 29,
 4,
 1,
 37,
 45,
 43,
 44,
 1,
 26,
 29,
 1,
 33,
 38,
 1,
 47,
 25,
 38,
 44,
 1,
 39,
 30,
 1,
 25,
 1,
 47,
 33,
 30,
 29,
 5,
 0,
 0,
 12,
 39,
 47,
 29,
 46,
 29,
 42,
 1,
 36,
 33,
 44,
 44,
 36,
 29,
 1,
 35,
 38,
 39,
 47,
 38,
 1,
 44,
 32,
 29,
 1,
 30,
 29,
 29,
 36,
 33,
 38,
 31,
 43,
 1,
 39,
 42,
 1,
 46,
 33,
 29,
 47,
 43,
 1,
 39,
 30,
 1,
 43,
 45,
 27,
 32,
 1,
 25,
 1,
 37,
 25,
 38,
 1,
 37,
 25,
 49,
 1,
 26,
 29,
 1,
 39,
 38,
 1,
 32,
 33,
 43,
 0,
 30,
 33,
 42,
 43,
 44,
 1,
 29,
 38,
 44,
 29,
 42,
 33,
 38,
 31,
 1,
 25,
 1,
 38,
 29,


Let's now use the PyTorch utility class `Dataset` to prepare the tokenized text into sequences for the features and labels.

Finish building the `TextDataset` class that utilizes `Dataset` utility such that:

1. In the `init` method, initialize the class attributes:
   - `self.tokenized_text` assigned with the input variable `tokenized_text`
   - `self.seq_length` assigned with the input variable `seq_length`

2. In the `getitem` method, return the feature sequences and labels as PyTorch tensors. The labels are created by shifting the feature sequences by one token to the right.


In [10]:
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, tokenized_text, seq_length):
        self.tokenized_text = tokenized_text
        self.seq_length = seq_length
        
    def __len__(self):
        return len(self.tokenized_text) - self.seq_length
        
    def __getitem__(self, idx):
        features =torch.tensor(self.tokenized_text[idx:idx+self.seq_length])
        labels =torch.tensor(self.tokenized_text[idx+1:idx+self.seq_length+1])
        return features, labels


Now that the dataset class is built, let's access the features and labels to be loaded as batches.

1. Create and access the sequences for the features and labels.
   - Specify a sequence length of `24` and save the value to the variable `seq_length`
   - Use the created `TextDataset` class to create the features and labels from the tokenized text using the sequence length. Save the sequences to the variable `dataset`. 

2. Create the iterable that allows us to load the sequences as batches.
   - Specify a batch size of `48` and save the value to the variable `batch_size`
   - Use the `DataLoader` utility class by inputting the dataset, batch size, and setting shuffle to `True`. Save the iterable to the variable `dataloader`.


In [13]:
torch.manual_seed(1) # set random seed 

seq_length = 24
dataset = TextDataset(tokenized_id_text,seq_length)

batch_size = 48
dataloader = DataLoader(dataset,batch_size=batch_size, shuffle=True )

In [14]:
dataloader

<torch.utils.data.dataloader.DataLoader at 0x7fb448079bd0>