
---

In this exercise, your goal is to implement a `Dataset` class in PyTorch, which is used to handle and process the data for training a transformer model. You'll also use a `DataLoader` to load batches of data from this dataset. Follow the steps below to complete the implementation based on the provided starter code.

The starter code includes the tokenizer that we've implemented in previous videos. You will need to create an instance of this tokenizer using the training dataset.

To implement the exercise go over the code and implement the missing parts marked with the `TODO` comment.

**1. Define the `TokenIdsDataset` Class:**

For the first step, you would need to implement the missing methods in the `TokenIdsDataset` class.

- **`__init__` Method:**
    - Initialize the class with `data` and `block_size`.
    - Save `data` and `block_size` as instance variables.

- **`__len__` Method:**
    - Compute the size of the dataset. If every position in the data can be the start of an item, the length of the dataset should be less than the `len(data)`.

- **`__getitem__` Method:**
    - Validate the input position to ensure it is within a valid range.
    - Retrieve an item starting from position `pos` up to `pos + block_size`.
    - Retrieve a target item that is the same as the item but has been shifted by one position.
    - Return both the input item and the target item.

**2. Tokenize the Text:**

Create a tokenizer and encode data from the training dataset. Then, create an instance of the `TokenIdsDataset` for the training data.

**3. Retrieve the First Item from the Dataset**

Get the first item from the dataset and decode it using the tokenizer. If everything is implemented correctly, you should get the first 64 characters of the training dataset.

**4. Use a DataLoader:**

Now, try using the `DataLoader` with the training dataset we've created. The `DataLoader` here is created with a `RandomSampler` to randomize the items we get from the training dataset.

For this exercise, first, get a single training batch from the `DataLoader` we've created.

Then, we decode input and target tokens using the tokenizer we've created. The input and target should be from the same part of the training dataset but shifted by one character.

---


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from get_device import get_device

# Use CUDA if available
device = get_device()

In [2]:
from pathlib import Path

text = Path('../../data/tiny-shakespeare.txt').read_text()

In [3]:
print(text[0:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:

class CharTokenizer:
  def __init__(self, vocabulary):
    self.token_id_for_char = {char: token_id for token_id, char in enumerate(vocabulary)}
    self.char_for_token_id = {token_id: char for token_id, char in enumerate(vocabulary)}

  @staticmethod
  def train_from_text(text):
    vocabulary = set(text)
    return CharTokenizer(sorted(list(vocabulary)))

  def encode(self, text):
    token_ids = []
    for char in text:
      token_ids.append(self.token_id_for_char[char])
    return torch.tensor(token_ids, dtype=torch.long)

  def decode(self, token_ids):
    chars = []
    for token_id in token_ids.tolist():
      chars.append(self.char_for_token_id[token_id])
    return ''.join(chars)


  def vocabulary_size(self):
    return len(self.token_id_for_char)

In [5]:
tokenizer = CharTokenizer.train_from_text(text)

In [6]:
print(tokenizer.encode("Hello world"))
print(tokenizer.decode(tokenizer.encode("Hello world")))

tensor([20, 43, 50, 50, 53,  1, 61, 53, 56, 50, 42])
Hello world


In [7]:
print(f"Vocabulary size: {tokenizer.vocabulary_size()}")

Vocabulary size: 65


In [8]:
from torch.utils.data import Dataset

class TokenIdsDataset(Dataset):
  def __init__(self, data, block_size):
    self.data = data
    self.block_size = block_size

  def __len__(self):
    return len(self.data) - self.block_size

  def __getitem__(self, pos):
    assert pos < len(self.data) - self.block_size

    x = self.data[pos:pos + self.block_size]
    y = self.data[pos + 1:pos + 1 + self.block_size]
    return x, y

In [9]:
tokenized_text = tokenizer.encode(text)
dataset = TokenIdsDataset(tokenized_text, block_size=64)

In [10]:
x, y = dataset[0]

In [11]:
x

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50])

In [12]:
tokenizer.decode(x)

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAl'

In [13]:
from torch.utils.data import DataLoader, RandomSampler

sampler = RandomSampler(dataset, replacement=True)
dataloader = DataLoader(dataset, batch_size=2, sampler=sampler)

In [14]:
x, y = next(iter(dataloader))

In [15]:
x.shape

torch.Size([2, 64])

In [16]:
x

tensor([[ 1, 53, 60, 43, 56,  1, 51, 39, 52, 63,  1, 49, 52, 39, 60, 47, 57, 46,
          1, 54, 56, 53, 44, 43, 57, 57, 47, 53, 52, 57,  6,  1, 46, 43,  0, 57,
         43, 58, 58, 50, 43, 42,  1, 53, 52, 50, 63,  1, 47, 52,  1, 56, 53, 45,
         59, 43, 10,  1, 57, 53, 51, 43,  1, 41],
        [46, 43, 39, 56, 58,  1, 47, 57,  1, 46, 43, 56, 43, 12,  0, 32, 59, 56,
         52,  1, 40, 39, 41, 49,  6,  1, 42, 59, 50, 50,  1, 43, 39, 56, 58, 46,
          6,  1, 39, 52, 42,  1, 44, 47, 52, 42,  1, 58, 46, 63,  1, 41, 43, 52,
         58, 56, 43,  1, 53, 59, 58,  8,  0,  0]])

In [17]:
tokenizer.decode(x[0])

' over many knavish professions, he\nsettled only in rogue: some c'

In [18]:
tokenizer.decode(y[0])

'over many knavish professions, he\nsettled only in rogue: some ca'