<a href="https://colab.research.google.com/github/Apoak/Deep-Learning-Projects/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Lab 8.1 Tokenization

This week we will work up to creating an RNN text generator.  In today's lab you will explore different methods of text tokenization.   Here's an overview of what you will try to do.

Imagine that our entire dataset consists of the following text:

    hello world hello a b c

We would first build a vocabulary of the words in the dataset:

    0: hello
    1: world
    2: a
    3: b
    4: c

Thus the dataset can be mapped to token indices:

    0 1 0 2 3 4

Now suppose that we have defined the maximum sequence length (`seq_len`) to be 3.  We will use each possible sequence as the input to our RNN, and the next token as the target.  Here are the possible input sequences and targets:

    0 1 0 -> 2
    1 0 2 -> 3
    0 2 3 -> 4

You will build a subclass of `Dataset` to find all possible sequences for a given dataset, either at the word or character level.

The following code will download the text of Shakespeare's sonnets and read it in as one long string.

In [None]:
from torch.utils.data import Dataset

In [None]:
!wget --no-clobber "https://www.dropbox.com/scl/fi/7r68l64ijemidyb9lf80q/sonnets.txt?rlkey=udb47coatr2zbrk31hsfbr22y&dl=1" -O sonnets.txt
text = (open("sonnets.txt").read())


In [None]:
text = text.lower()

In [None]:
print(text[:1000])

### Exercises

1. Prepare a vocabulary of the unique words in the dataset.  (For simplicity's sake you can leave the punctuation in.)

In [None]:
text.split()
split_text = text.split()
vocab = {}
for word in split_text:
  if word not in vocab:
    vocab[word] = word

2. Now you will make a Dataset subclass that can return sequences of tokens, encoded as integers.

In [None]:
class WordDataset(Dataset):
  def __init__(self,text,seq_len=100):
    self.split_text = text.lstrip("\ufeff").split()
    self.seq_len = seq_len
    # add code to compute the vocabulary (copied from exercise 1)
    self.vocab = {}
    # add code to convert the text to a sequence of word indices
    self.word_indices = {}
    self.token_sequence = []

    self.indices_word = {}


  def init_vocab(self):
    self.vocab.update({char: char for char in self.split_text if char not in self.vocab})

  def init_token_sequence(self):
    self.token_sequence = [self.word_indices[self.split_text[idx]] for idx in range(len(self.split_text))]

  def indices(self):
    self.word_indices = {word: i for i, word in enumerate(self.vocab) if word not in self.word_indices}
    self.indices_word = {i: word for word, i in self.word_indices.items()}

  def __len__(self):
    # replace this with code to return the number of possible sub-sequences
    return len(self.token_sequence) - self.seq_len

  # A little confused what this method really does!
  def __getitem__(self,i):
    # replace this with code to return a sequence of length seq_len of token indices starting at i, and the index of token i+seq_len as the label
    sequence = [self.token_sequence[idx] for idx in range(i, i+self.seq_len)]
    label = self.token_sequence[i+self.seq_len]
    pair = (label, sequence)
    return pair

  def decode(self,tokens):
    # replace this with code to convert a sequence of tokens back into a string
    string = ""
    for token in tokens:
      string = string + self.indices_word[token] + " "
    return string

3. Verify that your class can successfully encode and decode sequences.

In [None]:
ds = WordDataset(text)
ds.init_vocab()
ds.indices()
ds.init_token_sequence()

tokens = ds.__getitem__(29)

sequence = ds.decode(tokens[1])
print("Vocabulary: ", ds.vocab)
print("Word to Indice hash: ",  ds.word_indices)
print("Indices to Word hash: ",  ds.indices_word)
print("Possible subsequences: ", ds.__len__())
print("Encoded tokens: ", ds.__getitem__(29))
print("Decoded tokens: ", sequence)



4. Do the exercise again, but this time at the character level.

In [None]:
class CharacterDataset(Dataset):
  def __init__(self,text,seq_len=100):
    self.chunk_text = text.lstrip("\ufeff").replace("\n","")
    self.seq_len = seq_len
    # add code to convert the text to a sequence of word indices
    self.vocab = {}
    # add code to compute the vocabulary of unique characters
    self.char_tokens = {}
    self.tokens_char = {}
    self.token_sequence = []

    # add code to convert the text to a sequence of character indices
  def init_vocab(self):
    self.vocab.update({char: char for char in self.chunk_text if char not in self.vocab})

  def init_token_sequence(self):
    self.token_sequence = [self.char_tokens[self.chunk_text[idx]] for idx in range(len(self.chunk_text))]

  def indices(self):
    self.char_tokens = {char: i for i, char in enumerate(self.vocab) if char not in self.char_tokens}
    self.tokens_char = {i: char for char, i in self.char_tokens.items()}

  def __len__(self):
    # replace this with code to return the number of possible sub-sequences
    return len(self.token_sequence) - self.seq_len

# A little confused what this method really does!
  def __getitem__(self,i):
  # replace this with code to return the sequence of length seq_len of token indices starting at i, and the index of token i+seq_len as the label
    sequence = [self.token_sequence[idx] for idx in range(i, i+self.seq_len)]
    label = self.char_tokens[self.chunk_text[i+self.seq_len]]
    pair = (label, sequence)
    return pair

  def decode(self,tokens):
    # replace this with code to convert a sequence of tokens back into a string
    sequence = "".join([self.tokens_char[token]for token in tokens])
    return sequence

In [None]:
ds = CharacterDataset(text)
ds.init_vocab()
ds.indices()
ds.init_token_sequence()

tokens = ds.__getitem__(100)

sequence = ds.decode(tokens[1])
print(ds.token_sequence)
print("Vocabulary: ", ds.vocab)
print("Word to Indice hash: ",  ds.char_tokens)
print("Indices to Word hash: ",  ds.tokens_char)
print("Possible subsequences: ", ds.__len__())
print("Encoded tokens: ", ds.__getitem__(100))
print("Decoded tokens: ", sequence)

5. Compare the number of sequences for each tokenization method.

The number of sequences for the word tokenization is 17,570 while the number for the character tokenization is 95,202. This difference makes sense because the amount of data in each window of sequence length is very different. One is able to process 100 words at a time while the other is only able to process 100 letters at a time.

6. Optional: implement the byte pair encoding algorithm to make a Dataset class that uses word parts.