NLP Training 2: Tokenizers
--- 

In [1]:
import os
os.chdir('..')
print(f'Setting working dir to: {os.getcwd()}')

Setting working dir to: /Users/ingomarquart/Documents/GitHub/itern-nlp-training-cases


# Tokenization

In this section, we will tokenize a text using different methods

## Exercise 1: One-Hot Encodings

Tokenize the sentence below by splitting by world, then transform it to an one-hot-encoding representation.

You can use either of two tools:
1. Use the `OneHotEncoder` from `sklearn` to create the representation.

2. You can use `one_hot` from `torch.nn.functional`


In [2]:
your_text = 'Do you think, large language models are slightly conscious?'

# Add your solution here:
# ...

#### Hints



1. First, you need to split the string into individual elements. You can use nltk.tokenize.word_tokenize

2. Second, to use sklearn, we need to define the number of unique tokens 

2. Use the sklearn OneHotEncoder, remember that we do not want a sparse representation in this case!

#### Solution

In [6]:
import string
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from nltk.tokenize import word_tokenize

tokens = word_tokenize(your_text)

# Alternative 1: Use String Functions and regex
# import re
# tokens = your_text.split(' ')
# tokens = [re.split('(\W)', token)[0:2] if any(p in token for p in string.punctuation) else token for token in tokens]
# tokens = flatten(tokens)

# Alternative 2: RegEx Ninja skills
# tokens = re.findall(r"[\w']+|[.,!?;]", your_text)

print(f'Token representation \n {tokens}')

ids = np.arange(0, len(tokens))
print(f'ID representation \n {ids}')

onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(ids.reshape(-1, 1))
print(f'Sklearn: One-hot representation \n {onehot_encoded}')


import torch
import torch.nn.functional as F

one_hot_encodings_torch = F.one_hot(ids_torch, num_classes=len(set(ids)))
print(f'PyTorch: One-hot representation \n {one_hot_encodings_torch}')

Token representation 
 ['Do', 'you', 'think', ',', 'large', 'language', 'models', 'are', 'slightly', 'conscious', '?']
ID representation 
 [ 0  1  2  3  4  5  6  7  8  9 10]
Sklearn: One-hot representation 
 [[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
PyTorch: One-hot representation 
 tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0,

What are the issues with the above approach?

## Exercise 2: PyTorch Transformer Tokenizers

To solve the issues, we will now a tokenizer from PyTorch Transformers.

Production-ready tokenizers differ from the above in a number of ways: They are subword tokenizers, are adapted to a particular model, and are trained on a corpus to have a larger vocabulary.
Tokenizers also add additional information required by the model. We will have a more detailed look at this in session 2!

In this task, use PyTorch Transformers to load a tokenizer for a Bert-type model

In [9]:
your_text = 'Do you think, large language models training statworks are slightly conscious?'
model = "gpt2"

# Add your solution here:
# ...

#### Hints:



1. You can use `AutoTokenizer` to get the correct tokenizer for your model (identified by string)

2. `AutoTokenizer` has the `from_pretrained` function to load a tokenizer model from the Hugginface Hub

3. The tokenizer has different functions to tokenizer, encode and convert text. Try to understand the differences

#### Solution

The "manual" way

In [10]:
from transformers import AutoTokenizer
# Use transformers
transformer_tokenizer = AutoTokenizer.from_pretrained(model)

# We can use the tokenizer's tokenize function for the first step
tokens = transformer_tokenizer.tokenize(your_text)
print(tokens)
# Now we encode the tokens with the tokenizer's encode_plus function
# Since we are using PyTorch, we will return PyTorch tensors
token_ids = transformer_tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
# To check that this did what we wanted, we can use the tokenizer's decode function
print(transformer_tokenizer.convert_ids_to_tokens(token_ids))


['Do', 'Ġyou', 'Ġthink', ',', 'Ġlarge', 'Ġlanguage', 'Ġmodels', 'Ġtraining', 'Ġstat', 'works', 'Ġare', 'Ġslightly', 'Ġconscious', '?']
[5211, 345, 892, 11, 1588, 3303, 4981, 3047, 1185, 5225, 389, 4622, 6921, 30]
['Do', 'Ġyou', 'Ġthink', ',', 'Ġlarge', 'Ġlanguage', 'Ġmodels', 'Ġtraining', 'Ġstat', 'works', 'Ġare', 'Ġslightly', 'Ġconscious', '?']


Transformers do more: They encode the text by adding the necessary special tokens for the model

In [11]:
# Use the encode function in one go
token_ids = transformer_tokenizer.encode(your_text)
print(token_ids)
print(transformer_tokenizer.convert_ids_to_tokens(token_ids))


[5211, 345, 892, 11, 1588, 3303, 4981, 3047, 1185, 5225, 389, 4622, 6921, 30]
['Do', 'Ġyou', 'Ġthink', ',', 'Ġlarge', 'Ġlanguage', 'Ġmodels', 'Ġtraining', 'Ġstat', 'works', 'Ġare', 'Ġslightly', 'Ġconscious', '?']


Finally, note that Tokenizers should actually be called via the forward function.

In addition to encoding the sentence, it also adds other information required by the model - such as the attention mask (see session 2!)

In [12]:

# Regular way: Use the forward function of the tokenizer
encoded_sentence = transformer_tokenizer(your_text)
print(encoded_sentence)

{'input_ids': [5211, 345, 892, 11, 1588, 3303, 4981, 3047, 1185, 5225, 389, 4622, 6921, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Since we use PyTorch, we can even return PyTorch tensors like so

In [13]:
encoded_sentence = transformer_tokenizer(your_text, return_tensors='pt')
print(encoded_sentence)

{'input_ids': tensor([[5211,  345,  892,   11, 1588, 3303, 4981, 3047, 1185, 5225,  389, 4622,
         6921,   30]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


This Tokenizer was from the famous BERT model.

Try `model = "gpt2"`and see what the differences are!

## Exercise 3: Batch Encoding with Transformers

We typically want to encode a whole bunch of sentences.
Either, since we plan to apply tokenization as a map, or because we want to prepare a whole 

In [14]:
import pandas as pd
import numpy as np
from datasets import load_dataset, list_datasets

dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
dataset = dataset.filter(lambda x: x['text'] != '')
dataset = dataset.filter(lambda x: ~x['text'].startswith('='))
idx = np.random.randint(0, len(dataset))
random_sentence = dataset['text'][idx]
print(random_sentence)

Reusing dataset wikitext (/Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Loading cached processed dataset at /Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-d6ba27cc6c67cf9a.arrow
Loading cached processed dataset at /Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-5edef9c02646c5f2.arrow


 In 1969 , General António Augusto dos Santos was relieved of command , with General Kaúlza de Arriaga taking over officially in March 1970 . Kaúlza de Arriaga favoured a more direct method of fighting the insurgents , and the established policy of using African counter @-@ insurgency forces was rejected in favour of the deployment of regular Portuguese forces accompanied by a small number of African fighters . Indigenous personnel were still recruited for special operations , such as the Special Groups of Parachutists in 1973 , though their role less significant under the new commander . His tactics were partially influenced by a meeting with United States General William Westmoreland . 



In [28]:
idx_range = slice(1602,1605)
text=dataset['text'][idx_range]
model = "bert-base-cased"

Your exercise is to tokenize these sentences and collect their token_ids in a single batched PyTorch tensor.

#### Hints

1. Use the same tokenizer as before

2. Think about the output shape of each sentence, and what dimension your PyTorch tensor will have

3. Check out the parameters of the forward function of your tokenizer

#### Solution

In [16]:
text

[' Adam Stansfield ( 10 September 1978 – 10 August 2010 ) was an English professional footballer who played as a striker . He competed professionally for Yeovil Town , Hereford United and Exeter City , and won promotion from the Football Conference to The Football League with all three teams . \n',
 " Having played for three counties as a child , Stansfield began his career in non @-@ league with Cullompton Rangers and Elmore , and had unsuccessful trials at league teams . At the age of 23 , he signed his first professional contract with Yeovil Town , after impressing their manager Gary Johnson in a match against them . In his first season , he helped them win the FA Trophy , scoring in the 2002 final . The following season , Yeovil won the Conference and promotion into The Football League , although Stansfield was ruled out with a broken leg in the first game . In 2004 , he transferred to Hereford United , where he won promotion to The Football League via the 2006 play @-@ offs , and 

The text pieces have very different length. If we simply use the Tokenizer, we will get a helpful error telling us that this won't work: PyTorch tensors have a fixed dimension.

Luckily, our Tokenizer also has the option to pad and truncate.

In [29]:

from transformers import AutoTokenizer
transformer_tokenizer = AutoTokenizer.from_pretrained(model)
encoded_batch = transformer_tokenizer(text, return_tensors='pt', padding="longest")

In [31]:
print(encoded_batch["input_ids"].shape)
encoded_batch["input_ids"][:,:]

torch.Size([3, 114])


tensor([[  101,  1556,  1103,  1207, 17452,   117,  1126,  5677,  1108,  2234,
          1154,  2689,  1555,   119,  1109,  1353,  7309,  1104,  6133,  4384,
          1276,  1142,  1849,  1106,  1129,  1154,  2879,  1895,   117,  1105,
          1152,  1310,  1106,  2080,  1147,  1826, 10380,   117,  1107, 12765,
          4045,   119,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0],
        [  101,  1130,  1103,  6148,  8386,   117,  1103, 17452,  3421,  3137,
          3290

## Exercise 4: Tokenizing and Batching for a fixed length

The prior code allows you to tokenize a batch of inputs to its longest example.
This leads to dynamic batch sizes (more on this below).

Sadly, dynamic batches are not supported in every situation. For instance, it doesn't work on TPUs. 

Or, a case more relevant to us, we might not have enough memory to deal with the very largest of sentences. Since sentence length follows a power law, these are very few. But they love to crash your pipeline at the end of an epoch!


Your task in this exercise is to create, for the same text above, a PyTorch batch of tokenized sentences - using padding and truncation - to length 35

As additional challenge, use the tokenizer for GPT2!

In [19]:
import pandas as pd
import numpy as np
from datasets import load_dataset, list_datasets

dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
dataset = dataset.filter(lambda x: x['text'] != '')
dataset = dataset.filter(lambda x: ~x['text'].startswith('='))
idx_range = slice(1602,1605)
text=dataset['text'][idx_range]
model = "gpt2"

Reusing dataset wikitext (/Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Loading cached processed dataset at /Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-d6ba27cc6c67cf9a.arrow
Loading cached processed dataset at /Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-5edef9c02646c5f2.arrow


#### Solution

In [20]:

from transformers import AutoTokenizer
transformer_tokenizer = AutoTokenizer.from_pretrained(model)


The GPT2 tokenizer has no padding token by default. This is because it is an auto-regressive model (more in session 2). Here, we need to set the padding token

In [21]:
# This is appropriate for GPT2
transformer_tokenizer.pad_token = transformer_tokenizer.eos_token

We will use set the max_length parameter to 35. If we use the padding strategy "max_length" and the truncation strategy "longest_first"

In [22]:

encoded_batch = transformer_tokenizer(text, return_tensors='pt', padding="max_length", max_length=35, truncation="longest_first")

In [23]:
print(encoded_batch["input_ids"].shape)
encoded_batch["input_ids"]

torch.Size([3, 35])


tensor([[ 7244,   520,   504,  3245,   357,   838,  2693, 15524,   784,   838,
          2932,  3050,  1267,   373,   281,  3594,  4708, 44185,   508,  2826,
           355,   257, 19099,   764,   679, 32440, 28049,   329, 11609,   709,
           346,  8329,   837,  3423,  3841],
        [11136,  2826,   329,  1115, 14683,   355,   257,  1200,   837,   520,
           504,  3245,  2540,   465,  3451,   287,  1729,  2488,    12,    31,
          4652,   351, 31289,   296, 10972, 13804,   290,  2574,  3549,   837,
           290,   550, 23993,  9867,   379],
        [  520,   504,  3245,   373, 14641,   351,   951,   382,   310,   282,
          4890,   287,  3035,  3050,   764,   679,  4504,   284,  3047,   706,
          8185,   290, 34696,   837,   475,  3724,   319,   838,  2932,   326,
           614,   764,   317,  8489,   287]])

## Appendix: Tokenizing in the PyTorch Pipeline and in the Transformers (Huggingface) pipeline

### Case 1: Tokenizing the whole dataset using the PyTorch Transformer Ecosystem

#### Step 1 Tokenize the dataset

In the standard case, we will apply the tokenizer to our dataset.

When using a Transformer dataset, we can simply use the map function:

In [24]:
from datasets import load_dataset, list_datasets
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
from transformers import AutoTokenizer
transformer_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# This is lazily evaluated!
new_dataset=dataset.map(lambda x: transformer_tokenizer(x["text"]))
new_dataset=new_dataset.remove_columns(["text"])

Reusing dataset wikitext (/Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Loading cached processed dataset at /Users/ingomarquart/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-0eba9b53d2ecc055.arrow


#### Step 2: Use a Hugginface DataCollator for your Use Case

If you are going to use the Hugginface trainer interface, you can use a DataCollator to automize batching and other collation operations.
These include standard masked-language modeling, or alignment operations for language modeling

In this case, it's best to provide a DataCollator to the trainer class.
The trainer will then use dynamic batching, but also align training examples to be similar in size to minimize padding and maximize throughput (etc.)

In [25]:
from transformers import DataCollatorForLanguageModeling
import torch
datacollator = DataCollatorForLanguageModeling(tokenizer=transformer_tokenizer, mlm=True, mlm_probability=0.15)

To see how this works in the trainer, we have to use a trick. The trainer, our code, or the PyTorch Dataset will encode examples via the tokenizer. The collator receives these as iterable and will create batches

In [26]:
List_of_encodings_to_batch=[new_dataset[15],new_dataset[16],new_dataset[17]]
type(List_of_encodings_to_batch)

list

In this case, the model includes our sequences with randomly-masked tokens and a new "label" tensor, that includes the true values of the masked tokens

**We have not used padding during tokenization, because the datacollator performs it and, if used in the trainer class, will apply further optimizations**


The masking Token for BERT is 103

In [32]:
print(datacollator(List_of_encodings_to_batch).input_ids[:,:15])
print(datacollator(List_of_encodings_to_batch).labels[:,:15])

tensor([[  101,  1109,  1342,   103,  1282,  1219,  1103,  2307,   103,   103,
          1414,   119,   144,  5727,  1811],
        [  101,  1249,   103, 10208,  2008,  3184,  1202,   103,  9933,   103,
          1103,   103,   174, 20492,  4199],
        [  101,   103,  1193,  1496,   103,  1292,  1958,   117,  1105,  6146,
          1496,  1106,  1103,  1558,  6053]])
tensor([[ -100,  1109,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100],
        [ -100,  -100,  1103, 10208,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  3105,   174,  -100,  4199],
        [ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  1105,  6146,
          -100,  -100,  -100,  -100,  -100]])


### Case 2: Tokenizing for a PyTorch Dataset

If we use a PyTorch Dataset, we have two options:

1. Tokenize the entire dataset during initialization.

    This is great if our dataset fits in memory, or if we have access to fast storage such that we can save the prepared dataset (or cache it otherwise). 

    We can then either: 
    
    * Use a custom collate_fn to do the batching dynamically

    * Pad and truncate all examples to a fixed length and use standard PyTorch collation

2. Tokenize on the fly

    This might be required if we can not load the data before (e.g. streaming). 
    
    Another case where this happens if the training examples themselves have several examples combined in a complicated manner - for example if the masking probabilities depend on the properties of the batch, or when a whole set of examples are augmented from a single example.

    In this case we'd return or yield these batches, including padding and tokenization, right from the dataset.
