# The 🤗 tokenizers library
## [Introduction](https://huggingface.co/course/chapter6/1?fw=pt)

In [Chapter 3](https://huggingface.co/course/chapter3), we looked at how to fine-tune a model on a given task. When we do that, we use the same tokenizer that the model was pretrained with — but what do we do when we want to train a model from scratch? In these cases, using a tokenizer that was pretrained on a corpus from another domain or language is typically suboptimal. For example, a tokenizer that's trained on an English corpus will perform poorly on a corpus of Japanese texts because the use of spaces and punctuation is very different in the two languages.

In this chapter, you will learn how to train a brand new tokenizer on a corpus of texts, so it can then be used to pretrain a language model. This will all be done with the help of the [🤗 Tokenizers](https://github.com/huggingface/tokenizers) library, which provides the "fast" tokenizers in the [🤗 Transformers](https://github.com/huggingface/transformers) library. We'll take a close look at the features that this library provides, and explore how the fast tokenizers differ from the "slow" versions.

Topics we will cover include:
- How to train a new tokenizer similar to the one used by a given checkpoint on a new corpus of texts.
- The special features of fast tokenizers.
- The differences between the three main subword tokenization algorithms used in NLP today.
- How to build a tokenizer from scratch with the 🤗 Tokenizers library and train it on some data.

The techniques introduced in this chapter will prepare you for the section in [Chapter 7](https://huggingface.co/course/chapter7/6) where we look at creating a language model for Python source code. Let's start by looking at what it means to "train" a tokenizer in the first place.

## [Training a new tokenizer from an old one](https://huggingface.co/course/chapter6/2?fw=pt)

If a language model is not available in the language you are interested in, or if your corpus is very different from the one your language model was trained on, you will most likely want to retrain the model from scratch using a tokenizer adapted to your data. That will require training a new tokenizer on your dataset. But what exactly does that mean? When we first looked at tokenizers in [Chapter 2](https://huggingface.co/course/chapter2), we saw that most Transformer models use a *subword tokenization algorithm*. To identify which subwords are of interest and occur most frequently in the corpus at hand, the tokenizer needs to take a hard look at all the texts in the corpus — a process we call *training*. The exact rules that govern this training depend on the type of tokenizer used, and we'll go over the three main algorithms later in this chapter.

In [1]:
from IPython.display import HTML
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/DJimQynXZsQ" allowfullscreen></iframe>')



> <font color="darkred">⚠️ Training a tokenizer is not the same as training a model! Model training uses stochastic gradient descent to make the loss a little bit smaller for each batch. It's randomized by nature (meaning you have to set some seeds to get the same results when doing the same training twice). Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It's deterministic, meaning you always get the same results when training with the same algorithm on the same corpus.</font>

### Assembling a corpus
There's a very simple API in 🤗 Transformers that you can use to train a new tokenizer with the same characteristics as an existing one: `AutoTokenizer.train_new_from_iterator()`. To see this in action, let's say we want to train GPT-2 from scratch, but in a language other than English. Our first task will be to gather lots of data in that language in a training corpus. To provide examples everyone will be able to understand, we won't use a language like Russian or Chinese here, but rather a specialized English language: Python code.

The [🤗 Datasets](https://github.com/huggingface/datasets) library can help us assemble a corpus of Python source code. We'll use the usual `load_dataset()` function to download and cache the [CodeSearchNet](https://huggingface.co/datasets/code_search_net) dataset. This dataset was created for the [CodeSearchNet challenge](https://wandb.ai/github/CodeSearchNet/benchmark) and contains millions of functions from open source libraries on GitHub in several programming languages. Here, we will load the Python part of this dataset:

In [2]:
from datasets import load_dataset
# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset("code_search_net", "python")

Reusing dataset code_search_net (/Users/matthias/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27)


  0%|          | 0/3 [00:00<?, ?it/s]

We can have a look at the training split to see which columns we have access to:

In [3]:
raw_datasets["train"]

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

We can see the dataset separates docstrings from code and suggests a tokenization of both. Here. we'll just use the `whole_func_string` column to train our tokenizer. We can look at an example of one these functions by indexing into the `train` split:

In [4]:
print(raw_datasets["train"][123456]["whole_func_string"])

def updater():
    """Update the current installation.

    git clones the latest version and merges it with the current directory.
    """
    print('%s Checking for updates' % run)
    # Changes must be separated by ;
    changes = '''major bug fixes;removed ninja mode;dropped python < 3.2 support;fixed unicode output;proxy support;more intels'''
    latest_commit = requester('https://raw.githubusercontent.com/s0md3v/Photon/master/core/updater.py', host='raw.githubusercontent.com')
    # Just a hack to see if a new version is available
    if changes not in latest_commit:
        changelog = re.search(r"changes = '''(.*?)'''", latest_commit)
        # Splitting the changes to form a list
        changelog = changelog.group(1).split(';')
        print('%s A new version of Photon is available.' % good)
        print('%s Changes:' % info)
        for change in changelog: # print changes
            print('%s>%s %s' % (green, end, change))

        current_path = os.getcwd().split('/') #

The above output differ from the one in the course (see below) because the dataset – specifically the `whole_func_string` – has been updated.

Originally, the `whole_func_string` read as follows:

```python
def handle_simple_responses(
      self, timeout_ms=None, info_cb=DEFAULT_MESSAGE_CALLBACK):
    """Accepts normal responses from the device.

    Args:
      timeout_ms: Timeout in milliseconds to wait for each response.
      info_cb: Optional callback for text sent from the bootloader.

    Returns:
      OKAY packet's message.
    """
    return self._accept_responses('OKAY', info_cb, timeout_ms=timeout_ms)
```

The first thing we need to do is transform the dataset into an *iterator* of lists of texts — for instance, a list of list of texts. Using lists of texts will enable our tokenizer to go faster (training on batches of texts instead of processing individual texts one by one), and it should be an iterator if we want to avoid having everything in memory at once. If your corpus is huge, you will want to take advantage of the fact that 🤗 Datasets does not load everything into RAM but stores the elements of the dataset on disk.

Doing the following would create a list of lists of 1,000 texts each, but would load everything in memory:

In [5]:
# Don't uncomment the following line(s) unless your dataset is small!
# training_corpus = [
#     raw_datasets["train"][i: i + 1000]["whole_func_string"] for i in range(0, len(raw_datasets["train"]), 1000)
# ]

Using a Python generator, we can avoid Python loading anything into memory until it's actually necessary. To create such a generator, you just to need to replace the brackets with parentheses:

In [6]:
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)
training_corpus

<generator object <genexpr> at 0x7f7bf00cbb30>

This line of code doesn't fetch any elements of the dataset; it just creates an object you can use in a Python `for` loop. The texts will only be loaded when you need them (that is, when you're at the step of the `for` loop that requires them), and only 1,000 texts at a time will be loaded. This way you won't exhaust all your memory even if you are processing a huge dataset.

The problem with a generator object is that it can only be used once. So, instead of this giving us the list of the first 10 digits twice:

In [7]:
gen = (i for i in range(10))
print(list(gen))
print(list(gen))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


we get them once and then an empty list.

That's why we define a function that returns a generator instead:

In [8]:
def get_training_corpus():
    return (
        raw_datasets["train"][i : i + 1000]["whole_func_string"]
        for i in range(0, len(raw_datasets["train"]), 1000)
    )
training_corpus = get_training_corpus()
training_corpus

<generator object get_training_corpus.<locals>.<genexpr> at 0x7f7c00e3c5f0>

You can also define your generator inside a `for` loop by using the `yield` statement:

In [9]:
def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]
get_training_corpus

<function __main__.get_training_corpus()>

which will produce the exact same generator as before, but allows you to use more complex logic than you can in a list comprehension.

### Training a new tokenizer
Now that we have our corpus in the form of an iterator of batches of texts, we are ready to train a new tokenizer. To do this, we first need to load the tokenizer we want to pair with our model (here, GPT-2):

In [10]:
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
old_tokenizer

PreTrainedTokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'})

Even though we are going to train a new tokenizer, it's a good idea to do this to avoid starting entirely from scratch. This way, we won't have to specify anything about the tokenization algorithm or the special tokens we want to use; our new tokenizer will be exactly the same as GPT-2, and the only thing that will change is the vocabulary, which will be determined by the training on our corpus.

First let's have a look at how this tokenizer would treat an example function:

In [11]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''
tokens = old_tokenizer.tokenize(example)
tokens

['def',
 'Ġadd',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`',
 '."',
 '""',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

This tokenizer has a few special symbols, like `Ġ` and `Ċ`, which denote spaces and newlines, respectively. As we can see, this is not too efficient: the tokenizer returns individual tokens for each space, when it could group together indentation levels (since having sets of four or eight spaces is going to be very common in code). It also split the function name a bit weirdly, not being used to seeing words with the `_` character.

Let's train a new tokenizer and see if it solves those issues. For this, we'll use the method `train_new_from_iterator()`:

In [12]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
tokenizer

PreTrainedTokenizerFast(name_or_path='gpt2', vocab_size=52000, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'})

This command might take a bit of time if your corpus is very large, but for this dataset of 1.6 GB of texts it's blazing fast (1 minute 16 seconds on an AMD Ryzen 9 3900X CPU with 12 cores).

Note that `AutoTokenizer.train_new_from_iterator()` only works if the tokenizer you are using is a "fast" tokenizer. As you'll see in the next section, the 🤗 Transformers library contains two types of tokenizers: some are written purely in Python and others (the fast ones) are backed by the 🤗 Tokenizers library, which is written in the [Rust](https://www.rust-lang.org/) programming language. Python is the language most often used for data science and deep learning applications, but when anything needs to be parallelized to be fast, it has to be written in another language. For instance, the matrix multiplications that are at the core of the model computation are written in CUDA, an optimized C library for GPUs.

Training a brand new tokenizer in pure Python would be excruciatingly slow, which is why we developed the 🤗 Tokenizers library. Note that just as you didn't have to learn the CUDA language to be able to execute your model on a batch of inputs on a GPU, you won't need to learn Rust to use a fast tokenizer. The 🤗 Tokenizers library provides Python bindings for many methods that internally call some piece of code in Rust; for example, to parallelize the training of your new tokenizer or, as we saw in [Chapter 3](https://huggingface.co/course/chapter3), the tokenization of a batch of inputs.

Most of the Transformer models have a fast tokenizer available (there are some exceptions that you can check [here](https://huggingface.co/transformers/#supported-frameworks)), and the `AutoTokenizer` API always selects the fast tokenizer for you if it's available. In the next section we'll take a look at some of the other special features fast tokenizers have, which will be really useful for tasks like token classification and question answering. Before diving into that, however, let's try our brand new tokenizer on the previous example:

In [13]:
tokens = tokenizer.tokenize(example)
tokens

['def',
 'Ġadd',
 '_',
 'numbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'ĊĠĠĠ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`."""',
 'ĊĠĠĠ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

Here we again see the special symbols `Ġ` and `Ċ` that denote spaces and newlines, but we can also see that our tokenizer learned some tokens that are highly specific to a corpus of Python functions: for example, there is a `ĊĠĠĠ` token that represents an indentation, and a `Ġ"""` token that represents the three quotes that start a docstring. The tokenizer also correctly split the function name on `_`. This is quite a compact representation; comparatively, using the plain English tokenizer on the same example will give us a longer sentence:

In [14]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

27
36


Let's look at another example:

In [15]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """
tokenizer.tokenize(example)

['class',
 'ĠLinear',
 'Layer',
 '():',
 'ĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'init',
 '__(',
 'self',
 ',',
 'Ġinput',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'weight',
 'Ġ=',
 'Ġtorch',
 '.',
 'randn',
 '(',
 'input',
 '_',
 'size',
 ',',
 'Ġoutput',
 '_',
 'size',
 ')',
 'ĊĠĠĠĠĠĠĠ',
 'Ġself',
 '.',
 'bias',
 'Ġ=',
 'Ġtorch',
 '.',
 'zeros',
 '(',
 'output',
 '_',
 'size',
 ')',
 'ĊĊĠĠĠ',
 'Ġdef',
 'Ġ__',
 'call',
 '__(',
 'self',
 ',',
 'Ġx',
 '):',
 'ĊĠĠĠĠĠĠĠ',
 'Ġreturn',
 'Ġx',
 'Ġ@',
 'Ġself',
 '.',
 'weights',
 'Ġ+',
 'Ġself',
 '.',
 'bias',
 'ĊĠĠĠĠ']

In addition to the token corresponding to an indentation, here we can also see a token for a double indentation: `ĊĠĠĠĠĠĠĠ`. The special Python words like `class`, `init`, `call`, `self`, and `return` are each tokenized as one token, and we can see that as well as splitting on `_` and `.` the tokenizer correctly splits even camel-cased names: `LinearLayer` is tokenized as `["ĠLinear", "Layer"]`.

### Saving the tokenizer

To make sure we can use it later, we need to save our new tokenizer. Like for models, this is done with the `save_pretrained()` method:

In [16]:
tokenizer.save_pretrained("code-search-net-tokenizer")

('code-search-net-tokenizer/tokenizer_config.json',
 'code-search-net-tokenizer/special_tokens_map.json',
 'code-search-net-tokenizer/vocab.json',
 'code-search-net-tokenizer/merges.txt',
 'code-search-net-tokenizer/added_tokens.json',
 'code-search-net-tokenizer/tokenizer.json')

This will create a new folder named `code-search-net-tokenizer`, which will contain all the files the tokenizer needs to be reloaded. If you want to share this tokenizer with your colleagues and friends, you can upload it to the Hub by logging into your account. If you're working in a notebook, there's a convenience function to help you with this:

In [17]:
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /Users/matthias/.huggingface/token


This will display a widget where you can enter your Hugging Face login credentials. If you aren’t working in a notebook, just type the following line in your terminal:
```bash
huggingface-cli login
```
Once you've logged in, you can push your tokenizer by executing the following command:

In [18]:
# use_temp_dir=True
# https://discuss.huggingface.co/t/chapter-4-questions/6801/5
tokenizer.push_to_hub("code-search-net-tokenizer", use_temp_dir=True)

Cloning https://huggingface.co/mdroth/code-search-net-tokenizer into local empty directory.


This will create a new repository in your namespace with the name `code-search-net-tokenizer`, containing the tokenizer file. You can then load the tokenizer from anywhere with the `from_pretrained()` method:

In [19]:
# Replace "huggingface-course" below with your actual namespace to use your own tokenizer
#tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")
tokenizer = AutoTokenizer.from_pretrained("mdroth/code-search-net-tokenizer")

You're now all set for training a language model from scratch and fine-tuning it on your task at hand! We'll get to that in Chapter 7, but first, in the rest of this chapter we'll take a closer look at fast tokenizers and explore in detail what actually happens when we call the method `train_new_from_iterator()`.

## [Fast tokenizers' special powers](https://huggingface.co/course/chapter6/3?fw=pt)

In this section we will take a closer look at the capabilities of the tokenizers in 🤗 Transformers. Up to now we have only used them to tokenize inputs or decode IDs back into text, but tokenizers — especially those backed by the 🤗 Tokenizers library — can do a lot more. To illustrate these additional features, we will explore how to reproduce the results of the `token-classification` (that we called `ner`) and `question-answering` pipelines that we first encountered in [Chapter 1](https://huggingface.co/course/chapter1).

In [20]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/g8quOxoqhHQ" allowfullscreen></iframe>')



In the following discussion, we will often make the distinction between "slow" and "fast" tokenizers. Slow tokenizers are those written in Python inside the 🤗 Transformers library, while the fast versions are the ones provided by 🤗 Tokenizers, which are written in Rust. If you remember the table from [Chapter 5](https://huggingface.co/course/chapter5/3) that reported how long it took a fast and a slow tokenizer to tokenize the Drug Review Dataset, you should have an idea of why we call them fast and slow:

|Fast tokenizer|Slow|tokenizer|
|:---|---|---|
|batched=True|10.8s|4min41s|
|batched=False|59.2s|5min3s|

> <font color="darkred">⚠️ When tokenizing a single sentence, you won't always see a difference in speed between the slow and fast versions of the same tokenizer. In fact, the fast version might actually be slower! It's only when tokenizing lots of texts in parallel at the same time that you will be able to clearly see the difference.</font>

### Batch Encoding

In [21]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/3umI3tm27Vw" allowfullscreen></iframe>')

The output of a tokenizer isn't a simple Python dictionary; what we get is actually a special `BatchEncoding` object. It's a subclass of a dictionary (which is why we were able to index into that result without any problem before), but with additional methods that are mostly used by fast tokenizers.

Besides their parallelization capabilities, the key functionality of fast tokenizers is that they always keep track of the original span of texts the final tokens come from — a feature we call *offset mapping*. This in turn unlocks features like mapping each word to the tokens it generated or mapping each character of the original text to the token it's inside, and vice versa.

Let's take a look at an example:

In [22]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

<class 'transformers.tokenization_utils_base.BatchEncoding'>


As mentioned previously, we get a `BatchEncoding` object in the tokenizer's output (see above).

Since the `AutoTokenizer` class picks a fast tokenizer by default, we can use the additional methods this `BatchEncoding` object provides. We have two ways to check if our tokenizer is a fast or a slow one. We can either check the attribute `is_fast` of the `tokenizer`:

In [23]:
tokenizer.is_fast

True

or check the same attribute of our `encoding`:

In [24]:
encoding.is_fast

True

Let's see what a fast tokenizer enables us to do. First, we can access the tokens without having to convert the IDs back to tokens:

In [25]:
encoding.tokens

<bound method BatchEncoding.tokens of {'input_ids': [101, 1422, 1271, 1110, 156, 7777, 2497, 1394, 1105, 146, 1250, 1120, 20164, 10932, 10289, 1107, 6010, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}>

In this case the token at index 5 is `##yl`, which is part of the word "Sylvain" in the original sentence. We can also use the `word_ids()` method to get the index of the word each token comes from:

In [26]:
encoding.word_ids()

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

We can see that the tokenizer's special tokens `[CLS]` and `[SEP]` are mapped to `None`, and then each token is mapped to the word it originates from. This is especially useful to determine if a token is at the start of a word or if two tokens are in the same word. We could rely on the `##` prefix for that, but it only works for BERT-like tokenizers; this method works for any type of tokenizer as long as it's a fast one. In the next chapter, we'll see how we can use this capability to apply the labels we have for each word properly to the tokens in tasks like named entity recognition (NER) and part-of-speech (POS) tagging. We can also use it to mask all the tokens coming from the same word in masked language modeling (a technique called *whole word masking*).

> <font color="darkgreen">The notion of what a word is is complicated. For instance, does "I'll" (a contraction of "I will") count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.</font><br>
✏️ Try it out! <font color="darkgreen">Create a tokenizer from the `bert-base-cased` and `roberta-base` checkpoints and tokenize "81s" with them. What do you observe? What are the word IDs?</font>

In [27]:
# Trying it out
example_81s = "81s"
## bert-base-cased (bbc)
tokenizer_bbc = AutoTokenizer.from_pretrained("bert-base-cased")
encoding_bbc = tokenizer_bbc(example_81s)
print(f"bbc tokens:\n{encoding_bbc.tokens}\n\nbbc word IDs:\n{encoding_bbc.word_ids()}")
start_bbc1, end_bbc1 = encoding_bbc.token_to_chars(1)
start_bbc2, end_bbc2 = encoding_bbc.token_to_chars(2)
bbc_str = f"bert-base-cased treats '{example_81s}' as ONE word with TWO tokens:"
print(f"{bbc_str}\n{example_81s[start_bbc1:end_bbc1]}\n{example_81s[start_bbc2:end_bbc2]}")
## roberta-base (rb)
tokenizer_rb = AutoTokenizer.from_pretrained("roberta-base")
encoding_rb = tokenizer_rb(example_81s)
print(f"\n\nrb tokens:\n{encoding_rb.tokens}\n\nrb word IDs:\n{encoding_rb.word_ids()}")
start_rb1, end_rb1 = encoding_rb.token_to_chars(1)
start_rb2, end_rb2 = encoding_rb.token_to_chars(2)
rb_str = f"roberta-base treats '{example_81s}' as TWO words with ONE token each:"
print(f"{rb_str}\n{example_81s[start_rb1:end_rb1]}\n{example_81s[start_rb2:end_rb2]}")

bbc tokens:
<bound method BatchEncoding.tokens of {'input_ids': [101, 5615, 1116, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}>

bbc word IDs:
[None, 0, 0, None]
bert-base-cased treats '81s' as ONE word with TWO tokens:
81
s


rb tokens:
<bound method BatchEncoding.tokens of {'input_ids': [0, 6668, 29, 2], 'attention_mask': [1, 1, 1, 1]}>

rb word IDs:
[None, 0, 1, None]
roberta-base treats '81s' as TWO words with ONE token each:
81
s


Similarly, there is a `sentence_ids()` method that we can use to map a token to the sentence it came from (though in this case, the `token_type_ids` returned by the tokenizer can give us the same information).

Lastly, we can map any word or token to characters in the original text, and vice versa, via the `word_to_chars()` or `token_to_chars()` and `char_to_word()` or `char_to_token()` methods. For instance, the `word_ids()` method told us that `##yl` is part of the word at index 3, but which word is it in the sentence? We can find out like this:

In [28]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Sylvain'

As we mentioned previously, this is all powered by the fact the fast tokenizer keeps track of the span of text each token comes from in a list of *offsets*. To illustrate their use, next we'll show you how to replicate the results of the `token-classification` pipeline manually.

> ✏️ Try it out! <font color="darkgreen">Create your own example text and see if you can understand which tokens are associated with which word ID, and also how to extract the character spans for a single word. For bonus points, try using two sentences as input and see if the sentence IDs make sense to you.</font>

In [29]:
# 1. DONE: add loop over tokenized inputs
# 2. handle final word
# 3. confirm loop works
trytext_1 = "HuggingFace transformers rule NLP!"
trytext_2 = "And stable baselines rule RL."
trytext_list = [trytext_1]
trytext_list = [trytext_2]
trytext_list = [trytext_1, trytext_2]
for trytext in trytext_list:
    print(trytext)
    try_encoding_bbc = tokenizer_bbc(trytext)
    normal_token_ids = try_encoding_bbc["input_ids"][1:-1]
    normal_word_ids = try_encoding_bbc.word_ids()[1:-1]
    print(normal_word_ids)
    # word IDs and spans
    current_word_id = -1
    word_spans = []
    word_span = []
    # loop and output
    print("word IDs   tokens\n")
    for i in range(len(normal_token_ids)):
        word_id_i = normal_word_ids[i]
        start_end_i = try_encoding_bbc.token_to_chars(i+1)
        if word_id_i>current_word_id:
            end_i = start_end_i[1]
            word_span.append(start_end_i[1])
            if word_spans!=[]:
                prev_word_end = start_end_i[0]
                print(f"word: {trytext[start_i:prev_word_end]}\n")
            word_spans.append(word_span)
            word_span = [start_end_i[0]]
            current_word_id = word_id_i
            start_i = start_end_i[0]
        word_span.append(start_end_i[1])
        print(f"{word_id_i}\t   {trytext[start_end_i[0]:start_end_i[1]]}")
        prev_start_end_i = start_end_i
    #
    prev_word_end = start_end_i[1]
    print(f"word: {trytext[start_i:prev_word_end]}\n=> {trytext}\n\n")

HuggingFace transformers rule NLP!
[0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !
word: !
=> HuggingFace transformers rule NLP!


And stable baselines rule RL.
[0, 1, 2, 2, 3, 4, 4, 5]
word IDs   tokens

0	   And
word: And 

1	   stable
word: stable 

2	   base
2	   lines
word: baselines 

3	   rule
word: rule 

4	   R
4	   L
word: RL

5	   .
word: .
=> And stable baselines rule RL.




In [30]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens\n")
print(len(normal_token_ids))
for i in range(len(normal_token_ids)):
    #print(i, normal_word_ids)
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
#
prev_word_end = start_end_i[1]
print(start_i, prev_word_end)
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

word IDs   tokens

10
0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !
33 34
word: !



(33, 34)

In [31]:
trytext_1 = "HuggingFace transformers rule NLP!"
trytext_2 = "And stable baselines rule RL."
trytext_list = [trytext_1, trytext_2]
for i, trytext in enumerate(trytext_list):
    print(i)
    print(trytext)

0
HuggingFace transformers rule NLP!
1
And stable baselines rule RL.


In [32]:
# 1. DONE: add loop over tokenized inputs
# 2. handle final word
# 3. confirm loop works
trytext_1 = "HuggingFace transformers rule NLP!"
trytext_2 = "And stable baselines rule RL."
trytext_list = [trytext_1]
trytext_list = [trytext_2]
trytext_list = [trytext_1, trytext_2]
for trytext in trytext_list:
    try_encoding_bbc = tokenizer_bbc(trytext)
    normal_token_ids = try_encoding_bbc["input_ids"][1:-1]
    #print(normal_token_ids)
    normal_word_ids = try_encoding_bbc.word_ids()[1:-1]
    #print(normal_word_ids)
    # word IDs and spans
    current_word_id = -1
    word_spans = []
    word_span = []
    # loop and output
    print("word IDs   tokens\n")
    #print(len(normal_token_ids))
    for i in range(len(normal_token_ids)):
        #print(i, normal_word_ids)
        word_id_i = normal_word_ids[i]
        start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
        if word_id_i>current_word_id:
            end_i = start_end_i[1]
            word_span.append(start_end_i[1])
            if word_spans!=[]:
                prev_word_end = start_end_i[0]
                print(f"word: {trytext_1[start_i:prev_word_end]}\n")
            word_spans.append(word_span)
            word_span = [start_end_i[0]]
            current_word_id = word_id_i
            start_i = start_end_i[0]
        word_span.append(start_end_i[1])
        print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
        prev_start_end_i = start_end_i
    #
    prev_word_end = start_end_i[1]
    print(f"word: {trytext_1[start_i:prev_word_end]}\n")
    #print(start_i, prev_word_end)

word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !
word: !

word IDs   tokens

0	   Hu
word: Hu

1	   gging
word: gging

2	   F
2	   ace
word: Face 

3	   transform
word: transform

4	   ers
4	   rule
word: ers rule 

5	   NL
word: NL



In [33]:
# 1. DONE: add loop over tokenized inputs
# 2. handle final word
# 3. confirm loop works
trytext_1 = "HuggingFace transformers rule NLP!"
trytext_2 = "And stable baselines rule RL."
trytext_list = [trytext_1]
#trytext_list = [trytext_2]
#trytext_list = [trytext_1, trytext_2]
for trytext in trytext_list:
    try_encoding_bbc = tokenizer_bbc(trytext)
    normal_token_ids = try_encoding_bbc["input_ids"]
    #print(normal_token_ids)
    normal_word_ids = try_encoding_bbc.word_ids()[1:-1]
    print(normal_word_ids)
    # word IDs and spans
    current_word_id = -1
    word_spans = []
    word_span = []
    # loop and output
    print("word IDs   tokens\n")
    for i in range(len(normal_token_ids)):
        print(i, normal_word_ids)
        word_id_i = normal_word_ids[i]
        start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
        if word_id_i>current_word_id:
            end_i = start_end_i[1]
            word_span.append(start_end_i[1])
            if word_spans!=[]:
                prev_word_end = start_end_i[0]
                print(f"word: {trytext_1[start_i:prev_word_end]}\n")
            word_spans.append(word_span)
            word_span = [start_end_i[0]]
            current_word_id = word_id_i
            start_i = start_end_i[0]
        word_span.append(start_end_i[1])
        print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
        prev_start_end_i = start_end_i
    #
    prev_word_end = start_end_i[1]
    print(len(trytext))
    print(f"word: {trytext_1[start_i:len(trytext)]}\n")
    print(start_i, prev_word_end)

[0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
word IDs   tokens

0 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
0	   Hu
1 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
0	   gging
2 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
0	   F
3 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
0	   ace
4 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
word: HuggingFace 

1	   transform
5 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
1	   ers
6 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
word: transformers 

2	   rule
7 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
word: rule 

3	   NL
8 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
3	   P
9 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]
word: NLP

4	   !
10 [0, 0, 0, 0, 1, 1, 2, 3, 3, 4]


IndexError: list index out of range

In [34]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
#
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !
word: 



(33, 33)

In [35]:
#print(try_encoding_bbc_1) # after tokenization, get number of lists and loop over lists
normal_token_ids = try_encoding_bbc_1["input_ids"]
print(normal_token_ids)
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("\nword IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
# 
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

[101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102]

word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !


IndexError: list index out of range

In [36]:
trytext_1 = "HuggingFace transformers rule NLP!"
trytext_2 = "And stable baselines rule RL."
trytext_list = [trytext_1]
trytext_list = [trytext_2]
#trytext_list = [trytext_1, trytext_2]
try_encoding_bbc_1 = tokenizer_bbc(trytext)
print(try_encoding_bbc_1)
normal_token_ids = try_encoding_bbc_1["input_ids"]#[1:-1]
print(normal_token_ids)
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("\nword IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
# 
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

{'input_ids': [101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102]

word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !


IndexError: list index out of range

In [37]:
trytext_1 = "HuggingFace transformers rule NLP!"
trytext_2 = "And stable baselines rule RL."
trytext = [trytext_1]
trytext = [trytext_2]
trytext = [trytext_1, trytext_2]
try_encoding_bbc_1 = tokenizer_bbc(trytext)
print(try_encoding_bbc_1)
normal_token_ids = try_encoding_bbc_1["input_ids"]#[1:-1]
print(normal_token_ids)
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("\nword IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
# 
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

{'input_ids': [[101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102], [101, 1262, 6111, 2259, 10443, 3013, 155, 2162, 119, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
[[101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102], [101, 1262, 6111, 2259, 10443, 3013, 155, 2162, 119, 102]]

word IDs   tokens

0	   Hu
0	   gging
word: Hu



(0, 2)

In [38]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
#
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !
word: 



(33, 33)

In [39]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
#
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !
word: 



(33, 33)

In [40]:
trytext_1 = "HuggingFace transformers rule NLP!"
trytext_2 = "And stable baselines rule RL."
trytext = [trytext_1]
trytext = [trytext_2]
#trytext = [trytext_1, trytext_2]
try_encoding_bbc_1 = tokenizer_bbc(trytext)
print(try_encoding_bbc_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
# 
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

{'input_ids': [[101, 1262, 6111, 2259, 10443, 3013, 155, 2162, 119, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
word IDs   tokens

word: 



(33, 33)

In [41]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens\n")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            prev_word_end = start_end_i[0]
            print(f"word: {trytext_1[start_i:prev_word_end]}\n")
        word_spans.append(word_span)
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
#
prev_word_end = start_end_i[0]
print(f"word: {trytext_1[start_i:prev_word_end]}\n")
start_i, prev_word_end

word IDs   tokens

0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace 

1	   transform
1	   ers
word: transformers 

2	   rule
word: rule 

3	   NL
3	   P
word: NLP

4	   !
word: 



(33, 33)

In [42]:
start_i, prev_word_end

(33, 33)

In [43]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens")
for i in range(len(normal_token_ids)): # loop over tokens
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    #
    if word_id_i>current_word_id:      # new word
        end_i = start_end_i[1]
        #print(start_end_i)
        word_span.append(start_end_i[1])
        # neu
        if word_spans!=[]:
            #print(f"prev_end_i: {prev_start_end_i[1]}")
            prev_word_end = start_end_i[0]-1
            #print(f"end position of previous word is {prev_word_end}")
            print(f"word: {trytext_1[start_i:prev_word_end]}")
        #
        word_spans.append(word_span)
        #print(f"\nword starts at {start_end_i[0]}")
        #print(f"prev_end_i: {prev_start_end_i[1]}")
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
    prev_start_end_i = start_end_i
print("end of string")

word IDs   tokens
0	   Hu
0	   gging
0	   F
0	   ace
word: HuggingFace
1	   transform
1	   ers
word: transformers
2	   rule
word: rule
3	   NL
3	   P
word: NL
4	   !
end of string


In [44]:
word_span

[33, 34]

In [45]:
trytext_1[30:33]

'NLP'

In [46]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens")
for i in range(len(normal_token_ids)): # loop over tokens
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    # start of new word => ...
    # ... 1. finish word_span and append it ...
    # ... 2. start new word_span ...
    # ... 3. update current_word_id
    if word_id_i>current_word_id:      # new word
        end_i = start_end_i[1]
        print(start_end_i)
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            print(f"word ends at {start_end_i[0]}")
            word_spans.append(word_span)
        print(f"word starts at {start_end_i[0]}")
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
print("end of string")
#########
# for token_id in token_ids:
#     get word_id
#     get word_start from start_token_start
#     get word_end from end_token_end # put this line where the next word_id is handled

word IDs   tokens
CharSpan(start=0, end=2)
word starts at 0
0	   Hu
0	   gging
0	   F
0	   ace
CharSpan(start=12, end=21)
word starts at 12
1	   transform
1	   ers
CharSpan(start=25, end=29)
word starts at 25
2	   rule
CharSpan(start=30, end=32)
word starts at 30
3	   NL
3	   P
CharSpan(start=33, end=34)
word starts at 33
4	   !
end of string


In [47]:
###

In [48]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
for i in range(len(normal_token_ids)): # loop over tokens
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    print(f"token start: {start_end_i[0]}\ttoken end: {start_end_i[1]}\tword id: {word_id_i}")
    if word_id_i > current_word_id:
        # get end position of previous word (if there is a previous word)
        if word_spans!=[]:
            prev_word_end = start_end_i[0]-1
            print(f"previous word exists: its end position is {prev_word_end}")
            print(f"previous word: {trytext_1[word_start:prev_word_end]}")
        word_spans.append(word_span)
        # get start position of current word
        print("\nnew word")
        word_start = start_end_i[0]
        print(f"word start: {word_start}")
        current_word_id = word_id_i
# last word
last_word_end = start_end_i[0]
print(f"last word exists: its end position is {last_word_end}")
print(f"last word: {trytext_1[word_start-1:last_word_end+2]}")




#########
#  H u g g i n g F a c  e     t  r  a  n  s  f  o  r  m  e  r  s     r  u  l  e     N  L  P  !"
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 "
# 0                      11 12                                  24 25          29 30    32 33 "

token start: 0	token end: 2	word id: 0

new word
word start: 0
token start: 2	token end: 7	word id: 0
token start: 7	token end: 8	word id: 0
token start: 8	token end: 11	word id: 0
token start: 12	token end: 21	word id: 1
previous word exists: its end position is 11
previous word: HuggingFace

new word
word start: 12
token start: 21	token end: 24	word id: 1
token start: 25	token end: 29	word id: 2
previous word exists: its end position is 24
previous word: transformers

new word
word start: 25
token start: 30	token end: 32	word id: 3
previous word exists: its end position is 29
previous word: rule

new word
word start: 30
token start: 32	token end: 33	word id: 3
token start: 33	token end: 34	word id: 4
previous word exists: its end position is 32
previous word: NL

new word
word start: 33
last word exists: its end position is 33
last word: P!


In [49]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
for i in range(len(normal_token_ids)): # loop over tokens
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    print(f"token start: {start_end_i[0]}\ttoken end: {start_end_i[1]}\tword id: {word_id_i}")
    if word_id_i > current_word_id:
        # get end position of previous word (if there is a previous word)
        if word_spans!=[]:
            prev_word_end = start_end_i[0]-1
            print(f"previous word exists: its end position is {prev_word_end}")
            print(f"previous word: {trytext_1[word_start:prev_word_end]}")
        word_spans.append(word_span)
        # get start position of current word
        print("\nnew word")
        word_start = start_end_i[0]
        print(f"word start: {word_start}")
        current_word_id = word_id_i
#########
# for token_id in token_ids:
#     get word_id
#     get word_start from start_token_start
#     get word_end from end_token_end # put this line where the next word_id is handled

token start: 0	token end: 2	word id: 0

new word
word start: 0
token start: 2	token end: 7	word id: 0
token start: 7	token end: 8	word id: 0
token start: 8	token end: 11	word id: 0
token start: 12	token end: 21	word id: 1
previous word exists: its end position is 11
previous word: HuggingFace

new word
word start: 12
token start: 21	token end: 24	word id: 1
token start: 25	token end: 29	word id: 2
previous word exists: its end position is 24
previous word: transformers

new word
word start: 25
token start: 30	token end: 32	word id: 3
previous word exists: its end position is 29
previous word: rule

new word
word start: 30
token start: 32	token end: 33	word id: 3
token start: 33	token end: 34	word id: 4
previous word exists: its end position is 32
previous word: NL

new word
word start: 33


In [50]:
word_spans==[]

False

In [51]:
"abcdefghij"[:3]

'abc'

In [52]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens")
for i in range(len(normal_token_ids)): # loop over tokens
    word_id_i = normal_word_ids[i]
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    # start of new word => ...
    # ... 1. finish word_span and append it ...
    # ... 2. start new word_span ...
    # ... 3. update current_word_id
    if word_id_i>current_word_id:      # new word
        end_i = start_end_i[1]
        print(start_end_i)
        word_span.append(start_end_i[1])
        if word_spans!=[]:
            print(f"word ends at {start_end_i[0]}")
            word_spans.append(word_span)
        print(f"word starts at {start_end_i[0]}")
        word_span = [start_end_i[0]]
        current_word_id = word_id_i
        start_i = start_end_i[0]
    word_span.append(start_end_i[1])
    
    print(f"{word_id_i}\t   {trytext_1[start_end_i[0]:start_end_i[1]]}")
print("word ends")
#########
# for token_id in token_ids:
#     get word_id
#     get word_start from start_token_start
#     get word_end from end_token_end # put this line where the next word_id is handled

word IDs   tokens
CharSpan(start=0, end=2)
word starts at 0
0	   Hu
0	   gging
0	   F
0	   ace
CharSpan(start=12, end=21)
word starts at 12
1	   transform
1	   ers
CharSpan(start=25, end=29)
word starts at 25
2	   rule
CharSpan(start=30, end=32)
word starts at 30
3	   NL
3	   P
CharSpan(start=33, end=34)
word starts at 33
4	   !
word ends


In [53]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
normal_token_ids = try_encoding_bbc_1["input_ids"][1:-1]
normal_word_ids = try_encoding_bbc_1.word_ids()[1:-1]
# word IDs and spans
current_word_id = -1
word_spans = []
word_span = []
# loop and output
print("word IDs   tokens")
for i in range(len(normal_token_ids)):
    word_id_i = normal_word_ids[i]
    print(word_id_i)
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    if word_id_i>current_word_id:
        end_i = start_end_i[1]

word IDs   tokens
0
0
0
0
1
1
2
3
3
4


In [54]:
try_encoding_bbc_1.word_ids()[1:-1]

[0, 0, 0, 0, 1, 1, 2, 3, 3, 4]

In [55]:
word_spans!=[]

False

In [56]:
trytext_1 = "HuggingFace transformers rule NLP!"
try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
try_encoding_bbc_1["input_ids"]

[101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102]

In [57]:
tokenizer_bbc.decode(try_encoding_bbc_1["input_ids"])

'[CLS] HuggingFace transformers rule NLP! [SEP]'

In [58]:
# Trying it out
trytext_2 = "And stable baselines rule RL."
## 
trytext_1 = "HuggingFace transformers rule NLP!"
## get word ids, associated tokens, and character spans (=> [start:end]) for each word => print these

try_encoding_bbc_1 = tokenizer_bbc(trytext_1)
trytext_1_word_ids = try_encoding_bbc_1.word_ids()
print(f"word ids:\n{trytext_1_word_ids}") # word ids

word ids:
[None, 0, 0, 0, 0, 1, 1, 2, 3, 3, 4, None]


In [59]:
n_normal_tokens = len(trytext_1_word_ids) - 2
n_normal_tokens

10

In [60]:
trytext_1[3:6]

'gin'

In [61]:
for i in range(n_normal_tokens):
    start_end_i = try_encoding_bbc_1.token_to_chars(i+1)
    print(trytext_1[start_end_i[0]:start_end_i[1]])

Hu
gging
F
ace
transform
ers
rule
NL
P
!


In [62]:
start_rb1, end_rb1 = encoding_rb.token_to_chars(1)
example_81s[start_rb1:end_rb1]

'81'

In [63]:
try_encoding_bbc_1.token_to_chars(10)

CharSpan(start=33, end=34)

In [64]:
# try_encoding_bbc_1.decode(try_1_word_ids[3])

# try_encoding_bbc_1.token_to_chars(2) # works!

for chars in try_encoding_bbc_1.token_to_chars(2): # [0:2]
    print(chars)
# input ids
input_ids = try_encoding_bbc_1["input_ids"]
print(input_ids)
print()
print(try_encoding_bbc_1.token_to_chars)
for input_id in input_ids:
    print(input_id)

2
7
[101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102]

<bound method BatchEncoding.token_to_chars of {'input_ids': [101, 20164, 10932, 2271, 7954, 11303, 1468, 3013, 21239, 2101, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}>
101
20164
10932
2271
7954
11303
1468
3013
21239
2101
106
102


### Inside the `token-classification` pipeline