[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/huggingface_tokenizer_tricks.ipynb)

# Useful Tips & Tricks when using the HuggingFace tokenizer

The tokenizer in the HuggingFace transformers library can do a lot of useful things. But it can be a bit of a mystery how to use some of them. Here are a bunch of examples of useful things you can do.

## Install dependencies

If needed, you could install dependencies with the command below:

```
pip install transformers
```

## Basic tokenization

Here is a tokenizer with default settings (using PubMed BERT)

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")



By default, it gives us `input_ids`, `token_type_ids` and `attention_mask`. We normally only focus on the `input_ids` which is the text tokenized and converted to the IDs in the vocabulary.

In [None]:
tokenized = tokenizer("The quick brown fox jumped over the lazy dog")
tokenized

## Quickly looking up a token ID

The `.vocab` attribute is the dictionary of the tokenizer's vocabulary. You can use it to look up the token ID of individual tokens, which is sometimes useful.

In [None]:
tokenizer.vocab['quick']

## Tokenizing text that has already been split into words

You may have input text that has already been split into words/tokens. This could be from source files or from using another library like [Spacy](https://spacy.io). A lot of datasets come in the [CoNLL file format](https://simpletransformers.ai/docs/ner-data-formats/#text-file-in-conll-format) which has things split into words. The HuggingFace tokenizer can deal with text like this using the `is_split_into_words` argument.

In [None]:
pretokenized = ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

tokenized = tokenizer(pretokenized, is_split_into_words=True)
tokenized

## Getting the tokens back from the IDs

It can be helped to convert the `input_ids` given by the tokenizer back into tokens to see what the tokenizer has done. The `convert_ids_to_tokens` function can do that.

Note how some of the words have been broken into smaller subwords with the `##` prefix to show that the token connects to the previous one. And we can also see the special tokens `[CLS]` and `[SEP]` that are added to the start and end of sequences by default.

In [None]:
tokenizer.convert_ids_to_tokens(tokenized['input_ids'])

## Truncating text

Transformer models have a maximum length of text (by count of tokens) that they can accept. For some models, this is 512 tokens. There are various ways to trim or truncate the input data to make sure it doesn't surpass that limit. The tokenizer has `truncation` and `max_length` arguments to do that.

In [None]:
tokenized = tokenizer("The quick brown fox jumped over the lazy dog",
                      truncation=True,
                      max_length=512)

## Getting the token offsets in the original text

You may often want to be able to go from the tokens back to the original text. For example, you may want to know which tokens relate to a particular span of text, e.g. a set of words that represent a named entity. The tokenizer has the `return_offsets_mapping` argument that adds an extra field `offset_mapping` to the tokenized output. It is a list of start and end offsets into the text.

If you have annotations for the locations of named entities in some text, you can use the `offset_mapping` data to match the tokens. The [intervaltree library](https://pypi.org/project/intervaltree/) can be very helpful to make some clean code if there are a lot of annotations to account for.

In [None]:
input_text = "The quick brown fox jumped over the lazy dog"
tokenized = tokenizer(input_text,
                      return_offsets_mapping = True)

tokenized['offset_mapping']

Here's a demonstration of iterating through the `offset_mapping` and getting the text from the source text.

In [None]:
for start,end in tokenized['offset_mapping']:
  print(f"{start}\t{end}\t{input_text[start:end]}")

## Getting Pytorch tensors

The `input_ids` are probably going to be passed into a deep learning network, so will need to be converted into a Pytorch tensor at some point. You can get the tokenizer to do it with `return_tensors='pt'` where `'pt'` refers to pytorch. You can ask for TensorFlow tensors as well. You may want to use this argument along with the truncation ones.

In [None]:
tokenized = tokenizer("The quick brown fox jumped over the lazy dog",
                      return_tensors='pt')
tokenized

## Adding your own "special" tokens

Tokenizers have a bunch of special tokens, such as the `[CLS]` and `[SEP]` added at the start and end of sequences and the `[MASK]` tag used in the masked learning pretraining. You can see all the ones that a tokenizer uses with `.all_special_tokens`.

In [None]:
tokenizer.all_special_tokens

You may want to use your own special tokens for something, like tagging around particular entities. However, if you do that without setup, the tokenizer has never seen them before and will likely split these into separate tokens as below.

In [None]:
tokenized = tokenizer("The quick brown [E1]fox[/E1] jumped over the lazy [E2]dog[/E2]")

tokenizer.convert_ids_to_tokens(tokenized['input_ids'])

Instead, you can add your tokens to the tokenizer with `.add_tokens` as below.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

tokenizer.add_tokens(["[E1]", "[/E1]", "[E2]", "[/E2]"])

Let's check if one of the special tokens is now in the tokenizer's vocab:

In [None]:
tokenizer.vocab['[E1]']

Then when you run the same code again, the special tokens have been encoded properly and not split into multiple tokens.

In [None]:
tokenized = tokenizer("The quick brown [E1]fox[/E1] jumped over the lazy [E2]dog[/E2]")

tokenizer.convert_ids_to_tokens(tokenized['input_ids'])

To use these new tokens in a model, you'll need to tell the model to adjust to the new number of tokens using `resize_token_embeddings` as below.

However, the initial embeddings for these new tokens will be random, and the system will have no idea of their meaning. So end-to-end training may be required for whatever task you are focussed on to get the system to use them effectively.

In [None]:
from transformers import AutoModel
model = AutoModel.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

model.resize_token_embeddings(len(tokenizer))