# **Preprocessing**

Reference:  
https://huggingface.co/docs/transformers/preprocessing  
https://huggingface.co/docs/transformers/tokenizer_summary

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you’ll learn that for:
- Text, use a `Tokenizer` to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
- Speech and audio, use a `Feature extractor` to extract sequential features from audio waveforms and convert them into tensors.
- Image inputs use a `ImageProcessor` to convert images into tensors.
- Multimodal inputs, use a `Processor` to combine a tokenizer and a feature extractor or image processor.

**Note: `AutoProcessor` always works and automatically chooses the correct class for the model you’re using, whether you’re using a tokenizer, image processor, feature extractor or processor.**

<img width="800" height="500" src="data/images/hugging_face_transformers_pipeline.jpeg">

## **AutoTokenizer**

A tokenizer takes text as input and outputs numbers the associated model can make sense of.

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

input_text = "Let's try to tokenize!"
print("Input Text:", input_text)
print()

## The first step of the above pipeline is to split the text into tokens
tokens = tokenizer.tokenize(input_text)
print("Tokens:", tokens)
print()

## Convert the tokens to unique numerical number
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens Id:", input_ids)
print()

## Lastly, the tokenizer adds special tokens the model expects
final_inputs = tokenizer.prepare_for_model(input_ids)
print("Tokens Id with special tokens:", final_inputs["input_ids"])
print()

## Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:", tokenizer.decode(final_inputs["input_ids"]))

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Input Text: Let's try to tokenize!

Tokens: ['let', "'", 's', 'try', 'to', 'token', '##ize', '!']

Tokens Id: [2292, 1005, 1055, 3046, 2000, 19204, 4697, 999]

Tokens Id with special tokens: [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]

Decoded Text Output: [CLS] let's try to tokenize! [SEP]


In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")

input_text = "Let's try to tokenize!"
print("Input Text:", input_text)
print()

## The first step of the above pipeline is to split the text into tokens
tokens = tokenizer.tokenize(input_text)
print("Tokens:", tokens)
print()

## Convert the tokens to unique numerical number
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens Id:", input_ids)
print()

## Lastly, the tokenizer adds special tokens the model expects
final_inputs = tokenizer.prepare_for_model(input_ids)
print("Tokens Id with special tokens:", final_inputs["input_ids"])
print()

## Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:", tokenizer.decode(final_inputs["input_ids"]))

You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Input Text: Let's try to tokenize!

Tokens: ['▁let', "'", 's', '▁try', '▁to', '▁to', 'ken', 'ize', '!']

Tokens Id: [408, 22, 18, 1131, 20, 20, 2853, 2952, 187]

Tokens Id with special tokens: [2, 408, 22, 18, 1131, 20, 20, 2853, 2952, 187, 3]

Decoded Text Output: [CLS] let's try to tokenize![SEP]


In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

input_text = "Let's try to tokenize!"
print("Input Text:", input_text)
print()

## The first step of the above pipeline is to split the text into tokens
tokens = tokenizer.tokenize(input_text)
print("Tokens:", tokens)
print()

## Convert the tokens to unique numerical number
input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens Id:", input_ids)
print()

## Lastly, the tokenizer adds special tokens the model expects
final_inputs = tokenizer.prepare_for_model(input_ids)
print("Tokens Id with special tokens:", final_inputs["input_ids"])
print()

## Decode method allows us to check how the final output of the 
## tokenizer translates back to text
print("Decoded Text Output:", tokenizer.decode(final_inputs["input_ids"]))

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Input Text: Let's try to tokenize!

Tokens: ['Let', "'s", 'Ġtry', 'Ġto', 'Ġtoken', 'ize', '!']

Tokens Id: [7939, 18, 860, 7, 19233, 2072, 328]

Tokens Id with special tokens: [0, 7939, 18, 860, 7, 19233, 2072, 328, 2]

Decoded Text Output: <s>Let's try to tokenize!</s>


In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

inputs = "Let's try to tokenize!"

input_ids = tokenizer(inputs)

print(input_ids)

{'input_ids': [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [6]:
print(input_ids["input_ids"])

[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]


In [7]:
print(tokenizer.decode(input_ids["input_ids"]))

[CLS] let's try to tokenize! [SEP]


## **Padding and Truncation**

Batched inputs are often different lengths, so they can’t be converted to fixed-size tensors. Padding and truncation are strategies for dealing with this problem, to create rectangular tensors from batches of varying lengths. Padding adds a **special padding token** to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.

In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept works pretty well. However, the API supports more strategies if you need them. The three arguments you need to are: padding, truncation and max_length.

### **Pad**

Sentences aren’t always the same length which can be an issue because tensors, the model inputs, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to shorter sentences.

Set the padding parameter to True to pad the shorter sequences in the batch to match the longest sequence:

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences, padding=True)

encoded_input["input_ids"]

# The first and third sentences are now padded with 0’s because they are shorter.

[[101, 2021, 2054, 2055, 2117, 6350, 1029, 102, 0, 0, 0, 0, 0, 0],
 [101,
  2123,
  1005,
  1056,
  2228,
  2002,
  4282,
  2055,
  2117,
  6350,
  1010,
  28315,
  1012,
  102],
 [101, 2054, 2055, 5408, 14625, 1029, 102, 0, 0, 0, 0, 0, 0, 0]]

In [9]:
for seq in encoded_input["input_ids"]:
    print(tokenizer.decode(seq))

[CLS] but what about second breakfast? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] don't think he knows about second breakfast, pip. [SEP]
[CLS] what about elevensies? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


## **Fine-Tunning**

https://huggingface.co/docs/transformers/training#train-a-tensorflow-model-with-keras