<a href="https://colab.research.google.com/github/ShinAsakawa/ShinAsakawa.github.io/blob/master/2022notebooks/2022_0112preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- date: 2022_0112
- source url: https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/preprocessing.ipynb
- filename: 2022_0112preprocessing.ipynb


In [None]:
import platform
isColab = platform.system() == 'Linux'

if isColab:
    # Transformers installation
    !pip install transformers datasets > /dev/null 2>&1
    # To install from source instead of the last release, comment the command above and uncomment the following one.
    #!pip install git+https://github.com/huggingface/transformers.git

In [None]:
if isColab:
    # MeCab, fugashi, ipadic のインストール
    !apt install aptitude swig > /dev/null 2>&1
    !aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y > /dev/null 2>&1
    !pip install mecab-python3 > /dev/null 2>&1
    !git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 2>&1
    !echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a > /dev/null 2>&1
    
    import subprocess
    cmd='echo `mecab-config --dicdir`\"/mecab-ipadic-neologd\"'
    path_neologd = (subprocess.Popen(cmd, stdout=subprocess.PIPE,
                                     shell=True).communicate()[0]).decode('utf-8')

    !pip install 'fugashi[unidic]' > /dev/null 2>&1
    !python -m unidic download > /dev/null 2>&1
    !pip install ipadic > /dev/null 2>&1    

# データの前処理

<!-- # Preprocessing data -->

このチュートリアルでは，Transoformers を使ってデータを前処理する方法を紹介します。
このための主なツールは [tokenizer](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer) と呼ばれるものです。
使用したいモデルに関連するトークナイザクラスを使って構築することもできますし， [AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer) クラスを使って直接構築することもできます。
<!-- In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. 
The main tool for this is what we call a [tokenizer](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer). 
You can build one using the tokenizer class associated to the model you would like to use, or directly with the [AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer) class. -->

クイックツアー](https://huggingface.co/docs/transformers/master/en/quicktour)で見たように，トークン化器はまず，与えられたテキストを，通常は _tokens_ と呼ばれる単語 (または単語の一部，句読点記号など) に分割します。 
次に，これら _tokens_ を数値に変換して，それらからテンソルを構築し，モデルに与えることができるようにします。
また，モデルが適切に動作するために必要な追加入力を加えます。
<!-- As we saw in the [quick tour](https://huggingface.co/docs/transformers/master/en/quicktour), the tokenizer will first split a given text in words (or part of words, punctuation symbols, etc.) usually called _tokens_. 
Then it will convert those _tokens_ into numbers, to be able to build a tensor out of them and feed them to the model. 
It will also add any additional inputs the model might expect to work properly. -->

<Tip>

事前学習済みのモデルを使用する場合は，関連する事前学習済みのトークン化器を使用することが重要です。
このトークン化器は，前学習コーパスと同じ方法で，与えられたテキストをトークンに分割し，事前学習時と同じ対応関係のあるトークンを数値に使用します (通常 _vocab_ と呼びます)。
<!-- If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: 
it will split the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence token to index (that we usually call a _vocab_) as during pretraining. -->

</Tip>

事前訓練やモデルの微調整の際に使用した語彙を自動的にダウンロードするには， [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) メソッドを使用します。
<!-- To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) method: -->

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


model_name_ja = "cl-tohoku/bert-base-japanese"  # 東北大学乾研版 BERT
tknz = AutoTokenizer.from_pretrained(model_name_ja)


## 1. 基本的用法
<!-- ## Base use -->

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Yffk5aydLzg?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[PreTrainedTokenizer](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) には多くのメソッドがありますが，前処理のために覚えておく必要があるのは，`__call__`  だけです。
<!-- A [PreTrainedTokenizer](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) has many methods, but the only one you need to remember for preprocessing is its `__call__`: you just need to feed your sentence to your tokenizer object. -->

In [None]:
encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

encoded_input_ja1 = tknz("こんにちは，竈門炭治郎です")
encoded_input_ja2 = tknz("これはニューラルネットワークモデルです")

print(encoded_input_ja1)
print(encoded_input_ja2)

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [2, 10350, 25746, 28450, 228, 1, 1605, 3236, 1311, 29082, 2992, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [2, 171, 9, 621, 2151, 14610, 6570, 2992, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


これは，辞書文字列を int リストにして返します。 
[input_ids](https://huggingface.co/docs/transformers/master/en/glossary#input-ids) は，文の各トークンに対応するインデックスです。
[attention_mask](https://huggingface.co/docs/transformers/master/en/glossary#attention-mask) が何に使われるかは後述しますし，[次節](#preprocessing-pairs-of-sentences) では [token_type_ids](https://huggingface.co/docs/transformers/master/en/glossary#token-type-ids) の目的を説明します。
<!-- This returns a dictionary string to list of ints. The [input_ids](https://huggingface.co/docs/transformers/master/en/glossary#input-ids) are the indices corresponding to each token in our sentence. 
We will see below what the [attention_mask](https://huggingface.co/docs/transformers/master/en/glossary#attention-mask) is used for and in [the next section](#preprocessing-pairs-of-sentences) the goal of [token_type_ids](https://huggingface.co/docs/transformers/master/en/glossary#token-type-ids). -->

トークン化器は，トークン ID のリストを適切な文に符号化することができます。
<!-- The tokenizer can decode a list of token ids in a proper sentence: -->

In [None]:
print(tokenizer.decode(encoded_input["input_ids"]))

print(tknz.decode(encoded_input_ja1["input_ids"]))
print(tknz.decode(encoded_input_ja2["input_ids"]))


[CLS] Hello, I'm a single sentence! [SEP]
[CLS] こんにちは, [UNK] 門 炭 治郎 です [SEP]
[CLS] これ は ニューラルネットワークモデル です [SEP]


ご覧のとおり，トークン化器はモデルが期待する特別なトークンを自動的に追加しています。
すべてのモデルが特別なトークンを必要としているわけではありません。
たとえば，トークン化器を作成する際に _bert-base-cased_  の代わりに _gpt2-medium_ を使用していたとしたら，元の文と同じものがここに表示されていたでしょう。
`add_special_tokens=False` を渡すことで，この動作を無効にすることができます (自分でこれらの特殊なトークンを追加した場合にのみお勧めします)。
<!-- As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need special tokens; for instance, if we had used _gpt2-medium_ instead of _bert-base-cased_ to create our tokenizer, we would have seen the same sentence as the original one here. 
You can disable this behavior (which is only advised if you have added those special tokens yourself) by passing `add_special_tokens=False`. -->

処理したい文が複数ある場合は，それらをリストとしてトークン化器に送ることで，効率的に処理することができます。
<!-- If you have several sentences you want to process, you can do this efficiently by sending them as a list to the tokenizer: -->

In [None]:
batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

batch_sentences_ja = ["日本国民は、正当に選挙された国会における代表者を通じて行動し、われらとわれらの子孫のために、諸国民との協和による成果と、わが国全土にわたつて自由のもたらす恵沢を確保し、政府の行為によつて再び戦争の惨禍が起ることのないやうにすることを決意し、ここに主権が国民に存することを宣言し、この憲法を確定する。"]
encoded_inputs_ja = tokenizer(batch_sentences_ja)
print(encoded_inputs_ja)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102], [101, 1262, 1330, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}
{'input_ids': [[101, 1033, 1039, 1004, 100, 916, 885, 1045, 100, 914, 100, 100, 903, 28822, 28801, 1004, 100, 914, 28790, 28794, 28821, 100, 100, 100, 100, 100, 100, 100, 100, 904, 885, 100, 1014, 100, 915, 28801, 28814, 28807, 885, 100, 1004, 100, 912, 28808, 100, 1002, 914, 28818, 28821, 100, 100, 912, 885, 100, 1004, 100, 1006, 100, 100, 100, 915, 28815, 28801, 28819, 28798, 100, 100, 100, 100, 100, 904, 885, 100, 100, 915, 100, 100, 914, 28818, 28803, 28804, 100, 100, 100, 100, 915, 100, 100, 100, 100, 100, 100, 100, 904, 885, 902, 28795, 28807, 100, 100, 100, 1004, 100, 914, 100, 100, 100, 100, 904, 885, 902, 28808, 100, 100, 100, 100, 100, 905, 28821, 886, 1

再び辞書が返されますが，今回は値が int のリストのリストになっています。
<!-- We get back a dictionary once again, this time with values being lists of lists of ints.-->

一度にいくつかの文章をトークン化器に送る目的が，モデルに供給するためのバッチを構築することであれば、おそらく次のことが必要になるでしょう。
<!-- If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will probably want: -->

- 各センテンスをバッチ内の最大の長さにする。
- 各文をモデルが受け入れられる最大の長さに切り詰める (該当する場合)。
- テンソルを返す。

<!-- - To pad each sentence to the maximum length there is in your batch.
- To truncate each sentence to the maximum length the model can accept (if applicable).
- To return tensors. -->

これらのことは，文のリストをトークン化する際に以下のオプションを使用することで実現できます。
<!-- You can do all of this by using the following options when feeding your list of sentences to the tokenizer: -->

In [None]:
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(batch)

batch_ja = tknz(batch_sentences_ja, padding=True, truncation=True, return_tensors="pt")
print(batch_ja)

{'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
        [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
        [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
{'input_ids': tensor([[    2,    91,  1598,     9,     6,  9492,     7,   958,    26,    20,
            10,  4158,   587,   542,   104,  3016,  1891,    15,     6, 15296,
            60,    13, 15296,    60,     5,  5162,     5,    82,     7,     6,
           959,  1598,    13,     5, 28195,   250,  5638,    13,     6, 12344,
         28518, 11278,     7,   630, 14618,    16,  1287,     5,  8538,  3331,
         29331,    11,  2974,    15,     6,   886,     5,  2382,    56,   181,
            16,  1438,   941,     5,  9717

In [None]:
# Tensorflow 版なのでコメントアウトした
# batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
# print(batch)

これは，文字列キーとテンソル値を持つ辞書を返します。
これで [attention_mask](https://huggingface.co/docs/transformers/master/en/glossary#attention-mask)の意味がわかりました。
モデルが注意を払うべきトークンとそうでないトークンを示しています (この場合はパディングを表しているため)。
<!-- It returns a dictionary with string keys and tensor values. We can now see what the [attention_mask](https://huggingface.co/docs/transformers/master/en/glossary#attention-mask) is all about: it points out which tokens the model should pay attention to and which ones it should not (because they represent padding in this case). -->

なお，モデルに最大長が設定されていない場合は，上記のコマンドは警告を出します。
これは無視しても大丈夫です。
また `verbose=False` を渡すことで，トークン化器がこの種の警告を出さないようにすることもできます。

<a id='sentence-pairs'></a

<!-- Note that if your model does not have a maximum length associated to it, the command above will throw a warning. 
You can safely ignore it. 
You can also pass `verbose=False` to stop the tokenizer from throwing those kinds of warnings. --> -->



## 2. 対となる文の準備
<!-- ## Preprocessing pairs of sentences -->

In [None]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/0u3ioSwev3s?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

文章対をモデルに与える必要がある場合があります。
例えば，対になる 2 つの文が類似しているかどうかを分類したい場合や，文脈と質問を受け付ける質問応答モデルの場合などです。
BERT モデルの場合，入力は次のように表されます。
[CLS] 系列A [SEP] 系列B [SEP]。
<!-- Sometimes you need to feed a pair of sentences to your model. 
For instance, if you want to classify if two sentences in a pair are similar, or for question-answering models, which take a context and a question. 
For BERT models, the input is then represented like this: `[CLS] Sequence A [SEP] Sequence B [SEP]`-->

2 つの文を 2 つの引数として与えることで，モデルが期待する形式で文対を符号化することができます (2 つの文のリストは，前に見たように 2 つの単一文のバッチとして解釈されるので，リストではありません)。
これにより，再び dict の文字列から int のリストが返されます。
<!-- You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments (not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
This will once again return a dict string to list of ints: -->

In [None]:
encoded_input = tokenizer("How old are you?", "I'm 6 years old")
print(encoded_input)

encoded_input_ja = tknz("こんにちは", "ぼくドラえもんです")
print(encoded_input_ja)

{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [2, 10350, 25746, 28450, 3, 12253, 9574, 2992, 3], 'token_type_ids': [0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


これは [token_type_id](https://huggingface.co/docs/transformers/master/en/glossary#token-type-ids) が何のためにあるのかを示しています。
これは，入力のどの部分が第 1 文に対応し，どの部分が第 2 文に対応するのかをモデルに示すものです。
なお，_token_type_id_ はすべてのモデルで必要とされたり，扱われたりするわけではありません。
デフォルトでは，トークン化器は，関連付けられたモデルが期待する入力のみを返します。
`return_input_ids` や `return_token_type_ids` を使えば，それらの特別な引数を強制的に返す (または返さない) ことができます。
<!-- This shows us what the [token_type_ids](https://huggingface.co/docs/transformers/master/en/glossary#token-type-ids) are for: they indicate to the model which part of the inputs correspond to the first sentence and which part corresponds to the second sentence. 
Note that _token_type_ids_ are not required or handled by all models. 
By default, a tokenizer will only return the inputs that its associated model expects. You can force the return (or the non-return) of any of those special arguments by using `return_input_ids` or `return_token_type_ids`.  -->

取得したトークン ID を符号化すると，特殊なトークンが適切に追加されていることがわかります。
<!-- If we decode the token ids we obtained, we will see that the special tokens have been properly added. -->

In [None]:
print(tokenizer.decode(encoded_input["input_ids"]))
print(tknz.decode(encoded_input_ja["input_ids"]))

[CLS] How old are you? [SEP] I'm 6 years old [SEP]
[CLS] こんにちは [SEP] ぼく ドラえもん です [SEP]


処理したい配列の対のリストがある場合，それを 2 つのリストにしてトークン化器に渡す必要があります。
第 1 文のリストと第二文のリストです。
<!-- If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: 
the list of first sentences and the list of second sentences: -->

In [None]:
batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
batch_of_second_sentences = [
    "I'm a sentence that goes with the first sentence",
    "And I should be encoded with the second sentence",
    "And I go with the very last one",
]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)


batch_sentences_ja = ["春はあけぼの。",
                      "やうやう白くなりゆく山ぎは、すこしあかりて、紫だちたる 雲のほそくたなびきたる",
                      "夏は夜。月のころはさらなり。やみもなほ、蛍の多く飛びちがひたる。"]
batch_of_second_sentences_ja = [
    "最初の文章です。",
    "そして二番目の文章です。",
    "さらに，これが三番目の文章です。",
]
encoded_inputs_ja = tknz(batch_sentences_ja, batch_of_second_sentences_ja)
print(encoded_inputs_ja)

{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
{'input_ids': [[2, 1409, 9, 20123, 29245, 28444, 8, 3, 918, 5, 7204, 2992, 8, 3], [2, 49, 28489, 28528, 28489, 21845, 297, 7676, 309, 3185, 9, 6, 340, 21797, 22, 20431, 16, 6, 5007, 75, 728, 6424, 3436, 5, 232, 17985, 10, 28462, 28670, 28512, 6424, 3, 893, 287, 2944, 5, 7204, 2992, 8, 3], [2, 1428, 9, 1563,

見ての通り，これは，各値が int のリストのリストである辞書を返します。
<!-- As we can see, it returns a dictionary where each value is a list of lists of ints.-->

モデルに入力された内容をダブルチェックするために _input_ids_ の各リストを 1 つずつ符号化してみましょう。
<!-- To double-check what is fed to the model, we can decode each list in _input_ids_ one by one: -->

In [None]:
for ids in encoded_inputs["input_ids"]:
    print(tokenizer.decode(ids))

for ids in encoded_inputs_ja["input_ids"]:
    print(tknz.decode(ids))    

[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
[CLS] 春 は あけぼの 。 [SEP] 最初 の 文章 です 。 [SEP]
[CLS] やうやう 白く なり ゆく 山 ぎ は 、 すこし あ かり て 、 紫 だ ち たる 雲 の ほそく たなびき たる [SEP] そして 二 番目 の 文章 です 。 [SEP]
[CLS] 夏 は 夜 。 月 の ころ は さら なり 。 やみ も なほ 、 蛍 の 多く 飛 びちがひたる 。 [SEP] さらに, これ が 三 番目 の 文章 です 。 [SEP]


繰り返しになりますが，次のようにして，入力をバッチ内の最大文長に自動的にパッドし，モデルが受け入れられる最大長に切り詰め，テンソルを直接返すことができます。
<!-- Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum length the model can accept and return tensors directly with the following: -->

In [None]:
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")

batch_ja = tknz(batch_sentences_ja, batch_of_second_sentences_ja, padding=True, truncation=True, return_tensors="pt")

In [None]:
#TensorFlow 版なので省略
#batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")

## 3. 埋め込みと切断について知るべきことすべて
<!-- ## Everything you always wanted to know about padding and truncation -->

ほとんどの場合に機能するコマンドを見てきました (バッチを最大文の長さにパッドし，モデルが受け入れられる最大の長さに切り詰める)。
しかし，API は必要に応じてより多くの戦略をサポートしています。
そのために必要な 3 つの引数は `padding`、`truncation`、`max_length` です。
<!-- We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and truncate to the maximum length the model can accept). 
However, the API supports more strategies if you need them. 
The three arguments you need to know for this are `padding`, `truncation` and `max_length`. -->

- padding` は埋め込みをコントロールします。
これはブール値か文字列で，次のようになります。
<!-- - `padding` controls the padding. It can be a boolean or a string which should be:-->

  - `True` または `'longest'` バッチ内の最も長い系列にパディングします  (1つの系列しか提供していない場合はパディングしません)。
  - `'max_length'` は `max_length` 引数で指定された長さにパディングします。
`max_length` が指定されていない場合は，モデルが受け入れる最大の長さに埋め込みします (`max_length=None`)。
  1 つのシ系列のみを指定した場合でも，パディングはその系列に適用されます。
  - False` または `'do_not_pad'` を指定すると，系列のパディングを行いません。
  これまで見てきたように，これがデフォルトの動作です。

<!--
  - `True` or `'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide a single sequence).
  - `'max_length'` to pad to a length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). 
  If you only provide a single sequence, padding will still be applied to it.
  - `False` or `'do_not_pad'` to not pad the sequences. 
  As we have seen before, this is the default behavior. -->

- `truncation` controls the truncation. 
It can be a boolean or a string which should be:

  - `True` or `'longest_first'` truncate to a maximum length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). 
  This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached.
  - `'only_second'` truncate to a maximum length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). 
  This will only truncate the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided. 
  - `'only_first'` truncate to a maximum length specified by the `max_length` argument or the maximum length accepted by the model if no `max_length` is provided (`max_length=None`). 
  This will only truncate the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
  - `False` or `'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the default behavior.

- `max_length` to control the length of the padding/truncation. 
It can be an integer or `None`, in which case it will default to the maximum length the model can accept. 
If the model has no specific maximum input length, truncation/padding to `max_length` is deactivated.

Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in any of the following examples, you can replace `truncation=True` by a `STRATEGY` selected in
`['only_first', 'only_second', 'longest_first']`, i.e. `truncation='only_second'` or `truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.

| Truncation                           | Padding                           | Instruction                                                                                 |
|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
| no truncation                        | no padding                        | `tokenizer(batch_sentences)`                                                           |
|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True)` or                                          |
|                                      |                                   | `tokenizer(batch_sentences, padding='longest')`                                        |
|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')`                                     |
|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
| truncation to max model input length | no padding                        | `tokenizer(batch_sentences, truncation=True)` or                                       |
|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
|                                      | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
|                                      | padding to specific length        | Not possible                                                                                |
| truncation to specific length        | no padding                        | `tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
|                                      |                                   | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
|                                      | padding to max sequence in batch  | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
|                                      |                                   | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
|                                      | padding to max model input length | Not possible                                                                                |
|                                      | padding to specific length        | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
|                                      |                                   | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |

## Pre-tokenized inputs

The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
predictions in [named entity recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) or
[part-of-speech tagging (POS tagging)](https://en.wikipedia.org/wiki/Part-of-speech_tagging).

<Tip warning={true}>

Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
like BPE).

</Tip>

If you want to use pre-tokenized inputs, just set `is_split_into_words=True` when passing your inputs to the
tokenizer. For instance, we have:

In [None]:
encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
print(encoded_input)

{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
`add_special_tokens=False`.

This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
like this:

In [None]:
batch_sentences = [
    ["Hello", "I'm", "a", "single", "sentence"],
    ["And", "another", "sentence"],
    ["And", "the", "very", "very", "last", "one"],
]
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)

or a batch of pair sentences like this:

In [None]:
batch_of_second_sentences = [
    ["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
    ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
    ["And", "I", "go", "with", "the", "very", "last", "one"],
]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)

And you can add padding, truncation as well as directly return tensors like before:

In [None]:
batch = tokenizer(
    batch_sentences,
    batch_of_second_sentences,
    is_split_into_words=True,
    padding=True,
    truncation=True,
    return_tensors="pt",
)

In [None]:
batch = tokenizer(
    batch_sentences,
    batch_of_second_sentences,
    is_split_into_words=True,
    padding=True,
    truncation=True,
    return_tensors="tf",
)