## Step 1: Fetching the tokenizer.

In [30]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [31]:
example_texts = ["I am a short sentence.", "I am a medium sentence. Not too long, not too short.", "I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on and on and on."]

## Step 2: Lets see how the example text output looks like without the padding.

In [32]:
tokenized_data = []
for text in example_texts:
    tokenized = tokenizer(text)
    tokenized_data.append(tokenized)

for text,row in zip(example_texts,tokenized_data):
    print("Following text:")
    print(text)
    print("Following is the tokenized output:")
    print(row['input_ids'])
    print(f"Length of the tokenized output: {len(row['input_ids'])}")
    print("\n")

Following text:
I am a short sentence.
Following is the tokenized output:
[101, 1045, 2572, 1037, 2460, 6251, 1012, 102]
Length of the tokenized output: 8


Following text:
I am a medium sentence. Not too long, not too short.
Following is the tokenized output:
[101, 1045, 2572, 1037, 5396, 6251, 1012, 2025, 2205, 2146, 1010, 2025, 2205, 2460, 1012, 102]
Length of the tokenized output: 16


Following text:
I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on and on and on.
Following is the tokenized output:
[101, 1045, 2572, 2019, 17003, 2135, 2146, 11101, 2098, 1010, 12034, 9232, 1010, 7566, 1011, 1037, 1011, 2843, 1010, 21707, 6251, 2008, 2064, 1005, 1056, 2644, 2770, 2006, 1998, 2006, 1998, 2006, 1012, 102]
Length of the tokenized output: 34




# **Conclusion from above output**

**As we can see, each senetence has the different length. And to work in batchs we need to process them in parallel. Therefore, there length should be the same while processing in the parallel. `So lets add the paading and see the output.`**

In [33]:
tokenized_data = []
for text in example_texts:
    tokenized = tokenizer(text, padding='max_length', max_length=50)
    tokenized_data.append(tokenized)

for text,row in zip(example_texts,tokenized_data):
    print("Following text:")
    print(text)
    print("Following is the tokenized output:")
    print(row['input_ids'])
    print(f"Length of the tokenized output: {len(row['input_ids'])}")
    print("\n")

Following text:
I am a short sentence.
Following is the tokenized output:
[101, 1045, 2572, 1037, 2460, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Length of the tokenized output: 50


Following text:
I am a medium sentence. Not too long, not too short.
Following is the tokenized output:
[101, 1045, 2572, 1037, 5396, 6251, 1012, 2025, 2205, 2146, 1010, 2025, 2205, 2460, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Length of the tokenized output: 50


Following text:
I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on and on and on.
Following is the tokenized output:
[101, 1045, 2572, 2019, 17003, 2135, 2146, 11101, 2098, 1010, 12034, 9232, 1010, 7566, 1011, 1037, 1011, 2843, 1010, 21707, 6251, 2008, 2064, 1005, 1056, 2644, 2770, 2006, 1998, 2006, 1998, 2006, 1012, 102, 0, 0,

# Conclusion

**As you can see, every sequence is padded out to the length specified (in this case, 50 tokens). And as last sentence is the longest senetence of `example_texts`. Still has the lots of zero as padding. This is inefficient, because even the longest sequence now contains numerous tokens that aren't teaching the model anything.**

**Instead of that, we should use `dynamic padding`, which pads sequences to match the length of the longest sentence (or the maximum number of tokens) required by the model for a given batch or corpus.**

In [34]:
dynamically_padded = tokenizer(example_texts, padding='longest')

for i in range(3):
    print("Following text:")
    print(example_texts[i])
    print("Following is the tokenized output:")
    print(dynamically_padded['input_ids'][i])
    print(f"Length of the tokenized output: {len(dynamically_padded['input_ids'][i])}")
    print("\n")

Following text:
I am a short sentence.
Following is the tokenized output:
[101, 1045, 2572, 1037, 2460, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Length of the tokenized output: 34


Following text:
I am a medium sentence. Not too long, not too short.
Following is the tokenized output:
[101, 1045, 2572, 1037, 5396, 6251, 1012, 2025, 2205, 2146, 1010, 2025, 2205, 2460, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Length of the tokenized output: 34


Following text:
I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on and on and on.
Following is the tokenized output:
[101, 1045, 2572, 2019, 17003, 2135, 2146, 11101, 2098, 1010, 12034, 9232, 1010, 7566, 1011, 1037, 1011, 2843, 1010, 21707, 6251, 2008, 2064, 1005, 1056, 2644, 2770, 2006, 1998, 2006, 1998, 2006, 1012, 102]
Length of the tokenized output: 34




# Conclusion

See how the longwinded sentence contains no padding? That's because every other example only pads out to its length.

We've covered how to lengthen examples that are too short. What about the opposite problem?

*Truncation* is the technique used to cut off sequences which we decide are too long. The default truncation setting when `True` is to truncate to the longest length permitted by the model. However, you can also set a max length to truncate to, if that is ever a requirement.


In the version of `example_texts` below, we now make the long sentence 50 times longer, to a total length of 1402 tokens.

First, assign to `tokenized_no_truncation` the tokenized version of `example_texts` with truncation turned off (`truncation=False`).

Then, for `tokenized_default_truncation`, simply pass `truncation=True` when passing `example_texts` to the tokenizer. This automatically truncates to the longest length permitted by the model `(512 for DistilBERT)`.

Finally, in `tokenized_max_length`, pass `truncation=True` and `max_length=5` to the tokenizer. Then execute the cell and compare the lengths of each output.


In [35]:
example_texts = ["I am a short sentence.", "I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on." * 50]

tokenized_no_truncation = tokenizer(example_texts, truncation= False)

print("#### Toknized with no trunction ####\n")
for i in range(2):
    print("Following text:")
    print(example_texts[i])
    print("Following is the tokenized output:")
    print(tokenized_no_truncation['input_ids'][i])
    print(f"Length of the tokenized output: {len(tokenized_no_truncation['input_ids'][i])}")
    print("\n")

tokenized_default_truncation = tokenizer(example_texts, truncation= True)

print("#### Tokenization with truncation ####")
for i in range(2):
    print("Following text:")
    print(example_texts[i])
    print("Following is the tokenized output:")
    print(tokenized_default_truncation['input_ids'][i])
    print(f"Length of the tokenized output: {len(tokenized_default_truncation['input_ids'][i])}")
    print("\n")

tokenized_max_length = tokenizer(example_texts, truncation= True, max_length= 5)

print("#### Tokenization with truncation and max length ####")
for i in range(2):
    print("Following text:")
    print(example_texts[i])
    print("Following is the tokenized output:")
    print(tokenized_max_length['input_ids'][i])
    print(f"Length of the tokenized output: {len(tokenized_max_length['input_ids'][i])}")
    print("\n")

Token indices sequence length is longer than the specified maximum sequence length for this model (1402 > 512). Running this sequence through the model will result in indexing errors


#### Toknized with no trunction ####

Following text:
I am a short sentence.
Following is the tokenized output:
[101, 1045, 2572, 1037, 2460, 6251, 1012, 102]
Length of the tokenized output: 8


Following text:
I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on.I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on.I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on.I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on.I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on.I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on.I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on.I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop 