<a href="https://colab.research.google.com/github/Bayhaqieee/Transformer_Learning/blob/main/Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenizers Summary

- The Transformers API provides a high-level function to handle tokenization efficiently.  
- This function simplifies the process of preparing text inputs for models.  
- When you call the tokenizer on a sentence, it automatically converts the text into model-ready inputs.
-  It's able to handle multiple sequences at the same time
- It's able to pad sequences
- It's also able to truncate

In [5]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = ["Damn we need a break and rest on this, but gotta keep grinding.", "Yanto just said his legendary words! 'ICIKIWIR!' "]

model_inputs = tokenizer(sequence)
print(model_inputs)

{'input_ids': [[101, 4365, 2057, 2342, 1037, 3338, 1998, 2717, 2006, 2023, 1010, 2021, 10657, 2562, 16153, 1012, 102], [101, 13619, 3406, 2074, 2056, 2010, 8987, 2616, 999, 1005, 24582, 17471, 9148, 2099, 999, 1005, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


- The `model_inputs` variable contains all necessary components for model processing.  
- For **DistilBERT**, this includes **input IDs** and an **attention mask**.  
- Other models requiring additional inputs will also receive them from the tokenizer.

In [8]:
# Will pad the sequence up to the maximum sequence length
model_inputs = tokenizer(sequence, padding="longest")
print(model_inputs)

# Will pad the sequence up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequence, padding="max_length")
print(model_inputs)

# Will pad the sequence up to the specified max length
model_inputs = tokenizer(sequence, padding="max_length", max_length=8)
print(model_inputs)

{'input_ids': [[101, 4365, 2057, 2342, 1037, 3338, 1998, 2717, 2006, 2023, 1010, 2021, 10657, 2562, 16153, 1012, 102], [101, 13619, 3406, 2074, 2056, 2010, 8987, 2616, 999, 1005, 24582, 17471, 9148, 2099, 999, 1005, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
{'input_ids': [[101, 4365, 2057, 2342, 1037, 3338, 1998, 2717, 2006, 2023, 1010, 2021, 10657, 2562, 16153, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [9]:
sequences = ["I've been waiting for a this course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
print(model_inputs)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
print(model_inputs)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


- The tokenizer can convert text into **framework-specific tensors**.  
- These tensors can be directly passed to the model.  
- Supported frameworks:  
  - `pt` → Returns **PyTorch** tensors.  
  - `tf` → Returns **TensorFlow** tensors.  
  - `np` → Returns **NumPy** arrays.

In [10]:
sequences = ["I've been waiting for this course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
print(model_inputs)

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
print(model_inputs)

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(model_inputs)

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2026, 2878, 2166,
         1012,  102],
        [ 101, 2061, 2031, 1045,  999,  102,    0,    0,    0,    0,    0,    0,
            0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
{'input_ids': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
array([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2026, 2878,
        2166, 1012,  102],
       [ 101, 2061, 2031, 1045,  999,  102,    0,    0,    0,    0,    0,
           0,    0,    0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 14), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}
{'input_ids': array([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2026, 2878,
        2166, 1012,  102],
       [ 101, 2061, 2031, 1045,  999,  102,    0,    0,    0,    0,    0,
   

###Special-Tokens

In [11]:
sequence = "I've been waiting for this course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2026, 2878, 2166, 1012]


In [12]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for this course my whole life. [SEP]
i've been waiting for this course my whole life.
