<a href="https://colab.research.google.com/github/HanSong19/Hugging-Face/blob/main/2.2%20Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
!pip install datasets evaluate transformers[sentencepiece]



##1. Tokenization
### Translate raw data to numberical data that can be used in model. The goal is to find the most meaningful representation: the smallest representation that makes the most sense to the model.

### It has basic three steps. tokenize the sequence (word/ punctuation etc.) -> conversion to ID (assign unique numberic value to each token) -> decode (and make it into a sequence again)

###1.1 Tokenize by word.
   use Python .split() to tokenize a sentence "Jim Henson was a puppeteer" into word.


In [12]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


##Encoding
###1. tokenization
use BertTokenizer class or AutoTokenizer class with "bert-base-cased" identifier.

use .tokenize() method

In [20]:
from transformers import BertTokenizer, AutoTokenizer

sequence =  "Jim Henson was a puppeteer"
tokenizer_Bert = BertTokenizer.from_pretrained("bert-base-cased" )
tokenizer_AutoTokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

output_Bert = tokenizer_Bert.tokenize(sequence)
output_Auto = tokenizer_AutoTokenizer.tokenize(sequence)

print(output_Bert)
print(output_Auto)

['Jim', 'He', '##nson', 'was', 'a', 'puppet', '##eer']
['Jim', 'He', '##nson', 'was', 'a', 'puppet', '##eer']


In [21]:
#save the token


tokenizer_Bert.save_pretrained("directory_on_my_computer")
tokenizer_AutoTokenizer.save_pretrained("directory_on_my_computer")



('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

### 2. Generating input ID
Assign each token an id using .convert_tokens_to_ids().


In [22]:
ids = tokenizer_Bert.convert_tokens_to_ids(output_Bert)
print(ids)

[3104, 1124, 15703, 1108, 170, 16797, 8284]


##Decoding

Then, decode token ids to text using .decode()

In [19]:
decode = tokenizer_Bert.decode([3104, 1124, 15703, 1108, 170, 16797, 8284])
print(decode)

Jim Henson was a puppeteer


## Handling Multiple Sequences



In [37]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output= model(input_ids)
print("Logits:", output.logits)



Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [40]:
batched_ids=[ids, ids]
inputbatch_ids = torch.tensor(batched_ids)
print(inputbatch_ids)
output_batch=model(inputbatch_ids)
print(output_batch.logits)

tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


##Padding

Padding is used to make the tensor rectangular shape. Also make sure that multiple sequences, sequences have to be the same length.


In [45]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batch_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]


print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batch_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


The reason print(model(torch.tensor(batch_ids)).logits) shows.  
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>).

instead of shows tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>).
is that the model also considers the patted token (contextual). Hence, we need to tell the model attention masek (what to focus and what not to focus).

## Attention Masks
0 means the corresponding token should be ignored.
1 means the corresponding token should be counted.


In [50]:
batch_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1,1,1],
    [1,1,0],
]

outputs = model(torch.tensor(batch_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)


tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


In [70]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

#this will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding = "longest")
#this will pad the sequence up to the maximum length the model allows
model_inputs = tokenizer(sequences, padding = "max_length")
#this will pad the sequence up to specific length (8)
model_inputs = tokenizer(sequences, padding = "max_lenth", max_lenth=8)

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


In [None]:
#this will truncate the sequences that are longer than the maximum sequence length
model_inputs = tokenizer(sequences, truncation = True)
#this will truncate the sequence that are longer than 8
mode_input = tokenizer(sequences, max_length = 8, truncation = True)

In [None]:
# return_tensors 0-> different conversion
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

In [72]:
#sometimes models (depending on the model) adds special words to indicate the beginning and end

sequence = "I've been waiting for a HuggingFace course my whole life."
model_input = tokenizer(sequence)

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

print("this is model_input[input_id]:",model_input["input_ids"])
print("this is ids:", ids)

this is model_input[input_id]: [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
this is ids: [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [75]:
print(tokenizer.decode(model_input["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.
