<a href="https://colab.research.google.com/github/Matonice/30-Days-of-Transformer/blob/main/Fast_tokenizer_special_power.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install transformers
!pip install sentencepiece
!pip install datasets

In [2]:
from transformers import AutoTokenizer

In [3]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
example = "I'm Abdulmatin and I work as a freelance data scientist at Upwork"
encoding = tokenizer(example)
print(type(encoding))

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

<class 'transformers.tokenization_utils_base.BatchEncoding'>


In [5]:
#checking if our tokenizer is a fast one or not
print(tokenizer.is_fast)
print(encoding.is_fast)

True
True


**What fast tokenizers enables us to do**

In [7]:
#accessing the tokens without converting the ids back to tokens
print(encoding.tokens())

['[CLS]', 'i', "'", 'm', 'abdul', '##mat', '##in', 'and', 'i', 'work', 'as', 'a', 'freelance', 'data', 'scientist', 'at', 'up', '##work', '[SEP]']


In [9]:
#getting index of the word each tokens come from
print(encoding.word_ids())

[None, 0, 1, 2, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, None]


**Inside the token classification pipeline**

In [11]:
#getting the base result of the pipeline

from transformers import pipeline

token_classifier = pipeline("token-classification")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [12]:
print(token_classifier("I'm Abdulmatin and I work as a freelance data scientist at Upwork"))

[{'entity': 'I-PER', 'score': 0.99946135, 'index': 4, 'word': 'Abdul', 'start': 4, 'end': 9}, {'entity': 'I-PER', 'score': 0.9987431, 'index': 5, 'word': '##mat', 'start': 9, 'end': 12}, {'entity': 'I-PER', 'score': 0.997752, 'index': 6, 'word': '##in', 'start': 12, 'end': 14}, {'entity': 'I-ORG', 'score': 0.9989403, 'index': 16, 'word': 'Up', 'start': 59, 'end': 61}, {'entity': 'I-ORG', 'score': 0.9819564, 'index': 17, 'word': '##work', 'start': 61, 'end': 65}]


In [13]:
#asking the pipeline to group together tokens corresponding to the same entity
token_classifier = pipeline("token-classification", aggregation_strategy="simple")
print(token_classifier("I'm Abdulmatin and I work as a freelance data scientist at Upwork"))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'PER', 'score': 0.99865216, 'word': 'Abdulmatin', 'start': 4, 'end': 14}, {'entity_group': 'ORG', 'score': 0.99044836, 'word': 'Upwork', 'start': 59, 'end': 65}]


**Obtaining the result manually without using the pipeline function**

In [27]:
#tokenization and passing the output to the model

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "I'm Abdulmatin and I work as a freelance data scientist at Upwork"
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

In [28]:
print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 19])
torch.Size([1, 19, 9])


In [29]:
import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)

[0, 0, 0, 0, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 6, 0]


In [21]:
model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

In [22]:
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)

[{'entity': 'I-PER', 'score': 0.9994614720344543, 'word': 'Abdul'}, {'entity': 'I-PER', 'score': 0.9987432360649109, 'word': '##mat'}, {'entity': 'I-PER', 'score': 0.9977518916130066, 'word': '##in'}, {'entity': 'I-ORG', 'score': 0.9989402890205383, 'word': 'Up'}, {'entity': 'I-ORG', 'score': 0.9819564819335938, 'word': '##work'}]


In [30]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

[{'entity_group': 'PER', 'score': 0.9986521999041239, 'word': 'Abdulmatin', 'start': 4, 'end': 14}, {'entity_group': 'ORG', 'score': 0.990448385477066, 'word': 'Upwork', 'start': 59, 'end': 65}]
