<a href="https://colab.research.google.com/github/LCaravaggio/NLP/blob/main/notebooks/06a-HuggingFaceTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[**Hugging Face**](https://huggingface.co/docs) es un ecosistema de librerías que permite a desarrolladores y científicos compartir y utilizar recursos open-source de machine learning. Es particularmente popular en el campo del NLP.

Vamos a hacer un _overview_ de los principales componentes de Hugging Face: tokenizadores, modelos, datasets y pipelines.

-----------------------

Tarea: entender todo el código y responder donde dice **PREGUNTA**

## Configuración del entorno

In [1]:
!pip install -qU transformers datasets watermark

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2024.12.0 which is incompatible.
t

In [2]:
%load_ext watermark

In [3]:
%watermark -vmp transformers,datasets,torch,numpy,pandas

Python implementation: CPython
Python version       : 3.11.12
IPython version      : 7.34.0

transformers: 4.51.3
datasets    : 3.5.0
torch       : 2.6.0+cu124
numpy       : 2.0.2
pandas      : 2.2.2

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.123+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



## Tokenizadores

Los modelos preentrenados se desarrollan junto con **tokenizadores**: toman strings sin procesar y devuelven diccionarios con los **inputs del modelo**.

Cada **token** es un número entero que se corresponde con una **palabra en el vocabulario** del modelo.

Con `AutoTokenizer` podemos cargar **tokenizers pre-entrenados**. Para ver cómo entrenar un tokenizador con BPE desde cero, ver: https://huggingface.co/learn/nlp-course/en/chapter6/8?fw=pt

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [5]:
print(tokenizer)

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


In [6]:
input_str = "These pretzels are making me thirsty!"
tokenized_input = tokenizer(input_str)

print("> Tokenizer input:")
print(input_str)
print("-"*70)
print("> Tokenizer output:")
print(tokenized_input)
print("-"*70)
print("> Tokenizer output (input IDs):")
print(tokenized_input["input_ids"])

> Tokenizer input:
These pretzels are making me thirsty!
----------------------------------------------------------------------
> Tokenizer output:
{'input_ids': [101, 1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
----------------------------------------------------------------------
> Tokenizer output (input IDs):
[101, 1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106, 102]


Veamos lo que sucede por debajo paso a paso

In [7]:
def tokenize_step_by_step(input_str):
    input_tokens = tokenizer.tokenize(input_str)
    input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
    cls = [tokenizer.cls_token_id]
    sep = [tokenizer.sep_token_id]
    input_ids_special_tokens = cls + input_ids + sep
    decoded_str = tokenizer.decode(input_ids_special_tokens)
    print("input:                  ", input_str)
    print("tokenize:               ", input_tokens)
    print("convert_tokens_to_ids:  ", input_ids)
    print("add special tokens:     ", input_ids_special_tokens)
    print("-"*70)
    print("decode (IDs to strings):", decoded_str)

tokenize_step_by_step(input_str)

input:                   These pretzels are making me thirsty!
tokenize:                ['These', 'pre', '##tz', '##els', 'are', 'making', 'me', 'thirst', '##y', '!']
convert_tokens_to_ids:   [1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106]
add special tokens:      [101, 1636, 3073, 5745, 5999, 1132, 1543, 1143, 26190, 1183, 106, 102]
----------------------------------------------------------------------
decode (IDs to strings): [CLS] These pretzels are making me thirsty! [SEP]


**PREGUNTA** ¿Qué pasa si pasamos una oración en castellano o árabe? ¿Por qué?

In [8]:
x = "Quiero aprender a usar modelos de lenguaje"

tokenize_step_by_step(x)

input:                   Quiero aprender a usar modelos de lenguaje
tokenize:                ['Q', '##ui', '##ero', 'a', '##p', '##ren', '##der', 'a', 'us', '##ar', 'model', '##os', 'de', 'le', '##ng', '##ua', '##je']
convert_tokens_to_ids:   [154, 6592, 10771, 170, 1643, 5123, 2692, 170, 1366, 1813, 2235, 2155, 1260, 5837, 2118, 6718, 5561]
add special tokens:      [101, 154, 6592, 10771, 170, 1643, 5123, 2692, 170, 1366, 1813, 2235, 2155, 1260, 5837, 2118, 6718, 5561, 102]
----------------------------------------------------------------------
decode (IDs to strings): [CLS] Quiero aprender a usar modelos de lenguaje [SEP]


In [9]:
x = "أريد أن أتعلم استخدام نماذج اللغة"

tokenize_step_by_step(x)

input:                   أريد أن أتعلم استخدام نماذج اللغة
tokenize:                ['أ', '##ر', '##ي', '##د', 'أ', '##ن', 'أ', '##ت', '##ع', '##ل', '##م', 'ا', '##س', '##ت', '##خ', '##د', '##ا', '##م', 'ن', '##م', '##ا', '##ذ', '##ج', 'ا', '##ل', '##ل', '##غ', '##ة']
convert_tokens_to_ids:   [562, 19775, 16070, 18191, 562, 17754, 562, 28477, 28490, 28495, 26259, 565, 28484, 28477, 28481, 18191, 28475, 26259, 590, 26259, 28475, 28482, 28479, 565, 28495, 28495, 28491, 23525]
add special tokens:      [101, 562, 19775, 16070, 18191, 562, 17754, 562, 28477, 28490, 28495, 26259, 565, 28484, 28477, 28481, 18191, 28475, 26259, 590, 26259, 28475, 28482, 28479, 565, 28495, 28495, 28491, 23525, 102]
----------------------------------------------------------------------
decode (IDs to strings): [CLS] أريد أن أتعلم استخدام نماذج اللغة [SEP]


Cuando entrenamos modelos o hacemos inferencia, vamos a querer:

* trabajar con **_batches_**, pasando muchas secuencias simultáneamente como input
* trabajar con **tensores** de PyTorch, no con listas

**PREGUNTA** ¿para qué sirve usar batches?

In [10]:
input_strings = [
    "These pretzels are making me thirsty!",
    "I am speechless! I am without speech.",
    "No more soup for you!",
    "I'm a wealthy industrialist and philanthropist and a bicyclist."
]

In [11]:
model_inputs = tokenizer(
    input_strings, return_tensors="pt", padding='longest', truncation=True,
    max_length=tokenizer.model_max_length)

Esto quiere decir:

- Tokenizar todas las frases simultáneamente
- Devolver los tensores en formato PyTorch ("pt") en un tensor _rectangular_
- Truncar las frases más largas para que no excedan el tamaño máximo admitido por el modelo
- Rellenar con _padding_ hasta el máximo largo del batch para que todas las entradas tengan la misma longitud


In [12]:
print(f"Max model length: {tokenizer.model_max_length}")

Max model length: 512


In [13]:
print(f"Pad token: {tokenizer.pad_token}")
print(f"Pad token ID: {tokenizer.pad_token_id}")

Pad token: [PAD]
Pad token ID: 0


In [14]:
model_inputs = tokenizer(
    input_strings, return_tensors="pt", padding='longest', truncation=True,
    max_length=tokenizer.model_max_length)

print("Batch encode:")
print([f"{k}: {v.shape}" for k, v in model_inputs.items()])
print(model_inputs["input_ids"])
print(model_inputs["attention_mask"])
print("-"*70)
print("Batch decode:")
print(*tokenizer.batch_decode(model_inputs.input_ids, skip_special_tokens=False), sep="\n")

Batch encode:
['input_ids: torch.Size([4, 17])', 'attention_mask: torch.Size([4, 17])']
tensor([[  101,  1636,  3073,  5745,  5999,  1132,  1543,  1143, 26190,  1183,
           106,   102,     0,     0,     0,     0,     0],
        [  101,   146,  1821,  4055,  2008,   106,   146,  1821,  1443,  4055,
           119,   102,     0,     0,     0,     0,     0],
        [  101,  1302,  1167, 13128,  1111,  1128,   106,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0],
        [  101,   146,   112,   182,   170,  6822, 24916,  1105, 16581,  1105,
           170, 16516,  3457,  1665,  7276,   119,   102]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
----------------------------------------------------------------------
Batch decode:
[CLS] These pretzels are maki

## Modelos

Los modelos suelen tener un **body** y **head**.

* El "body" son los **pesos preentrenados** que devuelven una representación de la secuencia de input.
* El "head" son los pesos adicionales que dependen de la **tarea específica** que estamos resolviendo.

Con las clases `AutoModel...` podemos cargar un modelo preentrenado y agregarle un head específico para nuestra tarea.

```
AutoModel # (solo hidden states, sin head)
AutoModelForCausalLM
AutoModelForMaskedLM
AutoModelForSequenceClassification
AutoModelForTokenClassification
# etc
```

Vamos a cargar un BERT "destilado" para hacer **clasificación binaria de secuencias**.

In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-cased', num_labels=2, id2label={0: "A", 1: "B"}, label2id={"A": 0, "B": 1})

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Esto quiere decir que estamos agregando una **capa de clasificación** con **dos salidas** al final del modelo preentrenado.

El warning nos dice que los pesos de esta capa todavía no fueron entrenados. Es decir, necesitamos hacer fine-tuning sobre un dataset específico para que tenga sentido usarlo.

**_Solo a modo ilustrativo_**, vamos a hacer **inferencia** sobre una frase de ejemplo.

In [16]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [17]:
param_names = [n for n, p in model.named_parameters()]

print(f"# de 'capas': {len(param_names)}")
print(param_names[:3])
print(param_names[-3:])

# de 'capas': 104
['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight']
['pre_classifier.bias', 'classifier.weight', 'classifier.bias']


In [18]:
import torch

input_str = "These pretzels are making me thirsty!"

model_inputs = tokenizer(input_str, return_tensors="pt")
model.eval() # eval mode: desactiva componentes random, como dropout
with torch.inference_mode(): # inference mode: desactiva cómputo de gradientes
    model_outputs = model(**model_inputs)

print("Inputs:")
print(model_inputs)
print("-"*70)
print("Outputs:")
print(model_outputs)
print(f"Logits: {model_outputs.logits}")
print(f"Probabilidades: {torch.softmax(model_outputs.logits, dim=1)}")
pred = torch.argmax(model_outputs.logits).item()
print(f"Predicción: {model.config.id2label[pred]}")

Inputs:
{'input_ids': tensor([[  101,  1636,  3073,  5745,  5999,  1132,  1543,  1143, 26190,  1183,
           106,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
----------------------------------------------------------------------
Outputs:
SequenceClassifierOutput(loss=None, logits=tensor([[-0.0718,  0.0601]]), hidden_states=None, attentions=None)
Logits: tensor([[-0.0718,  0.0601]])
Probabilidades: tensor([[0.4671, 0.5329]])
Predicción: B


Si tuviésemos labels, podemos usar pytorch para entrenar i.e. actualizar los pesos del modelo con respecto a la loss.

In [40]:
input_str = "These pretzels are making me thirsty!"

model_inputs = tokenizer(input_str, return_tensors="pt")
model.train()
model_outputs = model(**model_inputs)
label = torch.tensor([1])
loss = torch.nn.functional.cross_entropy(model_outputs.logits, label)
print(f"Loss: {loss.item():.4f}")
loss.backward() # Computa gradientes
# optimizer.step() # Si quisieramos actualizar los pesos con un optimizer

Loss: 0.6149


Si los labels están en el input, podemos obtener la loss automáticamente:

In [20]:
model_inputs['labels'] = torch.tensor([1]) # label de ejemplo
model.eval()
with torch.inference_mode():
    model_outputs = model(**model_inputs)

print(f"Logits: {model_outputs.logits}")
print(f"Probabilidades: {torch.softmax(model_outputs.logits, dim=1)}")
print(f"Pérdida: {model_outputs.loss:.4f}")

Logits: tensor([[-0.0718,  0.0601]])
Probabilidades: tensor([[0.4671, 0.5329]])
Pérdida: 0.6294


**PREGUNTA** ¿por qué puede diferir la loss en las dos celdas anteriores?

## Datasets

HF también tiene [datasets](https://huggingface.co/datasets) open-source que podemos usar para entrenar y evaluar nuestros modelos.

Hay [muchas funcionalidades](https://huggingface.co/docs/datasets/process) para leer y modificar la estructura y contenido de un dataset (e.g. streaming, split de datos, reordenar filas, cambiar nombres de columnas, eliminar columnas, transformar ejemplos, concatenar datasets, etc.)

Vamos a cargar un dataset de [reviews de películas](https://huggingface.co/datasets/rotten_tomatoes).

In [21]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")

README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [22]:
# filas y columnas
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [23]:
dataset["train"].features.items()

dict_items([('text', Value(dtype='string', id=None)), ('label', ClassLabel(names=['neg', 'pos'], id=None))])

In [24]:
dataset["train"][3]

{'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
 'label': 1}

In [25]:
# En general es útil armar una función para mapear de IDs a labels en la variable respuesta
label_names = dataset["train"].features["label"].names
label2id = {name: dataset["train"].features["label"].str2int(name) for name in label_names}
id2label = {id: label for label, id in label2id.items()}

id_example = dataset["train"][3]["label"]
print(f"Label ID: {id_example}")
print(f"Label: {id2label[id_example]}")

Label ID: 1
Label: pos


A modo de ejemplo, vamos a limpiar cualquier caracter HTML que pueda haber en las reviews y truncar.

In [26]:
import html

def truncate(examples, max_length=50):
    """Recibe un diccionario con los nombres de las columnas como keys
    Como lo vamos a aplicar en batches, cada value del dict es una lista con los
    valores de esa columna
    """
    return {
        'text': [html.unescape(text[:max_length]) for text in examples['text']],
        # 'label': ... # si quisieramos modificar el label
    }

In [27]:
# ejemplo:
truncate(dataset["train"][:4])

{'text': ["the rock is destined to be the 21st century's new ",
  'the gorgeously elaborate continuation of " the lor',
  'effective but too-tepid biopic',
  'if you sometimes like to go to the movies to have ']}

In [28]:
dataset = dataset.map(lambda x: truncate(x, max_length=50), batched=True)
# batch_size default es 1000

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [29]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [30]:
dataset['train'][3]

{'text': 'if you sometimes like to go to the movies to have ', 'label': 1}

## Pipelines

Hay tareas estándar de NLP para las que ya hay **modelos preentrenados y fine-tuneados**. HF los disponibiliza a través de la interfaz de [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines).

Por ejemplo para **clasificación de sentimiento**:

In [31]:
from transformers import pipeline

sentiment_analysis = pipeline(
    "sentiment-analysis", model="siebert/sentiment-roberta-large-english"
)

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


In [32]:
sentiment_analysis("Change is inevitable")

[{'label': 'POSITIVE', 'score': 0.9696395993232727}]

In [33]:
sentiment_analysis("Change is inevitable", top_k=None) # devuelve los scores de todas las clases

[{'label': 'POSITIVE', 'score': 0.9696395993232727},
 {'label': 'NEGATIVE', 'score': 0.03036043420433998}]

In [34]:
input_strings = [
    "These pretzels are making me thirsty!",
    "I am speechless! I am without speech.",
    "No more soup for you!",
    "I'm a wealthy industrialist and philanthropist and a bicyclist."
]
outputs = sentiment_analysis(input_strings)

for i, output in enumerate(outputs):
    print(f"Input: {input_strings[i]}")
    print(f"Sentiment: {output['label']}, score: {output['score']:.4f}")
    print("-"*70)

Input: These pretzels are making me thirsty!
Sentiment: NEGATIVE, score: 0.9983
----------------------------------------------------------------------
Input: I am speechless! I am without speech.
Sentiment: NEGATIVE, score: 0.9994
----------------------------------------------------------------------
Input: No more soup for you!
Sentiment: NEGATIVE, score: 0.9970
----------------------------------------------------------------------
Input: I'm a wealthy industrialist and philanthropist and a bicyclist.
Sentiment: POSITIVE, score: 0.9964
----------------------------------------------------------------------


O para [NER](https://huggingface.co/dslim/bert-base-NER):

In [35]:
# model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
# tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
ner = pipeline("ner", model="dslim/bert-base-NER", tokenizer="dslim/bert-base-NER")

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [36]:
print(ner.model.config.id2label)

{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}


In [37]:
ner_string = "Did George Washington go to Washington? Will the real Slim Shady please stand up?"

In [38]:
outputs = ner(ner_string)
for entity in outputs:
    print(entity)

{'entity': 'B-PER', 'score': np.float32(0.99944705), 'index': 2, 'word': 'George', 'start': 4, 'end': 10}
{'entity': 'I-PER', 'score': np.float32(0.99628437), 'index': 3, 'word': 'Washington', 'start': 11, 'end': 21}
{'entity': 'B-LOC', 'score': np.float32(0.99970907), 'index': 6, 'word': 'Washington', 'start': 28, 'end': 38}
{'entity': 'B-PER', 'score': np.float32(0.9991824), 'index': 11, 'word': 'Slim', 'start': 54, 'end': 58}
{'entity': 'I-PER', 'score': np.float32(0.9974842), 'index': 12, 'word': 'S', 'start': 59, 'end': 60}
{'entity': 'I-PER', 'score': np.float32(0.9952407), 'index': 13, 'word': '##hady', 'start': 60, 'end': 64}
