# Using transformer models from Huggingface

Transformers have been the state-of-the-art architecture for language models since its introduction around 2017. Transformer models can be _fine-tuned_ for many different tasks, some of which we have worked on ourselves already, e.g. language modeling as a next-word prediction task, sentiment analysis (a classification task) and Named Entity Recognition (NER).

In [1]:
!pip install transformers torch

Defaulting to user installation because normal site-packages is not writeable


# Part 1: HuggingFace pipelines

We are going to use the ```pipeline()``` abstraction in HuggingFace. This allows us to load a fine-tuned model and use it for the task in question. You can read more [here](https://huggingface.co/docs/transformers/v4.27.2/en/task_summary#natural-language-processing).

In [2]:
# This will store models in the specified directory, allowing for inspection later on
# Leave out unless you need the models to be somewhere specific
import os
os.environ["HF_HOME"] = "/work/tf_cache"

In [3]:
import transformers

from transformers import pipeline

In [4]:
# This will create more verbose logging from the transformers library which can give a peek into what is going on under the hood
# transformers.logging.set_verbosity_info()

## Text generation 

NOTE: This is quite slow compared to the other tasks. Why?

In [None]:
generator = pipeline(task="text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [None]:
prompt = "Today I want to talk about BERTology."

In [None]:
generated = generator(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
print(generated[0]["generated_text"])

Today I want to talk about BERTology. I was very interested how it would look with the "TECH" logo, since we didn't want to have too much to go on. I was also interested how this would look while maintaining the


## Sentiment Analysis (Text classification)

In [6]:
classifier = pipeline(task="sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [7]:
predictions = classifier("Hugging Face is the best thing since sliced bread!")

In [8]:
print(predictions)

[{'label': 'POSITIVE', 'score': 0.9990912675857544}]


## Named Entity Recognition (Token classification)

In [10]:
classifier = pipeline(task="ner")
sentence = "Hugging Face is a French company based in New York City."
predictions = classifier(sentence)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [11]:
print(sentence)
print(*predictions, sep='\n')

Hugging Face is a French company based in New York City.
{'entity': 'I-ORG', 'score': np.float32(0.9967675), 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': np.float32(0.92930275), 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': np.float32(0.9763208), 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-MISC', 'score': np.float32(0.9982874), 'index': 6, 'word': 'French', 'start': 18, 'end': 24}
{'entity': 'I-LOC', 'score': np.float32(0.99896204), 'index': 10, 'word': 'New', 'start': 42, 'end': 45}
{'entity': 'I-LOC', 'score': np.float32(0.9986792), 'index': 11, 'word': 'York', 'start': 46, 'end': 50}
{'entity': 'I-LOC', 'score': np.float32(0.9992418), 'index': 12, 'word': 'City', 'start': 51, 'end': 55}


We'll want to merge the tokens that make up entities.

In [12]:
classifier = pipeline(task="ner", aggregation_strategy="simple")
sentence = "Hugging Face is a French company based in New York City."
predictions = classifier(sentence)
print(sentence)
print(*predictions, sep='\n')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Hugging Face is a French company based in New York City.
{'entity_group': 'ORG', 'score': np.float32(0.9674637), 'word': 'Hugging Face', 'start': 0, 'end': 12}
{'entity_group': 'MISC', 'score': np.float32(0.9982874), 'word': 'French', 'start': 18, 'end': 24}
{'entity_group': 'LOC', 'score': np.float32(0.99896103), 'word': 'New York City', 'start': 42, 'end': 55}


## Using a different model
So far, we've only been using the default models and parameters for these tasks. But if you check out the ```HuggingFace``` model universe, you'll see that there are many (in some cases hundreds) of finetuned models which can be slotted into these pipelines. Check out the options [here](https://huggingface.co/models).

In [9]:
classifier = pipeline("text-classification",
                      model="j-hartmann/emotion-english-distilroberta-base",
                      return_all_scores=True)

config.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/294 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


In [10]:
classifier("I love this!")

[[{'label': 'anger', 'score': 0.004419787786900997},
  {'label': 'disgust', 'score': 0.0016119900392368436},
  {'label': 'fear', 'score': 0.00041385277290828526},
  {'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'neutral', 'score': 0.005764589179307222},
  {'label': 'sadness', 'score': 0.002092392183840275},
  {'label': 'surprise', 'score': 0.008528691716492176}]]

### Questions
While easy to use, one may be inclined to say that it is not very clear what happens in the `pipeline()` abstraction.
1. Have a look at [the pipeline documentation](https://huggingface.co/docs/transformers/v4.27.2/en/task_summary#natural-language-processing). What different tasks are available, and how may they be useful?
2. Have a look at [the model page](https://huggingface.co/models) under Natural Language Processing. Are there any interesting models available?

# Part 2: Breaking the pipeline apart

The `pipeline()` abstraction hides away multiple steps: tokenization, inference and post-processing.

Let's break down what happens behind the scenes, focusing on sentiment analysis as our example.



## Tokenization: Preparing the Input
Before a model can process text, it needs to be tokenized, and those tokens need to be converted into a numerical representations that the transformers can work with.

Each pre-trained model comes with its own specific tokenizer. We load it using `AutoTokenizer.from_pretrained()`:

In [14]:
from transformers import AutoTokenizer

# Specify the model name (a sentiment analysis model)
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Tokenizer loaded: {tokenizer}")

Tokenizer loaded: DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


In [15]:
# "misinformation" included to illustrate subword tokenization
text = "This is a great movie! But it spreads misinformation."
tokens = tokenizer(text)

print(f"Input text: {text}")
print(f"Tokenization output: {tokens}")

decoded_tokens = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
print(f"Tokenized: {decoded_tokens}")


Input text: This is a great movie! But it spreads misinformation.
Tokenization output: {'input_ids': [101, 2023, 2003, 1037, 2307, 3185, 999, 2021, 2009, 20861, 28616, 2378, 14192, 3370, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Tokenized: ['[CLS]', 'this', 'is', 'a', 'great', 'movie', '!', 'but', 'it', 'spreads', 'mis', '##in', '##form', '##ation', '.', '[SEP]']


We can inspect the vocabulary of the tokenizer.

In [16]:
vocab = tokenizer.get_vocab()
print(f"Vocabulary size: {len(vocab)}")
print(f"Example mapping: {list(vocab.items())[:5]}")

Vocabulary size: 30522
Example mapping: [('barbados', 16893), ('as', 2004), ('grandparents', 14472), ('##weight', 11179), ('cornice', 27848)]


## Inference: Feeding input to the model



In [17]:
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(model_name)
print(f"Model loaded: {model}")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Model loaded: DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, i

Have a look at the printed model. What can you make out from the specifications?

The tokenized input is passed to the model as PyTorch tensors:

In [18]:
input_ids = torch.tensor([tokens['input_ids']])
attention_mask = torch.tensor([tokens['attention_mask']])

with torch.no_grad():  # avoids unnecessary gradient calculations
    outputs = model(input_ids, attention_mask=attention_mask)

logits = outputs.logits  # output layer numbers before softmax, i.e. not probabilities yet
print(f"Raw logits: {logits}")

Raw logits: tensor([[ 2.8285, -2.3769]])


## Post-processing

We apply the softmax function to the logits to convert them into probabilities for each sentiment class. The class with the highest probability is the predicted sentiment.

(This should sound familiar from the classification classes).

In [19]:
probabilities = torch.softmax(logits, dim=-1)
predicted_class_id = torch.argmax(probabilities).item()
predicted_label = model.config.id2label[predicted_class_id]

print(f"Probabilities: {probabilities}")
print(f"Predicted sentiment: {predicted_label}")

Probabilities: tensor([[0.9945, 0.0055]])
Predicted sentiment: NEGATIVE


### Questions
1. When you download a model from huggingface, there are progress bars for multiple files. What are the individual files? Try to have a look. See the cell below to get the path for where the files are stored.
2. Go to the top of the notebook, run `transformers.logging.set_verbosity_info` and then re-run the notebook. What do you see? Does the logging information make sense to you with what you know about transformers by now?

In [None]:
# The model cache path is stored in an environment variable
import os
os.environ["HF_HOME"]

'/work/tf_cache'