# Use Transformer models directly
The model page of HuggingFace provides the code for how to use the model. 
Usually, there are two ways to use HuggingFace models
- using model & corresponding component directly
- using pipeline

# Tokenization

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
input_text = "What is unhappiness?"
tokens = tokenizer.tokenize(input_text, return_tensors="pt")

['what', 'is', 'un', '##ha', '##pp', '##iness', '?']

In [4]:
print(f"tokens = {tokens}")

tokens = ['what', 'is', 'un', '##ha', '##pp', '##iness', '?']


You could use `AutoTokenizer` to perform simple tokenization task. However, you could use specific Tokenizer with specific model.

## Token Embedding
Some Hugging Face models provide the Embedding and Position Encoding too

In [6]:
from transformers import BertTokenizer, BertModel

In [8]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [9]:
input_text = '''
After a long day at work, Sarah decided to relax by taking her 
dog for a walk in the park. As they strolled along the 
tree-lined paths, Sarah's dog, Max, eagerly sniffed around, 
chasing after squirrels and birds. Sarah smiled as she watched 
Max enjoy himself, feeling grateful for the companionship and 
joy that her furry friend brought into her life.'''

In [11]:
import torch

In [22]:
tokens = tokenizer(input_text, return_tensors="pt")
print(tokens)

{'input_ids': tensor([[  101,  2044,  1037,  2146,  2154,  2012,  2147,  1010,  4532,  2787,
          2000,  9483,  2011,  2635,  2014,  3899,  2005,  1037,  3328,  1999,
          1996,  2380,  1012,  2004,  2027, 20354,  2247,  1996,  3392,  1011,
          7732, 10425,  1010,  4532,  1005,  1055,  3899,  1010,  4098,  1010,
         17858, 18013,  2105,  1010, 11777,  2044, 29384,  1998,  5055,  1012,
          4532,  3281,  2004,  2016,  3427,  4098,  5959,  2370,  1010,  3110,
          8794,  2005,  1996, 11946,  5605,  1998,  6569,  2008,  2014, 28662,
          2767,  2716,  2046,  2014,  2166,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

You may find that the output from `AutoTokenizer` and output from `BertTokenizer` are different. `AutoTokenizer` outputs list of strings while `BertTokenizer` outputs a dictionary with keys `input_ids`, `token_type_ids`, `attention_mask`. These fields are required for later `BertModel` input.

In [15]:
with torch.no_grad():
    outputs = model(**tokens)

In [16]:
print(outputs)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0052, -0.1343, -0.6812,  ...,  0.1802,  0.7480,  0.2806],
         [-0.3172, -0.3149,  0.1389,  ..., -0.0371,  0.2483,  0.1884],
         [-0.1847, -0.4520,  0.0046,  ...,  0.4691,  0.0633,  0.1069],
         ...,
         [ 0.3646, -0.0579, -0.0137,  ..., -0.1730,  0.1503,  0.5632],
         [ 0.4776, -0.0128, -0.3064,  ...,  0.3408, -0.4373, -0.5021],
         [-0.1040, -0.1149, -0.2204,  ...,  0.2434,  0.2328, -0.5389]]]), pooler_output=tensor([[-0.4566, -0.5551, -0.9874,  0.3900,  0.9564, -0.1897,  0.2703,  0.2485,
         -0.9233, -0.9998, -0.8332,  0.9942,  0.9664,  0.2101,  0.8222,  0.2288,
          0.1856, -0.1588,  0.0763,  0.9313,  0.5949,  1.0000, -0.4481,  0.2314,
          0.3737,  0.9994, -0.8431,  0.8627,  0.8781,  0.6570,  0.3893,  0.1902,
         -0.9928,  0.1486, -0.9738, -0.9796,  0.4188, -0.4229,  0.1664,  0.1094,
         -0.7968,  0.2180,  1.0000, -0.3576,  0.8162, -0.0428, -1.0000,  0.

There is no functional header for `BertModel` because we are using base model. We only have pooler_output & last_hidden_state. Later section, we would have other function header.

## Position Encoding

In [18]:
embeddings = model.embeddings

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [19]:
embeddings

BertEmbeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [23]:
embeddings.position_embeddings

Embedding(512, 768)

In [24]:
position_embeddings = embeddings.position_embeddings.weight

In [33]:
input_texts_01 = 'The cat sat on the sofa'

tokens_01 = tokenizer(input_texts_01, return_tensors="pt")

In [27]:
position_ids = torch.arange(tokens_01['input_ids'].size(1), dtype=torch.long).unsqueeze(0)
position_ids

tensor([[0, 1, 2, 3, 4, 5, 6, 7]])

In [28]:
tokens_position_embeddings = position_embeddings[position_ids]
tokens_position_embeddings

tensor([[[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
           6.8312e-04,  1.5441e-02],
         [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
           2.9753e-02, -5.3247e-03],
         [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
           1.8741e-02, -7.3140e-03],
         ...,
         [-3.0871e-03, -1.8956e-02, -1.8930e-02,  ...,  7.4045e-03,
           2.0183e-02,  3.4077e-03],
         [ 6.4257e-03, -1.7664e-02, -2.2067e-02,  ...,  6.7531e-04,
           1.1108e-02,  3.7521e-03],
         [ 6.2613e-04, -1.6089e-02, -7.6365e-03,  ...,  5.3390e-03,
           1.5909e-02,  1.8119e-03]]], grad_fn=<IndexBackward0>)

In [29]:
tokens_position_embeddings.shape

torch.Size([1, 8, 768])

In [37]:
for _id, (token_id, pos_emb) in enumerate(zip(tokens_01['input_ids'][0], tokens_position_embeddings[0])):
    token = tokenizer.decode([token_id])
    print(f"{token}: {pos_emb}")
    if _id==2:
        break

[CLS]: tensor([ 1.7505e-02, -2.5631e-02, -3.6642e-02, -2.5286e-02,  7.9709e-03,
        -2.0358e-02, -3.7631e-03, -4.6880e-03,  6.2253e-03, -3.8342e-02,
         1.3103e-02, -3.7083e-03, -2.1014e-02,  1.1626e-02, -3.9546e-02,
         1.0155e-02,  1.8081e-03, -3.9818e-03,  1.6112e-02, -1.9327e-02,
        -3.1684e-02, -2.5482e-02,  3.2621e-04,  2.0337e-02, -1.6705e-02,
        -2.1000e-02, -7.8122e-03,  1.5647e-02, -6.3413e-03,  5.5291e-03,
        -1.5590e-02,  4.1118e-03, -1.8160e-02,  2.1867e-03,  7.1782e-03,
        -1.5383e-02, -3.2506e-03,  1.5954e-02,  1.8287e-02,  3.6061e-02,
        -3.9159e-03,  4.1934e-03, -9.5806e-03, -5.0352e-03, -4.4547e-03,
         2.0729e-03, -3.2415e-01,  3.4504e-03,  4.6929e-02, -2.1057e-02,
         5.6190e-02,  2.3602e-02, -2.3394e-02,  2.1003e-01,  3.3293e-02,
        -7.0262e-03, -9.4291e-03, -3.4105e-04,  2.4235e-02, -2.2936e-02,
         1.3023e-02,  6.9495e-03, -1.2559e-01, -8.3786e-03,  6.9158e-04,
        -9.6908e-03,  1.1022e-02, -1.6233e-0

***
# using model & corresponding component directly
The model page of HuggingFace provides the code for how to use the model. 
Usually, there are two ways to use HuggingFace models
- using model & corresponding component directly
- using pipeline

The web page of Hugging Face provides the code example of using the model or the pipeline.

The downloaded model would go to `~/.cache/huggingface/hub/`

In [1]:
!ls ~/.cache/huggingface/hub/

models--bert-base-uncased  models--huaen--question_detection  version.txt


## Use Models directly


First step is prepare the `Tokenizer`. We would use `AutoTokenizer` to demo.

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Next, we use `AutoModelForSequenceClassification` to perform classification task.

In [3]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [4]:
!ls ~/.cache/huggingface/hub/

models--bert-base-uncased
models--distilbert--distilbert-base-uncased-finetuned-sst-2-english
models--huaen--question_detection
version.txt


after loading tokenizer & model, we tokenize the input texts.

In [5]:
text = "I loved the movie, it was fantastic!"

tokens = tokenizer(text, return_tensors='pt')
print(tokens)

{'input_ids': tensor([[  101,  1045,  3866,  1996,  3185,  1010,  2009,  2001, 10392,   999,
           102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [6]:
outputs = model(**tokens)
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3428,  4.6955]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


we need to further distinguish if the text is +ve or -ve by the values under the key `logits` of output object

In [7]:
import torch

predict_label = torch.argmax(outputs.logits)
print(f"Predicted sentiment: {'Positive'  if predict_label==1 else 'Negative'}")

Predicted sentiment: Positive


## Use Pipelines
Hugging Face provides the convenient, user-friendly `pipeline()` function that shields the lower-level details from developer. What you need to do is specifying task and model.

In [8]:
from transformers import pipeline

model = pipeline(task='text-classification', model='distilbert/distilbert-base-uncased-finetuned-sst-2-english')

Device set to use cuda:0


In [9]:
review1 = '''From the warm welcome to the exquisite dishes and impeccable
 service, dining at Gourmet Haven is an unforgettable experience that
 leaves you eager to return.'''
 
review2 = '''Despite high expectations, our experience at Savor Bistro 
 fell short; the food was bland, service was slow, and the overall
 atmosphere lacked charm, leaving us disappointed and unlikely to
 revisit.'''
 

In [10]:
print(model([review1, review2]))

[{'label': 'POSITIVE', 'score': 0.9998437166213989}, {'label': 'NEGATIVE', 'score': 0.9997773766517639}]


### Use cases of pipelines
- Text Classifier
- Text Generation
- Text Summarization
- Text Translation
- Question Answering

### Text Classifier
Apart from the positive and negative reviews classification, Text Classifier task could also be applied to othe classification, e.g detecting question, detecting language

#### Question Detection

In [11]:
ques_detector = pipeline(task='text-classification', model="huaen/question_detection")

Device set to use cpu
Error during conversion: ChunkedEncodingError(ProtocolError('Response ended prematurely'))


In [12]:

response = ques_detector("Have you ever pondered the mysteries that lie beneath the surface of everyday life?")
response

[{'label': 'question', 'score': 0.9975988268852234}]

In [13]:

response = ques_detector('"Life is a journey that must be traveled, no matter how bad the roads and accommodations." - Oliver Goldsmith"')
response

[{'label': 'non_question', 'score': 0.9996671676635742}]

#### Language Detection

In [14]:
lang_detector = pipeline(task='text-classification', model="papluca/xlm-roberta-base-language-detection")

Device set to use cpu


In [15]:
response = lang_detector("日本の桜は美しいです。")
response

[{'label': 'ja', 'score': 0.9913387298583984}]

#### SPAM classifier

In [16]:
spam_classifier = pipeline(task='text-classification', model="Delphia/twitter-spam-classifier")

Device set to use cpu


In [17]:
text = """
Congratulations! You've been selected as the winner of our 
exclusive prize draw. Claim your reward now by clicking on 
the link below!
"""

response = spam_classifier(text)
response

[{'label': 1, 'score': 0.744691789150238}]

In [18]:
text = """
Hi Jimmy, I hope you're doing well. I just wanted to remind 
you about our meeting tomorrow at 10 AM in conference room A. 
Please let me know if you have any questions or need any 
further information. Looking forward to seeing you there!
"""

response = spam_classifier(text)
response

[{'label': 0, 'score': 0.7776529788970947}]

### Text Generation

In [19]:
generator = pipeline(task='text-generation', model='openai-community/gpt2')

Device set to use cpu


In [20]:
begin = "In this course, we will teach you how to"

responses = generator(begin, max_length=50, num_return_sequences=3)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [21]:
len(responses)

3

In [22]:
print(responses[0])

{'generated_text': 'In this course, we will teach you how to create custom HTML5 video embeddable HTML5 files. It will be useful in order to understand basic HTML, HTML6, CSS, and Media Composer coding concepts and how to develop a complex'}


In [23]:
print(responses[1])

{'generated_text': "In this course, we will teach you how to write a complete set of rules and code, using some examples and examples from real life. In practice, you should look at a set of basic concepts that you'll be able to write about in this"}


In [24]:
print(responses[2])

{'generated_text': 'In this course, we will teach you how to make a powerful web site with one simple technique.\n\nWe will also learn how to create an index, and a simple dashboard.\n\nOur goal is to help you quickly get your ideas.'}


### Text Summarization

In [25]:
with open('text_summarisation_usage.txt', 'r', encoding='utf-8') as f:
    content = f.read()

In [26]:
summarizer = pipeline(task='summarization', model='facebook/bart-large-cnn')

Device set to use cpu


**Extractive Summarization**: involves selecting and extracting important sentences or phrases directly from the original text

In [27]:
response = summarizer(content, min_length=100, max_length=250, do_sample=False)
print(response)

[{'summary_text': 'A quantum computer is a computer that exploits quantum mechanical phenomena. Classical physics cannot explain the operation of these quantum devices. A large-scale quantum computer could break widely used encryption. The current state of the art is still largely experimental and impractical. The basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics. The design of quantum algorithms involves creating procedures that allow a quantum computer to perform calculations efficiently. The study of the computational complexity of problems with respect to quantum computers is known asquantum complexity theory.'}]


**Abstractive Summarization**: generates summaries by paraphrasing and rephrasing the original text in a more concise form

In [28]:
response = summarizer(content, min_length=100, max_length=250, do_sample=True)
print(response)

[{'summary_text': 'A quantum computer is a computer that exploits quantum mechanical phenomena. Classical physics cannot explain the operation of these quantum devices. A large-scale quantum computer could break widely used encryption. The basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics. The design of quantum algorithms involves creating procedures that allow a quantum computer to perform calculations efficiently. Any computational problem that can be solved by a classical computer can also be solved  by a quantum computers, at least in principle.'}]


### Text Translation

In [29]:
translator = pipeline(task="translation_en_to_fr", model="google-t5/t5-base")

Device set to use cpu


In [30]:
en_text="Wikipedia is hosted by the Wikimedia Foundation, a non-profit organization that also hosts a range of other projects."

In [31]:
response = translator(en_text)
print(response)

[{'translation_text': "Wikipedia est hébergée par la Wikimedia Foundation, un organisme sans but lucratif qui héberge également une série d'autres projets."}]


### Question Answering

In [32]:
qa_model = pipeline(task='question-answering', model='deepset/roberta-base-squad2')

Device set to use cpu


In [33]:
with open('question_answering.txt', 'r', encoding='utf-8') as f:
    context = f.read()

In [34]:
question = {
    'question': 'What is the meaning of Singapura?',
    'context': context
}

response = qa_model(question)
print(response)



{'score': 0.13809449970722198, 'start': 185, 'end': 194, 'answer': 'lion city'}
