# **Transformer and Transformer-Based Models (Part 2)**

In this python notebook, we will play with the transformer-based models provided in **transformers** for multiple natural language processing (NLP) tasks.

In [26]:
import torch
from torch.nn.functional import cosine_similarity

***

## **1. Play with Transformer-Based Models**

### 1.1 ~ Installation

First, we install the *transformers* package with the following command:
```
pip install transformers
```

After it is done, we can load some pretrained BERT models and tokenizers like this (ignore warnings):

In [27]:
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

### 1.2 ~ Tokenizing Inputs

Next, we will the following examples

In [28]:
text = """The hotness of the sun and the coldness of the outer space are inexhaustible thermodynamic
resources for human beings. From a thermodynamic point of view, any energy conversion systems
that receive energy from the sun and/or dissipate energy to the universe are heat engines with
photons as the "working fluid" and can be analyzed using the concept of entropy. While entropy
analysis provides a particularly convenient way to understand the efficiency limits, it is typically
taught in the context of thermodynamic cycles among quasi-equilibrium states and its
generalization to solar energy conversion systems running in a continuous and non-equilibrium
fashion is not straightforward. In this educational article, we present a few examples to illustrate
how the concept of photon entropy, combined with the radiative transfer equation, can be used to
analyze the local entropy generation processes and the efficiency limits of different solar energy
conversion systems. We provide explicit calculations for the local and total entropy generation
rates for simple emitters and absorbers, as well as photovoltaic cells, which can be readily
reproduced by students. We further discuss the connection between the entropy generation and the
device efficiency, particularly the exact spectral matching condition that is shared by infinitejunction photovoltaic cells and reversible thermoelectric materials to approach their theoretical
efficiency limit."""

encoded_input = tokenizer.encode_plus(text, return_tensors='pt')

print(len(text.split()))
print(encoded_input['input_ids'].shape)

211
torch.Size([1, 275])


Why does the `encoded_input` have more elements than the actual number of words in `text`?

1. Subword Tokenization: BERT uses a subword tokenization algorithm where words are often broken down into smaller subwords. This helps the model deal with unfamiliar words while gaining a better understanding of linguistics.
2. Special Tokens: BERT requires certain special tokens, including:
    a. [CLS]: Special token added at the beginning of input sequences. Output representation is summary of entire text sequence.
    b. [SEP]: Special token indicating the end of a sentence or the separation between two sentences.
    c. [PAD]: Tokens used for padding sequences to a uniform length.
3. Word vs. Token Count: The method len(text.split()) counts the number of words separated by whitespace in the input text. However, the BERT tokenizer counts tokens which include subwords and special characters. Therefore, the number of tokens is typically higher than the number of words separated by whitespace.

### 2.3 ~ Output Word Vectors from BERT

In [30]:
output = model(**encoded_input)

last_hidden_state = output[0]

print(last_hidden_state.shape)

torch.Size([1, 275, 768])


With the following code, we can find the corresponding token of each integer id in `input_ids`.

In [31]:
input_ids_pt = encoded_input['input_ids']
input_ids_list = input_ids_pt.tolist()[0]
input_tokens = tokenizer.convert_ids_to_tokens(input_ids_list)

print(input_ids_list[:10])
print(input_tokens[:10])

[101, 1996, 2980, 2791, 1997, 1996, 3103, 1998, 1996, 3147]
['[CLS]', 'the', 'hot', '##ness', 'of', 'the', 'sun', 'and', 'the', 'cold']


Can we find the output vector**s** among `last_hidden_state` that correpond to the input word "entropy"?
Do they have the same values?

We can use a `if` statement to check if the current token is the word "entropy", and if so,  can append it to `vectors`.

In [32]:
vectors = []
for i, token in enumerate(input_tokens):
    if token == "entropy":
        vector = last_hidden_state[0, i]
        vectors.append(vector)
print('Number of "entropy":', len(vectors))

matches = [torch.allclose(vectors[i], vectors[i+1]) for i in range(len(vectors)-1)]
print(f'Do they have the same value? {matches}')

Number of "entropy": 6
Do they have the same value? [False, False, False, False, False]


### 2.4 ~ Sentence vectors from BERT

We can obtain the output vectors for a batch of sentences.

First, we need to break the text into a list of sentences, using a simple end-of-sentence str '.' as a separater. 

In [33]:
sentences = text.replace('\n', ' ').split('.')
sentences = [s.strip() + '.' for s in sentences if len(s.strip())>0]

print(f'Resulting in {len(sentences)} sentences:')
print(sentences)

Resulting in 6 sentences:
['The hotness of the sun and the coldness of the outer space are inexhaustible thermodynamic resources for human beings.', 'From a thermodynamic point of view, any energy conversion systems that receive energy from the sun and/or dissipate energy to the universe are heat engines with photons as the "working fluid" and can be analyzed using the concept of entropy.', 'While entropy analysis provides a particularly convenient way to understand the efficiency limits, it is typically taught in the context of thermodynamic cycles among quasi-equilibrium states and its generalization to solar energy conversion systems running in a continuous and non-equilibrium fashion is not straightforward.', 'In this educational article, we present a few examples to illustrate how the concept of photon entropy, combined with the radiative transfer equation, can be used to analyze the local entropy generation processes and the efficiency limits of different solar energy conversion 

Now, let's use tokenizer on this batch of sentences

In [34]:
encoded_sentences = tokenizer.batch_encode_plus(sentences, padding=True, return_tensors='pt')

print(encoded_sentences['input_ids'].shape)
print(encoded_sentences['input_ids'][0,:])

torch.Size([6, 57])
tensor([  101,  1996,  2980,  2791,  1997,  1996,  3103,  1998,  1996,  3147,
         2791,  1997,  1996,  6058,  2686,  2024,  1999, 10288, 13821,  3775,
         3468,  1996, 10867,  7716, 18279,  7712,  4219,  2005,  2529,  9552,
         1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])


We can find that shorter sentences are padded with a special id `0`.

Next, we can obtain the output tensors for all input sentences, also in a batch.

In [35]:
outputs = model(**encoded_sentences)

print(outputs[0].shape)

torch.Size([6, 57, 768])


Note that the first dimension of `outputs['last_hidden_state']` is batch size.

So the output tensor for the 1st sentence is `outputs['last_hidden_state'][0]`, and so on.

In [36]:
print(outputs[0][0].shape)

torch.Size([57, 768])


For each output tensor, the first 768-dim vector (at position 0) always corresponds to the special input token `[CLS]`. 

We can use this vector to represent the meaning of the whole sentence.

In [37]:
CLS_vec = outputs[0][0][0]
print(CLS_vec.shape)

torch.Size([768])


Now, we have to compute the cosine similarities between each pair of the 6 sentences, and find the pair that has the closest meanings.

We can use the `cosine_similarity()` function imported at the beginning, which takes input two tensors and returns the similarity score in a tensor. So we will need to append a `.item()` to retrieve the numeric value from the returned tensor. We also need to specify the argument `dim=0`.

***Note***: when calling cosine_similarity() function, remember to specify dim=0; also need to append .item() at the end to obtain a number instead of a tensor.

In [38]:
for i in range(5):
    for j in range(i+1, 6):
        vec_i = outputs["last_hidden_state"][i][0]
        vec_j = outputs["last_hidden_state"][j][0]
        sim = cosine_similarity(vec_i, vec_j, dim=0).item()
        print(f'{i} <-> {j}: {sim}')

0 <-> 1: 0.8591641187667847
0 <-> 2: 0.777198314666748
0 <-> 3: 0.7985227108001709
0 <-> 4: 0.7754687666893005
0 <-> 5: 0.8052165508270264
1 <-> 2: 0.876341700553894
1 <-> 3: 0.832162082195282
1 <-> 4: 0.8238449096679688
1 <-> 5: 0.8492753505706787
2 <-> 3: 0.8241375088691711
2 <-> 4: 0.8598625659942627
2 <-> 5: 0.8579832911491394
3 <-> 4: 0.9018083214759827
3 <-> 5: 0.9291440844535828
4 <-> 5: 0.9185266494750977


We can print out the two sentences to see if the similarity score makes sense.

In [39]:
print(sentences[3])
print(sentences[5])

In this educational article, we present a few examples to illustrate how the concept of photon entropy, combined with the radiative transfer equation, can be used to analyze the local entropy generation processes and the efficiency limits of different solar energy conversion systems.
We further discuss the connection between the entropy generation and the device efficiency, particularly the exact spectral matching condition that is shared by infinitejunction photovoltaic cells and reversible thermoelectric materials to approach their theoretical efficiency limit.


### 2.5 ~ Play with summarization

Lastly, let's play with the summarization pipelien provided by transformers. Be patient when the model is downloading.

We can try the following code with different input text or arguments.

In [40]:
from transformers import pipeline

summarizer = pipeline("summarization")

print(summarizer(text, max_length=150, min_length=30))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' The hotness of the sun and the coldness of outer space are inexhaustible thermodynamic resources for human beings . From a thermodynamic point of view, any energy conversion systems that receive energy from the sun or dissipate energy to the universe are heat engines with photons as the "working fluid"'}]
