# Transformers and language tasks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AIForVet/aiml/blob/main/08-transformers_and_language_tasks.ipynb)

In this notebook, you can try out how transformers are used in natural language processing tasks such as sentiment analysis and summarization. The examples will be related to the English language because the libraries and functionalities are primarily adapted to this language.

At the very beginning, we will load the standard libraries that are necessary for further work.

In [68]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

For the tasks we plan to cover, we will need the `transformers` library, which integrates various types of transformers and tools that facilitate more convenient work with them. To use this library in the Google Colab environment, it needs to be installed with the command `!pip install transformers` and then loaded with the command `include transformers`. We will primarily use the `pipeline` function of this library, but we will also discuss some of its other capabilities.

In [69]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: C:\Users\Mejkerslab\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [70]:
import transformers

## Sentiment Analysis Task

The task of sentiment analysis is the task of recognizing emotions or attitudes present in a text. The recognition itself is much more basic compared to how humans do it, but it has its important role in understanding user-generated content such as comments or reviews. We most often encounter the task of recognizing positive and negative content, where positive content denotes something commendable and nice, and negative content denotes criticisms and complaints. From the perspective of machine learning, the task of sentiment analysis is approached as a binary classification task. After preparing adequate representations of textual inputs, we can apply any classification algorithm.

The following code block will allow us to create the `analize_sentiment` function, which combines the steps of creating text representations and then running a pre-trained sentiment analysis classifier. To create it, we will use the `pipeline` function and specifically indicate with the `task` argument that we want to perform sentiment analysis.

In [71]:
%pip install tf-keras

analize_sentiment = transformers.pipeline(task='sentiment-analysis')



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: C:\Users\Mejkerslab\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Note: you may need to restart the kernel to use updated packages.


All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
Device set to use 0


We can provide inputs to the loaded functionality for which we want to get a sentiment score. The output will be the class name `POSITIVE` or `NEGATIVE`, as well as the `score` value in the range from 0 to 1, which indicates how confident the classification model is in its decision.

Here are a few examples.

In [72]:
analize_sentiment("We are very excited to learn more on sentiment analysis!")

[{'label': 'POSITIVE', 'score': 0.9994511008262634}]

In [73]:
analize_sentiment("We didn't like the food. It was too salty.")

[{'label': 'NEGATIVE', 'score': 0.9992080330848694}]

In [74]:
analize_sentiment("The movie was super interesting, but the end was quite boring.")

[{'label': 'NEGATIVE', 'score': 0.9983773231506348}]

While the emotion of excitement or dislike was quite clearly expressed in the first two sentences we tested, in the third sentence we have an interesting mix. Would you agree with the rating given by the classifier?

You can continue to test this functionality by checking how adjectives like *amazing*, *wonderful*, *boring*, *annoying* and their combinations affect the classifier's decisions. You can also check how the classifier behaves when negation is present in the sentence, for example, when you say something is *not great*.

## Summarization Task

Using a similar principle as before, we can also try the summarization task. For a given text, the summarization task involves generating summaries, shorter textual forms that contain important and relevant information from the original text. Since expectations from summaries can vary from user to user (the perception of what is important and relevant depends on many factors), it is possible to create multiple different summaries for a given text.

The following code block will allow us to create the `generisi_sazetak` function, which combines the steps of creating representations of the given text and then running a pre-trained model for generating summaries. We will also use the `pipeline` function and specifically indicate with the `task` argument that we want to perform summarization.

In [75]:
generate_summary = transformers.pipeline(task='summarization')

No model was supplied, defaulted to google-t5/t5-small and revision df1b051 (https://huggingface.co/google-t5/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Device set to use 0


To test the loaded functionality, we will use a paragraph from an article about Nikola Tesla taken from Wikipedia.

In [76]:
input_text = """
In 1884, Edison manager Charles Batchelor, who had been overseeing the Paris installation,
was brought back to the United States to manage the Edison Machine Works,
a manufacturing division situated in New York City, and asked that Tesla be brought to the United States as well.
In June 1884, Tesla emigrated and began working almost immediately at the Machine Works on Manhattan's Lower East Side,
an overcrowded shop with a workforce of several hundred machinists, laborers, managing staff, and 20 "field engineers"
struggling with the task of building the large electric utility in that city. As in Paris, Tesla was working on
troubleshooting installations and improving generators. Historian W. Bernard Carlson notes Tesla may have met
company founder Thomas Edison only a couple of times. One of those times was noted in Tesla's autobiography where,
after staying up all night repairing the damaged dynamos on the ocean liner SS Oregon, he ran into Batchelor and Edison,
who made a quip about their "Parisian" being out all night. After Tesla told them he had been up all night fixing the Oregon,
Edison commented to Batchelor that "this is a damned good man". One of the projects given to Tesla was to develop an
arc lamp-based street lighting system. Arc lighting was the most popular type of street lighting but it required
high voltages and was incompatible with the Edison low-voltage incandescent system, causing the company to lose
contracts in some cities. Tesla's designs were never put into production, possibly because of technical improvements
in incandescent street lighting or because of an installation deal that Edison made with an arc lighting company.
"""

When calling the functionality, we will generate exactly one summary whose length we will limit to a maximum of 150 "words". We will be able to read the content of the summary using the `summary_text` property.

In [77]:
sazetak = generate_summary(input_text, max_length=150)

In [78]:
sazetak[0]['summary_text']

'in 1884, Edison manager Charles Batchelor was brought back to the united states to manage the Edison machine Works . Tesla was working on troubleshooting installations and improving generators . the company may have met Thomas Edison only a couple of times .'

Try some more inputs and summary length settings. We will talk more about the models used later, so you can also use another model.

## Text Representation

Both previous functionalities prepared appropriate textual representations for us and called the corresponding models. If we go back and look at the messages we received when calling the `pipeline` function, we will notice that the `distilbert-base-uncased-finetuned-sst-2-english` model was used for the sentiment analysis task, while the `sshleifer/distilbart-cnn-12-6` model was used for the summarization task. We could have explicitly specified these names when calling the `pipeline` function through the `model` argument. There are also some other models available for these tasks in the library.

It is a practice that each model is paired with its tokenizer. The tokenizer is a tool that ensures the input is in a form that the model understands. To demonstrate how this works and learn more about the text preparation process, we will directly load the `distilbert-base-uncased-finetuned-sst-2-english` model. To load the model, we will use the `AutoModelForSequenceClassification` function, while to load the tokenizer, we will use the `AutoTokenizer` function.

In [79]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

In [82]:
%pip install transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: C:\Users\Mejkerslab\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [85]:
%pip install torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: C:\Users\Mejkerslab\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


ImportError: 
AutoModelForSequenceClassification requires the PyTorch library but it was not found in your environment.
However, we were able to find a TensorFlow installation. TensorFlow classes begin
with "TF", but are otherwise identically named to our PyTorch classes. This
means that the TF equivalent of the class you tried to import would be "TFAutoModelForSequenceClassification".
If you want to use TensorFlow, please use TF classes instead!

If you really do want to use PyTorch please go to
https://pytorch.org/get-started/locally/ and follow the instructions that
match your environment.


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The transformer we selected uses subword tokenization. Subword tokens are carefully extracted by processing a large amount of text and are chosen so that their combination can reconstruct most of the text. This is not done manually; instead, an algorithm for creating subwords is called. All these extracted subwords (also called tokens) represent the vocabulary. Now we will see how many tokens are in the vocabulary of this transformer, and then we will print some of them. When printing, you will see that some tokens start with the characters ## - they indicate that these are not whole words but subwords that combine with other words. This allows us to represent words present in the text that were not included in the process of selecting and generating subwords.

To determine the number of tokens in the vocabulary, we will use the `tokenizer.vocab_size` property.

In [None]:
print('Number of tokens in the vocabulary: ', tokenizer.vocab_size)

Number of tokens in the vocabulary:  30522


To extract some tokens from the vocabulary, we will use the `tokenizer.vocab` property.

In [None]:
number_of_tokens_to_display = 50
vocabulary = {id: token for token, id in tokenizer.vocab.items()}

for index, (id, token) in enumerate(vocabulary.items()):
  token = vocabulary[id]
  print("ID: {id} \t token: {token}".format(id=id, token=token))

  if index == number_of_tokens_to_display:
    break

ID: 17337 	 token: stationary
ID: 5771 	 token: flesh
ID: 26409 	 token: 44th
ID: 28999 	 token: securely
ID: 888 	 token: [unused883]
ID: 21723 	 token: ##ount
ID: 478 	 token: [unused473]
ID: 19096 	 token: 1760
ID: 8529 	 token: um
ID: 23782 	 token: ##ading
ID: 17910 	 token: foil
ID: 16030 	 token: thorough
ID: 12612 	 token: inventory
ID: 11994 	 token: knox
ID: 10353 	 token: sleeve
ID: 23044 	 token: slaughtered
ID: 27212 	 token: earrings
ID: 3533 	 token: joe
ID: 27961 	 token: impetus
ID: 7014 	 token: beth
ID: 30311 	 token: ##南
ID: 5920 	 token: suicide
ID: 23844 	 token: mortals
ID: 8202 	 token: ##mon
ID: 28311 	 token: ##57
ID: 2885 	 token: europe
ID: 28021 	 token: karel
ID: 9450 	 token: instructor
ID: 23397 	 token: emptiness
ID: 13688 	 token: hernandez
ID: 11159 	 token: slapped
ID: 30150 	 token: ##♠
ID: 23098 	 token: wavy
ID: 28051 	 token: humanoid
ID: 725 	 token: [unused720]
ID: 21899 	 token: huron
ID: 22881 	 token: ##trick
ID: 618 	 token: [unused613]
ID:

Tokens in the vocabulary have their unique identifiers. We used them to fetch a certain number of tokens from the vocabulary. This further means that each input must first be recorded by combining pieces of words, and then each piece of the word is assigned the corresponding identifier. In this way, we also arrive at the numerical representation of the text necessary for further application of the model.

The tokenizer itself helps us in preparing representations when working with the `transformers` library.

In [None]:
input_txt = "We are excited to learn about transformers and natural language processing."

In [None]:
tokenizer(input_txt)

{'input_ids': [101, 2057, 2024, 7568, 2000, 4553, 2055, 19081, 1998, 3019, 2653, 6364, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The output of the function, as we can see, are the identifiers of individual tokens represented by the `input_ids` array and the `attention_mask` array, which helps in the functionalities of the attention mechanism. At the beginning of the token sequence, one special start token is always added, whose identifier is 101 (the so-called start CLS token), and one end token, whose identifier is 102 (the so-called SEP token).

To see how the network processes this input, we will prepare a package with this input. In preparing the package itself, we will also use some additional settings that will either pad the input with zeros to meet the maximum length of 512 tokens or truncate it to this length if necessary (hence the attribute `padding` with the value `True`, the attribute `max_length` with the value 512, and the attribute `truncation` with the value `True`).

In [None]:
input_package = tokenizer([input_txt], padding=True, truncation=True, max_length=512, return_tensors="pt")

ImportError: Unable to convert output to PyTorch tensors format, PyTorch is not installed.

The prepared input can be further passed to the model.

In [None]:
result = model(**ulaz_paketic)

First, we will print the result itself, and then use the functionality that will calculate the necessary score value for us.

In [None]:
result

SequenceClassifierOutput(loss=None, logits=tensor([[-4.0110,  4.2654]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
from torch import nn
nn.functional.softmax(result.logits, dim=-1)

The first calculated value corresponds to the negative sentiment score, and the second to the positive sentiment score. Since the latter is higher, the model's conclusion would be that the content is positive. We can verify this by calling the `analize_sentiment` function.

In [86]:
analize_sentiment(ulaz)

NameError: name 'ulaz' is not defined

The `transformers` library offers the possibility to work with many other models and functionalities such as speech-to-text conversion, question answering, entity recognition in text, and many others. You might find it interesting to explore. You can find more about it on the official library page https://huggingface.co/docs/transformers/index. The library is developed and maintained by the HuggingFace community 🤗.