# Lab | Transformers

---

### Section structure

1. The open-source ecosystem: increasing accessibility to machine learning (ML) software and hardware
2. Some simple code demonstrations
3. Q&A

## 1. Ease-of-use: Using Transformers in 3 lines of code


**Overview of different tasks that can be automated with ML**
* Key ingredients: (1) a model trained on a specific task; (2) input data (e.g. texts or images); (3) output produced by the model.
* Transformers are currently the most popular type of deep learning algorithm. Most tasks below are solved with Transformers. There might be other types of algorithms coming up in the medium term.



**Install the Transformers library & dependencies**

In [None]:
!pip install transformers  # The Transformers library from Hugging Face
!pip install sentencepiece
!pip install wikipedia
!pip install accelerate
!pip install tf-keras
!pip install torch

# NOTE: you might need to restart you jupyter kernel after installing the libraries

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=30b9301aa92a99e2b63d6b6a9a515f2d461503ded0801fccf98e9bafc0a91ea3
  Stored in directory: /root/.cache/pip/wheels/8f/ab/cb/45ccc40522d3a1c41e1d2ad53b8f33a62f394011ec38cd71c6
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 

**The Hugging Face Pipeline**
* Makes automation of many NLP tasks possible in 3 lines of code
* Detailed documentation is available [here](https://huggingface.co/transformers/main_classes/pipelines.html)

In [9]:
from transformers import pipeline
import pandas as pd
import numpy as np
from pprint import pprint
import requests

Note : You might need more libraries or updates to run the cells below, if that is the case, follow the error messages and pip install accordingly. Chat gpt can help you if given the error messages.

### 2.1 Many models tailored to specific tasks


#### 2.1.1 Text classification

Let's select a popular text classification model in the [HF model hub](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads).

Here we chose "cardiffnlp/twitter-roberta-base-irony".

We will classify text into ironic or non ironic.

In [None]:
pipeline_classification = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-irony")  # cardiffnlp/twitter-roberta-base-irony, SamLowe/roberta-base-go_emotions




Now that we have the model we can pass it a string and have it give us a classification.

Feel free to experiment with different sentences by changing the contents of the variable text

In [None]:
text = "Well that workshop was totally worth my time..."  # "Well that workshop was totally worth my time..."  "This smells weird, I'm not sure if I should eat this ... Yikes, it tasted like old socks!"
output = pipeline_classification(text, top_k=10)
print(output)

[{'label': 'irony', 'score': 0.9424387812614441}, {'label': 'non_irony', 'score': 0.057561274617910385}]


Let's make the output a little cleaner

In [None]:
# make output a bit cleaner
df_output = pd.DataFrame(output)
print(df_output)

       label     score
0      irony  0.942439
1  non_irony  0.057561


As you can see, in a few lines of code and by leveraging an existing model we can classify text as ironic or non ironic. Now you have one more tool in your machine learning toolbox.

Remember that : 'when you only have a hammer everything is a nail'. But if we want to build a house (perform machine learning the right way), we need to use the right tool for the right job.

*italicized text*#### 2.1.2 Machine Translation

* Open source machine translation (MT) models enable you to translate between many different languages without Google Translate.
* [University of Helsinki](https://huggingface.co/Helsinki-NLP) uploaded models for more than 1000 language pairs to the Hugging Face hub
* [Facebook AI](https://huggingface.co/models?search=facebook+m2m) open-sourced several multi-lingual models
* The [EasyNMT library](https://github.com/UKPLab/EasyNMT), provides an easy wrapper for all these models
* Most machine translation models translate between two languages in one direction (e.g. German to English, but not English to German), some can translate in multiple directions.


In [None]:
# translation pipeline docs: https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TranslationPipeline
pipeline_translate = pipeline("translation", model="facebook/m2m100_418M")

M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.

The model that can directly translate between the 9,900 directions of 100 languages.

Here we specify to translate from German 'de' to English 'en'

In [None]:
text = "Ich bin ein Fisch"
pipeline_translate(text, src_lang="de", tgt_lang="en")

[{'translation_text': 'I am a fish'}]

Let's do the same but with and entire wikipedia page in german.

In [None]:
# download any text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("de")

text = wikipedia.summary("Donald Trump").replace('\n', ' ')[:318]
print(f"Original text:\n{text}\n")

# translate the text from wikipedia
text_translated = pipeline_translate(text, src_lang="de", tgt_lang="en")
print(f"Translated text:\n{text_translated[0]['translation_text']}")


Original text:
Donald John Trump [ˈdɑn.əld dʒɑn tɹɐmp] (* 14. Juni 1946 in New York City) ist ein US-amerikanischer Unternehmer, Entertainer und Politiker der Republikanischen Partei. Von 2017 bis 2021 war er der 45. Präsident der Vereinigten Staaten. Er gilt als einer der umstrittensten Politiker der US-Geschichte und ist der erst

Translated text:
Donald John Trump [ˈdɑn.əld dʒɑn tɔmp] (* 14 June 1946 in New York City) is an American entrepreneur, entertainer and politician of the Republican Party. From 2017 to 2021 he was the 45th President of the United States.


#### 2.1.3 Text Summarization

In [None]:
# docs for summarisation pipeline: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
pipeline_summarize = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")  # sshleifer/distilbart-cnn-12-6 , google/pegasus-cnn_dailymail

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
# download any long text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("en")

text_long = wikipedia.summary("Donald Trump").replace('\n', ' ')
print(f"Original text:\n{text_long}\n")

# translate the text from wikipedia
text_summarized = pipeline_summarize(text_long, min_length=5, max_length=30)
print(f"Summarized text:\n{text_summarized[0]['summary_text']}")

Original text:
Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021. Trump received a Bachelor of Science in economics from the University of Pennsylvania in 1968. His father named him president of his real estate business in 1971. Trump renamed it the Trump Organization and reoriented the company toward building and renovating skyscrapers, hotels, casinos, and golf courses. After a series of business failures in the late 1990s, he launched successful side ventures, mostly licensing the Trump name. From 2004 to 2015, he co-produced and hosted the reality television series The Apprentice. He and his businesses have been plaintiffs or defendants in more than 4,000 legal actions, including six business bankruptcies. Trump won the 2016 presidential election as the Republican Party nominee against Democratic Party candidate Hillary Clinton while losing the popular vote. A 

#### 2.1.4 Named Entity Recognition

NER is a task that involves identifying and classifying specific entities in text into predefined categories, such as names of people, organizations, locations, dates, and more.

For example, in the sentence "Apple Inc. was founded by Steve Jobs in California," NER would recognize "Apple Inc." as an organization, "Steve Jobs" as a person, and "California" as a location.

In [None]:
pipeline_ner = pipeline("token-classification", model="dslim/bert-base-NER-uncased", aggregation_strategy="simple")

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER-uncased were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
import wikipedia
wikipedia.set_lang("en")

text_long = wikipedia.summary("Donald Trump").replace('\n', ' ')

output = pipeline_ner(text_long)

pd.DataFrame(output)

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.989843,donald john trump,0,17
1,MISC,0.991848,american,45,53
2,LOC,0.992675,united states,141,154
3,PER,0.989312,trump,174,179
4,ORG,0.697325,university of pennsylvania,233,259
5,PER,0.969404,trump,337,342
6,ORG,0.71214,trump organization,358,376
7,PER,0.913634,trump,597,602
8,MISC,0.953361,the apprentice,684,698
9,PER,0.992383,trump,828,833


### 2.2. Universal models

The models mentioned above are designed to excel at a single specific task on a particular dataset. The key advantage of these models is their high performance and accuracy on that specific task and dataset.

However, in real-world applications, the problems you'll face often require solving slightly different tasks, possibly with varied category definitions or applied to different types of texts.

Universal models can help address this challenge. Although they also focus on one task, the task is general or universal enough that many other tasks can be reformulated into it. Two examples of universal tasks are:

- Natural Language Inference (NLI): A task that can effectively solve a wide range of classification tasks by determining whether a given premise supports, contradicts, or is neutral with respect to a hypothesis.

- Token Generation: An even more universal task that can be applied to solve virtually any text-related task, including translation, summarization, and text completion.

These universal tasks enable the models to be versatile and adaptable to various problems beyond the specific ones they were initially trained on.

# Zero-shot classification


Zero-shot classification is a technique where a model can categorize data into classes it has never seen before.

Instead of relying on labeled examples for each class, the model understands the relationship between the input and the class descriptions, allowing it to make accurate predictions without needing specific training on those classes.

In [None]:
pipeline_zeroshot_classification = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Here we will give the model a list of classes ('payment issues', 'travel advice', 'bug report') for it to classify our string.

In [None]:
text = "Customer: I have not received my reimbursement yet. What the hell is going on?"
classes = ['payment issues', 'travel advice', 'bug report']  # "account opening", "customer complaint"

#text = "I do not think the government is trustworthy anymore. We need to mobilize and resist!"
#classes = ["civil disobedience", "praise of the government", "travel advice"]  # "collective action"

output = pipeline_zeroshot_classification(text, classes, multi_label=True)

pd.DataFrame(data=[output["labels"], output["scores"]], index=["class", "probability"]).T


Unnamed: 0,class,probability
0,payment issues,0.991133
1,bug report,0.076115
2,travel advice,0.018696


## Exercise

Now it is your turn to go to the hugging face library https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads

(you can select on the left menu of the website the type of NLP tasks you want models to perform)

- Find an NLP model that we have not used previously.

- Get some data from wikipedia or elsewhere.

- Perform inference with the model and print the result!

- Comment your code along the way, describe what your model does and what your end goal is from input to output.

Have fun!

# Task 1. UAV log (Text Classification):

# 2. DistilBERT Model for UAV Log Data


- Perform Inference with the Model

In [10]:
#Fetching data from Wikipedia (UAV article)
#url = "https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle"
#response = requests.get(url)
#text = response.text

In [2]:
# Load the text classification pipeline with the chosen model
# This model is fine-tuned for sentiment analysis positive/negative classification

pipeline_classification = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


# The model will classify the text as either POSITIVE or NEGATIVE


In [15]:
# UAV text
#text = "UAV is operating normally, all systems green."  # positive UAV log
text = "Critical error: Motor failure, initiating emergency landing."  # Example of a negative UAV log

# Perform inference
output = pipeline_classification(text, top_k=10)

print(output)

[{'label': 'NEGATIVE', 'score': 0.9995260238647461}, {'label': 'POSITIVE', 'score': 0.0004739683645311743}]


# Classification Results:



In [16]:
# make output a bit cleaner
df_output = pd.DataFrame(output)
print(df_output)

      label     score
0  NEGATIVE  0.999526
1  POSITIVE  0.000474


# Task 2. Machine Translation - French To Arabic:

In [17]:
# 1: Load the translation pipeline
pipeline_translate = pipeline("translation", model="facebook/m2m100_418M")

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Device set to use cuda:0


In [22]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=5b4480d95637f5bf8834ae2eb9e45127e6b88bf8a5d8b2dc570c1ff0bfd43231
  Stored in directory: /root/.cache/pip/wheels/8f/ab/cb/45ccc40522d3a1c41e1d2ad53b8f33a62f394011ec38cd71c6
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [23]:
url = "https://en.wikipedia.org/wiki/Swiss_Alps"
response = requests.get(url)
text = response.text

In [26]:
# 2: Set Wikipedia language to French
import wikipedia
wikipedia.set_lang("fr")

#3: Fetch a summary of "Swiss Alps" in French
text = wikipedia.summary("Alpes suisses").replace('\n', ' ')[:100]  # Limit to 100 characters for simplicity
print(f"Original text (French):\n{text}\n")

#4: Translate the text from French to Arabic
text_translated = pipeline_translate(text, src_lang="fr", tgt_lang="ar")

print(f"Translated text (Arabic):\n{text_translated[0]['translation_text']}")

Original text (French):
Les Alpes suisses sont la partie située en Suisse de la chaîne des Alpes. Elles comprennent la haute

Translated text (Arabic):
الألب السويسرية هي الجزء الذي يقع في سويسرا من سلسلة الألب.


# Task 3. Named Entity Recognition (NER) for AIRBUS

# Chosen dslim/bert-base-NER-uncased Model for NER



In [27]:
# Load the NER pipeline
pipeline_ner = pipeline("token-classification", model="dslim/bert-base-NER-uncased", aggregation_strategy="simple")

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER-uncased were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [41]:
# Set Wikipedia language to English
wikipedia.set_lang("en")

# Fetch a summary of "AIRBUS"
text_long = wikipedia.summary("Airbus").replace('\n', ' ')
print(f"Text about Airbus:\n{text_long}\n")

Text about Airbus:
Airbus SE ( AIR-buss; French: [ɛʁbys] ; German: [ˈɛːɐ̯bʊs] ; Spanish: [ˈejɾβus]) is a European aerospace corporation. The company's primary business is the design and manufacturing of commercial aircraft but it also has separate defence and space and helicopter divisions. Airbus has long been the world's leading helicopter manufacturer and, in 2019, also emerged as the world's biggest manufacturer of airliners. The company was incorporated as the European Aeronautic Defence and Space Company (EADS) in the year 2000 through the merger of the French Aérospatiale-Matra, the German DASA and Spanish CASA. The new entity subsequently acquired full ownership of its subsidiary, Airbus Industrie GIE, a joint venture of European aerospace companies originally incorporated in 1970 to develop and produce a wide-body aircraft to compete with American-built airliners. EADS rebranded itself as Airbus SE in 2015. Reflecting its multinational origin, the company operates major office

In [47]:
# Perform NER on the text
output = pipeline_ner(text_long)

# Convert the output to a pandas DataFrame for better visualization
import pandas as pd
df_output = pd.DataFrame(output)

# the DataFrame
pd.DataFrame(output)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.994295,airbus se,0,9
1,ORG,0.945243,air - buss,12,20
2,MISC,0.994153,french,22,28
3,ORG,0.626516,##s,35,36
4,MISC,0.993665,german,40,46
5,MISC,0.994094,spanish,61,68
6,MISC,0.952975,european,86,94
7,ORG,0.991823,airbus,273,279
8,ORG,0.987255,european aeronautic defence and space company,451,496
9,ORG,0.984943,eads,498,502


In [49]:
df_output = df_output[df_output["score"] > 0.95]  # Keep only high-confidence predictions

In [50]:
df_output = df_output.sort_values(by="score", ascending=False).head(5)
print(df_output[["entity_group", "word", "score"]])

   entity_group      word     score
25          LOC    canada  0.998658
21          LOC    france  0.998649
22          LOC   germany  0.998594
23          LOC     spain  0.998563
26          LOC  malaysia  0.998532


# Task 4. Zero shot Classifications

In [52]:
pipeline_zeroshot_classification = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Device set to use cuda:0


In [57]:
# Define the input text = aviation-related
text = "My flight was delayed by 3 hours due to bad weather, and the airline did not provide any updates. This is insane!"

# Define it into categories
classes = ["flight delay", "safety concern", "food order"]

# Perform zero-shot classification
output = pipeline_zeroshot_classification(text, classes, multi_label=True)

# Format
df_output = pd.DataFrame(data=[output["labels"], output["scores"]], index=["class", "probability"]).T
print(df_output)

            class probability
0    flight delay    0.999004
1  safety concern    0.927149
2      food order    0.024952


# Wow! Look at this technique! Spot on