<a href="https://colab.research.google.com/github/DrNOFX97/lab-transformers/blob/main/lab_transformers_te.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab | Transformers

---

### Section structure

1. The open-source ecosystem: increasing accessibility to machine learning (ML) software and hardware
2. Some simple code demonstrations
3. Q&A

## 1. Ease-of-use: Using Transformers in 3 lines of code


**Overview of different tasks that can be automated with ML**
* Key ingredients: (1) a model trained on a specific task; (2) input data (e.g. texts or images); (3) output produced by the model.
* Transformers are currently the most popular type of deep learning algorithm. Most tasks below are solved with Transformers. There might be other types of algorithms coming up in the medium term.



**Install the Transformers library & dependencies**

In [None]:
#!pip install transformers~=4.31.0  # The Transformers library from Hugging Face
#!pip install sentencepiece==0.1.96  # optional tokeniser, required for some models. e.g. machine translation
#!pip install wikipedia==1.4.0  # to download any text from wikipedia
# running large models with accelerate https://huggingface.co/blog/accelerate-large-models
# NOTE: we need to restart the runtime after installing accelerate
#!pip install accelerate~=0.21.0

In [None]:
# automatically chose CPU or GPU for inference, depending on your hardware
import torch
#device_id = torch.cuda.current_device() if torch.cuda.is_available() else -1
# -1 == CPU ; 0 == GPU

# Check available device
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    device_info = "GPU acceleration in place powered by nVIDIA (CUDA)"
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    device_info = "GPU acceleration in place powered by Apple's Metal Performance Shaders (MPS)"
else:
    device = torch.device("cpu")
    device_info = "Using CPU... Best of luck..."

print(device_info)

GPU acceleration in place powered by nVIDIA (CUDA)


**The Hugging Face Pipeline**
* Makes automation of many NLP tasks possible in 3 lines of code
* Detailed documentation is available [here](https://huggingface.co/transformers/main_classes/pipelines.html)

In [None]:
from transformers import pipeline
import pandas as pd
import numpy as np
from pprint import pprint

### 2.1 Many models tailored to specific tasks


#### 2.1.1 Text classification

Let's search for a few popular text classification models in the [HF model hub](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads).

In [None]:
import warnings
# Suppress the FutureWarning about resume_download
warnings.filterwarnings("ignore", message="`resume_download` is deprecated and will be removed in version 1.0.0.")

In [None]:
#!pip install xformers

In [None]:
pipeline_classification = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-irony")

In [None]:
text = "Well that workshop was totally worth my time..."  # "Well that workshop was totally worth my time..."  "This smells weird, I'm not sure if I should eat this ... Yikes, it tasted like old socks!"
output = pipeline_classification(text, top_k=10)
print(output)

[{'label': 'irony', 'score': 0.9424386620521545}, {'label': 'non_irony', 'score': 0.057561296969652176}]


In [None]:
# make output a bit cleaner
df_output = pd.DataFrame(output)
print(df_output)

       label     score
0      irony  0.942439
1  non_irony  0.057561


#### 2.1.2 Machine Translation

* Open source machine translation (MT) models enable you to translate between many different languages without Google Translate.
* [University of Helsinki](https://huggingface.co/Helsinki-NLP) uploaded models for more than 1000 language pairs to the Hugging Face hub
* [Facebook AI](https://huggingface.co/models?search=facebook+m2m) open-sourced several multi-lingual models
* The [EasyNMT library](https://github.com/UKPLab/EasyNMT), provides an easy wrapper for all these models
* Most machine translation models translate between two languages in one direction (e.g. German to English, but not English to German), some can translate in multiple directions.


In [None]:
import warnings

# Suppress the FutureWarning about resume_download
warnings.filterwarnings("ignore", message="`resume_download` is deprecated and will be removed in version 1.0.0.")

# Install the required libraries
#!pip install transformers~=4.31.0
#!pip install sentencepiece==0.1.96
#!pip install wikipedia==1.4.0
#!pip install accelerate~=0.21.0

In [None]:
# translation pipeline docs: https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TranslationPipeline
pipeline_translate = pipeline("translation", model="facebook/m2m100_418M")

In [None]:
text = "Ich bin ein Fisch"
pipeline_translate(text, src_lang="de", tgt_lang="en")

[{'translation_text': 'I am a fish'}]

In [None]:
#!pip install wikipedia

In [None]:
# download any text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("pt")

text = wikipedia.summary("Sporting Clube Farense").replace('\n', ' ')[:318]
print(f"Original text:\n{text}\n")

# translate the text from wikipedia
text_translated = pipeline_translate(text, src_lang="pt", tgt_lang="en")
print(f"Translated text:\n{text_translated[0]['translation_text']}")


Original text:
Sporting Clube Farense é um clube de futebol português, da cidade de Faro. É o clube mais antigo e com maior historial do Algarve. Utiliza como equipamento, camisola preta ou (e) branca, calção preto ou branco e meias brancas ou pretas. O Sporting Clube Farense possui concomitantemente o décimo quarto (14.º) melhor r

Translated text:
Sporting Club Farense is a Portuguese football club, from the city of Faro. It is the oldest club and with the largest history of Algarve. It uses as equipment, black or (e) white shirt, black or white calcium and white or black mids. The Sporting Club Farense has simultaneously the twentieth quarter (14.) best r


#### 2.1.3 Text Summarization

In [None]:
# docs for summarisation pipeline: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
pipeline_summarize = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")  # sshleifer/distilbart-cnn-12-6 , google/pegasus-cnn_dailymail

In [None]:
# download any long text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("pt")

text_long = wikipedia.summary("Sporting Clube Farense").replace('\n', ' ')
print(f"Original text:\n{text_long}\n")

# translate the text from wikipedia
text_summarized = pipeline_summarize(text_long, min_length=5, max_length=30)
print(f"Summarized text:\n{text_summarized[0]['summary_text']}")

Original text:
Sporting Clube Farense é um clube de futebol português, da cidade de Faro. É o clube mais antigo e com maior historial do Algarve. Utiliza como equipamento, camisola preta ou (e) branca, calção preto ou branco e meias brancas ou pretas. O Sporting Clube Farense possui concomitantemente o décimo quarto (14.º) melhor registo na Primeira Liga Portuguesa e na Taça de Portugal. Destacam-se a presença na Final da Taça de Portugal na época 1989/1990, e ainda o 5.º lugar obtido na época 1994/1995, que valeu ao clube a participação na Taça UEFA no ano seguinte.

Summarized text:
 Sporting Clube Farense is a clube de futebol português, da cidade de Faro .


#### 2.1.4 Named Entity Recognition

In [None]:
pipeline_ner = pipeline("token-classification", model="dslim/bert-base-NER-uncased", aggregation_strategy="simple")

Some weights of the model checkpoint at dslim/bert-base-NER-uncased were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
import wikipedia
wikipedia.set_lang("pt")

text_long = wikipedia.summary("Sporting Clube Farense").replace('\n', ' ')

output = pipeline_ner(text_long)

pd.DataFrame(output)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.993572,sporting clube farense,0,22
1,ORG,0.917649,um clube de futebol portugues,25,54
2,ORG,0.864415,da cidade de faro,56,73
3,ORG,0.571216,clube mai,79,88
4,ORG,0.595262,antigo,90,96
5,MISC,0.478366,maior,103,108
6,ORG,0.857461,historial do al,109,124
7,LOC,0.840841,##garve,124,129
8,ORG,0.974541,ut,131,133
9,ORG,0.880519,##ili,133,136


### 2.2. Universal models

The models above are always tailored to **one specific task from one dataset**. The main advantage of these models is, that they are very good at this specific task and perform well on one specific dataset. In reality, however, he problems you will encounter in the real world will require a slightly different task, with different definitions of categories or on different types of texts.

Universal models can partly address this issue. They also only one task. But this one task is to general/universal, that many other tasks can be reformulated as this universal task. Two examples for universal tasks are:
- Natural Language Inference (NLI): a task that can solve any classification task.
- Token generation: an even more universal task that can solve any text-related task.

#### Zero-shot classification

In [None]:
pipeline_zeroshot_classification = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

In [None]:
text = "Customer: I have not received my reimbursement yet. What the hell is going on?"
classes = ['payment issues', 'travel advice', 'bug report']  # "account opening", "customer complaint"

#text = "I do not think the government is trustworthy anymore. We need to mobilize and resist!"
#classes = ["civil disobedience", "praise of the government", "travel advice"]  # "collective action"

output = pipeline_zeroshot_classification(text, classes, multi_label=True)

pd.DataFrame(data=[output["labels"], output["scores"]], index=["class", "probability"]).T


Unnamed: 0,class,probability
0,payment issues,0.991132
1,bug report,0.076115
2,travel advice,0.018696


## Exercise  +  Q&A


**1. Exercise:** (5 min)

Browse through the Hugging Face Hub and **identify a model or dataset that could be useful for you**. Then open this Google Doc and copy-paste the model identifier and a short explanation why this model is interesting for you. Googel Doc: https://docs.google.com/document/d/1KZ6DnZDUg_sxqpS8hhF0MDohZ0IRUZaV83Ixu93n-X8/edit?usp=sharing




**2. Reading, thinking & asking:** (5 min)

a) Go through the notebook and ask any questions you might have. You can also run the notebook yourself.

b) Write the answers to the following questions on a piece of paper / digital notebook in your own words:

* How does open source help increase accessibility to machine learning? Where does it not help?

* In your own words, write down the main difference between standard models and universal models.

* **Post any questions in the chat/Slack!**
