# Transformers a la HuggingFace 

Transformers can be used in several areas like:
- Natural Language Processing (NLP)
- Computer Vision (CV)
- Automatic Speech Recognition (ASR)

But this notebook we will focus on NLP and showcase a number of possible tasks like:
- Text Generation,
- Translation,
- Summarisation,
- and more.

That you easelly can perform by youself using the *transformers* packet and open models made available by HuggingFace.
  

-----

The Notebook will cover

HuggingFace Components:
- **Tokenizer**: Maps text (string) to tokens and associated id (int) that can be understood by a model. ([Tokenizer summary](https://huggingface.co/docs/transformers/tokenizer_summary))
- **Model**: A transformer model 
- **Pipeline**: Putting a tokenizer and model together for easy use  

Terminology:
- **Prompt**: Text input to a generative model


In [None]:
# Some necessary installations for your Colab instance
! pip install transformers
! pip install sentencepiece

## Text Generation - Distilled GPT2

Initiating a (pretrained) model and its tokenizer form the HuggingFace Zoo. In this case we choose a distilled version of GPT2

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True)
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

We make it easy with a completed pipeline

In [None]:
# Putting the tokenizer & the model in a pipeline
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

# Define a pretty printing function :)
def print_res(result, prompt):
    i = 1
    for x, y in zip(result, prompt):
        j = 1
        print(f"--- Prompt {i} ---: ")
        print(">> Prompt: ", y, "...")
        for xx in x:
            print(f">> Output ({j}): ", xx["generated_text"][len(y):])
            j += 1
        print("------------------\n")
        i += 1

In [None]:
settings = {
        # General
        "pad_token_id": 50256,
        "max_length": 50, 
        "no_repeat_ngram_size": 2, 
        "repetition_penalty": 1, 
        "num_return_sequences": 2,

        # # Beam search
        # "num_beams": 5, 
        # "num_return_sequences": 2,
        # "early_stopping": True,
        
        # # Sampling
        # "temperature": 1,
        "do_sample": True,
        "top_k": 0,
        "max_length": 50, 
        "top_p": 0.92, 
        }


prompt = [
    "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
    "Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne",
    "I'm a Transformer and welcome to my TED-talk",
    ]

result = generator(
    prompt,
    **settings
)

print_res(result, prompt)

### Sandbox - Try to play around with different prompts, settings and models!

#### Models
Text Generation models compatiable with the pipeline can be found [here](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
- For example [lunde/gpt2-snapsvisor](https://huggingface.co/lunde/gpt2-snapsvisor) a *finetuned swedish version* of *GPT2* that writes ***snapsvisor***... 


In [None]:
# Load models and create pipeline
# This might take a few minutes for a new model! So a tip is to load a model, and then play around with prompts and settings in the next cell

# Can be changed to any model name found on HuggingFace this link: https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads 
model_name = 'lunde/gpt2-snapsvisor'

tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
model = AutoModelForCausalLM.from_pretrained(model_name)

generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)


#### Settings
Text generation and its settings [explained by HuggingFace](https://huggingface.co/blog/how-to-generate), includes example settings for 
- Greedy/Beam search 
- Sampling strategies

Full [list of settings](https://huggingface.co/docs/transformers/main_classes/text_generation) for the generate function, some exampels
- ***min_length*** & ***max_length***: (int) min/max tokens to generate
- ***no_repeat_ngram_size***: (int) constrain the repetitiveness of the generation. (3 -> no three word sequence can be repeted)
- ***repetition_penalty***: (float) penatlty factor for repetion (1 means no penalty)

Rerun the cell with different settings and prompts 

In [None]:
# Change prompt and settings, and generate output

# A dictionary with settings for the generation function
settings = {
        # General
        "pad_token_id": 50256,
        "max_length": 50, 
        # "no_repeat_ngram_size": 2, 
        "repetition_penalty": 1, 
        
        # # Beam search
        # "num_beams": 5, 
        # "num_return_sequences": 2,
        # "early_stopping": True,
        
        # # Sampling
        # "temperature": 1,
        "do_sample": True,
        "top_k": 0,
        "max_length": 50, 
        "top_p": 0.92, 
        }

# List containing all strings you would like to send to the model as prompts 
prompt = [
    "Tre ringar för älvkungarnas makt högt i det blå, sju för dvärgarnas furstar i salarna av sten",
    "nio för de dödliga som köttets väg ska gå, en för Mörkrets herre i ondskans dunkla sken",
    ]

# Storing the result 
result = generator(
    prompt,
    **settings
)

print_res(result, prompt)

### What's going on inside the pipeline?
For the interested person. Here we can see how the pipeline works with tensors, encodings and decodings to go from input to output.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True)
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

# Input Prompt
text_input = "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone"
input_length = len(text_input)

# Encoding the prompt
input_ids = tokenizer.encode(text_input, return_tensors="pt")

# Generating
output_ids = model.generate(input_ids, do_sample=True, top_k=0, max_length=50, top_p=0.92, pad_token_id=50256)

# Decoding the output
text_output = tokenizer.batch_decode(output_ids)

# Checking the result
print(">> Prompt: ", text_input, "...")
print(">> Output: ", text_output[0][input_length:])


In [None]:
tokens = input_ids[:10].tolist()
for token, id in zip(tokenizer.tokenize(text_input), input_ids.tolist()[0]):
    print(f"{id}:\t {token.replace('Ġ', '_')}")


## Text classification (examplified through sentiment analysis)
In text classification we want to categorize our text. A common example is _sentiment analysis_, or classifying if a text is positive or negative. This is partly because there are a lot of labeled data available online in the form of reviews, we we quite easily can get a lot of training data where we know that a 1-star review probably has a negative tone for example.


Some other examples of where text-classification can be useful are: detecting the language of a text, classifying spam, finding urgency and important in customer messages or detecting toxic messages. 

[Here are other classifcation models on HuggingFace](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads). Most are sentiment analysis models, usually trained on either twitter or movie reviews from imdb. There are also other types of classification models here if you dig a bit, for example toxicity models. You can ask Victor if you want to know more about toxicity models specificly :)

In [None]:
classifier = pipeline(model="distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
review1 = "This movie is disgustingly good !"
review2 = "Director tried too much."

print(f'>>{review1}<< {classifier(review1)}\n>>{review2}<< {classifier(review2)}')

## Zero-Shot-Classification 

Using a large pretrained NLP model to classify text into never seen classes i.e. ***zero-shot***

Models [avaialble](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads) 

In [None]:
from transformers import pipeline

oracle = pipeline(model="facebook/bart-large-mnli")

In [None]:
print(oracle(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
))

oracle(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["english", "german"],
)

In [None]:
print(oracle(
    "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
    candidate_labels=["upper-class", "middle-class", "lower-class"],
))
oracle(
    "Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne",
    candidate_labels=["happy", "sad", "epic"],
)

## Text-2-Text

Example of a Text to Text transformer trained to generate a question that fits to a given answer and context. 

Other models [avaialble](https://huggingface.co/models?pipeline_tag=text2text-generation&sort=downloads)

In [None]:
from transformers import pipeline

# Generator Pipeline
generator = pipeline(model="mrm8488/t5-base-finetuned-question-generation-ap")

In [None]:
# Context and a sought answer
context = "Manuel has created RuPERTa-base with the support of HF-Transformers and Google"
answer = "Manuel"

result = generator(f"answer: {answer} context: {context}")[0]["generated_text"]

print("--- Input ---")
print("Context:", context)
print("Answer:", answer)
print("--- Output ---")
print(result)

Change the context and the answer you would like and see what question the model can come up with!  

In [None]:
context = "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne"

answer = "stone halls"

result = generator(f"answer: {answer} context: {context}")[0]["generated_text"]

print("--- Input ---")
print("Context:", context)
print("Answer:", answer)
print("--- Output ---")
print(result)

## Translation

Translation models avaiable on HuggingFace compatiable with [pipeline](https://huggingface.co/models?pipeline_tag=translation&sort=downloads)


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import pipeline


tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# The T5 model supports the following languages: en, de, fr & ro 
# Change the language by switching xx & yy in translation_xx_to_yy  
en_de_translator = pipeline("translation_en_to_de", model=model, tokenizer=tokenizer)


In [None]:
en_de_translator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")

## Conversational AI
Using a large pretrained NLP model to have a conversation. These models are usually trained to be __engaging__, whatever that means. 
The script below runs starts a conversation with an _initial_prompt_ and then runs for 5 turns, where you can reply to the model.

You can try out different prompts, models and replies. Maybe compare to a purely generative model, that you ask to simulate a conversation.

Models [avaialble](https://huggingface.co/models?pipeline_tag=conversational&sort=downloads) 

In [None]:
from transformers import Conversation

chatbot = pipeline(task='conversational', model ='facebook/blenderbot-400M-distill')

In [None]:
intial_input = "Going to the movies tonight - any suggestions?"
conversation = Conversation(intial_input)
reply = chatbot(conversation)
print(reply)

# run a conversation for 5 turns
for step in range(5):
    new_input = input(">>User:")
    conversation.add_user_input(new_input)
    reply = chatbot(conversation)
    print(reply)

## Named Entity Recognition
Models trained to extract __Entities__, i.e. people, places, organizations and so on. They are usually tagged with: 
|  |  |
|---|---|
| ORG | organization |
| LOC | location |
| PER | person |
| MISC | miscellaneous |

They also get prefixes, either I- or O-. I- is the most common, O- is used for to distinguish between entities if there are several of the same tag directly after each other. This is how to interpret the full tags:
| Abbreviation | Description |
|---|---|
|O |	Outside of a named entity|
|B-MIS| 	Beginning of a miscellaneous entity right after another miscellaneous entity|
|I-MIS| 	Miscellaneous entity|
|B-PER| 	Beginning of a person’s name right after another person’s name|
|I-PER| 	Person’s name|
|B-ORG| 	Beginning of an organization right after another organization|
|I-ORG| 	organization|
|B-LOC| 	Beginning of a location right after another location|
|I-LOC| 	Location|


The most common model on HuggingFace is in frech for some reason, probably because it has a nice name - camemBERT. 
The default model is an English version of BERT, finetuned for NER. You can find other models [here](https://huggingface.co/models?pipeline_tag=token-classification&sort=downloads). This is an example of a _token classification task_.

In [None]:
ner_pipe = pipeline("ner")
# calling the pipeline without an argument results in it loading a default mode. In this case, this is equivalent to:
# ner_model = "dbmdz/bert-large-cased-finetuned-conll03-english"
#ner_pipe = pipeline(task="token-classification", model=ner_model)


In [None]:

sequence = """NordAxon is a company based in Malmö, currently Filip and Victor are in Halmstad with HighFive Halmstad"""

for entity in ner_pipe(sequence):
    print(entity)

## Question answering
A question answering model takes two input parameters: a _context_ and a _question_. It tries to answer the _question_ using information in the _context_.
You could for example pipe in an article as a context, and ask questions through the model. These models are mainly trained on answering simple fact-based questions.

[Other models available here](https://huggingface.co/models?pipeline_tag=question-answering&sort=downloads)


In [None]:
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

In [None]:
context = """
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")


## Summarization
Models trained to summarize longer texts. These kinds of models are becomming more and more common, especially in AI-newsletters for some reason :thinking_face:

Here are [other available models](https://huggingface.co/models?pipeline_tag=summarization&sort=downloads). Many of them are trained on a dataset of news articles from [CNN and the Daily Mail](https://huggingface.co/datasets/cnn_dailymail). This dataset can be downloaded and interactied with through another HuggingFace package - [datasets](https://huggingface.co/docs/datasets/index) if you want to play around with that as well.

In [None]:
summarizer = pipeline("summarization", "t5-base")

In [None]:
# the article below is a snapshot of the text in this wikipedia article, as collected on 2023-01-25: https://en.wikipedia.org/wiki/Artificial_Intelligence_Act

article = """The Artificial Intelligence Act (AI Act) is a regulation[1] proposed on 21 April 2021 by the European Commission which aims to introduce a common regulatory and legal framework for artificial intelligence.[2] Its scope encompasses all sectors (except for military), and to all types of artificial intelligence. As a piece of product regulation, the proposal does not confer rights on individuals, but regulates the providers of artificial intelligence systems, and entities making use of them in a professional capacity.

The proposed regulation classifies artificial intelligence applications by risk, and regulates them accordingly. Low-risk applications are not regulated at all, with Member States largely precluded via maximum harmonisation from regulating them further and existing national laws relating to the regulation of design or use of such systems disapplied.[3] A voluntary code of conduct scheme for such low risk systems is envisaged, although not present from the outset. Medium and high-risk systems would require compulsory conformity assessment, undertaken as self-assessment by the provider, before being put on the market. Some especially critical applications which already require conformity assessment to be supervised under existing EU law, for example for medical devices, would the provider's self-assessment under AI Act requirements to be considered by the notified body conducting the assessment under that regulation, such as the Medical Devices Regulation.

The proposal also would place prohibitions on certain types of applications, namely remote biometric recognition, applications that subliminally manipulate persons, applications that exploit vulnerabilities of certain groups in a harmful way, and social credit scoring. For the first three, an authorisation regime context of law enforcement is proposed, but social scoring would be banned completely.[4]

The act also proposes the introduction of a European Artificial Intelligence Board which will encourage national cooperation and ensure that the regulation is respected.[5]

Like the European Union's General Data Protection Regulation (GDPR), the AI Act could become a global standard.[6] It is already having impact beyond Europe; in September 2021, Brazil’s Congress passed a bill that creates a legal framework for artificial intelligence.[7]

The European Council adopted its general approach on the AI Act on 6 December 2022.[8] Germany supports the Council's position but still sees some need for further improvement as formulated in an accompanying statement by the member state.[9]

The EU AI Act is a proposal by the European Commission to regulate Artificial Intelligence (AI) in the EU. The goal is to create a framework to manage and mitigate risks of AI systems and build trust in them. The proposal includes a classification system for AI systems based on risk level and prioritizes the fundamental rights of individuals. The proposal has undergone changes, such as amendments from the Parliament and the French and Czech presidencies, with the aim to balance between protecting fundamental rights and promoting AI. """

summarizer(article)

## Automatic Speech Recognition (ASR) / Speech to text (STT) with Wav2Vec2


In [None]:
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import IPython.display as ipd

print("Audio Backend found:", torchaudio.get_audio_backend())
assert torchaudio.get_audio_backend() != None, "fail"

Loading a swedish version of Wav2Vec2 ([VoxRex](https://huggingface.co/KBLab/wav2vec2-large-voxrex-swedish)) trained by KBlabs (Kungliga biblioteket) and a dataset Common Voice 

In [None]:
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")

test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")
sample_rate = 16000
resampler = torchaudio.transforms.Resample(48_000, sample_rate)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

Using the model to transcribe some audio files and print the predicted transcription as well as the actual transcription (reference)

In [None]:
# The test_dataset contains 41 exxamples, change the first/last values below to get different examples
first = 0
last = 8

# Inference/Prediction
inputs = processor(test_dataset["speech"][first:last], sampling_rate=sample_rate, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)


Use the cell below to lisent to the audio samples and verify that the transcription is correct! Change the variable *index* below to lisent other predicted samples. 

In [None]:
index = 0
print(f"Example {index} of {len(inputs.input_values)} predicted (index [0,..,{len(inputs.input_values)-1}] available).")
print("Prediction transcription:", processor.batch_decode(predicted_ids)[index])
print("Reference  transcription:", test_dataset["sentence"][first:last][index])

sample = inputs.input_values[index]
ipd.Audio(sample, rate=sample_rate)


Print all predicted examples transcriptions

In [None]:
print("Prediction (model):\n", processor.batch_decode(predicted_ids))
print("Reference:\n", test_dataset["sentence"][first:last])