<h1> Introduction to Hugging Face </h1>

Hugging Face is the worlds largest provider of open source machine learning software. They have implementations of a wide variety of models (and utilities such as a data processing library). They provide implementations of PyTorch models, like the one you built in the notebook from the kick-off weekend. That means you can use them exactly like you would normal PyTorch models, and train them exactly like in the kick-off notebook.

We will now look at solving a variety of tasks with Hugging Face, including text summarization and image generation. Specifically, we will use the libraries Transformers and Diffusers.

Anyone can train a model and upload it to Hugging Face. Companies such as Meta and Microsoft have uploaded models to Hugging Face. Here is a link to available models (https://huggingface.co/models). After this tutorial you will be able to use these models, as well as tune them on a specific task that you want to solve. 

In [29]:
from urllib.request import urlretrieve
from transformers import pipeline
from datasets import Dataset

<h2> Using a pretrained model to summarize text </h2>

We are now going to use a large pretrained transformer to generate text summaries. We are going to use a model from Facebook called bart-large-cnn. Here is a link to the model documentation: https://huggingface.co/facebook/bart-large-cnn.

In [30]:
model = "facebook/bart-large-cnn"

# Create a summarizer
summarizer = ...

# Find some long English text online and put it here
text = ...

# Generate the summary
summary = ...

# Print the summary
summary

Ellipsis

It really is that simple to solve many machine learning tasks when we have a large open source library like transformers. Companies such as Meta and Google have spent much resources to build and train these models, and we can conveniently use them. However, the models are usually trained on a general objective, and when we are dealing with specific tasks, we should finetune them. 

Let's see how well the model performs on Norwegian text. Find some Norwegian text online and paste it below:

In [31]:
# Paste some Norwegian text
text = ...

# Generate the summary
summary = ...

# Print the summary
summary

Ellipsis

<h2> More tasks with Diffusers and Transformers </h2>

Now we will solve some new tasks with the HF libraries, to showcase their versatility and power. We already tried the transformers library. Now we will look at the Diffusers library as well. Diffusers is all about generating images, audio, and similar data (through diffusion models). More info about Diffusers: https://huggingface.co/docs/diffusers/index. 

<h3> Here are some tasks (all can be solved with Hugging Face): </h3>

- Generate an image of a skateboarding turtle (keywords: text to image, stable diffusion).
- Find a news article online, cut away the last half, and then regenerate the last half with a transformer. Compare how similar the real and generated parts are (keywords: next word prediction). 
- Download some image online, remove 30% of it, and regenerate the 30% (keywords: diffusion inpainting).

These tasks sound a little crazy, but Hugging Face does the heavy lifting (you don't need to train any models, the tasks can be solved with pretrained ones. Also, Google is your friend).

In [32]:
# Code the turtle generator here

In [33]:
# Code the news completion here

In [34]:
# Code the image inpainting here

<h2> Finetuning </h2>



You will notice that most of the models uploaded to Hugging Face have been trained on a general task that may not specifically correspond to what you want to do. In order to leverage what these models have learnt during large scale pretraining, we can tune them. This means that we take the pretrained model (its architecture and weights), and continue the training on our selected task. 

Let's return to the summarization model. How did it perform on Norwegian text? Maybe it wasn't that bad. Still, it can be improved, so let's achieve this by finetuning it. We will tune the model on a large dataset of Norwegian text.

Let's start by fetching the data:

In [35]:
train_url = "https://huggingface.co/datasets/NbAiLab/norwegian-xsum/resolve/main/nob/train/data-00000-of-00001.arrow?download=true"
valid_url = "https://huggingface.co/datasets/NbAiLab/norwegian-xsum/resolve/main/nob/validation/data-00000-of-00001.arrow?download=true"
test_url = "https://huggingface.co/datasets/NbAiLab/norwegian-xsum/resolve/main/nob/test/data-00000-of-00001.arrow?download=true"
train_path, valid_path, test_path = "train.arrow", "valid.arrow", "test.arrow"

for p, url in [(train_path, train_url), (valid_path, valid_url), (test_path, test_url)]:
    urlretrieve(url, p)

train_data = Dataset.from_file(train_path)
validation_data = Dataset.from_file(valid_path)
test_data = Dataset.from_file(test_path)

Let's look at the first example:

In [36]:
train_data[0].keys()

dict_keys(['document', 'summary', 'id'])

In [37]:
train_data[0]["id"]

'35232142'

In [38]:
train_data[0]["document"]

'Det er fortsatt pågående reparasjonsarbeid i Hawick og mange veier i Peeblesshire er fortsatt sterkt påvirket av stillestående vann. Tog på vestkysten har blitt forstyrret på grunn av skader ved Lamington Viaduct. Mange bedrifter og husholdere ble rammet av flom i Newton Stewart etter at elven Cree overvannet i byen. Jeanette Tate, som eier Cinnamon Cafe, som ble sterkt rammet, sa at hun ikke kunne klandre multi-agenturresponsen etter at flommen rammet. Hun sa imidlertid at mer forebyggende arbeid kunne blitt utført for å sikre at støttemuren ikke sviktet. "Det er vanskelig, men jeg tror det er så mye publisitet for Dumfries og Nith - og jeg setter helt pris på det - men det er nesten som om vi er forsømmet eller glemt", sa hun. "Det er kanskje ikke sant, men det er kanskje mitt perspektiv de siste dagene. "Hvorfor var du ikke klar til å hjelpe oss litt mer når advarselen og alarmen var ute?" I mellomtiden er det fortsatt en overflodsalarm på tvers av grensen på grunn av den konstante

In [39]:
train_data[0]["summary"]

'Rengjøringsoperasjoner fortsetter over de skotske grensene og Dumfries og Galloway etter oversvømmelser forårsaket av Storm Frank.'

<h3> Now we leave the tuning part to you. Some things you will need: </h3>

1. A tokenizer that converts the text into tokens (Tokens correspond to words, but sometimes the words are split into multiple tokens. The tokens are represented by integers. We recommend you print the output of the tokenizer to inspect how it works)
2. Some data preprocessing function
3. A function to compute metrics and evaluate the summarizer (you can look into the Rouge score)
4. A training function (you can use a Hugging Face API, or train it in the PyTorch way with a function very similar to what you built in the last notebook)

This exercise is meant to make you familiar with working independently with Hugging Face libraries. That usually means Googling and looking through Hugging Face documentation. Here is a good resource: https://huggingface.co/learn/nlp-course/chapter7/5?fw=pt. 

In [40]:
# Code it here