A 🤗 tour of transformer applications
[GitHub Link](https://github.com/huggingface/workshops/tree/main/nlp-zurich)

# Practical Part 1

## Pipeline

High-level API called pipeline is used for the experimentation with different models for wide range of tasks. The pipeline takes care of all preprocessing and returns cleaned up predictions. The pipeline is primarily used for inference where we apply fine-tuned models to new examples.

<img src="https://github.com/huggingface/workshops/blob/main/nlp-zurich/images/pipeline.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=800>

## Setup

Install the dependencies 

In [None]:
# Datasets is a library provided by HuggingFace where datasets are present
# Link: https://huggingface.co/docs/datasets/index
!pip install datasets

In [None]:
# Transformers is a library used to download models from HuggingFace
# Link: https://huggingface.co/docs/transformers/index
!pip install transformers

A textwrapper to format long texts nicely

In [None]:
import textwrap
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)

## Classification

We start by setting up an example text that we would like to analyze with a transformer model. This looks like your standard customer feedback from a transformer:

In [None]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure
from your online store in Germany. Unfortunately, when I opened the package, 
I discovered to my horror that I had been sent an action figure of Megatron 
instead! As a lifelong enemy of the Decepticons, I hope you can understand my 
dilemma. To resolve the issue, I demand an exchange of Megatron for the 
Optimus Prime figure I ordered. Enclosed are copies of my records concerning 
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

print(wrapper.fill(text))

Dear Amazon, last week I ordered an Optimus Prime action figure from your online
store in Germany. Unfortunately, when I opened the package, I discovered to my
horror that I had been sent an action figure of Megatron instead! As a lifelong
enemy of the Decepticons, I hope you can understand my dilemma. To resolve the
issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered.
Enclosed are copies of my records concerning this purchase. I expect to hear
from you soon. Sincerely, Bumblebee.


One of the most common tasks in NLP and especially when dealing with customer texts is _sentiment analysis_. We would like to know if a customer is satisfied with a service or product and potentially aggregate the feedback across all customers for reporting.

For text classification the model gets all the inputs and makes a single prediction as shown in the following example:

<img src="https://github.com/huggingface/workshops/blob/main/nlp-zurich/images/clf_arch.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=600>

We can achieve this by setting up a `pipeline` object which wraps a transformer model. When initializing we need to specify the task. Sentiment analysis is a subfield of text classification where a single label is given to a sentence

### Version 1

In [None]:
from transformers import pipeline

sentiment_pipeline = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


We can see a warning message: we did not specify in the pipeline which model we would like to use. In that case it loads a default model. The `distilbert-base-uncased-finetuned-sst-2-english` model is a small BERT variant trained on [SST-2](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary) which is a sentiment analysis dataset.

You'll notice that the first time you execute the model a download is executed. The model is downloaded from the 🤗 Hub! The second time the cached model will be used.

Now we are ready to run our example through pipeline and look at some predictions:

In [None]:
sentiment_pipeline([text, "I like the bright and sunny day"])

[{'label': 'NEGATIVE', 'score': 0.9015460014343262},
 {'label': 'POSITIVE', 'score': 0.9998828172683716}]

The model predicts negative sentiment with a high confidence which makes sense. You can see that the pipeline returns a list of dicts with the predictions. We can also pass several texts at the same time in which case we would get several dicts in the list for each text one.

### Version 2

In [None]:
from transformers import pipeline

finetuned_checkpoint = "lewtun/xlm-roberta-base-finetuned-marc-en"
classifier = pipeline("text-classification", model=finetuned_checkpoint)

In [None]:
classifier([text, "I like the bright and sunny day"])

[{'label': 'terrible', 'score': 0.3314688801765442},
 {'label': 'good', 'score': 0.5771137475967407}]

It can be observed that the labels are more fine grained now as compared to previous model

# Practical Part 2

If the loading of the models are not possible when one has a web application, HuggingFace has provided with InferenceApi as a solution.

Gradio is a package that is provided by Hugginface to create a web application that will consume the large language models

In [None]:
! pip install huggingface_hub

In [None]:
!pip install gradio

Installing collected packages: sniffio, mdurl, uc-micro-py, rfc3986, markdown-it-py, h11, anyio, starlette, pynacl, monotonic, mdit-py-plugins, linkify-it-py, httpcore, cryptography, bcrypt, backoff, websockets, uvicorn, python-multipart, pydub, pycryptodome, paramiko, orjson, httpx, ffmpy, fastapi, analytics-python, gradio
Successfully installed analytics-python-1.4.0 anyio-3.6.1 backoff-1.10.0 bcrypt-4.0.0 cryptography-37.0.4 fastapi-0.82.0 ffmpy-0.3.0 gradio-3.2 h11-0.12.0 httpcore-0.15.0 httpx-0.23.0 linkify-it-py-1.0.3 markdown-it-py-2.1.0 mdit-py-plugins-0.3.0 mdurl-0.1.2 monotonic-1.6 orjson-3.8.0 paramiko-2.11.0 pycryptodome-3.15.0 pydub-0.25.1 pynacl-1.5.0 python-multipart-0.0.5 rfc3986-1.5.0 sniffio-1.3.0 starlette-0.19.1 uc-micro-py-1.0.1 uvicorn-0.18.3 websockets-10.3


In [None]:
label2emoji = {"terrible": "💩", "poor": "😾", "ok": "🐱", "good": "😺", "great": "😻"}

In [None]:
from huggingface_hub import InferenceApi
import gradio as gr

In [None]:
gradio_ui = gr.Interface.load(
    name="lewtun/xlm-roberta-base-finetuned-marc-en",
    src="huggingface",
    fn=inference_predict,
    title="Review analysis",
    description="Enter some text and check if model detects it's star rating.",
    inputs=[
        gr.inputs.Textbox(lines=5, label="Paste some text here"),
    ],
    outputs=[
        gr.outputs.Textbox(label="Label"),
        gr.outputs.Textbox(label="Score"),
    ],
    examples=[
        ["I love these running shoes"], ["J'adore ces chaussures de course"], ["Ich liebe diese Laufschuhe"]
    ],
)

gradio_ui.launch()

  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components",
  "Usage of gradio.outputs is deprecated, and will not be supported in the future, please import your components from gradio.components",


Fetching model from: https://huggingface.co/lewtun/xlm-roberta-base-finetuned-marc-en
Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Running on public URL: https://22637.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces: https://huggingface.co/spaces


(<gradio.routes.App at 0x7f3967b924d0>,
 'http://127.0.0.1:7860/',
 'https://22637.gradio.app')

# Practical Part 3

## Download the dataset

We will be using the [Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi) (or MARC for short). This is a large-scale collection of Amazon product reviews in several languages: English, Japanese, German, French, Spanish, and Chinese.

We can download the dataset from the Hugging Face Hub with the 🤗 Datasets library.

In [None]:
from datasets import get_dataset_config_names

dataset_name = "amazon_reviews_multi"
langs = get_dataset_config_names(dataset_name)
langs

It can be observed that the language codes associated with each language, as well as an `all_languages` subset which presumably concatenates all the languages together. Let's begin by downloading the **English** subset with the `load_dataset()` function from 🤗 Datasets:

In [None]:
from datasets import load_dataset

marc_en = load_dataset(path=dataset_name, name="en")
marc_en

🤗 Datasets `load_dataset()` will cache the files at `~/.cache/huggingface/dataset/`, so re-download of the dataset is not required the next time while running the notebook. We can see that `english_dataset` is a `DatasetDict` object which is similar to a Python dictionary, with each key corresponding to a different split. 

We can access one ot these splits just like an ordinary dictionary:

In [None]:
train_ds = marc_en["train"]
train_ds

Dataset({
    features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
    num_rows: 200000
})

This returns a `Dataset` object which behaves like a Python container, so we can query its length:

In [None]:
len(train_ds)

200000

In [None]:
train_ds[0]

{'review_id': 'en_0964290',
 'product_id': 'product_en_0740675',
 'reviewer_id': 'reviewer_en_0342986',
 'stars': 1,
 'review_body': "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.",
 'review_title': "I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review ...",
 'language': 'en',
 'product_category': 'furniture'}

This certainly looks like an Amazon product review (in this case the `review_body`) and we can see the number of stars associated with the review, as well as some metadata like the language and product category. We can also see that a single row is represented as a dictionary, where the keys are the same as the column names:

In [None]:
train_ds.column_names

['review_id',
 'product_id',
 'reviewer_id',
 'stars',
 'review_body',
 'review_title',
 'language',
 'product_category']

We can also access several rows with a slice:

In [None]:
train_ds[:3]

{'review_id': ['en_0964290', 'en_0690095', 'en_0311558'],
 'product_id': ['product_en_0740675',
  'product_en_0440378',
  'product_en_0399702'],
 'reviewer_id': ['reviewer_en_0342986',
  'reviewer_en_0133349',
  'reviewer_en_0152034'],
 'stars': [1, 1, 1],
 'review_body': ["Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.",
  'the cabinet dot were all detached from backing... got me',
  "I received my first order of this product and

and note that now we get a list of values for each column. This is because 🤗 Datasets is based on Apache Arrow, which defines a typed columnar format that is very memory efficient. We can see the types that are used to represent our dataset by accessing the `features` attribute:

In [None]:
train_ds.features

{'review_id': Value(dtype='string', id=None),
 'product_id': Value(dtype='string', id=None),
 'reviewer_id': Value(dtype='string', id=None),
 'stars': Value(dtype='int32', id=None),
 'review_body': Value(dtype='string', id=None),
 'review_title': Value(dtype='string', id=None),
 'language': Value(dtype='string', id=None),
 'product_category': Value(dtype='string', id=None)}

Now that we've had a quick look at the objects in 🤗 Datasets, let's explore the data - Pandas!

## From Datasets to DataFrames and back

In [None]:
from IPython.display import display, HTML

marc_en.set_format("pandas")
df = marc_en["train"][:]
# Create a random sample
sample = df.sample(n=5, random_state=42)
display(HTML(sample.to_html()))

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
119737,en_0522546,product_en_0681589,reviewer_en_0687817,3,"Not strong enough to run a small 120v vacuum cleaner, to clean car.",Not strong enough to run a small 120v vacuum cleaner ...,en,lawn_and_garden
72272,en_0612910,product_en_0295449,reviewer_en_0312138,2,"The leg openings are a little small, but other than that the suit fits nicely, and is high quality material. Edit: I have been wearing this for less than two months and it is 100% worn out. It has worn so thin in multiple spots that it’s no longer appropriate for wearing in public, I have to throw it away. This is unacceptable.",Crap,en,apparel
158154,en_0983065,product_en_0295095,reviewer_en_0927618,4,Really cute mug. I would have given 5 stars if it were a bit bigger.,Four Stars,en,kitchen
65426,en_0206761,product_en_0563487,reviewer_en_0936741,2,Well it’s looks and feels okay but it most certainly does not have 4 pockets that’s a lie it has 3 so that’s pretty messed up to say it has 4 when it’s only 3 the fabric is super stiff hopefully after washing it will be better,Lies!!,en,industrial_supplies
30074,en_0510474,product_en_0704805,reviewer_en_0417600,1,"Very, very thin, you can bend them with you fingers with no problem! Print is small.. More of a decoration. Would give 1/2 star!",Thin and bendable :(,en,pet_products


We can see that the column headers are the same as we saw in the Arrow format and from the reviews we can see that negative reviews are associated with a lower star rating. Since we're now dealing with a `pandas.DataFrame` we can easily query our dataset. For example, let's see what the distribution of reviews per product category looks like: 

In [None]:
df["product_category"].value_counts()

home                        17679
apparel                     15951
wireless                    15717
other                       13418
beauty                      12091
drugstore                   11730
kitchen                     10382
toy                          8745
sports                       8277
automotive                   7506
lawn_and_garden              7327
home_improvement             7136
pet_products                 7082
digital_ebook_purchase       6749
pc                           6401
electronics                  6186
office_product               5521
shoes                        5197
grocery                      4730
book                         3756
baby_product                 3150
furniture                    2984
jewelry                      2747
camera                       2139
industrial_supplies          1994
digital_video_download       1364
luggage                      1328
musical_instruments          1102
video_games                   775
watch         

Okay, the `home`, `apparel`, and `wireless` categories seem to be the most popular. How about the distribution of star ratings?

In [None]:
df["stars"].value_counts()

1    40000
2    40000
3    40000
4    40000
5    40000
Name: stars, dtype: int64

In this case we can see that the dataset is balanced across each star rating, which will make it somewhat easier to evaluate our models on. Imbalanced datasets are much more common in the real-world and in these cases some additional tricks like up- or down-sampling are usually needed.

Now that we've got a rough idea about the kind of data we're dealing with, let's reset the output format from `pandas` back to `arrow`:

In [None]:
marc_en.reset_format()

Although we could go ahead and fine-tune a Transformer model on the whole set of 200,000 English reviews, this will take several hours on a single GPU. So instead, we'll focus on fine-tuning a model for a single product category! In 🤗 Datasets, we can filter data very quickly by using the `Dataset.filter()` method. This method expects a function that returns Boolean values, in our case `True` if the `product_category` matches the chosen category and `False` otherwise. Here's one way to implement this, and we'll pick the `home` category as the domain to train on:

In [None]:
product_category = "home"

def filter_for_product(example, product_category=product_category):
    return example["product_category"] == product_category

Now when we pass `filter_for_product()` to `Dataset.filter()` we get a filtered dataset:

In [None]:
product_dataset = marc_en.filter(filter_for_product)
product_dataset

  0%|          | 0/200 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 17679
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 390
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 440
    })
})

We have 17,679 reviews in the train split which agrees the number we saw in the distribution of categories earlier. Let's do a quick sanity check by taking a look at a few samples. Here 🤗 Datasets provides `Dataset.shuffle()` and `Dataset.select()` functions that we can chain to get a random sample:

In [None]:
product_dataset["train"].shuffle(seed=42).select(range(3))[:]

{'review_id': ['en_0371222', 'en_0284443', 'en_0039575'],
 'product_id': ['product_en_0413998',
  'product_en_0716910',
  'product_en_0459841'],
 'reviewer_id': ['reviewer_en_0440379',
  'reviewer_en_0721859',
  'reviewer_en_0743935'],
 'stars': [1, 5, 2],
 'review_body': ['Poor customer service received broken just given the runaround poor value find another company',
  "Extremely pleased. My desktop fan just didn't put out enough to keep me cool in an extremely hot office. I'm very happy this with fan. More bang for the buck, doesn't take up much space, and more importantly, keeps me cool!",
  "Really disappointed... The elephant print looks nothing like the photo... It's vertically distorted and gets cut off at the sides."],
 'review_title': ['Poor customer service purchase from a different company',
  'Relief At Last!',
  'Really disappointed..'],
 'language': ['en', 'en', 'en'],
 'product_category': ['home', 'home', 'home']}

Okay, now that we have our corpus of home reviews, let's do one last bit of data preparation: creating label mappings from star ratings to human readable strings.

## Mapping the labels

During training, 🤗 Transformers expects the labels to be ordered, starting from 0 to N. But we've seen that our star ratings range from 1-5, so let's fix that. While we're at it, we'll create a mapping between the label IDs and names, which will be handy later on when we want to run inference with our model. First we'll define the label mapping from ID to name:

In [None]:
label_names = ["terrible", "poor", "ok", "good", "great"]
id2label = {idx:label for idx, label in enumerate(label_names)}
id2label

{0: 'terrible', 1: 'poor', 2: 'ok', 3: 'good', 4: 'great'}

We can then apply this mapping to our whole dataset by using the `Dataset.map()` method. Similar to the `Dataset.filter()` method, this one expects a function which receives examples as input, but returns a Python dictionary as output. The keys of the dictionary correspond to the columns, while the values correspond to the column entries. The following function creates two new columns:

* A `labels` column which is the star rating shifted down by one
* A `label_name` column which provides a nice string for each rating

In [None]:
def map_labels(example):
    # Shift labels to start from 0
    label_id = example["stars"] - 1
    return {"labels": label_id, "label_name": id2label[label_id]}

To apply this mapping, we simply feed it to `Dataset.map` as follows:

In [None]:
product_dataset = product_dataset.map(map_labels)
# Peek at the first example
product_dataset["train"][0]

  0%|          | 0/17679 [00:00<?, ?ex/s]

  0%|          | 0/390 [00:00<?, ?ex/s]

  0%|          | 0/440 [00:00<?, ?ex/s]

{'review_id': 'en_0311558',
 'product_id': 'product_en_0399702',
 'reviewer_id': 'reviewer_en_0152034',
 'stars': 1,
 'review_body': "I received my first order of this product and it was broke so I ordered it again. The second one was broke in more places than the first. I can't blame the shipping process as it's shrink wrapped and boxed.",
 'review_title': 'The product is junk.',
 'language': 'en',
 'product_category': 'home',
 'labels': 0,
 'label_name': 'terrible'}

Great, it works! We'll also need the reverse label mapping later, so let's define it here:

In [None]:
label2id = {v:k for k,v in id2label.items()}

## Using a fine-tuned model

[Pipelines](https://huggingface.co/docs/transformers/v4.21.3/en/main_classes/pipelines#transformers.TextClassificationPipeline)

In [None]:
from transformers import pipeline

finetuned_checkpoint = "lewtun/xlm-roberta-base-finetuned-marc-en"
classifier = pipeline("text-classification", model=finetuned_checkpoint)

Downloading config.json:   0%|          | 0.00/976 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/398 [00:00<?, ?B/s]

Downloading sentencepiece.bpe.model:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Now that we have our classification pipeline, we'll use `Dataset.map()` to apply it to each example in the validation set. For that we'll need a small function to create a new column of predictions:

In [None]:
product_dataset["validation"][0]

{'review_id': 'en_0423367',
 'product_id': 'product_en_0698375',
 'reviewer_id': 'reviewer_en_0985888',
 'stars': 1,
 'review_body': 'The mattress topper is not comfortable for a couple, especially if they cuddle at night. The mattress concaved around my wife and I to the point that if either of us was close to the other it felt like we would fall into their hole. It was a lot like sleeping on a sagging air mattress. Maybe this topper is comfortable for those that sleep alone or maybe mine was defective - I have no idea why it is so highly rated on Amazon.',
 'review_title': 'Not comfortable for a husband and wife',
 'language': 'en',
 'product_category': 'home',
 'labels': 0,
 'label_name': 'terrible'}

In [None]:
def compute_preds(examples):
    preds = classifier(examples["review_body"])
    label_pred = label2id[preds[0]["label"]]
    return {"prediction": label_pred}

In [None]:
preds = product_dataset["validation"].map(compute_preds)

  0%|          | 0/390 [00:00<?, ?ex/s]

Now that we've got some predictions, it's time to evaluate them! In the [MARC paper](https://arxiv.org/pdf/2010.02573.pdf), the authors point out that one should use the mean absolute error (MAE) for star ratings because:

> star ratings for each review are ordinal, and a 2-star prediction for a 5-star review should be penalized more heavily than a 4-star prediction for a 5-star review.

We'll take the same approach here and we can get the metric easily from Scikit-learn as follows:

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
mean_absolute_error(preds["labels"], preds["prediction"])

0.6153846153846154

For reference, the MARC paper quotes MAE results from mBERT in the range of 0.5-0.7. Let's see if we can get close to that with XLM-RoBERTa!