 <h1><center><font size="6">🤗 Transformers models | The "pipeline" function</font></center></h1>

<center>
    <img src="https://www.kdnuggets.com/wp-content/uploads/shastri_simple_nlp_pipelines_huggingface_transformers_1.png" width="600">
    </img>
</center>


This notebook is the first (1st) of a series of notebooks containing **notes and code snippets** from the [**NLP Course on Hugging Face**](https://huggingface.co/learn/nlp-course/). 

- **Full Notes: https://github.com/ANYANTUDRE/NLP-Course-Hugging-Face** 

- **Notebook: https://www.kaggle.com/code/waalbannyantudre/transformers-models-the-pipeline-function**


# <a id='0'>Content</a>

- <a href='#1'>🔗 Working with pipelines</a>  
- <a href='#2'>🎯 Zero-shot classification 🔫</a>  
- <a href='#3'>📝 Text generation</a>  
- <a href='#4'>🗂 Using any model from the Hub in a pipeline</a>
- <a href='#5'>❎ Mask filling</a>
- <a href='#6'>🧩 Named entity recognition (NER)</a>
- <a href='#7'>⁉️ Question answering</a>
- <a href='#8'>📜 Summarization</a>
- <a href='#9'>🈵 Translation</a>

In this notebook, we will look at what **Transformer models** can do and use our first tool from the **🤗 Transformers** library: the `pipeline()` function.


Transformers are everywhere! Let’s look at a few examples of how they can be used to solve some interesting NLP problems.

In [1]:
import transformers
from transformers import pipeline

import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

print(f"Transformers Version: {transformers.__version__}")

2024-03-29 23:29:40.171260: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 23:29:40.171429: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 23:29:40.345283: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Transformers Version: 4.38.2


# <a id="1">🔗 Working with pipelines</a>

The most basic object in the 🤗 Transformers library is the `pipeline()` function. We can directly input any text into it and get an intelligible answer.  
By default, this pipeline selects a particular **pretrained model** that has been fine-tuned for sentiment analysis in English. 

In [2]:
classifier = pipeline("sentiment-analysis")
classifier(["Nice, I've been waiting for short HuggingFace course my whole life!", "I hate this so much"])

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9984038472175598},
 {'label': 'NEGATIVE', 'score': 0.9995144605636597}]

### Steps
There are **three main steps** involved when you pass some text to a `pipeline()`:
- **Text preprocessing,**
- **Model prediction,**
- **Output post-processing**.

### Other pipelines 🥤
Some of the currently available pipelines are:
- **feature-extraction**
- **fill-mask**
- **ner (named entity recognition)**
- **question-answering**
- **sentiment-analysis**
- **summarization**
- **text-generation**
- **translation**
- **zero-shot-classification**

# <a id="2">🎯 Zero-shot classification 🔫</a>

The `zero-shot-classification` pipeline is very powerful for tasks where we need to **classify texts that haven’t been labelled.** It returns probability scores for any list of labels you want!    

It's called **zero-shot** because you don’t need to fine-tune the model on your data to use it. 

In [3]:
classifier = pipeline("zero-shot-classification")
classifier("This is a short course about the Transformers library", 
            candidate_labels=["education", "politics", "business"],)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a short course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.7481650114059448, 0.17828474938869476, 0.07355023920536041]}

# <a id="3">📝 Text generation</a>

The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.


In [4]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'generated_text': 'In this course, we will teach you how to handle multiple versions of the project.\n\nWe will also include a tutorial for using a Dockerfile as an executable.\n\nFinally, we will cover basic configuration of a Dockerfile:\n\n'}]

# <a id="4">🗂 Using any model from the Hub in a pipeline</a>

The previous examples used the **default model** for the task at hand, but you can also choose a particular model from the [Hub](https://huggingface.co/models) to use in a pipeline for a specific task — say, text generation.  

Let’s try the [distilgpt2](https://huggingface.co/distilbert/distilgpt2) model!

In [5]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'generated_text': "In this course, we will teach you how to solve and work on solving your job, which includes making sure you'll have skills that will help you"},
 {'generated_text': 'In this course, we will teach you how to read in English language and English speaking.\n\n\nIntroduction'}]

# <a id="5">❎ Mask filling</a>

The idea of this task is to **fill in the blanks** in a given text:

In [6]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.1961979866027832,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052741825580597,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

# <a id="6">🧩 Named entity recognition (NER)</a>

NER is a task where the model has to find which parts of the input text correspond to **entities such as persons, locations, or organizations.**  


In [7]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

Here the model correctly identified that Sylvain is a **person (PER)**, Hugging Face an **organization (ORG)**, and Brooklyn a **location (LOC)**.  


# <a id="7">⁉️ Question answering</a>

The **question-answering** pipeline answers questions using information from a given context.


In [8]:
question_answerer = pipeline("question-answering")
question_answerer(
            question="Where do I work?",
            context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.694976270198822, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

# <a id="8">📜 Summarization</a>

**Summarization** is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. 


In [9]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number 
    of graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering 
    graduates and a lack of well-educated engineers.
"""
)

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of graduates in traditional engineering disciplines has declined . China and India graduate six and eight times as many traditional engineers as does the United States . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues .'}]

# <a id="9">🈵 Translation</a>

For translation, you can use a default model if you provide a language pair in the task name (such as "`translation_en_to_fr`"), but the easiest way is to pick the model you want to use on the Model Hub.

In [10]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



[{'translation_text': 'This course is produced by Hugging Face.'}]

## References:

- [1] **GitHub: NLP-Course-Hugging-Face**: https://github.com/ANYANTUDRE/NLP-Course-Hugging-Face

- [2] **NLP Course on Hugging Face:** https://huggingface.co/learn/nlp-course/chapter1/3?fw=tf

- [3] **Cover Image** credits: [KDnuggets](https://www.kdnuggets.com/2023/02/simple-nlp-pipelines-huggingface-transformers.html)