# Effortless NLP using HuggingFace's Tranformers Ecosystem

![Image](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/pipeline_0.jpg)
> Image By [Karosu](https://unsplash.com/@karosu)
### Complex NLP Task can be performed with two lines of code using Pipeline. 
### A Pipeline? What does it do?

#### ------------------------------------------------ 
#### *Articles So Far In This Series*
#### -> [[NLP Tutorial] Finish Tasks in Two Lines of Code](https://www.kaggle.com/rajkumarl/nlp-tutorial-finish-tasks-in-two-lines-of-code)
#### -> [[NLP Tutorial] Unwrapping Transformers Pipeline](https://www.kaggle.com/rajkumarl/nlp-unwrapping-transformers-pipeline)
#### -> [[NLP Tutorial] Exploring Tokenizers](https://www.kaggle.com/rajkumarl/nlp-tutorial-exploring-tokenizers)
#### -> [[NLP Tutorial] Fine-Tuning in TensorFlow](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-tensorflow) 
#### -> [[NLP Tutorail] Fine-Tuning in Pytorch](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-pytorch) 
#### -> [[NLP Tutorail] Fine-Tuning with Trainer API](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-with-trainer-api) 
#### ------------------------------------------------ 

**Pipeline** module in the **transformers** library offers a quick lauch of NLP tasks without any preprocessing or model building activities. It helps one finish a complicated NLP task with just two lines of code.

Import the necessary libraries and modules. 

In [1]:
# for array operations
import numpy as np 
# for pretty text printing
from pprint import pprint


# pipeline module from the HF's transformers library
from transformers import pipeline

2021-12-19 09:15:56.694253: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-12-19 09:15:56.694423: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### Sentiment Analysis

Sentiment analysis is a binary text classification task in which the text is generally classified into one of the labels - positive and negative. The pipeline module downloads a suitable pre-trained transformer model, preprocesses the input text and outputs the sentiment label and its probability score value.

In [2]:
# download and cache a suitable pre-trained model
classifier = pipeline('sentiment-analysis')

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Using a text input to the classifier - 

In [3]:
classifier('One of the best products I have ever used')

[{'label': 'POSITIVE', 'score': 0.9998539090156555}]

Using multiple text inputs at a time - 

In [4]:
classifier(['They got thrilled with the movie', 
            'My car starts to underperform', 
            'I love to take icecreams along with milkshakes',
            'Nobody enters my restaurant after that horror incident'])

[{'label': 'POSITIVE', 'score': 0.999812662601471},
 {'label': 'NEGATIVE', 'score': 0.9996762275695801},
 {'label': 'POSITIVE', 'score': 0.998968243598938},
 {'label': 'NEGATIVE', 'score': 0.9858075380325317}]

### Feature Extraction

This pipeline extracts the hidden features from the transformer base, that can be used in downstream applications. Other pretrained models may also be used by feeding them as arguments.

In [5]:
extractor = pipeline('feature-extraction')

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [6]:
x = extractor('This is a nice example to follow')
x = np.array(x)
x.shape

(1, 9, 768)

### Zero-Shot Classification

Zero-shot classification is one of the hottest ML idea, in which, a given text prompt is classified into any one of the given labels without any fine-tuning process.

In [7]:
classifier = pipeline('zero-shot-classification')

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

In [8]:
classifier("I love courses offered by NPTEL",
          candidate_labels = ['sports', 'education', 'business'])

{'sequence': 'I love courses offered by NPTEL',
 'labels': ['education', 'business', 'sports'],
 'scores': [0.7576639652252197, 0.15684211254119873, 0.08549392968416214]}

In [9]:
classifier("Our Stock prices find a sudden fall",
          candidate_labels = ['politics', 'music', 'business', 'finance'])

{'sequence': 'Our Stock prices find a sudden fall',
 'labels': ['business', 'finance', 'politics', 'music'],
 'scores': [0.5720211863517761,
  0.3921557664871216,
  0.017959607765078545,
  0.017863402143120766]}

### Text Generation

Given a prompt, the text generation pipeline can generate a text sequence on its own.

In [10]:
generator = pipeline('text-generation')

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [11]:
text = generator('When you get stuck with a Python code')
pprint(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "When you get stuck with a Python code, you can't do it "
                    'anymore.\n'
                    '\n'
                    "In order to do this, you've got to think about the state "
                    'of your programs to identify the correct way you might '
                    'use the data. A python can look'}]


Controlled generation is possible by providing the pre-trained model, the number of return sequences and maximum sequence length as arguments.

In [12]:
texts = generator('I love to do trekking in mountains and', 
                  model = 'distilgpt2',
                  num_return_sequences=5, 
                  max_length=50)
pprint(texts)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I love to do trekking in mountains and I want to do that, '
                    "and now it's happening again. I just hope so and I hope "
                    'to continue to do it with all these wonderful friends of '
                    'mine."\n'
                    '\n'
                    'The trip made her a bit'},
 {'generated_text': 'I love to do trekking in mountains and snowboarding. I '
                    'always go to a retreat and watch the sunrise through my '
                    "binoculars. I've really enjoyed the scenery and I'm "
                    'always amazed how cool the terrain can be. I also '
                    'appreciate the'},
 {'generated_text': 'I love to do trekking in mountains and snow-covered '
                    'mountain passes. To explore it at a little less. Then '
                    'take another step and look outwards for a place to hike. '
                    "And once you've done that, try your luck getting up and"},


### Mask Filling

Mask filling is an interesting NLP task, where the pre-trained language model replaces the masks with most suitable words/phrases.

In [13]:
unmask = pipeline('fill-mask')

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [14]:
unmask('I really love icecream along with <mask> in rain', top_k = 5)

[{'sequence': 'I really love icecream along with soaking in rain',
  'score': 0.2503714859485626,
  'token': 30441,
  'token_str': ' soaking'},
 {'sequence': 'I really love icecream along with swimming in rain',
  'score': 0.21480678021907806,
  'token': 7358,
  'token_str': ' swimming'},
 {'sequence': 'I really love icecream along with dancing in rain',
  'score': 0.029250476509332657,
  'token': 7950,
  'token_str': ' dancing'},
 {'sequence': 'I really love icecream along with bathing in rain',
  'score': 0.02806648053228855,
  'token': 30260,
  'token_str': ' bathing'},
 {'sequence': 'I really love icecream along with baking in rain',
  'score': 0.01972861774265766,
  'token': 14814,
  'token_str': ' baking'}]

In [15]:
unmask('Sir Isaac Newton focused more in <mask> mechanics during his reseach',
      model='bert-base-cased')

[{'sequence': 'Sir Isaac Newton focused more in quantum mechanics during his reseach',
  'score': 0.901558518409729,
  'token': 17997,
  'token_str': ' quantum'},
 {'sequence': 'Sir Isaac Newton focused more in particle mechanics during his reseach',
  'score': 0.017226863652467728,
  'token': 33100,
  'token_str': ' particle'},
 {'sequence': 'Sir Isaac Newton focused more in mathematical mechanics during his reseach',
  'score': 0.011460122652351856,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'Sir Isaac Newton focused more in celestial mechanics during his reseach',
  'score': 0.009622407145798206,
  'token': 38402,
  'token_str': ' celestial'},
 {'sequence': 'Sir Isaac Newton focused more in classical mechanics during his reseach',
  'score': 0.006943091284483671,
  'token': 15855,
  'token_str': ' classical'}]

In [16]:
unmask('Who am I? I think I suffer from <mask> loss')

[{'sequence': 'Who am I? I think I suffer from weight loss',
  'score': 0.36119523644447327,
  'token': 2408,
  'token_str': ' weight'},
 {'sequence': 'Who am I? I think I suffer from memory loss',
  'score': 0.33623331785202026,
  'token': 3783,
  'token_str': ' memory'},
 {'sequence': 'Who am I? I think I suffer from hearing loss',
  'score': 0.024745365604758263,
  'token': 1576,
  'token_str': ' hearing'},
 {'sequence': 'Who am I? I think I suffer from cognitive loss',
  'score': 0.015265022404491901,
  'token': 14526,
  'token_str': ' cognitive'},
 {'sequence': 'Who am I? I think I suffer from sight loss',
  'score': 0.01471016276627779,
  'token': 6112,
  'token_str': ' sight'}]

### Named Entity Recognition (NER)

NER is the task of identifying words that are Names of a person [**PER**] or an organization [**ORG**] or a location [**LOC**]. By providing an argument 'grouped_entities=True' will ensure that a name formed by more than one word be grouped together during post-processing.

In [17]:
ner = pipeline('ner', grouped_entities = True)

Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

In [18]:
ner('I will meet Roshan and Rudhra when I reach the Indian Institute of Technology at Mumbai')

[{'entity_group': 'PER',
  'score': 0.9417180021603903,
  'word': 'Roshan',
  'start': 12,
  'end': 18},
 {'entity_group': 'PER',
  'score': 0.8873194058736166,
  'word': 'Rudhra',
  'start': 23,
  'end': 29},
 {'entity_group': 'ORG',
  'score': 0.981681302189827,
  'word': 'Indian Institute of Technology',
  'start': 47,
  'end': 77},
 {'entity_group': 'LOC',
  'score': 0.9978346824645996,
  'word': 'Mumbai',
  'start': 81,
  'end': 87}]

### Question Answering

By providing a question and a relevent context, the pipeline can find out answer to the question.

In [19]:
qa = pipeline('question-answering')

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [20]:
qa(question = 'Where did I stay last night?',
  context = 'I visited Darjeeling yesterday, \
   spent my night at a hotel in Delhi \
   and have returned back to Agra')

{'score': 0.44198018312454224, 'start': 64, 'end': 69, 'answer': 'Delhi'}

### Text Summarization

In [21]:
summarizer = pipeline('summarization')

Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

In [22]:
summarizer('Once, a group of frogs was roaming around the forest in search of water. \
Suddenly, two frogs in the group accidentally fell into a deep pit.\
The other frogs worried about their friends in the pit.\
Seeing how deep the pit was, they told the two frogs that there was no way \
they could escape the deep pit and that there was no point in trying.\
They continued to constantly discourage them as the two frogs tried to jump out of the pit. \
But keep falling back.\
Soon, one of the two frogs started to believe the other frogs — that they’ll never be \
able to escape the pit and eventually died after giving up.\
The other frog keeps trying and eventually jumps so high that he escapes the pit. \
The other frogs were shocked at this and wondered how he did it.\
The difference was that the second frog was deaf and couldn’t hear \
the discouragement of the group. He simply thought they were cheering him on!')

[{'summary_text': ' A group of frogs was roaming around the forest in search of water when two frogs accidentally fell into a deep pit . The other frogs worried about their friends in the pit and told the two frogs that there was no way they could escape the pit . They continued to constantly discourage them as they tried to jump out of the pit but one of the frogs was deaf and thought they were cheering him on!'}]

That's really great! The summarization is perfect and up to the point.

In [23]:
summarizer("Researchers at MIT have created a robotic system that can do just that. \
The system, RFusion, is a robotic arm with a camera and radio frequency (RF) antenna \
attached to its gripper. It fuses signals from the antenna with visual input from \
the camera to locate and retrieve an item, even if the item is buried under a pile \
and completely out of view.\
The RFusion prototype the researchers developed relies on RFID tags, \
which are cheap, battery-less tags that can be stuck to an item and \
reflect signals sent by an antenna. Because RF signals can travel through most surfaces \
(like the mound of dirty laundry that may be obscuring the keys), \
RFusion is able to locate a tagged item within a pile.\
Using machine learning, the robotic arm automatically zeroes-in on the object's \
exact location, moves the items on top of it, grasps the object, \
and verifies that it picked up the right thing. The camera, antenna, robotic arm, \
and AI are fully integrated, so RFusion can work in any environment without \
requiring a special set up.\
While finding lost keys is helpful, RFusion could have many broader applications \
in the future, like sorting through piles to fulfill orders in a warehouse, identifying \
and installing components in an auto manufacturing plant, or helping an elderly \
individual perform daily tasks in the home, though the current prototype isn't \
quite fast enough yet for these uses.\
\"This idea of being able to find items in a chaotic world is an open problem \
that we've been working on for a few years. Having robots that are able to search \
for things under a pile is a growing need in industry today. \
Right now, you can think of this as a Roomba on steroids, but in the near term, \
this could have a lot of applications in manufacturing and warehouse environments,\" \
said senior author Fadel Adib, associate professor in the Department of \
Electrical Engineering and Computer Science and director of the Signal Kinetics group \
in the MIT Media Lab.\
Co-authors include research assistant Tara Boroushaki, the lead author; \
electrical engineering and computer science graduate student Isaac Perper; \
research associate Mergen Nachin; and Alberto Rodriguez, the Class of 1957 \
Associate Professor in the Department of Mechanical Engineering. \
The research will be presented at the Association for Computing Machinery Conference \
on Embedded Networked Senor Systems next month.")

[{'summary_text': " RFusion is a robotic arm with a camera and radio frequency (RF) antenna attached to its gripper . It fuses signals from the antenna with visual input from the camera to locate and retrieve an item . Using machine learning, the robotic arm automatically zeroes-in on the object's exact location, moves the items on top of it, grasps the object and verifies that it picked up the right thing ."}]

A pretty short summary!

### Translation

Pre-trained models are available in HuggingFace ecosystem using which we can translate a sentence into hundreds of languages in no time!

In [24]:
# translate from english to german
helsinki_en_de = pipeline('translation', model='Helsinki-NLP/opus-mt-en-de')

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/298M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

In [25]:
result = helsinki_en_de('Transformers have started to lead the Natural Language Processing world recently')
print(result)
de_text = result[0]['translation_text']

[{'translation_text': 'Transformer haben vor kurzem begonnen, die Welt der natürlichen Sprachverarbeitung zu führen'}]


In [26]:
# let's try to revert back the translation 
helsinki_de_en = pipeline('translation', model='Helsinki-NLP/opus-mt-de-en')

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/298M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

In [27]:
helsinki_de_en(de_text)

[{'translation_text': 'Transformers have recently begun to lead the world of natural language processing'}]

In [28]:
# use a different model to translate back from german to english
facebook_de_en = pipeline('translation', model='facebook/wmt19-de-en')

Downloading:   0%|          | 0.00/825 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/849k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/315k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

In [29]:
facebook_de_en(de_text)

[{'translation_text': 'Transformers have recently begun to lead the world of natural language processing'}]

Translation to English by either models are the same and they read better than my actual English input! I may use this to-and-fro approach to enhance my English writing!

### Limitations of pipeline

Since NLP models are trained on large corpus, the outputs of models represent their training data. Hence the outputs may be biased based on gender, race, and so on.

In [30]:
# a sentence based on a male
results = unmask('This man works there as a <mask>', model='bert-base-uncased')
print([result['token_str'] for result in results])

[' bartender', ' courier', ' waiter', ' mechanic', ' contractor']


In [31]:
# an equivalent sentence based on a female
results = unmask('This woman works there as a <mask>', model='bert-base-uncased')
print([result['token_str'] for result in results])

[' waitress', ' bartender', ' nurse', ' maid', ' prostitute']


The results for either gender are drastically differing. The terms 'bartender' and 'waiter/waitress' are found on both results but with varying output probability. 

This one article seems to be short but it has invoked and used around half-a-dozen of State-Of-The-Art Models along with their checkpoints, Tokenizing methods, Pre-processing and Post-processing techniques to perform all above tasks in just two lines of code each!

Key reference: [HuggingFace's NLP Course](https://huggingface.co/course)

### Thank you for your valuable time!