# Effortless NLP using HuggingFace's Tranformers Ecosystem

![Image](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/pipeline_1.jpg)

> Image by [Bekky Bekks](https://unsplash.com/@bekkybekks)
### What is inside a transformer pipeline?

#### ------------------------------------------------ 
#### *Articles So Far In This Series*
#### -> [[NLP Tutorial] Finish Tasks in Two Lines of Code](https://www.kaggle.com/rajkumarl/nlp-tutorial-finish-tasks-in-two-lines-of-code)
#### -> [[NLP Tutorial] Unwrapping Transformers Pipeline](https://www.kaggle.com/rajkumarl/nlp-unwrapping-transformers-pipeline)
#### -> [[NLP Tutorial] Exploring Tokenizers](https://www.kaggle.com/rajkumarl/nlp-tutorial-exploring-tokenizers)
#### -> [[NLP Tutorial] Fine-Tuning in TensorFlow](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-tensorflow) 
#### -> [[NLP Tutorail] Fine-Tuning in Pytorch](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-pytorch) 
#### -> [[NLP Tutorail] Fine-Tuning with Trainer API](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-with-trainer-api) 
#### ------------------------------------------------ 

**Pipeline** module in the **transformers** library offers a quick lauch of NLP tasks without any preprocessing or model building activities. 

Looking at the processes inside a pipeline may help us explore the libraries and customize as per our problem needs. 

Let's discuss the Sentiment Analysis task for the sake of simplicity.

In [1]:
# Make necessary imports

# for array operations
import numpy as np 
# TensorFlow framework
import tensorflow as tf
# PyTorch framework
import torch
# for pretty printing
from pprint import pprint

2021-12-19 09:15:14.218383: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-12-19 09:15:14.218518: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


# Sentiment Analysis Using Pipeline

Sentiment analysis is a binary text classification task in which the text is generally classified into one of the labels - positive and negative. The pipeline module downloads a suitable pre-trained transformer model, preprocesses the input text and outputs the sentiment label and its probability score value.

In [2]:
# pipeline module from the HF's transformers library
from transformers import pipeline

# download and cache a suitable pre-trained model
classifier = pipeline('sentiment-analysis')

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Using multiple text inputs at a time - 

In [3]:
texts = ['One of the best products I have ever used',
            'They got thrilled with the movie', 
            'My car starts to underperform', 
            'I love to take icecreams along with milkshakes',
            'Nobody enters my restaurant after that horror incident']
classifier(texts)

[{'label': 'POSITIVE', 'score': 0.9998539090156555},
 {'label': 'POSITIVE', 'score': 0.999812662601471},
 {'label': 'NEGATIVE', 'score': 0.9996762275695801},
 {'label': 'POSITIVE', 'score': 0.998968243598938},
 {'label': 'NEGATIVE', 'score': 0.9858075380325317}]

**Things to consider:**

> While calling the 'pipeline', it downloads something in four steps. What are they?

> We have just entered texts. It outputs sentiments as humans expect. How?

Pipeline is a high-level wrapper that downloads a compatible tokenizer, a pre-trained model, a model head and other post-processing elements to deliver the expected output just with the inputs, but no other efforts. Further, pipeline does not expect its users to have knowledge in any frameworks such as PyTorch and TensorFlow.

In this article, however, we are going to reproduce the results of pipeline with separate modules and APIs with our effort. We will explore the approaches for both PyTorch and TensorFlow.

# Sentiment Analysis without Pipeline

When we call Sentiment Analysis Pipeline, it fetches a particular checkpoint of pre-trained distilBERT. Tokenizer and Model are formulated based on that checkpoint. Let's bring that checkpoint to light!

In [4]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'

### In TensorFlow

In [5]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# build a tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# convert raw text into number inputs
inputs = tokenizer(texts, truncation=True, padding=True, return_tensors='tf')
pprint(inputs) 

{'attention_mask': <tf.Tensor: shape=(5, 14), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(5, 14), dtype=int32, numpy=
array([[  101,  2028,  1997,  1996,  2190,  3688,  1045,  2031,  2412,
         2109,   102,     0,     0,     0],
       [  101,  2027,  2288, 16082,  2007,  1996,  3185,   102,     0,
            0,     0,     0,     0,     0],
       [  101,  2026,  2482,  4627,  2000,  2104,  4842, 14192,   102,
            0,     0,     0,     0,     0],
       [  101,  1045,  2293,  2000,  2202,  3256, 16748, 13596,  2247,
         2007,  6501,  7377,  9681,   102],
       [  101,  6343,  8039,  2026,  4825,  2044,  2008,  5469,  5043,
          102,     0,     0,     0,     0]], dtype=int32)>}


2021-12-19 09:15:39.014990: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-19 09:15:39.018341: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
2021-12-19 09:15:39.018376: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-19 09:15:39.018429: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (70c859915ee9): /proc/driver/nvidia/version does not exist
2021-12-19 09:15:39.020731: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operation

In [6]:
# build model 
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
# calculate outputs
outputs = model(inputs)
outputs

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

2021-12-19 09:15:47.954637: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[-4.2369432,  4.594136 ],
       [-4.137792 ,  4.4446945],
       [ 4.448906 , -3.586285 ],
       [-3.3940864,  3.4813333],
       [ 2.320928 , -1.9198208]], dtype=float32)>, hidden_states=None, attentions=None)

The outputs are logits. They need to be mapped to a probability distribution using softmax.

In [7]:
scores = tf.math.softmax(outputs.logits).numpy()
scores

array([[1.4609906e-04, 9.9985385e-01],
       [1.8732331e-04, 9.9981266e-01],
       [9.9967623e-01, 3.2375794e-04],
       [1.0317984e-03, 9.9896824e-01],
       [9.8580754e-01, 1.4192480e-02]], dtype=float32)

We get probability scores. How to interpret them as Negative or Positive label? How was the model configured in this regard?

In [8]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [9]:
# obtain position of max score
ids = np.argmax(scores, axis=-1)
# obtain the max score itself
values = np.max(scores, axis=-1)
# calculate the labels for each input
labels = list(map(lambda i: model.config.id2label[i], ids))
# print results as similar to a pipeline
for i in range(len(texts)):
    print(dict([('label',labels[i]), ('score',values[i])]))

{'label': 'POSITIVE', 'score': 0.99985385}
{'label': 'POSITIVE', 'score': 0.99981266}
{'label': 'NEGATIVE', 'score': 0.9996762}
{'label': 'POSITIVE', 'score': 0.99896824}
{'label': 'NEGATIVE', 'score': 0.98580754}


Yeah! We got the same output as obtained with Pipeline.

### In PyTorch

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# build a tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# convert raw text into tensor inputs
inputs = tokenizer(texts, truncation=True, padding=True, return_tensors='pt')
pprint(inputs) 

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]),
 'input_ids': tensor([[  101,  2028,  1997,  1996,  2190,  3688,  1045,  2031,  2412,  2109,
           102,     0,     0,     0],
        [  101,  2027,  2288, 16082,  2007,  1996,  3185,   102,     0,     0,
             0,     0,     0,     0],
        [  101,  2026,  2482,  4627,  2000,  2104,  4842, 14192,   102,     0,
             0,     0,     0,     0],
        [  101,  1045,  2293,  2000,  2202,  3256, 16748, 13596,  2247,  2007,
          6501,  7377,  9681,   102],
        [  101,  6343,  8039,  2026,  4825,  2044,  2008,  5469,  5043,   102,
             0,     0,     0,     0]])}


In [11]:
# build model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# make predictions
outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-4.2369,  4.5941],
        [-4.1378,  4.4447],
        [ 4.4489, -3.5863],
        [-3.3941,  3.4813],
        [ 2.3209, -1.9198]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [12]:
# convert logits into prob scores using softmax
scores = torch.nn.functional.softmax(outputs.logits, dim=-1).detach().numpy()
scores

array([[1.4609919e-04, 9.9985385e-01],
       [1.8732368e-04, 9.9981266e-01],
       [9.9967623e-01, 3.2375855e-04],
       [1.0317975e-03, 9.9896824e-01],
       [9.8580754e-01, 1.4192474e-02]], dtype=float32)

In [13]:
# calculate the max score position
ids = np.argmax(scores, axis=-1)
# calculate the max score itself
values = np.max(scores, axis=-1)
# identify the setiment labels 
labels = list(map(lambda i: model.config.id2label[i], ids))
# print the results as similar to pipeline
for i in range(len(texts)):
    print({'label':labels[i], 'score':values[i]})

{'label': 'POSITIVE', 'score': 0.99985385}
{'label': 'POSITIVE', 'score': 0.99981266}
{'label': 'NEGATIVE', 'score': 0.9996762}
{'label': 'POSITIVE', 'score': 0.99896824}
{'label': 'NEGATIVE', 'score': 0.98580754}


That's awesome! We could reproduce the exact results.

Key reference: [HuggingFace's NLP Course](https://huggingface.co/course)

### Thank you for your valuable time!