The transformers architecture comprises encoder and decoder models that work together to generate insightful results.

**Pipeline API**

The pipeline API is comprised of a high-level API that performs all the required steps, including preprocessing of inputs and so on. It supports a variety of NLP tasks, including Sentiment Analysis, Feature Extraction, Question Answering, Summarization.

In [28]:
!pip install transformers[sentencepiece]

Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece])
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


**Sentiment Analysis**

In [29]:
from transformers import pipeline


In [30]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


The pipeline API chooses a fine-tuned, pre-trained model for the purpose of sentiment analysis in English. We downloaded this model and cached it after creating the classifier object. Therefore, after rerunning the above command, the cached model will be loaded instead of being downloaded.

We can verify which model the present classifier object is using in the following way:

In [31]:
classifier.model.name_or_path

'distilbert-base-uncased-finetuned-sst-2-english'

Here, the classifier chooses a type of BERT model.

In [32]:
classifier("I am  really excited about about today !! ")

[{'label': 'POSITIVE', 'score': 0.9997952580451965}]

We observe that the classifier has classified the input sentence as a POSITIVE sentence with almost 100% confidence.

In [33]:
classifier(["I am very excited for this new movie !!",
            "I not very unhappy",
            "I hate this weather !!",
            "I really hate that movie.."])

[{'label': 'POSITIVE', 'score': 0.9998107552528381},
 {'label': 'POSITIVE', 'score': 0.8951895236968994},
 {'label': 'NEGATIVE', 'score': 0.9992252588272095},
 {'label': 'NEGATIVE', 'score': 0.9994947910308838}]

With a few more examples, we have verified that the classifier has classified the sentiment of the sentences correctly with good confidence.

**Zero-Shot Classification**

In Zero-Shot classification, the input texts are not labeled. Here, we need to define the labels as per our needs.

In [34]:
from transformers import pipeline
classifier = pipeline('zero-shot-classification')

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [36]:
classifier("This is really a very good course about how to cook mutton",
           candidate_labels = ["Education", "Eat", "Cooking"])

{'sequence': 'This is really a very good course about how to cook mutton',
 'labels': ['Cooking', 'Eat', 'Education'],
 'scores': [0.7549446225166321, 0.226561039686203, 0.018494367599487305]}

It is almost 75% confident that the sentence is about cooking.

In [39]:
classifier(
    ["This is a course on advanced deep learning",
     "This AI course can generate more than 10 million in revenue",],
    candidate_labels=["education", "ai", "business"],
)

[{'sequence': 'This is a course on advanced deep learning',
  'labels': ['ai', 'education', 'business'],
  'scores': [0.721426248550415, 0.21009023487567902, 0.06848350167274475]},
 {'sequence': 'This AI course can generate more than 10 million in revenue',
  'labels': ['ai', 'business', 'education'],
  'scores': [0.6334286332130432, 0.34118345379829407, 0.0253879576921463]}]

It is almost 72% confident that the first sentence is about AI and almost 63% sure that the second sentence is about AI.

**Text Generation**

The text generation is done using the initial prompt, and the model auto-completes the remaining text. However, text generation involves some randomness, and the results may not match exactly.

In [40]:
from transformers import pipeline

generator = pipeline('text-generation')

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [41]:
results = generator("I am really happy because ",
          num_return_sequences=2,
          max_length=30)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [42]:
for i in results:
    print(i['generated_text'])
    print('\n')

I am really happy because  I have all of these great things that my mom loves as she helps me to do things. If I can do


I am really happy because irl has been playing around with it.I just really like the color palette. The black and green feel of the blue




**Question Answering**

The question answering pipeline can answer questions by understanding the context of the given information.

In [43]:
from transformers import pipeline

question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [44]:
question_answerer(
    question = "What is the capital of India ?",
    context = """India, officially the Republic of India, is a country in South Asia.
    It is the second-most populous country, the seventh-largest country by land area, and the most populous democracy in the world.
    The capital is New Delhi"""
)

{'score': 0.9928799867630005, 'start': 221, 'end': 230, 'answer': 'New Delhi'}

**Summarization**

The summarization pipeline API has the ability to generate a summary of the given input text by keeping most of the important aspects.

In [45]:
from transformers import pipeline

summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

We have to enter the input text along with a max_length or min_length argument for the summary generation of the text.

In [47]:
summarizer("""
America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
""", max_length=40)

Your min_length=56 must be inferior than your max_length=40.


[{'summary_text': ' America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers . Rapidly developing economies such as China and India continue to encourage and advance the teaching'}]

We can only enter the input text and omit the max_length or min_length argument for the summary generation of the text.

In [53]:
summarizer("""
America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.
    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
""")

[{'summary_text': ' America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers . Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues .'}]

**From the pipeline, use any model.**

Let us implement the text generation pipeline object by using the GPT2 model.

In [52]:
from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')

We can verify the model in the following way:

In [50]:
generator.model.name_or_path

'gpt2'

From the above-mentioned examples, it is quite clear that pre-trained models can implement remarkable tasks in the NLP domain.

**Tokenizers**

Like other neural networks, the transformers cannot process the raw input text directly. Preprocessing is required, and this process is called tokenization, where text input is converted into numbers. The preprocessing and tokenization processes are required to be done in the same way during the training of the model. As we are using pre-trained models, the corresponding tokenizer for the model should be used. This can be achieved with the help of the AutoTokenizer class.

To do this, the tokenizer takes the following steps:

(i) The tokenizer splits the input text into words, subwords, or individual letters, which are known as tokens.

(ii) Each token is mapped to a unique integer.

(iii) Arranging and adding required inputs that are impactful to the model.

In [1]:
!pip install transformers[sentencepiece] --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

Here, the AutoTokenizer class has been imported from the transformers library and initialized with the model checkpoint name.

In [3]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Now, we have initialized the tokenizer.

After the execution of the code, we downloaded the tokenizer of the model named distilbert-base-uncased-finetuned-sst-2-english and cached it for further usage.

In [4]:
raw_inputs = [
    "I have been waiting for this HAAI course for last 3 months.",
    "I am very excited about learning new AI models !!",
    "I hate this hot weather which makes me feel irritated  !"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)


{'input_ids': tensor([[  101,  1045,  2031,  2042,  3403,  2005,  2023,  5292,  4886,  2607,
          2005,  2197,  1017,  2706,  1012,   102],
        [  101,  1045,  2572,  2200,  7568,  2055,  4083,  2047,  9932,  4275,
           999,   999,   102,     0,     0,     0],
        [  101,  1045,  5223,  2023,  2980,  4633,  2029,  3084,  2033,  2514,
         15560,   999,   102,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}


Transformers only accept tensors, and that's why raw texts are first converted into tensors.

After checking the output, we can see input_ids consist of tokenized inputs and attention_mask contains the 1s and 0s which are used by the attention layer in the transformer defining whether the corresponding token should be attended or not.

**Model**

In [5]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

The pre-trained model has been downloaded the same way as the tokenizer.

For loading the pre-trained model, the same from_pretrained method from the AutoModel class has been used.

In [6]:
model = AutoModel.from_pretrained(checkpoint)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [7]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([3, 16, 768])


We have the outputs with the help of tokenized inputs into the model. From the shape of the output’s hidden state, we observe that our model has a batch size (the number of sequences processed at a time) of 3, a sequence length (the length of each sequence) of 16, and a hidden state (the vector dimension of model input) of size 768.

**Sequence Classification**

In [75]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [76]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint)

Here we are downloading the model to get some predictions.

In [77]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [78]:
raw_inputs = [
    "I have been waiting for this HAAI course for last 3 months.",
    "I am very excited about learning new AI models !!",
    "I hate this hot weather which makes me feel irritated  !"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)


{'input_ids': tensor([[  101,  1045,  2031,  2042,  3403,  2005,  2023,  5292,  4886,  2607,
          2005,  2197,  1017,  2706,  1012,   102],
        [  101,  1045,  2572,  2200,  7568,  2055,  4083,  2047,  9932,  4275,
           999,   999,   102,     0,     0,     0],
        [  101,  1045,  5223,  2023,  2980,  4633,  2029,  3084,  2033,  2514,
         15560,   999,   102,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}


Our tokenized inputs are ready. Let us go for the predictions.

In [79]:
outputs = model(**inputs)
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.6973, -2.2845],
        [-4.0478,  4.3824],
        [ 3.9657, -3.2164]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [80]:
print(outputs.logits.shape)

torch.Size([3, 2])


Our model has a matrix dimension of 3 x 2, as we have 3 sequences in the input and there are 2 classes.

We are now finding the outputs by passing them through the softmax activation to get the probabilities of each class for the input sentences.

In [81]:
#Find the label / class probabilities
import torch
outputs = torch.nn.functional.softmax(outputs.logits, dim = -1)
print(outputs)

tensor([[9.9319e-01, 6.8147e-03],
        [2.1812e-04, 9.9978e-01],
        [9.9924e-01, 7.5948e-04]], grad_fn=<SoftmaxBackward0>)


We get [0.99, 0.006] as the output for the first input, [0.0002, 0.99] as the output for the second input, and finally [0.99, 0.0007] as the output for the third input sample.

Here we observe that our model is 99% confident that the first input sample belongs to the NEGATIVE class, 99% confident that the second input sample belongs to the POSITIVE class, and 99% confident that the third input sample belongs to the NEGATIVE class. We observe that the model’s output is quite accurate.

We can check the labels of the model in the following way:

In [82]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}