## Installations

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.7 MB/s[0m eta [36m0:00:00[0m
Col

## Imports

In [2]:
from transformers import pipeline

##Sentiment Analysis

Sentiment analysis is a natural language processing technique that identifies the polarity of a given text. Some of the most common practical uses of sentiment analysis are tweets analysis, product reviews, support tickets classification and others. Sentiment analysis allows quick processing of large amounts, real-time data. The Diagram below, describes the process of Sentiment Analysis. As shown in the Diagram below, we can use sentiment analysis to predict a label and a score of a random amazon product review.

The evaluation metric for this type of analysis is *accuracy score*. The default model behind "sentiment-analysis" pipeline is "distilbert-base-uncased-finetuned-sst-2-english" model, pre-trained on a large corpus of English data in a self-supervised fashion. This means it was pre-trained on the raw texts only, with no humans labelling them in any way.

We can start by loading text classification pipeline from pipeline() using "sentiment-analysis" task identifier. It uses the default, "distilbert-base-uncased-finetuned-sst-2-english" model for sentiment analysis. Note, you can skip specifying a default model, but you will recieve a warning message.

In [3]:
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [7]:
print(classifier("this product is shit"))
print(classifier("I'm very excited for today's match"))
print(classifier("Its okay"))

[{'label': 'NEGATIVE', 'score': 0.9997866749763489}]
[{'label': 'POSITIVE', 'score': 0.9996927976608276}]
[{'label': 'POSITIVE', 'score': 0.999775230884552}]


##Topic Classification

Topic Classification task classifies sequences into specified class names. It applies "zero-shot-classification" algorithm to perform this task. Zero-Shot Learning (ZSL) is when a classifier learns on one set of labels and then evaluates on a different set of labels, the ones that it has never seen before.

First, we load a pipeline with "zero-shot-classification", pass a sequence that we want to classify and a list of candidate labels and see how the model will assign corresponding labels to the input.

In [8]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [9]:
classifier("This spine-chilling story had me on the edge of my seat, with every creaking sound in the old, dark house sending shivers down my spine.",
    candidate_labels=["art", "natural science", "data analysis", "horror"])


{'sequence': 'This spine-chilling story had me on the edge of my seat, with every creaking sound in the old, dark house sending shivers down my spine.',
 'labels': ['horror', 'art', 'natural science', 'data analysis'],
 'scores': [0.9906460642814636,
  0.004592929966747761,
  0.002970872912555933,
  0.0017901540268212557]}

In [11]:
classifier("what should i use, tableau or powerBI for data insights",
    candidate_labels=["art", "natural science", "data analysis", "horror"])

{'sequence': 'what should i use, tableau or powerBI for data insights',
 'labels': ['data analysis', 'art', 'natural science', 'horror'],
 'scores': [0.921114444732666,
  0.05981069803237915,
  0.015579456463456154,
  0.003495367243885994]}

## Text Generator

Text Generation model is also known as causal language model, is a task of predicting a next word in a sentence, given some previous input. This task is very similar to the auto-correct function we have on our phones. Classification metric cannot be used in this task, as there is no single correct answer. Instead, text distribution auto-completed by the model is evaluated by the cross entropy loss and perplexity. The default model behind Text Generation is Generative Pre-trained Transformer 2, GPT-2 model. It can receive an input like "This course will teach you" and proceed to complete the sentence based on those first words, as shown in the Diagram below.

Similarly to Completion Generation Models, we also have Text-to-Text Models. These models are trained to learn the mapping between a pair of texts (e.g. translation from one language to another). The most popular variants of these models are T5, T0 and BART. Text-to-Text models are trained with multi-tasking capabilities, they can accomplish a wide range of tasks, including summarization, translation, and text classification.

In [12]:
generator = pipeline("text-generation", model="gpt2")

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Mourya is very happy with your help. Although it really helps for her. [22:15:21] <Xenocider> https://twitter.com/Xenocider/_status/6241675274765'}]

In [14]:
generator("Mourya is very happy")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Mourya is very happy.\n\nWe have a lot of new stuff coming up. We have the new camera and other stuff too! We have to finish everything at least in two weeks.'}]

In [15]:
generator("Where there is will, there is")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Where there is will, there is no way to escape it. We did the things of a fatherly nature, and in order so to continue our duties there we must first have the utmost respect for our fellows in this kingdom. And as we are'}]

Alternatively, we can also use "distilgpt2" model, as well as some parameters, such length and number of the sentences needed. Distilled GPT-2 model is an English-language model pre-trained with the supervision of the smallest version of GPT-2. Like GPT-2, DistilGPT2 can be used to generate text.

In [16]:
generator_distill = pipeline("text-generation", model="distilgpt2")


Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [18]:
generator_distill(
    "where there is will, there is",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'where there is will, there is not one person, but several people. There are very few things that can be said about them.\n"If'},
 {'generated_text': "where there is will, there is going to be a problem, but we don't want to see, the problem of how we use these kinds of"}]

Sometimes, it is useful to use Masked Language modeling, which also has Text Generation capabilities. Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we want to get a statistical understanding of the language in which the model is trained in. Masked language models do not require labelled data! They are trained by masking a couple of words in sentences and the model is expected to guess the masked word. The Diagram below shows a simple representation of this concept.

For example, masked language modeling is used to train large models for domain-specific problems. If you have to work on a domain-specific task, such as retrieving information from medical research papers, you can train a masked language model using those papers.

In [19]:
unmasker = pipeline("fill-mask", "distilroberta-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [20]:
unmasker("This course will teach you all about <mask> models.", top_k=4)

[{'score': 0.19619806110858917,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052723944187164,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'},
 {'score': 0.03301795944571495,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'This course will teach you all about predictive models.'},
 {'score': 0.031941577792167664,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models.'}]

In [22]:
unmasker("where there is <mask>, there is a way", top_k=4)

[{'score': 0.022514084354043007,
  'token': 35566,
  'token_str': ' ambiguity',
  'sequence': 'where there is ambiguity, there is a way'},
 {'score': 0.022232506424188614,
  'token': 1034,
  'token_str': ' hope',
  'sequence': 'where there is hope, there is a way'},
 {'score': 0.02035563439130783,
  'token': 4854,
  'token_str': ' danger',
  'sequence': 'where there is danger, there is a way'},
 {'score': 0.0193361546844244,
  'token': 2400,
  'token_str': ' pain',
  'sequence': 'where there is pain, there is a way'}]

##Name Entity Recognition (NER)

NER sometimes also referred as entity chunking, extraction, or identification, is the task of identifying and categorizing key information (entities) in text. The model sorts according to name of the person: 'PER', group: 'ORG', and location: 'LOC' with appropriate accuracy score and token location.

The default model behind NER is "camembert-ner". It was trained and fine tuned on wikiner-fr dataset (~170 634 sentences).

In [23]:
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



In [24]:
ner("My name is Mourya and i am  doing generative AI. It is pretty cool. OpenAI and HuggingFace are doing great")

[{'entity_group': 'PER',
  'score': 0.99849683,
  'word': 'Mourya',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.94939035,
  'word': 'OpenAI',
  'start': 68,
  'end': 74},
 {'entity_group': 'ORG',
  'score': 0.9662623,
  'word': 'HuggingFace',
  'start': 79,
  'end': 90}]

##Question Answering

Another widely used application of Hugging Face transformers is Question Answering task. Question Answering is the task of extracting an answer from a document. QA models take in a context parameter, which is a document in which you are searching for some information, and a question, and return an answer. The answer is being extracted, not generated. The task is evaluated on two metrics: exact match and F1-score.

In [25]:
qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [26]:
question = "what is the other word for Generative AI"
context = "My name is Mourya and i am  doing generative AI. It is pretty cool. It is also referred as GenAI"
qa_model(question = question, context = context)

{'score': 0.9339084625244141, 'start': 91, 'end': 96, 'answer': 'GenAI'}

##Text Summarization

Text Summarization is the task of creating a shorter version of a document, while preserving the relevant information and importance of the original document. The summarizer model takes in the whole document as input and outputs the summarized version. The evaluation metric used in this analysis is called Rouge. It is a benchmark based on the shared sabsequent tokens between the produced sequence and the original document.

In [27]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [28]:
summarizer(
    """
Spirituality is the broad concept of a belief in something beyond the self. It strives to answer questions about the meaning of life, how people are connected to each other, truths about the universe, and other mysteries of human existence.

Spirituality offers a worldview that suggests there is more to life than just what people experience on a sensory and physical level. Instead, it suggests that there is something greater that connects all beings to each other and to the universe itself.

It may involve religious traditions centering on the belief in a higher power. It can also involve a holistic belief in an individual connection to others and the world as a whole.

"""
)

Your max_length is set to 142, but your input_length is only 136. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)


[{'summary_text': ' Spirituality is the broad concept of a belief in something beyond the self . It strives to answer questions about the meaning of life, how people are connected to each other, truths about the universe . It may involve religious traditions centering on the belief in a higher power . It can also involve a holistic belief in the world as a whole .'}]

##Translation

Translation models take an input in some source language and output the translation in a target language. The evaluation metric used for this task is called BLEU (bilingual evaluation understudy). It is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. It is on a scale from 0 to 1, 1 meaning perfect score.

The are two types of models, monolingual models, trained on a specific language duo data, and there are multilingual models, trained on multiple languages dataset.

In [29]:
en_fr_translator = pipeline("translation_en_to_fr", model="t5-small")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [30]:
en_fr_translator("Hi, I am Mourya, I'm excited to meet you")

[{'translation_text': 'Bonjour, je suis Mourya, je suis ravi de vous rencontrer.'}]