<a href="https://colab.research.google.com/github/MethEthPro/colab/blob/main/hugging_face/nlp/pipeline_and_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TRANSFORMER MODELS

In [None]:
!pip install transformers[sentencepiece]



In [None]:
import transformers

## transformers

### pipeline

In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline("sentiment-analysis")
# selects a default model
# the model is downlaoded ans cached
# that is stored in ram
# so when we run again , no need to download again. just use

classifier(["My son finished last in the race.",
            "today are election results"

            ])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9946876764297485},
 {'label': 'POSITIVE', 'score': 0.998380184173584}]

here are three main steps involved when you pass some text to a pipeline:

The text is preprocessed into a format the model can understand.

The preprocessed inputs are passed to the model.

The predictions of the model are post-processed, so you can make sense of them.

SOME COMMON PIPELINES

In [None]:
# 1 - Zero-shot classification

 This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise.

it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model

This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This college is a scam.",
    candidate_labels = ['politics','education','business'],
)


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'This college is a scam.',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.723120391368866, 0.20106850564479828, 0.07581117004156113]}

In [None]:
# 2 - Text Generation

The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.

In [None]:
generator = pipeline("text-generation")
generator("this college is a scam")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "this college is a scam. It's also not a great place for kids.'\n\nHe's now asking the campus police to contact him and help them set up a hotline.\n\n'I hope we get him to talk to them and he"}]

In [None]:
# loading any model from the hub in the pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator("this college is a scam",
          max_length = 30,
          num_return_sequences=2)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "this college is a scam); and if that․ proves so, it ain't a scam, and it ain't an easy deal.\n\n"},
 {'generated_text': 'this college is a scam so you are not going to be able to pay off college tuition until you have graduated.\n\nThe second thing I will'}]

In [None]:
# 3 - Mask Filling

he idea of this task is to fill in the blanks in a given text:

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This is an introductory course to <mask> learning.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.07702131569385529,
  'token': 3563,
  'token_str': ' machine',
  'sequence': 'This is an introductory course to machine learning.'},
 {'score': 0.0361526682972908,
  'token': 37700,
  'token_str': ' reinforcement',
  'sequence': 'This is an introductory course to reinforcement learning.'}]

In [None]:
# 4 - Named Entity Recognition

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:

In [None]:
ner = pipeline("ner", grouped_entities = True)
ner("My name is Karthik and I am an ordinary person in India, working for hugging face.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.9975899,
  'word': 'Karthik',
  'start': 11,
  'end': 18},
 {'entity_group': 'LOC',
  'score': 0.9995468,
  'word': 'India',
  'start': 50,
  'end': 55}]

In [None]:
# 5 - Question Answering
question_answer = pipeline("question-answering")
question_answer(
    question="what is the captital of China",
    context="The 2008 olympics was held in Beijing, the captial of china."
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.5523170828819275, 'start': 54, 'end': 59, 'answer': 'china'}

Note that this pipeline works by extracting information from the provided context; it does not generate the answer.

In [None]:
# 6 - Summarization

In [None]:
summarizer = pipeline("summarization")
summarizer("""
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
""",
           max_length=100)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [None]:
# 7 - Translation

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi")
translator("what is your name?")

Device set to use cpu


[{'translation_text': 'आपका Windows Live कूटशब्द क्या है?'}]

The pipelines shown so far are mostly for demonstrative purposes. They were programmed for specific tasks and cannot perform variations of them. In the next chapter, you’ll learn what’s inside a pipeline() function and how to customize its behavior.

