Pipeline is the fastest way to use pretrained models.

The following code shows the basic ways we could use the pipline method for:


1.   Sentiment Analysis
2.   Zero Shot Classification ( Custom labels )
3. Text Generation
4. Fill Mask ( Filling text )
5. Labelling words
6. Question Answer
7. Summarization
8. Translation



CodeCarbon library is also used here to test the carbon emissions for training these models

In [1]:
from transformers import pipeline

In [2]:
classifier = pipeline('sentiment-analysis')

classifier([
    'are you supposed to be here ?',
    'what the hell are you doing',
    'great job'
])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9382468461990356},
 {'label': 'NEGATIVE', 'score': 0.996593177318573},
 {'label': 'POSITIVE', 'score': 0.9998588562011719}]

In [3]:
classifier = pipeline('zero-shot-classification')

classifier(
    "We are hiring some programmers from italy",
    candidate_labels:=['education', 'programing']
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'We are hiring some programmers from italy',
 'labels': ['programing', 'education'],
 'scores': [0.9929640293121338, 0.007035994436591864]}

In [4]:
generator = pipeline('text-generation')
generator("The roadmap for learning data science would be")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The roadmap for learning data science would be:\n\nDo not use tools, and do not use tools that are no more valuable to you than those of the user.\n\nLearn, learn, learn. Not because you're better or that knowledge"}]

In [5]:
generators = pipeline('text-generation', model='distilgpt2')

generators(
    "Ronaldo scores a goal against",
    max_length = 50,
    num_return_sequences = 2
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Ronaldo scores a goal against Everton The Gunners celebrate their win over Everton on Saturday in London © Getty Images\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'},
 {'generated_text': 'Ronaldo scores a goal against Sunderland, 6-1\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSteven Caulker has started all five Sunderland games in a row, as he has scored more goals in his'}]

In [6]:
unmasker = pipeline('fill-mask')

unmasker(
    "The reason for <mask> being extinct is the meteor shower.",
    top_k=2
)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.7734246850013733,
  'token': 29171,
  'token_str': ' dinosaurs',
  'sequence': 'The reason for dinosaurs being extinct is the meteor shower.'},
 {'score': 0.02204074151813984,
  'token': 43363,
  'token_str': ' Bigfoot',
  'sequence': 'The reason for Bigfoot being extinct is the meteor shower.'}]

In [7]:
ner = pipeline('ner', grouped_entities=True)
ner("My name is Rayan and I live in Pakistan where I study in Fast University")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



[{'entity_group': 'PER',
  'score': 0.9986502,
  'word': 'Rayan',
  'start': 11,
  'end': 16},
 {'entity_group': 'LOC',
  'score': 0.99981433,
  'word': 'Pakistan',
  'start': 31,
  'end': 39},
 {'entity_group': 'ORG',
  'score': 0.9917166,
  'word': 'Fast University',
  'start': 57,
  'end': 72}]

In [8]:
question_answer = pipeline('question-answering')

question_answer(
    question='what is data engineering?',
    context="Data engineering is part of the big data ecosystem and is closely linked to data science. Data engineers work in the background and do not get the same level of attention as data scientists, but they are critical to the process of data science. The roles and responsibilities of a data engineer vary depending on an organization's level of data maturity and staffing levels; however, there are some tasks, such as the extracting, loading, and transforming of data, that are foundational to the role of a data engineer"
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.7034920454025269,
 'start': 20,
 'end': 50,
 'answer': 'part of the big data ecosystem'}

In [9]:
summarizer = pipeline('summarization')
summarizer("""
Data engineering is part of the big data ecosystem and is closely linked to data science. Data engineers work in the background and do not get the same level of attention as data scientists, but they are critical to the process of data science. The roles and responsibilities of a data engineer vary depending on an organization's level of data maturity and staffing levels; however, there are some tasks, such as the extracting, loading, and transforming of data, that are foundational to the role of a data engineer
""",
max_length= 50,
min_length= 20
           )

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' Data engineering is part of the big data ecosystem and is closely linked to data science . Data engineers work in the background and do not get the same level of attention as data scientists . The roles and responsibilities of a data engineer vary depending on'}]

In [12]:
from codecarbon import EmissionsTracker

# codecarbon will generate a csv file and save all carbon emission data in it.

tracker = EmissionsTracker()
tracker.start()

translator = pipeline('translation', model='Helsinki-NLP/opus-mt-en-fr')
translator(
    "Hi whats up"
)

tracker.stop()

[codecarbon INFO @ 11:06:57] [setup] RAM Tracking...
[codecarbon INFO @ 11:06:57] [setup] GPU Tracking...
[codecarbon INFO @ 11:06:57] No GPU found.
[codecarbon INFO @ 11:06:57] [setup] CPU Tracking...
[codecarbon INFO @ 11:06:58] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 11:06:58] >>> Tracker's metadata:
[codecarbon INFO @ 11:06:58]   Platform system: Linux-6.1.58+-x86_64-with-glibc2.35
[codecarbon INFO @ 11:06:58]   Python version: 3.10.12
[codecarbon INFO @ 11:06:58]   CodeCarbon version: 2.3.4
[codecarbon INFO @ 11:06:58]   Available RAM : 12.675 GB
[codecarbon INFO @ 11:06:58]   CPU count: 2
[codecarbon INFO @ 11:06:58]   CPU model: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 11:06:58]   GPU count: None
[codecarbon INFO @ 11:06:58]   GPU model: None


config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

[codecarbon INFO @ 11:07:12] Energy consumed for RAM : 0.000019 kWh. RAM Power : 4.753046035766602 W
[codecarbon INFO @ 11:07:12] Energy consumed for all CPUs : 0.000169 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 11:07:12] 0.000188 kWh of electricity used since the beginning.


2.6038722072126172e-05