# Let's try HuggingFace Transformers NLP Pipelines!


In [None]:
!pip install transformers



In [4]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "on the highway there are several vehicles without drivers",
    candidate_labels=["education", "transport", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'on the highway there are several vehicles without drivers',
 'labels': ['transport', 'business', 'education'],
 'scores': [0.9423522353172302, 0.0367816686630249, 0.02086612582206726]}

In [6]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("food on the table")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "food on the table from other people\n\nThe biggest fear of any new policy, however, will be that these individuals will become less able to spend the money they've made.\n\nFor a short time, policymakers have been keenly aware of"}]

In [7]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "food on the table",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'food on the table, while those who want food in a closed, public arena do not get the same benefits as any other venue in Pittsburgh.\n'},
 {'generated_text': 'food on the table. They’ve all spent at least one day trying to save some kids‡, the next morning I saw “'}]

In [9]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("I go to school by <mask>.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.1136590838432312,
  'token': 2353,
  'token_str': ' bus',
  'sequence': 'I go to school by bus.'},
 {'score': 0.10690395534038544,
  'token': 2185,
  'token_str': ' myself',
  'sequence': 'I go to school by myself.'}]

In [13]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("Barack Obama was born in Hawaii and was the President of the United States.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.99925804,
  'word': 'Barack Obama',
  'start': 0,
  'end': 12},
 {'entity_group': 'LOC',
  'score': 0.99948347,
  'word': 'Hawaii',
  'start': 25,
  'end': 31},
 {'entity_group': 'LOC',
  'score': 0.9988576,
  'word': 'United States',
  'start': 61,
  'end': 74}]

In [16]:
from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
context = "Albert Einstein adalah seorang fisikawan terkenal yang mengembangkan teori relativitas."
question = "Siapa pengembang teori relativitas?"

result = qa_pipeline(question=question, context=context)
print(f"Jawaban: {result['answer']}")

Jawaban: Albert Einstein


In [18]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I can't believe how amazing this new movie is!")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998351335525513}]

In [19]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    The global climate crisis is one of the most pressing issues facing the world today.
    Human activities, especially the burning of fossil fuels, have significantly
    contributed to the rise in global temperatures. As a result, we are experiencing
    more frequent and intense heatwaves, storms, and flooding. The melting of polar
    ice caps and glaciers is causing sea levels to rise, threatening coastal cities
    and ecosystems. Addressing climate change requires urgent action from governments,
    businesses, and individuals to reduce greenhouse gas emissions, transition to renewable
    energy sources, and implement sustainable practices to protect the planet for future generations.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' The global climate crisis is one of the most pressing issues facing the world today . Human activities, especially the burning of fossil fuels, have contributed to the rise in global temperatures . We are experiencing more frequent and intense heatwaves, storms, and flooding . Addressing climate change requires urgent action from governments, businesses, and individuals .'}]

In [20]:
from transformers import pipeline

translator = pipeline("translation_id_to_en", model="Helsinki-NLP/opus-mt-id-en")

text_to_translate = "Saya sedang belajar untuk ujian matematika besok."
result = translator(text_to_translate)

print(result[0]['translation_text'])

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/291M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/801k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/796k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.26M [00:00<?, ?B/s]



I'm studying for the math exam tomorrow.


# Analisis

- Zero-Shot Classification
Merupakan teknik dalam pemrosesan NLP (Natural Language Processing) dimana model dapat mengklasifikasikan teks kedalam kategori tertentu tanpa membutuhkan pelatihan khusus pada kategori tersebut. Teknik ini akan memungkinkan model untuk memahami hubungan teks dan label kandidat berdasarkan konteks,  meskipun label tesebut belum pernah ditemukan selama pelatihan model. Dalam pipeline ini defaultnya menggunakan model "facebook/bart-large-mnli", namun dapat diubah dengan menggunakan parameter tambahan. Urutan label kandidat berdasarkan probabilitas kecocokan, dari yang tertinggi ke terendah.
Dari model yang dibuat, label "transport" memiliki persentase yang tertinggi. Hal tersebut menunjukkan bahwa model menganggap teks ini paling relevan dengan kategori "transport.

- Text generation merupakan salah satu teknik dalam NLP yang akan membuat teks secara otomatis. Model ini akan membuat teks berdasarkan prompt awal yang diberikan dan akan melanjutkan teks tersebut berdasarkan konteks input. Model ini sangat mudah digunakan untuk membuat konten kreatif. Namun model ini memiliki kontrol yang terbatas dalam membuat teks, sehingga akan sulit memprediksi arah teks yang dihasilkan meskipun parameter dikontrol.

- Fill-mask merupakan teknik untuk mengisi token yang kosong (<mask>) dalam sebuah kalimat dengan prediksi kata yang paling memungkinkan berdasarkan pada konteksnya. Dalam model yang dibuat, ditambahkan dengan parameter "top_k=2" yang berarti output yang ditampilkan adalah 2 kata prediksi teratas. Modek ini dapat digunakan untuk memahami konteks atau menghasilkan rekomendasi kata dalam berbagai bahasa (jika modelnya mendukung). Model ini memiliki keterbatasan dalam memprediksi kata yang hanya akan memprediksi satu token pada posisi <mask>. Model ini cocok digunakan untuk autokoreksi yang berfungsi untuk melengkapi kalimat dengan kata yang hilang.

- NER (Named Entity Recognition) merupakan model yang mengidentifikasi dan mengelompokkan entitas tertentu (seperti nama, lokasi, organisasi, dll.) dalam teks. Model membaca teks dan memprediksi label untuk setiap token berdasarkan konteksnya.Label mencakup kategori seperti B-PER (beginning of a person), I-PER (inside a person name), dll. Akurasi dari model ini bergantung pada model pralatih dan dataset pelatihannya. Model ini dapat diterapkan pada analisis teks secara otomatis untuk menemukan nama orang, organisasi, atau lokasi dari dokumen.

- Question-answering merupakan pipeline yang memiliki fungsi untuk menjawab pertanyaan berdasarkan konteks yang diberikan. Model tersebut hanya bisa menjawab jika konteks mencakup informasi yang relevan. Model ini memberikan jawaban pendek dari teks, bukan penjelasan mendalam.

- Sentiment-analysis merupakan pipeline yang berfungsi untuk menganalisis apakah teks mengandung sentimen positif, negatif, atau netral. Model ini Dapat menganalisis banyak data teks secara otomatis dan dapat memahami perasaan atau opini orang terhadap produk, layanan, atau isu tertentu. Namun model ini memiliki kekurangan yaitu terkadang model kesulitan dengan sentimen yang lebih kompleks atau ambigu.

- Summarization merupakan pipeline yang berfungsi untuk merangkum atau meringkas teks panjang menjadi bentuk yang lebih singkat, sambil mempertahankan informasi penting dan inti dari teks asli. Model ini dapat membantu pembaca untuk memahami inti dari teks panjang tanpa harus membaca seluruhnya. Model ini mungkin kesulitan merangkum teks yang sangat teknis atau sangat panjang dengan keakuratan tinggi.

- Dalam NLP terdapat pipeline yang berguna untuk menerjemahkan kalimat dari satu bahasa menjadi bahasa lainnya. Pengaturan bahasa yang diubah dapat diatur pada "translation_id_to_en", misalnya "translation_en_to_id" maka pipeline tersebut akan menerjemahkan dari Bahasa Inggris ke Bahasa Indonesia.