# **Pendahuluan**

Objek paling dasar dalam pustaka Transformers adalah fungsi `pipeline()`. Fungsi ini menghubungkan model dengan langkah-langkah preprocessing dan postprocessing yang diperlukan, memungkinkan kita untuk langsung memasukkan teks apa pun dan mendapatkan jawaban yang dapat dimengerti.

In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# **Zero-shot classification**

Untuk kasus penggunaan ini, pipeline *zero-shot-classification* sangatlah hebat: pipeline ini memungkinkan kita menentukan label yang akan digunakan untuk klasifikasi, sehingga kita tidak perlu bergantung pada label dari model yang telah dilatih sebelumnya.

In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445994257926941, 0.11197380721569061, 0.04342673346400261]}

# **Text generation**

Di sini kita memberikan sebuah prompt, dan model akan melengkapinya secara otomatis dengan menghasilkan teks yang tersisa.

In [None]:
from transformers import pipeline

In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to implement data in the real world to create your own data models. In particular, we will help you design better, more powerful, more readable data objects. You will be taught the same concepts used by'}]

# **Menggunakan model apa pun dari Hub dalam sebuah pipeline.**

In [None]:
from transformers import pipeline

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use the N-word to understand the basics of communication, to learn how to write code, and'},
 {'generated_text': 'In this course, we will teach you how to create your own custom tools to create a user interface for your desktop. If you want to explore how'}]

# **Mask filling**

Ide dari task ini adalah mengisi bagian yang kosong dalam sebuah teks yang diberikan.

In [None]:
from transformers import pipeline

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will tell <mask> to you", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.5047396421432495,
  'token': 1085,
  'token_str': ' nothing',
  'sequence': 'This course will tell nothing to you'},
 {'score': 0.09234447777271271,
  'token': 1652,
  'token_str': ' stories',
  'sequence': 'This course will tell stories to you'}]

# **Named entity recognition**

Named entity recognition (NER) adalah tugas di mana model harus menemukan bagian dari teks masukan yang sesuai dengan entitas seperti orang, lokasi, atau organisasi.

In [None]:
from transformers import pipeline

In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

Di sini model dengan benar mengidentifikasi bahwa Sylvain adalah seorang person (PER), Hugging Face adalah sebuah organization (ORG), dan Brooklyn adalah sebuah location (LOC).

# **Question answering**

Pipeline question-answering menjawab pertanyaan menggunakan informasi dari konteks yang diberikan.

In [None]:
from transformers import pipeline

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

# **Summarization**

Summarization adalah tugas merangkum sebuah teks menjadi lebih pendek sambil mempertahankan semua (atau sebagian besar) aspek penting yang dirujuk dalam teks.

In [None]:
from transformers import pipeline

In [None]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

# **Translation**

Untuk translation, Anda dapat menggunakan model default jika Anda menyediakan pasangan bahasa dalam nama tugas. Di sini kita akan mencoba menerjemahkan dari Indonesia ke English.

In [None]:
from transformers import pipeline

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
translator("Jalanan Braga sedang macet")

Device set to use cpu


[{'translation_text': 'Braga Street is jammed.'}]

# **Bias and limitations**

Model seperti BERT sering mencerminkan bias yang melekat pada data pelatihan mereka. Sebagai contoh, ketika diminta memprediksi kata yang hilang dalam kalimat seperti "This man works as a [MASK]" dan "This woman works as a [MASK]," model cenderung mengaitkan profesi tertentu dengan gender tertentu. Pria lebih mungkin diasosiasikan dengan profesi seperti lawyer atau mechanic, sementara wanita lebih mungkin diasosiasikan dengan profesi seperti nurse atau maid. Hal ini menunjukkan adanya bias gender dalam prediksi model.

Selain itu, model ini memiliki beberapa keterbatasan. Data pelatihan, bahkan yang tampaknya netral seperti Wikipedia dan BookCorpus, tetap mencerminkan stereotip sosial yang ada di masyarakat. Meskipun telah dilakukan fine-tuning dengan data baru, bias yang melekat pada model tidak dapat sepenuhnya dihilangkan. Risiko lain yang perlu diwaspadai adalah model dapat menghasilkan output yang berpotensi menyinggung, seperti konten yang seksis, rasis, atau homofobik.

Oleh karena itu, penting untuk menggunakan model ini dengan hati-hati. Evaluasi menyeluruh dan penyaringan terhadap output model diperlukan untuk memastikan penggunaannya sesuai dengan standar etika.

In [None]:
from transformers import pipeline

In [None]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']


In [None]:
result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
