# HuggingFace Pipeline

In [None]:
# !pip install -q transformers datasets
# !pip install -q sentencepiece
# !pip install -q kobert-transformers

# NLP Tasks

주어진 task에서 사전 훈련된 모델을 사용하는 가장 간단한 방법은 `pipeline`을 사용하는 것이다.

🤗 Transformers 라이브러리는 아래와 같은 주요 task를 지원한다:

- **기계 번역(Translation)**: 텍스트를 다른 언어로 번역한다.  
- **감정 분석(Text Classification)**: 텍스트가 긍정적인지 부정적인지 분류할 수 있다.  
- **텍스트 생성(Text Generation)**: 프롬프트를 입력하면 모델이 후속 텍스트를 생성한다.  
- **이름 개체 인식(NER)**: 입력 문장의 각 단어가 어떤 개체(예: 사람, 장소 등)를 나타내는지 식별할 수 있다.  
- **질문 답변(Question Answering)**: 컨텍스트와 질문을 입력하면 모델이 컨텍스트에서 적절한 답변을 추출한다.  
- **마스킹된 텍스트 채우기(Fill-Mask)**: 마스킹된 단어가 포함된 텍스트(`[MASK]`로 대체됨)를 입력하면 공백을 채울 단어를 예측한다.  
- **요약(Summarization)**: 긴 텍스트의 요약을 생성한다.  
- **특징 추출(Feature Extraction)**: 텍스트의 텐서 표현을 반환하여 특성을 추출한다.  
- **Zero-Shot 분류(Zero-Shot Classification)**: 레이블이 없는 데이터에 대해 사전 정의된 레이블에 맞는 분류를 수행한다.  

In [1]:
from transformers import pipeline

### 기계번역   
https://huggingface.co/Helsinki-NLP/opus-mt-ko-en

In [2]:
translator = pipeline('translation', model='Helsinki-NLP/opus-mt-ko-en')
translator

Device set to use cuda:0


<transformers.pipelines.text2text_generation.TranslationPipeline at 0x7f3c77493ed0>

In [3]:
translator('오늘은 화요일이니까 맛있는걸 먹어야겠어요!')

[{'translation_text': "It's Tuesday, so I'm gonna have to eat something good!"}]

In [4]:
translator([
    '오늘은 LLM의 개요를 배워볼 거에요',
    '트랜스포머를 다 공부했으니 여러분은 큰 산을 넘었답니다~',
    '아주 멋있어! 아주 성장했어~!!!'
])

[{'translation_text': "Today, we're going to learn an outline from LLM."},
 {'translation_text': "Now that you've studied the Transpomer, you've crossed a large mountain."},
 {'translation_text': "You've grown up!"}]

### 감성 분석

In [None]:
sentiment_clf = pipeline('sentiment-analysis')  # 모델 지정 안하면 default 모델 반환

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


In [6]:
sentiment_clf("I'm very hungry")

[{'label': 'NEGATIVE', 'score': 0.9992774128913879}]

In [7]:
sentiment_clf([
    "My exam was failed",
    "Spring weather is so weird",
    "Rabbit teacher is very nice"
])

[{'label': 'NEGATIVE', 'score': 0.9993190765380859},
 {'label': 'NEGATIVE', 'score': 0.9932923316955566},
 {'label': 'POSITIVE', 'score': 0.9996789693832397}]

### 한국어 감성 분석   
https://huggingface.co/sangrimlee/bert-base-multilingual-cased-nsmc

In [9]:
ko_sentiment_clf = pipeline("sentiment-analysis", model="sangrimlee/bert-base-multilingual-cased-nsmc")

pytorch_model.bin:  22%|##2       | 157M/712M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/711M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [15]:
ko_sentiment_clf([
    "어제 배운게 어려워서, 맥주를 한 잔 했어.",
    "공부는 어렵지만 맥주는 시원하더라",
    "그래도 나는 오늘도 공부할거야",
    "맥주는 맛없지 않아!!!"
])

[{'label': 'negative', 'score': 0.9273295998573303},
 {'label': 'negative', 'score': 0.7013086080551147},
 {'label': 'positive', 'score': 0.5110437870025635},
 {'label': 'negative', 'score': 0.6181637048721313}]

### Zero Shot Classification
- shot == 예시
- 예시 없이 (학습 없이) 추론 분류

In [16]:
zero_shot_clf = pipeline('zero-shot-classification')

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [17]:
zero_shot_clf(
    "This is a course about the transformers library",
    candidate_labels=["education", "politics", "business"]
)

{'sequence': 'This is a course about the transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.9192402958869934, 0.060778748244047165, 0.01998094469308853]}

In [18]:
zero_shot_clf(
    "It has a poison and it's very dangerous",
    candidate_labels=["rabbit", "snake", "squirrel"]
)

{'sequence': "It has a poison and it's very dangerous",
 'labels': ['snake', 'squirrel', 'rabbit'],
 'scores': [0.6727704405784607, 0.17300079762935638, 0.15422874689102173]}

### 한국어 Zero Shot   
https://huggingface.co/joeddav/xlm-roberta-large-xnli

In [19]:
ko_zero_shot_clf = pipeline('zero-shot-classification', model='joeddav/xlm-roberta-large-xnli')

config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0


In [20]:
sentence = '2025년 어떤 운동을 하시겠습니까?!'
candidate_labels=['정치', '경제', '건강']
hypothesis_template = '이 텍스트는 {}에 관한 내용입니다.'

ko_zero_shot_clf(
    sentence,
    candidate_labels=candidate_labels,
    hypothesis_template=hypothesis_template
)

{'sequence': '2025년 어떤 운동을 하시겠습니까?!',
 'labels': ['건강', '정치', '경제'],
 'scores': [0.9117885231971741, 0.05933362618088722, 0.02887781709432602]}

### 텍스트 생성

In [22]:
text_generator = pipeline('text-generation', model='distilgpt2')

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [23]:
text_generator(
    'In this course, we will teach you how to',
    max_length=30,
    num_return_sequences=2
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to change how you view and to learn a lot about using it on your own day-to-day'},
 {'generated_text': 'In this course, we will teach you how to think of the ways in which humans are capable of thinking that they are capable of thinking like all humans'}]

### 한국어 텍스트 생성   
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5

In [24]:
ko_text_generator = pipeline('text-generation', model='skt/ko-gpt-trinity-1.2B-v0.5')

config.json:   0%|          | 0.00/731 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/109 [00:00<?, ?B/s]

Device set to use cuda:0


In [25]:
ko_text_generator("오늘 점심에는")

[{'generated_text': '오늘 점심에는 뭐 먹지\n 답변:한국인은 밥심이죠. 든든하게 챙겨드세요.'}]

In [28]:
ko_text_generator("토끼는 귀가 2개다")

[{'generated_text': '토끼는 귀가 2개다 (... ). 그리고 이 두 개의 귀가 각각 다른 동물들의 귀를 가지고 있다. 이 두 개의 귀가'}]

### Fill Mask

In [26]:
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
unmasker(
    "This course will teach you all about <mask> model.",   # <mask> 안에 들어갈만한 단어를 채워줌
    top_k=3                                                 # score가 높은것으로 몇 개를 보여줄 지
)

[{'score': 0.08884059637784958,
  'token': 265,
  'token_str': ' business',
  'sequence': 'This course will teach you all about business model.'},
 {'score': 0.07198566198348999,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical model.'},
 {'score': 0.04102081060409546,
  'token': 5,
  'token_str': ' the',
  'sequence': 'This course will teach you all about the model.'}]

### NER

In [None]:
ner = pipeline(
    'ner',
    model='dslim/bert-base-NER',
    aggregation_strategy="simple"   # 토큰 단위의 예측 결과를 엔티티 단위로 그룹화 ex) "Bar", "##ack", "Obama" → "Barack Obama" → PER
)

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [30]:
ner("My name is Rabbit and I work at Playdata in Seoul.")

[{'entity_group': 'PER',
  'score': 0.9943137,
  'word': 'Rabbit',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.9896114,
  'word': 'Playdata',
  'start': 32,
  'end': 40},
 {'entity_group': 'LOC',
  'score': 0.99942976,
  'word': 'Seoul',
  'start': 44,
  'end': 49}]

### Q&A

In [31]:
qna = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


In [32]:
qna(
    question="Where do I work?",
    context="My Name is Rabbit and I work at Playdata in Seoul."
)

{'score': 0.8229941129684448, 'start': 32, 'end': 40, 'answer': 'Playdata'}

### 한국어 Q&A

In [2]:
ko_qna = pipeline('question-answering', model='klue/roberta-base')

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


In [3]:
ko_qna(
    question="아르바이트는 어디서 했나요?",
    context='''어렸을 때부터 꿈이 배우인 건 아니었으며, 사범대 진학을 목표로 재수해서 공부하던 중, 너무 졸린 나머지 서서 공부하려다가 선 채로 두 시간이나 존 걸 깨닫자 자신은 공부와 안 맞는 것으로 생각해 연기로 진로를 바꾸기로 결심했다고 했다. 이후 연극영화과에 진학하였다.
대학 생활도 평범했다고 한다. 이런저런 단편 영화에 출연하였고, 병역은 수원시 영통구청 가정복지과에서 공익근무요원으로 마쳤으며, 그 즈음 5년간 치아교정을 했다고 한다. 그의 데뷔작인 2014년 클래지콰이의 '내게 돌아와' 뮤직비디오부터 2015년 3월 12일에 개봉한 영화 《소셜포비아》까지는 교정기를 한 그의 모습을 볼 수 있다. 대학 시절부터 2015년 상반기까지는 안 해본 알바가 없을 정도라는데 피자 배달, 막노동, 마트 상하차, 케이터링, 고깃집 서빙, 쌀국수 가게 서빙, 편의점 아르바이트, 아이스크림 전문점, 돌잔치 및 결혼식 사회, 아이돌 콘서트 공식 굿즈 판매 등 그 종류도 매우 다양하다.[7] 서울특별시 강남구 세명초등학교와 서울특별시 종로구 서울청운초등학교 연극부에서 방과후 교사로 연극과 뮤지컬을 지도하기도 했다.
    '''
)

{'score': 3.902253956766799e-05,
 'start': 512,
 'end': 536,
 'answer': '서울특별시 종로구 서울청운초등학교 연극부에서'}

In [4]:
help(ko_qna)

Help on QuestionAnsweringPipeline in module transformers.pipelines.question_answering object:

class QuestionAnsweringPipeline(transformers.pipelines.base.ChunkPipeline)
 |  QuestionAnsweringPipeline(model: Union[ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')], tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, modelcard: Optional[transformers.modelcard.ModelCard] = None, framework: Optional[str] = None, task: str = '', **kwargs)
 |  
 |  Question Answering pipeline using any `ModelForQuestionAnswering`. See the [question answering
 |  examples](../task_summary#question-answering) for more information.
 |  
 |  Example:
 |  
 |  ```python
 |  >>> from transformers import pipeline
 |  
 |  >>> oracle = pipeline(model="deepset/roberta-base-squad2")
 |  >>> oracle(question="Where do I live?", context="My name is Wolfgang and I live in Berlin")
 |  {'score': 0.9191, 'start': 34, 'end': 40, 'answer': 'Berlin'}
 |  ```
 |  
 |  Learn more about the basics of usin