# HugginFace Pipeline

In [1]:
!pip install -q transformers datasets
!pip install -q sentencepiece
!pip install -q kobert-transformers
!pip install -q python-dotenv

In [2]:
!pip install -q tf-keras

In [3]:
from dotenv import load_dotenv
import os

load_dotenv()
HF_TOKEN = os.getenv('HF_TOKEN')

### NLP Tasks

주어진 Task에서 사전 훈련된 모델을 사용하는 가장 간단한 방법은 `pipeline`을 사용하는 것이다.

- **기계 번역(Translation)**: 텍스트를 다른 언어로 번역한다.  
- **감정 분석(Text Classification)**: 텍스트가 긍정적인지 부정적인지 분류할 수 있다.  
- **텍스트 생성(Text Generation)**: 프롬프트를 입력하면 모델이 후속 텍스트를 생성한다.  
- **이름 개체 인식(NER)**: 입력 문장의 각 단어가 어떤 개체(예: 사람, 장소 등)를 나타내는지 식별할 수 있다.  
- **질문 답변(Question Answering)**: 컨텍스트와 질문을 입력하면 모델이 컨텍스트에서 적절한 답변을 추출한다.  
- **마스킹된 텍스트 채우기(Fill-Mask)**: 마스킹된 단어가 포함된 텍스트(`[MASK]`로 대체됨)를 입력하면 공백을 채울 단어를 예측한다.  
- **요약(Summarization)**: 긴 텍스트의 요약을 생성한다.  
- **특징 추출(Feature Extraction)**: 텍스트의 텐서 표현을 반환하여 특성을 추출한다.  
- **Zero-Shot 분류(Zero-Shot Classification)**: 레이블이 없는 데이터에 대해 사전 정의된 레이블에 맞는 분류를 수행한다.  

In [4]:
from transformers import pipeline

##### 기계번역

https://huggingface.co/Helsinki-NLP/opus-mt-ko-en

In [5]:
translation = pipeline('translation', model='Helsinki-NLP/opus-mt-ko-en')
translation




Device set to use cpu


<transformers.pipelines.text2text_generation.TranslationPipeline at 0x124b4cad550>

In [6]:
translation("저 지금 집에 가고 싶어요")

[{'translation_text': 'I want to go home now.'}]

In [7]:
translation([
    '눈이 내려서 눈이 시려요.',
    '19기 화이팅',
    '말을 보니 말이 안나와요.',
    'nlp공부는 재미있다.'
])

[{'translation_text': "It's snowy. It's snowy."},
 {'translation_text': '19th stage.'},
 {'translation_text': "It doesn't make any sense to me."},
 {'translation_text': 'Learning nlp is fun.'}]

##### 감성 분석

In [8]:
sentiment_clf = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [9]:
sentiment_clf("I'm very happy")

[{'label': 'POSITIVE', 'score': 0.9998770952224731}]

- 한국어

https://huggingface.co/sangrimlee/bert-base-multilingual-cased-nsmc

In [10]:
ko_sentiment_clf = pipeline('sentiment-analysis', 
                            model='sangrimlee/bert-base-multilingual-cased-nsmc')

Device set to use cpu


In [11]:
ko_sentiment_clf([
    '이 프로젝트는 잘 진행되고 있어!',
    '열심히 한 만큼 결과도 좋게 나올 거야!',
    '나한테 관심도 없는 것 같아 ㅠㅠ',
    '컴퓨터가 자동으로 업데이트되었어.'
])

[{'label': 'positive', 'score': 0.9876468181610107},
 {'label': 'positive', 'score': 0.970969557762146},
 {'label': 'negative', 'score': 0.9228838086128235},
 {'label': 'positive', 'score': 0.7536783814430237}]

##### Zero Shot Classification

- shot == 예시
- 예시 없이 추론 분류

In [12]:
zero_shot_clf = pipeline('zero-shot-classification')

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:  78%|#######7  | 1.27G/1.63G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [13]:
zero_shot_clf(
    "This is a course about the transformers library.",
    candidate_labels=["education", "politics", "business"]
)

{'sequence': 'This is a course about the transformers library.',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.9053581953048706, 0.07259626686573029, 0.022045595571398735]}

##### 한국어 Zero Shot

https://huggingface.co/joeddav/xlm-roberta-large-xnli

In [14]:
ko_zero_shot_clf = pipeline('zero-shot-classification', model='joeddav/xlm-roberta-large-xnli')

config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


In [15]:
sequence_to_classify = "2025년은 어떤 운동을 하면 좋을까?"
candidate_labels = ["정치", "경제", "건강"]
hypothesis_template = '이 텍스트는 {}에 관한 내용입니다.'

ko_zero_shot_clf(sequence_to_classify, candidate_labels=candidate_labels, hypothesis_template=hypothesis_template)

{'sequence': '2025년은 어떤 운동을 하면 좋을까?',
 'labels': ['건강', '정치', '경제'],
 'scores': [0.9276527762413025, 0.05568736046552658, 0.01665988750755787]}

##### 텍스트 생성

In [16]:
text_generator = pipeline('text-generation', model='distilgpt2')

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [17]:
text_generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'In this course, we will teach you how to use CSS, how to build a website, and how to use the JavaScript library. If you want to learn more, read our course in CSS and how to use Javascript, learn more at CSS.com. It is free.\n\n\n\nWhat to use in your first project?\nIf you are ready to learn more, you can visit our website if you are familiar with the basics of Node.js. You can learn more about these basics and explore what to use in your project.'},
 {'generated_text': 'In this course, we will teach you how to write a simple Python application.'}]

- 한국어

https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5

In [18]:
ko_text_generator = pipeline('text-generation', model='skt/ko-gpt-trinity-1.2B-v0.5')

config.json:   0%|          | 0.00/731 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/109 [00:00<?, ?B/s]

Device set to use cpu


In [19]:
ko_text_generator('오늘 점심에는')

[{'generated_text': '오늘 점심에는 뭐 먹었어\n 답변:그럼요. 혹시 식사 전이시면 저에게 메뉴 추천해줘"라고 해보세요. 맛있는 메뉴 추천드릴게요."'}]

##### Fill Mask

In [20]:
unmasker = pipeline('fill-mask')

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [21]:
unmasker('This course will teach you all about <mask> model.', top_k=3)

[{'score': 0.09073042869567871,
  'token': 265,
  'token_str': ' business',
  'sequence': 'This course will teach you all about business model.'},
 {'score': 0.072625532746315,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical model.'},
 {'score': 0.049304164946079254,
  'token': 5,
  'token_str': ' the',
  'sequence': 'This course will teach you all about the model.'}]

##### NER

In [22]:
ner = pipeline('ner', model='dslim/bert-base-NER')
# grouped_entity=True

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [23]:
ner('My name is Rabbit and I work at Playdata in Seoul')

[{'entity': 'B-PER',
  'score': np.float32(0.9928189),
  'index': 4,
  'word': 'Rabbit',
  'start': 11,
  'end': 17},
 {'entity': 'B-ORG',
  'score': np.float32(0.99448276),
  'index': 9,
  'word': 'Play',
  'start': 32,
  'end': 36},
 {'entity': 'I-ORG',
  'score': np.float32(0.9833211),
  'index': 10,
  'word': '##data',
  'start': 36,
  'end': 40},
 {'entity': 'B-LOC',
  'score': np.float32(0.99951816),
  'index': 12,
  'word': 'Seoul',
  'start': 44,
  'end': 49}]

##### Q&A

In [24]:
qna = pipeline('question-answering')

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [25]:
qna(question="Where do I work?",
    context="My name is Rabbit and I work at Playdata in Seoul")

{'score': 0.9349358779199974, 'start': 32, 'end': 40, 'answer': 'Playdata'}

- 한국어

In [26]:
ko_qna = pipeline('question-answering', model='klue/roberta-base')

config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

Device set to use cpu


In [27]:
ko_qna(question="허준 아들의 이름은?",
       context="""허준에게는 외아들 허겸(許謙)이 있었다. 허겸은 문과에 급제하여 부사를 거쳐 이후 파릉군(巴陵君)에 봉작받았다. 이후 19대 숙종 때에는 그의 증손자 허진(許瑱)이 파춘군(巴春君)의 작호를 받았으며, 허진의 아들이자 허준의 고손자인 허육(許堉)은 양흥군(陽興君)의 작호를 받았다. 허육의 아들이자 허준의 5대손인 허선(許銑)은 21대 영조 때에 양원군(陽原君)에 올랐으며, 허선의 아들로 허준의 6대손 허흡(許潝) 역시 영조 때 양은군(陽恩君)에 봉작받았다. 이렇게 누대에 걸쳐 후손들이 조정의 관직을 역임했으며, 선대가 살던 경기도 장단군 우근리(現 경기도 파주시)에 대대로 세거했다. 이후 조선 후기에 허준의 10대손 허도(許堵, 1827~1884)가 황해도 해주시로 이주했으며, 13대 종손 허형욱(許亨旭, 1924~몰년 미상)이 1945년까지 그 곳에서 살았다.
""")

{'score': 6.684073014184833e-05,
 'start': 292,
 'end': 317,
 'answer': '살던 경기도 장단군 우근리(現 경기도 파주시)'}

In [28]:
help(ko_qna)

Help on QuestionAnsweringPipeline in module transformers.pipelines.question_answering object:

class QuestionAnsweringPipeline(transformers.pipelines.base.ChunkPipeline)
 |  QuestionAnsweringPipeline(
 |      model: Union[ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')],
 |      tokenizer: transformers.tokenization_utils.PreTrainedTokenizer,
 |      modelcard: Optional[transformers.modelcard.ModelCard] = None,
 |      framework: Optional[str] = None,
 |      task: str = '',
 |      **kwargs
 |  )
 |
 |  Question Answering pipeline using any `ModelForQuestionAnswering`. See the [question answering
 |  examples](../task_summary#question-answering) for more information.
 |
 |  Example:
 |
 |  ```python
 |  >>> from transformers import pipeline
 |
 |  >>> oracle = pipeline(model="deepset/roberta-base-squad2")
 |  >>> oracle(question="Where do I live?", context="My name is Wolfgang and I live in Berlin")
 |  {'score': 0.9191, 'start': 34, 'end': 40, 'answer': 'Berlin'}
 |  ``