In [1]:
from IPython.display import display, HTML
display(HTML("""
<style>
div.container{width:99% !important;}
div.cell.code_cell.rendered{width:100%;}
div.input_prompt{padding:0px;}
div.CodeMirror {font-family:Consolas; font-size:12pt;}
div.text_cell_render.rendered_html{font-size:20pt;}
div.text_cell_render ul li, div.text_cell_render ol li p, code{font-size:22pt; line-height:30px;}
div.output {font-size:12pt; font-weight:bold;}
div.input {font-family:Consolas; font-size:12pt;}
div.prompt {min-width:70px;}
div#toc-wrapper{padding-top:120px;}
div.text_cell_render ul li{font-size:12pt;padding:5px;}
table.dataframe{font-size:12px;}
</style>
"""))

In [2]:
import warnings
import os
import logging
# 경고 제거
warnings.filterwarnings('ignore')

# transformers 로깅 레벨 조정
logging.getLogger("transformers").setLevel(logging.ERROR)

# Hugging Face symlink 경고 제거
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# from transformers import pipeline, logging as hf_logging
# hf_logging.set_verbosity_error()

# <span style="color:red">ch1_Hugging Face</span>
- Inference API 사용 : 모델의 결과를 server에서 돌림
- pipeline() 사용 : 모델을 다운로드 받아 모델의 결과를 local에서 돌림
    * raw text -> tokenizer -> model -> [0.11, 0.55, 0.xx] logits 값으로 prediction 결과 출력
```
Hugging Face transformers에서 지원하는 task
'sentiment-analysis' : 'text-classification'의 별칭(감정분석 전용으로 사용)
'text-classification' : 감정분석, 뉴스분류, 리뷰분류 등 일반적인 문장 분류
'zero-shot-classification' : 레이블을 학습 없이 주어진 후보군 중에서 분류
'token-classification' : 개체명 인식(NER : Named Entity Recognition) 등 단위 라벨링
'ner' : 'token-classification'의 별칭
'fill-mask' : 빈칸 채우기
'text-generation' : 텍스트 생성 (GPT류 모델에 사용)
'text2text-generation' : 번역, 요약 등 입력 -> 출력 변환
'translation' : 번역
'summerization' : 텍스트 요약
'question-answering' : 주어진 context를 보고 질문에 답하기
'image-to-text' : 그림 설명
'image-classification' : 이미지 분류
```

## 1.텍스트 기반 감정 분석(긍정/부정)
- c:/사용자/내컴퓨터명/.cache/huggingface/hub 모델 다운로드

In [3]:
from transformers import pipeline

In [4]:
classifier = pipeline(task='sentiment-analysis',
                      model='distilbert/distilbert-base-uncased-finetuned-sst-2-english')
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [5]:
classifier = pipeline(task='text-classification',
                      model='distilbert/distilbert-base-uncased-finetuned-sst-2-english')
# 감정분석 시 내용이 많으면 list로 
classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    'I hate this so much!'
])

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [7]:
classifier(['이 영화 정말 최고였어',
            'This movie is the best movie I have ever watched']) # 위 모델은 영어로 학습한 모델로 한국어는 정확도가 많이 떨어짐

[{'label': 'POSITIVE', 'score': 0.870750904083252},
 {'label': 'POSITIVE', 'score': 0.9998502731323242}]

In [8]:
classifier('이 물건 정말 사고 싶어요')

[{'label': 'POSITIVE', 'score': 0.8577604293823242}]

In [9]:
classifier(['I like you', 'I hate you', '난 너가 싫어', '힘들어요'])

[{'label': 'POSITIVE', 'score': 0.9998695850372314},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079},
 {'label': 'POSITIVE', 'score': 0.5550515055656433},
 {'label': 'POSITIVE', 'score': 0.8669533729553223}]

In [10]:
classifier = pipeline(task="sentiment-analysis",
                     model="matthewburke/korean_sentiment")
classifier(['나는 너가 좋아', "당신이 싫어요", "힘들어요", "오늘 기분이 최고야"])

config.json:   0%|          | 0.00/887 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'LABEL_1', 'score': 0.9557897448539734},
 {'label': 'LABEL_0', 'score': 0.9092598557472229},
 {'label': 'LABEL_0', 'score': 0.9140233397483826},
 {'label': 'LABEL_1', 'score': 0.9714491367340088}]

In [11]:
classifier = pipeline(task="sentiment-analysis",
                     model="matthewburke/korean_sentiment")
texts = ['나는 너가 좋아', "당신이 싫어요", "힘들어요", "오늘 기분이 최고야"]
result = classifier(texts)

Device set to use cpu


In [23]:
for text, result in zip(texts, classifier(texts)):
    label = '긍정' if result['label']=='LABEL_1' else '부정'
    print(f"{text} => {label} : {result['score']:.4f}")

나는 너가 좋아 => 긍정 : 0.9558
당신이 싫어요 => 부정 : 0.9093
힘들어요 => 부정 : 0.9140
오늘 기분이 최고야 => 긍정 : 0.9714


## 2.zero-shot 분류
- 기계 학습 및 자연어 처리에서 각 개별 작업에 대한 특정 교육 없이 작업을 수행할 수 있는 모형(비지도 학습)

In [24]:
classifier = pipeline('zero-shot-classification',
#                       model='facebook/bart-large-mnli'
                     )
classifier(
    "I have a problem with my iphone that needs to be resolved asap!",
    candidate_labels=['urgent', 'not urgent', 'phone', 'tablet', 'computer']
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'sequence': 'I have a problem with my iphone that needs to be resolved asap!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.5227580070495605,
  0.45814019441604614,
  0.0142647260800004,
  0.0026850001886487007,
  0.002152054337784648]}

In [25]:
sequence_to_classify = "One day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'One day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9941016435623169, 0.0031261250842362642, 0.0027722232043743134]}

## 3.text 생성

In [28]:
# from transformers import set_seed
# set_seed(2) # 시드값 맞추기
generation = pipeline('text-generation') # 텍스트 생성 gpt3부터는 Hugging Face에 없음
generation(
    "In this course, we will teach you how to",
    pad_token_id=generation.tokenizer.eos_token_id
) # pad_token_id 경고를 없애려고 setting

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'generated_text': 'In this course, we will teach you how to design and build an app for the iPhone or the iPad. It will be used with the iPad Pro 2 and later. We will also show you how to make a simple and easy to use program for the iPhone or iPad.\n\nThis course is available for students who are not familiar with app development. It is designed to teach you how to make apps for the iPad.\n\nYou will learn the basics of App development and how to build apps using the iPhone or iPad. You will also learn the fundamentals of using the iOS SDK.\n\nWe will cover the most common components of a typical app. We will develop the components and add them to the app. We will do the rest.\n\nWe will cover the most common components of a typical app. We will develop the components and add them to the app. We will do the rest. You will learn how to use the iOS SDK to develop a large number of apps. We will then show you how to add components to your app and use them to your advantage.\n\nYou will

In [31]:
result = generation(
    "In this course, we will teach you how to",
    pad_token_id=generation.tokenizer.eos_token_id
)

print(result[0]['generated_text'])

In this course, we will teach you how to use your iPad to navigate the world of virtual reality. We will explain how to create VR content, what to look for when looking for content and how to navigate your way to a virtual world. We will demonstrate how to create a virtual world when you are traveling with your iPad and how to avoid getting lost. This course will help you understand the rules of VR and how to navigate around virtual worlds.

In this course, we will teach you how to create VR content, what to look for when looking for content and how to navigate your way to a virtual world. We will demonstrate how to create a virtual world when you are traveling with your iPad and how to avoid getting lost. This course will help you understand the rules of VR and how to navigate around virtual worlds.

In this course, we will teach you how to create VR content, what to look for when looking for content and how to navigate your way to a virtual world. We will teach you how to create a vi

In [37]:
generation = pipeline('text-generation', 'skt/kogpt2-base-v2')
result = generation(
    "이 과정은 다음과 같은 방법을 알려 드립니다.",
    pad_token_id=generation.tokenizer.eos_token_id,
    max_new_tokens = 200, # 뒤에 생성할 최대 길이(생성할 토큰 수)
#     num_return_sequences = 1, # 생성할 문장 개수
#     do_sample=True, # 다양한 샘플 사용
#     top_k = 50, # top_k 샘플링(확률 높은 상위 50개 토큰만 사용)
#     top_p = 0.95, # 확률이 높은 순서대로 95%가 될 때까지의 단어들로만 후보로 사용
#     temperature = 1.0, # 창의성 조절(온도가 낮을수록 창의성 낮음)
#     no_repeat_ngram_size=2 # 반복 방지
)
print(result[0]['generated_text'])

Device set to use cpu


이 과정은 다음과 같은 방법을 알려 드립니다. 대구 남구청은 21일 오전 10시 시청 대회의실에서 대구시 남구의회(의장 이충훈)와 '사회복지사업 공동 추진 업무 협약'을 체결했다.
이번 협약은 지난해 7월 남구의회와 복지사업 업무협약 체결 이후 공동사업의 효율적인 추진을 위해 마련됐다.
협약 체결에 따라 남구는 사회복지사업 공동 추진에 필요한 행정적 지원을 할 예정이다.
또 남구는 올해 말까지 사회복지사업 공동추진위원회를 구성해 추진과제를 발굴, 사업추진에 적극 지원할 계획이다.
대구사회복지협의회는 남구 지역사회보장협의회와 함께 취약계층 지원에 관한 전문성을 갖춘 복지시설 운영자 및 종사자를 모집해 사회복지사업 공동 추진의 기반을 조성한다.
대구사회복지협의회는 실무위원회를 구성해 사회복지사업 공동추진위원회를 구성해 운영할 예정이다.
이충훈 대구 남구청장은 "이번 협약을 통해 양 기관이 지속적인 협력관계를 구축해 지역 내 사회복지서비스 수요자가 더욱 증가할 것으로 기대한다"고 말했다. 경찰이 박근혜 전 대통령의 출석을 앞두고 특별수사본부를 편성했다.
특


## 4.마스크(빈칸) 채우기

In [39]:
unmasker = pipeline(task='fill-mask',
                    model='distilbert/distilroberta-base') # 마스크 채우기 (영어로 학습 기준)

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [40]:
unmasker("I'm going to hospital and meet a <mask>",
         top_k=5) # top_k의 기본값은 5

[{'score': 0.19275707006454468,
  'token': 3299,
  'token_str': ' doctor',
  'sequence': "I'm going to hospital and meet a doctor"},
 {'score': 0.06794589757919312,
  'token': 27321,
  'token_str': ' psychiatrist',
  'sequence': "I'm going to hospital and meet a psychiatrist"},
 {'score': 0.06435535103082657,
  'token': 16308,
  'token_str': ' surgeon',
  'sequence': "I'm going to hospital and meet a surgeon"},
 {'score': 0.0591287724673748,
  'token': 9008,
  'token_str': ' nurse',
  'sequence': "I'm going to hospital and meet a nurse"},
 {'score': 0.05705631151795387,
  'token': 1441,
  'token_str': ' friend',
  'sequence': "I'm going to hospital and meet a friend"}]

In [42]:
unmasker("Hello, I'm a <mask> model.")

[{'score': 0.0629730075597763,
  'token': 265,
  'token_str': ' business',
  'sequence': "Hello, I'm a business model."},
 {'score': 0.038101598620414734,
  'token': 18150,
  'token_str': ' freelance',
  'sequence': "Hello, I'm a freelance model."},
 {'score': 0.03764132782816887,
  'token': 774,
  'token_str': ' role',
  'sequence': "Hello, I'm a role model."},
 {'score': 0.037326786667108536,
  'token': 2734,
  'token_str': ' fashion',
  'sequence': "Hello, I'm a fashion model."},
 {'score': 0.026023676618933678,
  'token': 24526,
  'token_str': ' Playboy',
  'sequence': "Hello, I'm a Playboy model."}]

In [46]:
unmasker = pipeline(task='fill-mask',
                    model='google-bert/bert-base-uncased')
unmasker("Hello, I'm a [MASK] model.")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'score': 0.1441437155008316,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello, i ' m a role model."},
 {'score': 0.14175789058208466,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello, i ' m a fashion model."},
 {'score': 0.062214579433202744,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello, i ' m a new model."},
 {'score': 0.041028350591659546,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello, i ' m a super model."},
 {'score': 0.025911200791597366,
  'token': 2449,
  'token_str': 'business',
  'sequence': "hello, i ' m a business model."}]

### ※ Inference API 사용

In [49]:
%pip install -q python-dotenv

Note: you may need to restart the kernel to use updated packages.




In [51]:
from dotenv import load_dotenv
import os
load_dotenv()
# os.environ['HF_TOKEN']
# 허깅페이스 토큰을 READ 권한으로 생성하여 env에 추가

True