In [1]:
#사전 학습(pre-training) - 업스트림 태스크(upstream task)
#
#구글에서는 BERT를 학습시켜서 커뮤니티에 올림.
#
#우린 그걸 받아서 다운스트림 태스크에 받아다 쓰면 되는데 그걸 transfer learning, 미세 조정(fine-tuning)이라고 함.
#
#transfer learning은 그냥 적용하는 것. 미세 조정(fine-tuning)은 사전학습 모델을 일부 변형해서 사용하는 방법.
#


## HuggingFace 라이브러리 - pipeline 함수 사용

In [2]:
#- transformer를 기반으로 하는 다양한 딥러닝 모델들을 공유하는 개발자 커뮤니티
#url : https://huggingface.co/
#
#- pipeline() 함수
#
#1) transformers 라이브러리의 가장 기본적인 함수
#2) 사전 학습된 모델과 토크나이저를 연결하여 자연어 처리에 관한 작업을 편리하게 할 수 있도록 지원
#
#- 사용법(예시)
#1) pipeline 함수 실행 : summarizer = pipeline('summarization', model, tokenizer)
#2) 문서 요약 실행 : summarizer(sentence)


In [3]:
## text classification

### 필요한 함수 임폴트
from transformers import pipeline, set_seed





In [4]:
### 함수 호출, 분류 모델 생성(다운로드)
model = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [5]:
### test data 생성
test_data = ['Once you choose hope, anything is possible']

In [6]:
### 감성 분석 실행
result = model(test_data)

print(result)

[{'label': 'POSITIVE', 'score': 0.9992693066596985}]


In [7]:
text = input()
result = model(text)
print(result)

[{'label': 'POSITIVE', 'score': 0.748120903968811}]


In [8]:
### 경제 기사 긍정 또는 부정 분류 모델 생성

model_name = 'ProsusAI/finbert'
finbert = pipeline('text-classification', model=model_name)

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [9]:
### 경제 기사 텍스트 데이터 생성
test_data = "Investing Club: Our takes on Amazon and Apple heading into next week's earnings reports"

In [10]:
### 감성 분석 실행
result = finbert(test_data)

print(result)

[{'label': 'neutral', 'score': 0.9251992106437683}]


In [11]:
### 한국어 감성 분석

# 모델 생성
model_name = 'doya/klue-sentiment-nsmc'
kor_model = pipeline('text-classification', model=model_name)

# test data 생성
test_data = ['음악이 주가 된, 최고의 음악영화', '발연기 도저히 못보겠다 진짜 이렇게 연기를 못할거라곤 상상도 못했네']

# 감성 분석 실행
result = kor_model(test_data)

# 결과 확인하기
print(result)

Downloading pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/563 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/752k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'label': 'LABEL_1', 'score': 0.993305504322052}, {'label': 'LABEL_0', 'score': 0.9989787340164185}]


### summarization

In [12]:
### pipeline 함수 호출, 모델 생성(다운로드)

summarizer = pipeline('summarization')


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [13]:
### 문서 요약에 사용할 텍스트 데이터 생성

text_example = '''The tower is 324 meters (1,063 ft) tall, about the same height
as an 81-storey building, and the tallest structure in Paris. Its base is square,
measuring 125 meters (410 ft) on each side. During its construction, the Eiffel
Tower surpassed the Washington Monument to become the tallest man-made structure
in the world, a title it held for 41 years until the Chrysler Building in New York
City was finished in 1930. It was the first structure to reach a height of 300 meters.
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is
now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters,
the Eiffel Tower is the second tallest free-standing structure in France
after the Millau Viaduct.'''

text_sentences = '''The chief of global tech titan Samsung Electronics has remained Korea's richest stockholder as the value of shares held by the country's wealthiest 100 swelled nearly 20 percent over the past year, a corporate tracker said Wednesday.
Samsung Electronics Executive Chairman Lee Jae-yong held stocks worth 14.7 trillion won ($11.3 billion) as of Tuesday, up about 26 percent from Dec. 29 a year earlier and the biggest among them, according to CEO Score.
Lee, who virtually controls South Korea's top conglomerate, Samsung Group, also chalked up the largest increase of nearly 3 trillion won over the last year.
Lee was followed by his mother and two younger sisters. His mother, Hong Ra-hee, placed second in the country's stock-rich rankings with 9.2 trillion won, followed by Boo-jin, chairwoman of Hotel Shilla, with around 7 trillion won and Seo-hyun, in charge of the Samsung Welfare Foundation, with slightly over 6 trillion won.
The 100 wealthiest stockholders owned a combined 118.8 trillion won worth of listed shares, up 19.5 percent from a year earlier.
'''


In [22]:
### 문서 요약 실행
result = summarizer(text_example)
print(result[0]['summary_text'])



The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building. Its base is square, measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.


In [18]:
### pipeline 함수 호출, 모델 생성(다운로드)(2)
model_name='facebook/bart-large-cnn'
summarizer = pipeline('summarization', model=model_name)

Downloading config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [23]:
### 한국어 문서 요약 모델 생성
model_name = 'gogamza/kobart-summarization'
kor_summarizer = pipeline('summarization', model=model_name)

Downloading config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


Downloading vocab.json:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/177k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/682k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


In [24]:
### 한글 텍스트 생성
text_sample = '''신용점수는 개인의 신용 상태를 평가하여 점수로 나타낸 것입니다. 현재 연체 및 과거 채무 상환 이력, 대출 및 보증채무 부담 정도, 신용 거래 기간, 체크카드 및 신용카드 이용 정보, 비금융정보 등을 종합적으로 판단하여 계산합니다. 예를 들어, 과거에 연체 없이 채무 상환을 잘 한 사람은 앞으로도 채무 상환을 잘해 나갈 것이라는 신뢰 정도를 점수로 표기한 것입니다.
신용점수 제도는 신용등급제의 문제를 보완하여 2021년에 처음 도입되었습니다. 기존에는 1~10등급으로 신용등급을 나눠 평가했는데, 이는 실제 점수가 1점밖에 차이나지 않아도 등급이 갈리며 카드 발급, 대출이 거절되며 지속적인 불만이 제기었습니다. 이에 2021년 신용등급제를 폐지하고 1~1,000점까지 1점 단위로 개인의 신용을 평가하는 점수제가 도입되었습니다.
그렇다면 개인의 신용점수는 어떻게 결정되는 걸까요? 신용점수는 금융기관이 아니라, 개인신용평가사에서 결정합니다. 공식적인 개인신용평가사는 2곳으로 나이스평가정보와 코리아크레딧뷰로가 있습니다. 두 신용평가사에서는 자체적으로 신용점수를 산출하는 항목과 비중을 정해 신용점수를 평가하기 때문에 평가사에 따라 신용점수는 상이할 수 있습니다. 세부적인 평가 기준 및 반영 비율이 궁금하다면 각 개인신용평가회사 홈페이지에서 확인할 수 있습니다.
'''

In [25]:
### 문서 요약 실행
result = kor_summarizer(text_sample)
print(result[0]['summary_text'])

개인의 신용 상태를 개인의 신용 상태를 개인의 신용 상태를 개인의 신용 상태를 개인의 신용 상태를 평가하여 점


In [None]:
## 모델은 나중에 ㅎㅎ 실습해보장~
# 
# ainize/kobart-news

In [None]:
### HuggingFace 라이브러리 pipeline() 함수를 이용한 실습
'''
1) 문서 분류(text_classification)
2) 문서 요약(text summuarization)
'''

In [None]:
'''
3) 기계독해(MRC)
4) 문장생성(text generation)
'''


## 기계 독해 (QA)
- 묻고 답하기 모델

In [26]:
### 
from transformers import pipeline, set_seed

In [27]:
### pipeline 함수 호출, 모델 생성

qa = pipeline('question-answering')

# qa2 = pipeline('question-answering', model=model_name)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [28]:
### qa에 사용할 텍스트 데이터 생성

question = "Which name is also used to describe the Amazon rainforest in English?"
context = """The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."""



In [30]:
### QA 실행

result = qa(
    question=question,
    context=context
)

print(result)

{'score': 0.5200247168540955, 'start': 201, 'end': 209, 'answer': 'Amazonia'}


## 문장 생성(text generation)

In [33]:
### pipeline 함수 호출, 모델 생성
generator1 = pipeline('text-generation')

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [48]:
### 문장 생성하기(1) - gpt2 모델 이용

# 시드 설정
set_seed(3)

# 문장 생성
result = generator1(
    
    'we had a sexual intercourse. she was so happy',#입력문장
    max_length=100,     #매개변수
    num_return_sequences=4, #두번째 매개변수 뒤로 문장을 몇개 만들 것이냐..
)

# 결과 확인하기
print(f'첫번째 생성 문장 : \n{result[0]}')
print('*'*80)
print(f'두번째 생성 문장 : \n{result[1]}')
print('*'*80)
print(f'세번째 생성 문장 : \n{result[2]}')
print('*'*80)
print(f'네번째 생성 문장 : \n{result[3]}')





Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


첫번째 생성 문장 : 
{'generated_text': "we had a sexual intercourse. she was so happy when I had a baby\n\nBitchy\n\nSo here we are now in 2010, and that's the first time she's ever been able to go out with her two best friends for a long time. Not that long after they finally reunited for their first time.\n\nJH and my mom were born for good. In that time of year a lot of bad things started happening, and we all knew it. For three minutes"}
********************************************************************************
두번째 생성 문장 : 
{'generated_text': "we had a sexual intercourse. she was so happy. i saw her smile and he kissed her. her. she never asked herself that question. i didn't want to know. we didn't have to talk for a couple of days. she took a few steps forward and we talked... she asked me one last question... what was the other part of the story... how was he? she took a step back and we started kissing... she began to tell me about how he had been the reason"}
******************




All PyTorch model weights were used when initializing TFGPT2Model.

All the weights of TFGPT2Model were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.
