# 0. 언어 모델

* 언어 모델은 주어진 문장, 단어를 바탕으로 단어에 확률을 부여하는 모델을 말한다.
* 즉, 가장 자연스러운 단어 시퀀스를 찾아내는 모델이다.
* 예) 비행기를 타려고 공항에 갔는데 지각을 하는 바람에 비행기를 (  ).

  P(놓쳤다) > P(먹었다)
* 언어 모델 유형
  * 이전 단어들이 주어졌을 때 다음 단어를 예측하는 언어 모델 (GPT)
  * 주어진 양쪽 단어들로부터 가운데 비어있는 단어를 예측하는 언어 모델 (BERT)

* Hugging Face : 자연어처리 스타트업이 개발한 Transformer를 기반으로 하는 다양한 모델과 학습 스크립트를 제공하는 머신러닝 플랫폼
  * https://huggingface.co/
  * https://transformer.huggingface.co/
  * https://github.com/huggingface/transformers/blob/main/README_ko.md

# 1. Hugging Face (BERT 외)

* https://huggingface.co/bert-base-uncased

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 5.1 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 37.0 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 38.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


In [2]:
import transformers
transformers.__version__

'4.23.1'

In [3]:
from transformers import pipeline

In [4]:
# 감정 분류하기
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to introduce pipeline to the transformers repository.')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9996980428695679}]

In [5]:
classifier('I am so sad to hear that.')

[{'label': 'NEGATIVE', 'score': 0.9989989399909973}]

In [6]:
# 질문에 대한 답변하기
question_answerer = pipeline('question-answering')
question_answerer({
  'question': 'What is the name of the repository?',
  'context': 'Pipeline has been included in the huggingface/transformers repository'
})

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.30970194935798645,
 'start': 34,
 'end': 58,
 'answer': 'huggingface/transformers'}

In [7]:
# BERT로 마스킹된 단어 완성하기
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

[{'score': 0.10731087625026703,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello i'm a fashion model."},
 {'score': 0.08774493634700775,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello i'm a role model."},
 {'score': 0.05338375270366669,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello i'm a new model."},
 {'score': 0.046672236174345016,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello i'm a super model."},
 {'score': 0.027095822617411613,
  'token': 2986,
  'token_str': 'fine',
  'sequence': "hello i'm a fine model."}]

In [8]:
unmasker('The Avengers is a really fun [MASK].')

[{'score': 0.31124347448349,
  'token': 2208,
  'token_str': 'game',
  'sequence': 'the avengers is a really fun game.'},
 {'score': 0.12129539996385574,
  'token': 2466,
  'token_str': 'story',
  'sequence': 'the avengers is a really fun story.'},
 {'score': 0.09142926335334778,
  'token': 3185,
  'token_str': 'movie',
  'sequence': 'the avengers is a really fun movie.'},
 {'score': 0.03700070083141327,
  'token': 6172,
  'token_str': 'adventure',
  'sequence': 'the avengers is a really fun adventure.'},
 {'score': 0.033156175166368484,
  'token': 2265,
  'token_str': 'show',
  'sequence': 'the avengers is a really fun show.'}]

In [9]:
# 편향(Bias)된 Prediction
unmasker("The man worked as a [MASK].")

[{'score': 0.09747567027807236,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'the man worked as a carpenter.'},
 {'score': 0.05238332226872444,
  'token': 15610,
  'token_str': 'waiter',
  'sequence': 'the man worked as a waiter.'},
 {'score': 0.049626946449279785,
  'token': 13362,
  'token_str': 'barber',
  'sequence': 'the man worked as a barber.'},
 {'score': 0.0378861278295517,
  'token': 15893,
  'token_str': 'mechanic',
  'sequence': 'the man worked as a mechanic.'},
 {'score': 0.03768080845475197,
  'token': 18968,
  'token_str': 'salesman',
  'sequence': 'the man worked as a salesman.'}]

In [10]:
unmasker("The woman worked as a [MASK].")

[{'score': 0.21981488168239594,
  'token': 6821,
  'token_str': 'nurse',
  'sequence': 'the woman worked as a nurse.'},
 {'score': 0.15974114835262299,
  'token': 13877,
  'token_str': 'waitress',
  'sequence': 'the woman worked as a waitress.'},
 {'score': 0.1154731884598732,
  'token': 10850,
  'token_str': 'maid',
  'sequence': 'the woman worked as a maid.'},
 {'score': 0.03796885535120964,
  'token': 19215,
  'token_str': 'prostitute',
  'sequence': 'the woman worked as a prostitute.'},
 {'score': 0.030423814430832863,
  'token': 5660,
  'token_str': 'cook',
  'sequence': 'the woman worked as a cook.'}]

In [11]:
# GPT-2로 텍스트 생성하기
generator = pipeline('text-generation', model='gpt2')
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, and I'm happy to give you the tools you need to create your own web apps. And here I am"},
 {'generated_text': 'Hello, I\'m a language model, and I\'m doing things differently now than I ever was before," said Scott Peterson, 29, of Seattle.'},
 {'generated_text': "Hello, I'm a language model, a programming model, I can't explain everything. But I think you can learn to look at data and see"},
 {'generated_text': 'Hello, I\'m a language model, I think I can solve those problems."\n\nWhile his students were speaking, there was also talk by one'},
 {'generated_text': 'Hello, I\'m a language model, so I\'m not going to be using any of your code. They are all just my own idea."\n'}]

In [14]:
# 편향(Bias)된 Prediction
generator("The White man worked as a", max_length=10, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The White man worked as a detective in Brooklyn,'},
 {'generated_text': 'The White man worked as a driver, chauffe'},
 {'generated_text': 'The White man worked as a clerk in a restaurant'},
 {'generated_text': 'The White man worked as a secretary.\n\n'},
 {'generated_text': 'The White man worked as a nurse for ten years'}]

In [16]:
generator("The Black man worked as a", max_length=10, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The Black man worked as a barber/bar'},
 {'generated_text': 'The Black man worked as a cop for 18 years'},
 {'generated_text': 'The Black man worked as a private security guard in'},
 {'generated_text': 'The Black man worked as a prostitute to support two'},
 {'generated_text': 'The Black man worked as a mechanic and a part'}]

In [17]:
# RoBERTa(A Robustly Optimized BERT Pretraining Approach)로 자연어 추론하기
# RoBERTa : Facebook AI팀에서 기존 BERT모델을 유지하며 학습단계의 hyperparameter를 조정하여 성능을 높이는 방법
classifier = pipeline('zero-shot-classification', model='roberta-large-mnli')
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

Downloading:   0%|          | 0.00/688 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'cooking', 'dancing'],
 'scores': [0.979964017868042, 0.010604999028146267, 0.009431006386876106]}

In [18]:
sequence_to_classify = "My mom is in the kitchen"
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'My mom is in the kitchen',
 'labels': ['cooking', 'travel', 'dancing'],
 'scores': [0.9746202230453491, 0.015938350930809975, 0.009441414847970009]}

In [19]:
# BART를 이용한 요약하기
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
text = """ 
The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. 
Its base is square, measuring 125 metres (410 ft) on each side. During its construction, 
the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, 
a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. 
It was the first structure to reach a height of 300 metres. 
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). 
Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct
"""
summarizer(text, max_length=130, min_length=30, do_sample=False)

Downloading:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'summary_text': 'The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. During its construction, it surpassed the Washington Monument to become the tallest man-made structure in the world.'}]

In [20]:
# 모델 사용하기
# https://huggingface.co/models
from transformers import BertTokenizer, TFBertModel

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')    # 110M params

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')    # PyTorch는 'pt'
output = model(encoded_input)
output

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(1, 12, 768), dtype=float32, numpy=
array([[[ 0.1386268 ,  0.15826862, -0.29666469, ..., -0.27084914,
         -0.28436273,  0.45808393],
        [ 0.53636354, -0.23269668,  0.17541982, ...,  0.5540257 ,
          0.4980719 , -0.00240791],
        [ 0.30023754, -0.3475122 ,  0.12084441, ..., -0.45624804,
          0.32880232,  0.87728155],
        ...,
        [ 0.37985978,  0.12028786,  0.82829404, ..., -0.86237186,
         -0.5956974 ,  0.047116  ],
        [-0.02524197, -0.7176756 , -0.6950478 , ...,  0.07574223,
         -0.66678154, -0.34007478],
        [ 0.7535388 ,  0.23910932,  0.07174372, ...,  0.24671514,
         -0.6458062 , -0.32129812]]], dtype=float32)>, pooler_output=<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-0.93767864, -0.50425893, -0.979893  ,  0.90304404,  0.9329326 ,
        -0.24377495,  0.89257544,  0.228806  , -0.95312095, -0.9999953 ,
        -0.88623035,  0.990

# 2. Hugging Face (KLUE BERT 외)

* https://huggingface.co/klue/bert-base
* https://huggingface.co/skt/kogpt2-base-v2

In [21]:
# KLUE BERT로 마스킹된 단어 완성하기
unmasker = pipeline('fill-mask', model='klue/bert-base')
unmasker("축구는 정말 재미있는 [MASK]다.")

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/495k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'score': 0.8963626623153687,
  'token': 4559,
  'token_str': '스포츠',
  'sequence': '축구는 정말 재미있는 스포츠 다.'},
 {'score': 0.025958096608519554,
  'token': 568,
  'token_str': '거',
  'sequence': '축구는 정말 재미있는 거 다.'},
 {'score': 0.010034133680164814,
  'token': 3682,
  'token_str': '경기',
  'sequence': '축구는 정말 재미있는 경기 다.'},
 {'score': 0.007924514822661877,
  'token': 4713,
  'token_str': '축구',
  'sequence': '축구는 정말 재미있는 축구 다.'},
 {'score': 0.007844360545277596,
  'token': 5845,
  'token_str': '놀이',
  'sequence': '축구는 정말 재미있는 놀이 다.'}]

In [22]:
unmasker("한국 디지털 미디어 고등학교는 대한민국의 [MASK]다.")

[{'score': 0.23672793805599213,
  'token': 4037,
  'token_str': '미래',
  'sequence': '한국 디지털 미디어 고등학교는 대한민국의 미래 다.'},
 {'score': 0.10459045320749283,
  'token': 3741,
  'token_str': '학교',
  'sequence': '한국 디지털 미디어 고등학교는 대한민국의 학교 다.'},
 {'score': 0.07952725887298584,
  'token': 5868,
  'token_str': '고등학교',
  'sequence': '한국 디지털 미디어 고등학교는 대한민국의 고등학교 다.'},
 {'score': 0.06690999120473862,
  'token': 3846,
  'token_str': '역사',
  'sequence': '한국 디지털 미디어 고등학교는 대한민국의 역사 다.'},
 {'score': 0.02507483959197998,
  'token': 6056,
  'token_str': '교과서',
  'sequence': '한국 디지털 미디어 고등학교는 대한민국의 교과서 다.'}]

In [None]:
# KoGPT2(한국어 GPT-2) : SKT-AI
# https://github.com/SKT-AI/KoGPT2

In [23]:
# 토큰화
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("skt/kogpt2-base-v2")
tokenizer.tokenize("아버지가방에들어가신다.")

Downloading:   0%|          | 0.00/2.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


['▁아버지가', '방에', '들어', '가', '신', '다.']

In [24]:
# 문장 생성
import torch
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')
text = '근육을 키우기 위해서는'
input_ids = tokenizer.encode(text, return_tensors='pt')
gen_ids = model.generate(input_ids,
                         max_length=128,
                         repetition_penalty=2.0,
                         pad_token_id=tokenizer.pad_token_id,
                         eos_token_id=tokenizer.eos_token_id,
                         bos_token_id=tokenizer.bos_token_id,
                         use_cache=True)
generated = tokenizer.decode(gen_ids[0])
print(generated)

Downloading:   0%|          | 0.00/513M [00:00<?, ?B/s]

근육을 키우기 위해서는 무엇보다 자신의 몸을 잘 관리해야 한다.
특히, 평소에는 운동을 통해 체력을 키워주는 것이 중요하다.
또한 운동 후에는 반드시 스트레칭과 함께 가벼운 걷기 등 유산소운동을 해주는 것도 도움이 된다.
이러닝은 단순히 땀 배출에 그치는 게 아니라 몸의 신진대사를 촉진시켜 몸매를 가꾸는데 도움을 준다.
운동 후 바로 샤워나 목욕 등으로 체온유지를 돕는 것은 물론 피부미용에도 효과적이다.</d> 한국관광공사는 지난해 한국을 찾은 외국인 관광객 수가 전년 대비 12% 증가한 4억2000만명을 기록했다고 1일 밝혔다.
이는 역대 최대 규모다.
지난 2016년부터 3년 연속
