transformers 라이브러리 (복습) - BERT 외 Transformer을 기반으로한 self-supervised learning method들  
학습 목표 : 학습, inference 과정 

# Transformers?

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017

![screensh](https://blog.kakaocdn.net/dn/blla7d/btqBPXAzWdA/1yMKSf4SYWRT9t0yDt2lM1/img.jpg)

# huggingface library를 이용한 모델 불러오기

In [2]:
!pip install transformers[sentencepiece]
# transformers tokenizer -> sentencepiece 라이브러리를 추가로 받는 코드
!pip install datasets
# datasets 라이브러리(huggingface)

Collecting transformers[sentencepiece]
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 14.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 60.2 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 51.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 49.3 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.

In [2]:
import torch
import torch.nn as nn
import torchtext

모델 구조 : BERT, GPT, ... -> transformer
- Encoder / Decoder / Encoder + Decoder
- Layer가 몇개, vector dim이 몇이냐
- 학습 방법에 차이가 있음. pre-training 방법(가장 중요한 부분)
- fine-tuning을 어느 데이터에 했는지 
총 26,662개의 모델들을 hugging face에서 제공함

## BERT

In [12]:
# 26,663개 모델 -> 한 줄만 추가하여 모델을 사용할 수 있음

# BERT Tokenizer(token으로 변환)
from transformers import BertTokenizer, BertForTokenClassification
# token classification : 각 토큰 마다 classification (0 or 1)분류

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# tokenizer는 왜 pretrained? --> 
# 1. 더 일반적인 토큰
# - 사전 학습 데이터는 매우 큼, 더 일반적인 토큰들이 포함되어 있음.
# 2. model과 세트
#   - word embedding할 때, <pad> -> 0, it -> 32번 .....
#   - 새로운 데이터 사용시 <pad> -> 0, it -> 25번 
#   25번째 embedding vector / 32번째 embedding vector
# -> 새로운 단어를 추가하고 싶으면? (더 많은 vocabulary를 쓰고 싶다면)
# -> 원래 30000개 vocab -> 30001번 부터 사용

model = BertForTokenClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")
# pt : tensor Pytorch
# tf : tensor tensorflow
# 안넣으면 list로 반환

outputs = model(inputs)[0]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [13]:
input = tokenizer("Hello, my dog is cute", return_tensors="pt")
print(input.keys())
print(input['input_ids'])
print(input['token_type_ids'])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0]])


In [14]:
outputs.shape # Batch(1), Sentence length(8), Classification(2)

torch.Size([1, 8, 2])

## Electra
Gan 형태 (Discriminator)

In [15]:
from transformers import ElectraTokenizer, ElectraModel

tokenizer = ElectraTokenizer.from_pretrained('google/electra-large-discriminator')
model = ElectraModel.from_pretrained('google/electra-large-discriminator')

inputs = tokenizer.encode("The capital of France is [MASK].", return_tensors="pt")

outputs = model(inputs)

Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
outputs

BaseModelOutputWithPastAndCrossAttentions([('last_hidden_state',
                                            tensor([[[-0.1712,  0.0506,  0.6153,  ...,  0.3136,  0.0033, -0.1737],
                                                     [ 0.2136,  0.2114,  0.0852,  ...,  0.0938, -0.4353, -0.0061],
                                                     [-0.0450,  0.1075,  0.2639,  ..., -0.0869, -0.2249,  0.2051],
                                                     ...,
                                                     [-0.1608,  0.1830,  0.5355,  ...,  0.1005, -0.0964, -0.2321],
                                                     [ 0.2983,  0.0679,  0.1625,  ..., -0.0437, -0.1851,  0.4285],
                                                     [-0.1704,  0.1914,  0.5484,  ...,  0.1241, -0.0846, -0.2425]]],
                                                   grad_fn=<NativeLayerNormBackward0>))])

last_hidden_state? 
위의 BertForTokenClassification은 bert인데, token classification 용 bert  
BERT model을 불러올 경우 지금처럼 last_hidden_state가 출력됨.  

- BertFor**Token**Classification = BertModel + Linear layer(어떤 형태의 아웃플을 쓸것이냐)  
한 token에 1개의 classification 결과
- BertFor**Sequence**Classificaion = BertModel + Linear layer  
한 문장에 1개의 classification 결과


last_hiddden_state는 transformer에서 디코더, 인코더의 마지막 부분, linear을 지나기 전에  
logits :  값이 나오는 경우는 softmax를 취하기 전 값.  
liklihodd : softmax를 취한 후 값(확률값)

In [19]:
outputs.last_hidden_state

tensor([[[-0.1712,  0.0506,  0.6153,  ...,  0.3136,  0.0033, -0.1737],
         [ 0.2136,  0.2114,  0.0852,  ...,  0.0938, -0.4353, -0.0061],
         [-0.0450,  0.1075,  0.2639,  ..., -0.0869, -0.2249,  0.2051],
         ...,
         [-0.1608,  0.1830,  0.5355,  ...,  0.1005, -0.0964, -0.2321],
         [ 0.2983,  0.0679,  0.1625,  ..., -0.0437, -0.1851,  0.4285],
         [-0.1704,  0.1914,  0.5484,  ...,  0.1241, -0.0846, -0.2425]]],
       grad_fn=<NativeLayerNormBackward0>)

In [20]:
outputs.last_hidden_state.shape

torch.Size([1, 9, 1024])

# AutoModel 사용해보기

In [21]:
from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoModel
# BertModel ElectraModel, ... -> AutoModel

# auto : 자동
# pretrained model만 가능 
# BERT, GPT, ELECTRA, ALBERT RoBERTa, ...
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")
outputs = model(inputs)[0]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [24]:
model_name = ''
model = AutoModel.from_pretrained(model_name)
model

TypeError: ignored

# 사전학습된 모델 사용해보기

In [26]:
# pre-training
# fine-tuning
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Q&A 데이터 셋
# SQuAD v2 -> 질문을 하면 생성해서 대답(서술형) -> 생성모델 사용
# SQuAD v1 -> 질문을 하면 지문에서 어디에 있는지 맞춤(객관식) -> 

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/t5-base-finetuned-squadv2")


Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [27]:

def get_answer(question, context):
  input_text = "question: %s  context: %s" % (question, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'])
  
  return tokenizer.decode(output[0])



# wikipeida text
context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
question = 'When did Beyonce start becoming popular?'

get_answer(question,context)

'<pad> late 1990s</s>'

In [28]:
from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [29]:
def get_question(answer, context, max_length=64):
  input_text = "answer: %s  context: %s </s>" % (answer, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])

context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
answer = '1981'

get_question(answer, context)

'<pad> question: What year was Beyonce born?</s>'

# huggingfcae 라이브러리를 이용한 데이터 처리하기

In [30]:
from datasets import load_dataset
datasets = load_dataset('imdb')

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [31]:
datasets.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [32]:
datasets['train'][0]

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

In [34]:
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()






In [35]:
train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

In [36]:
batch = next(iter(train_dataloader))

In [37]:
features = tokenizer(list(batch[0]))

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors


In [38]:
features.keys()

dict_keys(['input_ids', 'attention_mask'])

# 짧은 코드만으로 학습을 시켜봅시다.

In [57]:
from datasets import load_dataset
datasets = load_dataset('imdb')

from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()


train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [58]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [59]:
def train_epoch(model, dataloader, tokenizer, optimizer):
    model.train()
    train_loss = 0
    for i, (x,y) in enumerate(dataloader):
        # x : text
        # y : 정답 label
        features = tokenizer(list(x), padding='max_length', return_tensors='pt', max_length=512, truncation=True)
        
        # padding 512
        # 600 -> 512 truncation=True
        x = features['input_ids'].to(DEVICE)

        attention_mask = features['attention_mask'].to(DEVICE)
        y = y.to(DEVICE)
        loss = model(x, attention_mask=attention_mask, labels=y)['loss']

        # model update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        if i % 50 == 0:
            print('Iter [{}/{}] Loss {:.6f}'.format(i+1, len(dataloader), train_loss / (i+1)))
    
    return train_loss / len(dataloader)

def test_epoch(model, dataloader, tokenizer):
    model.eval()
    preds = []
    labels = []
    with torch.no_grad():
      for x,y in dataloader:
          x = tokenizer(list(x), padding='max_length', return_tensors='pt',max_length=512)['input_ids'].to(DEVICE)
          out = model(x)['logits']
          pred = out.argmax(-1)
          preds.append(pred.cpu())
          labels.append(y)
    preds = torch.cat(preds)
    labels = torch.cat(labels)
    acc = (preds == labels).float().mean()
    print('ACC : {:.3f}'.format(acc))
    return preds, labels

def predict(model, tokenizer, sentence):
    model.eval()
    x = tokenizer.encode(sentence, return_tensors='pt').to(DEVICE)
    out = model(x)['logits']
    pred = out.argmax(-1)
    return pred.cpu()

In [60]:
EPOCHS=1

for i in range(EPOCHS):
    train_epoch(model, train_dataloader, tokenizer, optimizer)
    test_epoch(model, test_dataloader, tokenizer)

Iter [1/6250] Loss 0.719544


KeyboardInterrupt: ignored

# data-Train 전체 과정

In [3]:
!pip install transformers[sentencepiece]
# datasets 라이브러리(huggingface)
!pip install datasets

import torch
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')


Collecting transformers[sentencepiece]
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 18.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 66.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 75.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 64.3 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 63.7 MB/s 
Installing collected packages: pyy



In [9]:
# data 받아오기 
from datasets import load_dataset

datasets_name = 'xsum'
datasets = load_dataset(datasets_name)

# 2.model 골라서 tokenizer, model 불러오기 
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name= 't5-small'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

model = model.to(DEVICE) # gpu

# optimizer 정의
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# 3. custom dataset class
train_data = datasets['train']
valid_data = datasets['validation']

print('train_data[0]: ', train_data[0]) # doc, id, summary

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data
        # doc, id, summary
        # doc: 입력데이터
        # summary : 출력 데이터(정답)
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['document']
        y = item['summary']
        return x, y


train_dataset = CustomDataset(train_data)
valid_dataset = CustomDataset(valid_data)

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=4, shuffle=False) # validation

# batch
batch = next(iter(valid_dataloader))

# tokenizing + tensor
def tokenizing(batch):
    document = batch[0]
    summary = batch[1]

    document_features = tokenizer(list(document), return_tensors='pt', padding='max_length', max_length=512, truncation=True)
    summary_features = tokenizer(list(summary), return_tensors='pt', padding='max_length', max_length=512, truncation=True)
    
    # truncation : 길이가 너무 길면 512로 잘라줌
    # padding : 길이가 너무 길면 512로 늘려줌

    return document_features, summary_features


# 4. 학습 코드
for epoch in range(5):
    model.train()
    train_loss = 0
    for idx, batch in enumerate(train_dataloader):
        # tokenizing + tensor + gpu upload
        document_features, summary_features = tokenizing(batch)
        
        loss = model(document_features['input_ids'].to(DEVICE),
                     attention_mask=document_features['attention_mask'].to(DEVICE),
                     labels=summary_features['input_ids'].to(DEVICE))['loss']

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item() # tensor.item()

        if idx % 100 == 0:
            print('iter[{}/{}] train loss [{:.6f}]'.format(idx, len(train_dataloader), train_loss/(idx+1)))
    # print('-----'*10)
    # print('train_loss : {:.5f}'.format(train_loss/len(train_dataloader)))
    

    # validation
    model.eval()
    valid_loss = 0
    for idx, batch in enumerate(valid_dataloader):
        # tokenizing + tensor + gpu upload
        document_features, summary_features = tokenizing(batch)
        
        loss = model(document_features['input_ids'].to(DEVICE),
                     attention_mask=document_features['attention_mask'].to(DEVICE),
                     labels=summary_features['input_ids'].to(DEVICE))['loss']

        # optimizer.zero_grad()
        # loss.backward()
        # optimizer.step()

        valid_loss += loss.item() # tensor.item()
    print('valid_loss : {:.5f}'.format(valid_loss/len(valid_dataloader)))


# 5. 학습 진행




Using custom data configuration default
Reusing dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)


  0%|          | 0/3 [00:00<?, ?it/s]

iter[0/51012] train loss [23.951654]
iter[100/51012] train loss [3.090957]
iter[200/51012] train loss [1.710274]
iter[300/51012] train loss [1.225301]
iter[400/51012] train loss [0.978071]
iter[500/51012] train loss [0.826488]
iter[600/51012] train loss [0.724566]
iter[700/51012] train loss [0.649202]
iter[800/51012] train loss [0.591820]
iter[900/51012] train loss [0.546956]
iter[1000/51012] train loss [0.510957]
iter[1100/51012] train loss [0.480907]
iter[1200/51012] train loss [0.456592]
iter[1300/51012] train loss [0.435647]
iter[1400/51012] train loss [0.417385]
iter[1500/51012] train loss [0.401195]
iter[1600/51012] train loss [0.387316]
iter[1700/51012] train loss [0.374897]
iter[1800/51012] train loss [0.363712]
iter[1900/51012] train loss [0.353537]
iter[2000/51012] train loss [0.344960]
iter[2100/51012] train loss [0.336945]


KeyboardInterrupt: ignored

# 과제

- Text summary 에 fine-tuned 되어있는 모델을 불러와 아래의 글들을 요약해봅시다.
- 정상적으로 보이는 글이 완성되면 과제 통과입니다. 
- 완벽하게 요약하지 않아도 됩니다. 완전 이상한 글만 아니면 통과입니다!
    - 이상한 글 예시: 이 글은 이 이 글은, . , , pad 이 것  (학습이 제대로 안 된 결과)
    - 정상적인 글 예시: 이건 과제에 관한 글이다

In [10]:
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

tokenizer = PreTrainedTokenizerFast.from_pretrained('ainize/kobart-news')
model = BartForConditionalGeneration.from_pretrained('ainize/kobart-news')

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/302 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

In [23]:
text = "과거를 떠올려보자. 방송을 보던 우리의 모습을. 독보적인 매체는 TV였다. 온 가족이 둘러앉아 TV를 봤다. 간혹 가족들끼리 뉴스와 드라마, 예능 프로그램을 둘러싸고 리모컨 쟁탈전이 벌어지기도  했다. 각자 선호하는 프로그램을 ‘본방’으로 보기 위한 싸움이었다. TV가 한 대인지 두 대인지 여부도 그래서 중요했다. 지금은 어떤가. ‘안방극장’이라는 말은 옛말이 됐다. TV가 없는 집도 많다. 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다. 각자의 방에서 각자의 휴대폰으로, 노트북으로, 태블릿으로 콘텐츠 를 즐긴다."

# 1. tokenizer를 이용해 토크나이즈를 진행합니다. (huggingface library에 있는 예제를 참고해보세요.)
input_ids = tokenizer.encode(text, return_tensors='pt')

# 2. model.generate 함수를 이용해 생성해봅시다.
summary_ids = model.generate(
    input_ids=input_ids,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    length_penalty=2.0,
    max_length=80,
    min_length=56,
    num_beams=4,
)

# 3. tokenizer 를 이용해 decode하여 읽을 수 있는 글로 바꿔줍니다.

tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'안안극장이라는 말은 TV가 한 대인지 두 대인지 여부도 그래서 중요하며 미디어의 혜 택을 누릴 수 있는 방법은 늘어나고 각자의 방에서 각자의 휴대폰으로, 노트북으로, 태블릿으로 콘텐츠 를 즐기고 콘텐츠 를 즐긴다는 것이 현재의 새로운 말인  ‘안방극장’이라는 말은 옛말이 됐다.'

In [22]:
text = '수학에서 순환소수인 0.999…는 실수 1의 또 다른 십진법 소수 표현이다. 즉 "0.999…"와 "1"은 같은 수이다. 이러한 증명은 실수론의 전개, 배경이 있는 가정, 역사적 맥락, 대상이 되는 청자(듣는 사람) 등에 맞는 수준에 따른 것으로서 여러 단계의 수학적 엄밀함을 적절하게 고려한 다양한 정식화가 있다.'

# 위의 코드를 가져와 반복해보세요.
input_ids = tokenizer.encode(text, return_tensors='pt')

summary_ids = model.generate(
    input_ids=input_ids,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    length_penalty=2.0,
    max_length=25,
    min_length=56,
    num_beams=4,
)

tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'수학에서 순환소수인 0.999...는 실수 1의 또 다른 십진법 소수 표현'

In [14]:
text = '암모니아(영어: ammonia)는 질소와 수소로 이루어진 화합물이다. 분자식은 NH3이다. 상온에서는 특유의 자극적인 냄새가 나는 무색의 기체 상태로 존재하고있다. 대기 중에도 소량의 양이 포함되어 있으며, 천연수에 미량 함유되어 있기도 하다. 토양 중에도 세균의 질소 유기물의 분해 과정에서 생겨난 암모니아가 존재할 수 있다. 대표적인 반자성체 중 하나이다.'

# 위의 코드를 가져와 반복해보세요.
input_ids = tokenizer.encode(text, return_tensors='pt')

summary_ids = model.generate(
    input_ids=input_ids,
    bos_token_id=model.config.bos_token_id,
    eos_token_id=model.config.eos_token_id,
    length_penalty=2.0,
    max_length=142,
    min_length=56,
    num_beams=4,
)

tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'질소와 수소로 이루어진 화합물 암모니아(영어: ammonia)는 질소와 수소로 이루어진 화합물로 특유의 자극적인 냄새가 나는 무색의 기체 상태로 토양 중에도 세균의 질소 유기물의 분해 과정에서 생겨난 암모니아가 존재할 수 있다.'