<a href="https://colab.research.google.com/github/Changho0514/web1/blob/main/220124_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers?

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017

![screensh](https://blog.kakaocdn.net/dn/blla7d/btqBPXAzWdA/1yMKSf4SYWRT9t0yDt2lM1/img.jpg)

# huggingface library를 이용한 모델 불러오기

In [1]:
!pip install transformers[sentencepiece]
# transformers에 딸려오는 tokenizer가 있는데, -> sentencepiece 라이브러리를 추가로 받는 코드
!pip install datasets
# dataset 라이브러리 (hugging face)

Collecting transformers[sentencepiece]
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 57.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 7.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 2.1 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 56.2 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-

In [2]:
# pytorch library installation
import torch
import torch.nn as nn
import torchtext

모델 구조 : BERT, GPT, ... -=> transfromer
- Encoder / Decoder / Encoder + Decoder
- layer가 몇개, vector dimension이 몇인지
- pre-training 방법(가장 중요)
- fine-tuning을 어느 데이터에 했는지
- 26,663개의 모델들

In [3]:
# Sentence들을 token으로 바꿔주는 Tokenizer중에, Bert Tokenizer 사용하겠다.
# 어떤 task에 model을 적용할지 사전에 생각해놔야한다.

# BERT
from transformers import BertTokenizer, BertForTokenClassification

# token classification : 각 token마다 classification (0 or 1)
# sentence classification 하고 싶으면 BertForTokenClassification 대신 써주면 된다. 원하는 Input을 받을 수 있다.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # from_pretrained : tokenizer를 pre-trained된 모델을 왜 쓸까?
# tokenizer를 pre-trained된 모델을 왜 쓸까?
# 1. 더 일반적인 토큰
# - 사전 학습 데이터는 매우 큼. 더 일반적인 토큰들이 포함되어 있음.
# 2. model과 세트임.
# - word embedding할 때 (token이 벡터가 되는 과정)
#     <pad> -> 0번, it -> 32번 
#     <pad> -> 0번, it -> 25번
#   25번째 embedding vector / 32번째 embedding vector
# -> 새로운 단어를 추가하고 싶으면? 기존 쓰던 단어보다 더 넓은 vector를 추가하고싶다면?
#   - 기존에 존재하던 vector 남겨두고 그 뒤부터 이어쓰면 된다.  
model = BertForTokenClassification.from_pretrained('bert-base-uncased') # model은 pretrained 된 것을 쓰는게 맞긴한데

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt") # tensor를 pytorch로 하겠다.
outputs = model(inputs)[0]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [4]:
# # print(input.keys())
# inputs['input_ids'] # token index
# inputs['token_type_ids'] # 앞문장, 뒷문장 구분함. # 앞문장 -> 0, 뒷문장-> 1, padding -> 0 # 00000000111110000
# inputs['attention_mask'] # 값이 있는 곳. padding인지 아닌지 분간함.

In [5]:
# Electra
# GAN 형태 (discriminator 사용)
# 무슨 model쓸지는 어떻게 알아야 할까?


from transformers import ElectraTokenizer, ElectraModel  # ElectraModel을 사용하면 last hidden state가 나온다. last hdden state는 마지막 add&norm을 거치고 난 후 이다.
# logits : softmax 취하기 전 값
# liklihood: softmax 취한 후 값.
# For token classification -> classification 모델

# BertModel + linear layer
# 한 토큰당 1개의 classification 결과.

# 
# BertModel + linear layer(어떤 형태의 output을 쓸것인지 정해주는게 linear layer)
# 한 문장에 1개의 classification 결과 
# BertforClassification

tokenizer = ElectraTokenizer.from_pretrained('google/electra-large-discriminator') # google/electra-small-discriminator 도 사용 가능.
model = ElectraModel.from_pretrained('google/electra-large-discriminator')

inputs = tokenizer.encode("The capital of France is [MASK].", return_tensors="pt")

outputs = model(inputs)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/668 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
outputs

BaseModelOutputWithPastAndCrossAttentions([('last_hidden_state',
                                            tensor([[[-0.1712,  0.0506,  0.6153,  ...,  0.3136,  0.0033, -0.1737],
                                                     [ 0.2136,  0.2114,  0.0852,  ...,  0.0938, -0.4353, -0.0061],
                                                     [-0.0450,  0.1075,  0.2639,  ..., -0.0869, -0.2249,  0.2051],
                                                     ...,
                                                     [-0.1608,  0.1830,  0.5355,  ...,  0.1005, -0.0964, -0.2321],
                                                     [ 0.2983,  0.0679,  0.1625,  ..., -0.0437, -0.1851,  0.4285],
                                                     [-0.1704,  0.1914,  0.5484,  ...,  0.1241, -0.0846, -0.2425]]],
                                                   grad_fn=<NativeLayerNormBackward0>))])

In [7]:
outputs.last_hidden_state

tensor([[[-0.1712,  0.0506,  0.6153,  ...,  0.3136,  0.0033, -0.1737],
         [ 0.2136,  0.2114,  0.0852,  ...,  0.0938, -0.4353, -0.0061],
         [-0.0450,  0.1075,  0.2639,  ..., -0.0869, -0.2249,  0.2051],
         ...,
         [-0.1608,  0.1830,  0.5355,  ...,  0.1005, -0.0964, -0.2321],
         [ 0.2983,  0.0679,  0.1625,  ..., -0.0437, -0.1851,  0.4285],
         [-0.1704,  0.1914,  0.5484,  ...,  0.1241, -0.0846, -0.2425]]],
       grad_fn=<NativeLayerNormBackward0>)

In [8]:
outputs.last_hidden_state.shape

torch.Size([1, 9, 1024])

# AutoModel 사용해보기

In [9]:
from transformers import AutoModelForTokenClassification, AutoTokenizer  # 알아서 bert를 적용할 수 있음. 
# BertModel, ElectraModel, ... -> AutoModel로 사용가능 

# auto : 자동
# pretrained model만 가능
# BERT, GPT, ELECTRA, ALBERT, RoBERTa, ...

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")
outputs = model(inputs)[0]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.weight', 'h.7.attn.masked_bias', 'h.3.attn.masked_bias', 'h.2.attn.masked_bias', 'h.8.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.0.attn.masked_bias', 'classifier.bias', 'h.1.attn.masked_bias', 'h.9.attn.masked_bias', 'h.6.attn.masked_bias', 'h.11.attn.masked_bias', 'h.10.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# model # 알아서 bert를 써줌.

# 사전학습된 모델 사용해보기

In [11]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# finetuned 되어있는 친구들 사용하면 됨.

# T5: encoder-decoder 구조 모두 사용.
# Q&A 데이터 셋
# SQuAD v2 -> 질문을 하면 생성해서 대답(서술형) -> 분류모델, 회귀모델, 생성모델 중 생성모델을 써야함
# SQuAD v1 -> 질문을 하면 지문에서 어디에 있는지 맞춤(객관식) - 
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/t5-base-finetuned-squadv2")


Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [12]:

def get_answer(question, context):
  input_text = "question: %s  context: %s" % (question, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'])
  
  return tokenizer.decode(output[0])



# wikipeida text
context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
question = 'When did Beyonce start becoming popular?'

get_answer(question,context)

'<pad> late 1990s</s>'

In [13]:
from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [14]:
def get_question(answer, context, max_length=64):
  input_text = "answer: %s  context: %s </s>" % (answer, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])

context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
answer = 'American' #'1981' 

get_question(answer, context)

'<pad> question: What nationality is Beyonce?</s>'

# huggingfcae 라이브러리를 이용한 데이터 처리하기

In [15]:
from datasets import load_dataset
datasets = load_dataset('imdb')

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [16]:
datasets.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [18]:
# unsupervised VS supervised 
# label이 있냐 없냐.
# self-supervised -> 정답지를 스스로 만들어서 학습(본인이 문제내고 본인이 학습) [masking] 하고 복원하고 하니까. 이게 다 self임.
#

# datasets['unsupervised'] = 

In [19]:
datasets['train'][0] # 0 or 1 : negative / positive 이런식으로 labeling 되어있다.

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

In [20]:
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data  #train dataset안에 넣기

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()






In [21]:
train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

In [22]:
batch = next(iter(train_dataloader))

In [23]:
batch  # 리뷰 문단 하나가 들어감

[('I found the documentary entitled Fast, Cheap, and Out of Control to be a fairly interesting documentary. The documentary contained four "mini" documentaries about four interesting men. Each one of these men was extremely involved with his job, showing sheer love and enjoyment for one\'s job.<br /><br />The sad part, I must say, would have to be the subjects in which these individuals worked/studied. They were interesting for about five minutes, afterwards becoming boring and lasting entirely too long.<br /><br />The video was filmed in a very creative way though. I very much enjoyed the film of one thing with a voice dub over another. It played out excellent and also coincided nicely with the music.',
  "It's beyond my comprehension that so much rubbish from Norway has been remastered for DVD release, and still gems like this don't get a shot at recapturing their past glory. I give this a 7, not because it is very good, but because it is one of the few SciFi films made for Norwegian

In [24]:
features = tokenizer(list(batch[0]))

In [25]:
features.keys()

dict_keys(['input_ids', 'attention_mask'])

# 짧은 코드만으로 학습을 시켜봅시다.

In [26]:
from datasets import load_dataset
datasets = load_dataset('imdb')

from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()


train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [27]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

# cuda
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(DEVICE)

# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

#criterion = nn.CrossEntropyLoss() # 필요없음. 왜?: 모델안에 내장

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [31]:
def train_epoch(model, dataloader, tokenizer, optimizer):
    model.train()
    train_loss = 0
    for i, (x,y) in enumerate(dataloader):
        features = tokenizer(list(x), padding='max_length', return_tensors='pt',max_length=512, truncation=True)
        x = features['input_ids'].to(DEVICE)
        
        # padding 512
        # 600 -> 512 trucation = True
        attention_mask = features['attention_mask'].to(DEVICE)
        y = y.to(DEVICE)
        loss = model(x, attention_mask=attention_mask, labels=y)['loss']
        
        # model update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        if i % 50 == 0:
            print('Iter [{}/{}] Loss {:.6f}'.format(i+1, len(dataloader), train_loss / (i+1)))
    
    return train_loss / len(dataloader)

def test_epoch(model, dataloader, tokenizer):
    model.eval()
    preds = []
    labels = []
    with torch.no_grad():
      for x,y in dataloader:
          x = tokenizer(list(x), padding='max_length', return_tensors='pt',max_length=512, truncation=True)['input_ids'].to(DEVICE)
          out = model(x)['logits']
          pred = out.argmax(-1)
          preds.append(pred.cpu())
          labels.append(y)
    preds = torch.cat(preds)
    labels = torch.cat(labels)
    acc = (preds == labels).float().mean()
    print('ACC : {:.3f}'.format(acc))
    return preds, labels

def predict(model, tokenizer, sentence):
    model.eval()
    x = tokenizer.encode(sentence, return_tensors='pt', truncation=True).to(DEVICE)
    out = model(x)['logits']
    pred = out.argmax(-1)
    return pred.cpu()

In [32]:
EPOCHS=1

for i in range(EPOCHS):
    train_epoch(model, train_dataloader, tokenizer, optimizer)
    test_epoch(model, test_dataloader, tokenizer)

Iter [1/6250] Loss 0.725982
Iter [51/6250] Loss 0.700102
Iter [101/6250] Loss 0.717954
Iter [151/6250] Loss 0.714655
Iter [201/6250] Loss 0.714259
Iter [251/6250] Loss 0.705433
Iter [301/6250] Loss 0.708348
Iter [351/6250] Loss 0.708692
Iter [401/6250] Loss 0.708556
Iter [451/6250] Loss 0.709711
Iter [501/6250] Loss 0.709173
Iter [551/6250] Loss 0.708973
Iter [601/6250] Loss 0.708499
Iter [651/6250] Loss 0.708169
Iter [701/6250] Loss 0.706836
Iter [751/6250] Loss 0.705920
Iter [801/6250] Loss 0.706052
Iter [851/6250] Loss 0.706963
Iter [901/6250] Loss 0.706776
Iter [951/6250] Loss 0.708356
Iter [1001/6250] Loss 0.707605
Iter [1051/6250] Loss 0.707000
Iter [1101/6250] Loss 0.707282
Iter [1151/6250] Loss 0.707816


KeyboardInterrupt: ignored

In [43]:
# 1. 데이터 받아오기
from datasets import load_dataset

dataset_name = 'xsum'
datasets = load_dataset(dataset_name)

# 2. model 골라서 tokenizer, model 불러오기, optimizer 만들기 (model, tokenizer 고르기 + optimizer 고르기(+criterion 구하기))
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = model.cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# 3. custom dataset class 
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data  #train dataset안에 넣기
        # document, id, summary
        # documnet : 입력 데이터
        # summary : 출력 데이터 (정답지)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['document'] # text
        y = item['summary'] # text
        return x, y

train_dataset = CustomDataset(datasets['train'])
valid_dataset = CustomDataset(datasets['validation'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True) # batch size단위로 나올수 있게 합쳐준다. 
valid_dataloader = DataLoader(valid_dataset, batch_size=4, shuffle=False)  #validation
# validation에서 shuffle이 False인 이유:
# shuffle은 랜덤하게 섞는 것이다.
# data 1, 2, 3, 4,...

# 4. 학습코드
next(iter(valid_dataloader)) # iter는 for문 돌듯이 실행해주고, next로 감싸면 1개에 대한 결과가 나온다.
def tokenizing(batch):
  document = batch[0]
  summary = batch[1]
  document_features = tokenizer(list(document), return_tensors='pt', padding = 'max_length', max_length=512, truncation=True)
  summary_features = tokenizer(list(summary), return_tensors='pt',padding = 'max_length', max_length=512, truncation=True) # truncation은 길이가 너무 긴걸 512로 잘라주는 것. 패딩이랑 반대임.

  return document_features, summary_features

for epoch in range(5):
  model.train()
  train_loss = 0
  for batch in train_dataloader:
    # tokenizing + tensor + gpu upload
    document_features, summary_features = tokenizing(batch)
    loss = model(document_features['input_ids'].to(DEVICE),
      attention_mask = document_features['attention_mask'].to(DEVICE),
      labels = summary_features['input_ids'].to(DEVICE))['loss']
    optimizer.zero_grad()
    loss.backward() #gradient 계산
    optimizer.step()

    train_loss += loss.item()  # tensor.item()  [1]->1 # 스칼라값으로 바꿔주는 함수
  print('train loss : {:.5f}'.format(train_loss/len(train_dataloader)))
    
  #validation
  model.eval()
  valid_loss = 0
  with torch_no_grad(): 
    # gradient 그래프 유지 X => 속도 4-5배 정도 빠름
    for batch in valid_dataloader:
      # tokenizing + tensor + gpu upload
      document_features, summary_features = tokenizing(batch)
      model(document_features['input_ids'].to(DEVICE),
        attention_mask = document_features['attention_mask'].to(DEVICE),
        labels = summary_features['input_ids'].to(DEVICE))['loss']
      # 학습단계에서 필요한 부분은 지우기
      # optimizer.zero_grad()
      # loss.backward() #gradient 계산
      # optimizer.step()

      valid_loss += loss.item()  # tensor.item()  [1]->1 # 스칼라값으로 바꿔주는 함수
    print('valid loss : {:.5f}'.format(valid_loss/len(valid_dataloader)))
  

# 5. 학습진행






Using custom data configuration default
Reusing dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)


  0%|          | 0/3 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [None]:
import torch
import torch.nn as nn
class FreezeModel(nn.Module):
  def __init__(self,):
    from transformers import AutoModel
    model_name = "t5-small"
    # hidden state vector
    self.t5 = AutoModel.from_pretrained(model_name)

    for k, v in self.t5.named_parameters():
      v.requires_grad = False

    self.out_layer = nn.Linear(512, 30000) # 512 dim vector -> 30000개 vocab

  def forward(self, x):
    with torch.no_grad():
      x = self.t5(x) # gradient 계산 X
    x = self.out_layer(x)
    return x

# 과제

- Text summary 에 fine-tuned 되어있는 모델을 불러와 아래의 글들을 요약해봅시다.
- 정상적으로 보이는 글이 완성되면 과제 통과입니다. 
- 완벽하게 요약하지 않아도 됩니다. 완전 이상한 글만 아니면 통과입니다!
    - 이상한 글 예시: 이 글은 이 이 글은, . , , pad 이 것  (학습이 제대로 안 된 결과)
    - 정상적인 글 예시: 이건 과제에 관한 글이다

In [2]:
from transformers import BartModel
from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
tokenizer = get_kobart_tokenizer()
model = BartModel.from_pretrained(get_pytorch_kobart_model())

/content/.cache/kobart_base_tokenizer_cased_cf74400bce.zip[██████████████████████████████████████████████████]
/content/.cache/kobart_base_cased_ff4bda5738.zip[██████████████████████████████████████████████████]


In [10]:
from transformers import PreTrainedTokenizerFast
from tokenizers import SentencePieceBPETokenizer
from transformers import BartForConditionalGeneration
import torch

In [36]:
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v1')
model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-base-v1')

In [37]:
def summarize(text):
  # 1. tokenizer를 이용해 토크나이즈를 진행합니다. (huggingface library에 있는 예제를 참고해보세요.)
  features = tokenizer([text], return_tensors='pt')
  input_ids = features['input_ids']
  # 2. model.generate 함수를 이용해 생성해봅시다.
  summary_ids = model.generate(input_ids=input_ids, attention_mask=features['attention_mask'])
  # 3. tokenizer 를 이용해 decode하여 읽을 수 있는 글로 바꿔줍니다.
  return tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

In [38]:
text = "과거를 떠올려보자. 방송을 보던 우리의 모습을. 독보적인 매체는 TV였다. 온 가족이 둘러앉아 TV를 봤다. 간혹 가족들끼리 뉴스와 드라마, 예능 프로그램을 둘러싸고 리모컨 쟁탈전이 벌어지기도  했다. 각자 선호하는 프로그램을 ‘본방’으로 보기 위한 싸움이었다. TV가 한 대인지 두 대인지 여부도 그래서 중요했다. 지금은 어떤가. ‘안방극장’이라는 말은 옛말이 됐다. TV가 없는 집도 많다. 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다. 각자의 방에서 각자의 휴대폰으로, 노트북으로, 태블릿으로 콘텐츠 를 즐긴다."
summarize(text)

'과거 모습을. 독보적인 매체는 TV였다. 온 가족이 둘러앉아 TV였다. 온 가족이'

In [29]:
text = '수학에서 순환소수인 0.999…는 실수 1의 또 다른 십진법 소수 표현이다. 즉 "0.999…"와 "1"은 같은 수이다. 이러한 증명은 실수론의 전개, 배경이 있는 가정, 역사적 맥락, 대상이 되는 청자(듣는 사람) 등에 맞는 수준에 따른 것으로서 여러 단계의 수학적 엄밀함을 적절하게 고려한 다양한 정식화가 있다.'
summarize(text)

'즉 증 실수 실수 실수론의 전개, 배경이 있는 가정, 역사적,, 역사적 맥락,'

In [30]:
text = '암모니아(영어: ammonia)는 질소와 수소로 이루어진 화합물이다. 분자식은 NH3이다. 상온에서는 특유의 자극적인 냄새가 나는 무색의 기체 상태로 존재하고있다. 대기 중에도 소량의 양이 포함되어 있으며, 천연수에 미량 함유되어 있기도 하다. 토양 중에도 세균의 질소 유기물의 분해 과정에서 생겨난 암모니아가 존재할 수 있다. 대표적인 반자성체 중 하나이다.'
summarize(text)

'암모니아(영어: ammonia)는 질소와 수소로 이루어진 화합'