# Transformers?

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017

![screensh](https://blog.kakaocdn.net/dn/blla7d/btqBPXAzWdA/1yMKSf4SYWRT9t0yDt2lM1/img.jpg)

# huggingface library를 이용한 모델 불러오기

In [1]:
!pip install transformers[sentencepiece]
!pip install datasets

Collecting transformers[sentencepiece]
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 7.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 38.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.6 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 52.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 40.7 MB/s 
[?25hCollecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.9

In [2]:
import torch
import torch.nn as nn
import torchtext

In [3]:
from transformers import BertTokenizer, BertForTokenClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")
outputs = model(inputs)[0]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [19]:
from transformers import ElectraTokenizer, ElectraModel

tokenizer = ElectraTokenizer.from_pretrained('google/electra-large-discriminator')
model = ElectraModel.from_pretrained('google/electra-large-discriminator')

inputs = tokenizer.encode("The capital of France is [MASK].", return_tensors="pt")

outputs = model(inputs)

Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [20]:
outputs.last_hidden_state

tensor([[[-0.1712,  0.0506,  0.6153,  ...,  0.3136,  0.0033, -0.1737],
         [ 0.2136,  0.2114,  0.0852,  ...,  0.0938, -0.4353, -0.0061],
         [-0.0450,  0.1075,  0.2639,  ..., -0.0869, -0.2249,  0.2051],
         ...,
         [-0.1608,  0.1831,  0.5355,  ...,  0.1005, -0.0964, -0.2321],
         [ 0.2983,  0.0679,  0.1625,  ..., -0.0437, -0.1851,  0.4285],
         [-0.1704,  0.1914,  0.5484,  ...,  0.1241, -0.0846, -0.2425]]],
       grad_fn=<NativeLayerNormBackward0>)

In [6]:
outputs.last_hidden_state.shape

torch.Size([1, 9, 1024])

# AutoModel 사용해보기

In [7]:
from transformers import AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')

inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")
outputs = model(inputs)[0]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

# 사전학습된 모델 사용해보기

In [28]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/t5-base-finetuned-squadv2")


The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [31]:

def get_answer(question, context):
  input_text = "question: %s  context: %s" % (question, context) # 질문 , 지문
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'])
  # 생성 모델의 경우 model.generate 가 이미 구현 되어 있음
  return tokenizer.decode(output[0])   # answer



# wikipeida text
context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
question = 'wtf?'

get_answer(question,context)

'<pad> add(n0,n1)|</s>'

In [None]:
model


In [32]:
from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [35]:
def get_question(answer, context, max_length=64):
  input_text = "answer: %s  context: %s </s>" % (answer, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])

context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy'
answer = 'giselle'

get_question(answer, context)

"<pad> question: What is Beyonce's full name?</s>"

# huggingfcae 라이브러리를 이용한 데이터 처리하기

In [36]:
from datasets import load_dataset
datasets = load_dataset('imdb')

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [37]:
datasets.keys()

dict_keys(['train', 'test', 'unsupervised'])

In [40]:
datasets['unsupervised'][0]

{'label': -1,
 'text': 'This is just a precious little diamond. The play, the script are excellent. I cant compare this movie with anything else, maybe except the movie "Leon" wonderfully played by Jean Reno and Natalie Portman. But... What can I say about this one? This is the best movie Anne Parillaud has ever played in (See please "Frankie Starlight", she\'s speaking English there) to see what I mean. The story of young punk girl Nikita, taken into the depraved world of the secret government forces has been exceptionally over used by Americans. Never mind the "Point of no return" and especially the "La femme Nikita" TV series. They cannot compare the original believe me! Trash these videos. Buy this one, do not rent it, BUY it. BTW beware of the subtitles of the LA company which "translate" the US release. What a disgrace! If you cant understand French, get a dubbed version. But you\'ll regret later :)'}

In [39]:
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data
        # document : 입력 epdlxj
        #  summary : 출력 데이터 ( 정답지 )
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()






In [41]:
train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

In [42]:
batch = next(iter(train_dataloader))

In [44]:
batch


[('(I\'ll indicate in this review the point where spoilers begin.) My dissatisfaction is split: 30% tone-deafness, 70% lackluster writing.<br /><br />The 30%: I agree with the first commenter\'s synopsis about the lack of diversity in the characters and scope of the stories. I was surprised how, this film, at best, woefully shortchanges the real NYC by presenting a collection of people and relationships so narrow as to come across as if it\'s inhabited only by the cast of Gossip Girl (this is coming from someone who likes Gossip Girl). A few minority characters are written into the stories, but they are included by obligation, while we can see the gears under the film so clearly, striving to "be diverse" but falling ever-so-short.<br /><br />The 70% is why everything falls short. All characters, white plus a few token minorities, are one-dimensional, cardboard cutouts of people concepts. Worse, their interactions with each other are scripted in such a way that for each vignette in the 

In [47]:
features = tokenizer(list(batch[0]))

# 앞문장, 뒷문장 구분 X -> bert 때는 있었는데 직므은 없는 이유 ?
#  - > T5 모델은 문장 두개를 구분하는 NSP를 안함 .

In [48]:
features.keys()

dict_keys(['input_ids', 'attention_mask'])

In [53]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")


# cuda
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(DEVICE)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

# 짧은 코드만으로 학습을 시켜봅시다.

In [54]:
from datasets import load_dataset
datasets = load_dataset('imdb')

from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['text']
        y = item['label']
        return x, torch.tensor(y).long()


train_dataset = CustomDataset(datasets['train'])
test_dataset = CustomDataset(datasets['test'])

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

In [55]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [60]:
def train_epoch(model, dataloader, tokenizer, optimizer):
    model.train()
    train_loss = 0
    for i, (x,y) in enumerate(dataloader):
        x = tokenizer(list(x), padding='max_length', truncation=True, return_tensors='pt',max_length=512)['input_ids'].to(DEVICE)
        y = y.to(DEVICE)
        loss = model(x, labels=y)['loss']
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        if i % 50 == 0:
            print('Iter [{}/{}] Loss {:.6f}'.format(i+1, len(dataloader), train_loss / (i+1)))
    
    return train_loss / len(dataloader)

def test_epoch(model, dataloader, tokenizer):
    model.eval()
    preds = []
    labels = []
    with torch.no_grad():
      for x,y in dataloader:
          x = tokenizer(list(x), padding='max_length', return_tensors='pt',max_length=512)['input_ids'].to(DEVICE)
          out = model(x)['logits']
          pred = out.argmax(-1)
          preds.append(pred.cpu())
          labels.append(y)
    preds = torch.cat(preds)
    labels = torch.cat(labels)
    acc = (preds == labels).float().mean()
    print('ACC : {:.3f}'.format(acc))
    return preds, labels

def predict(model, tokenizer, sentence):
    model.eval()
    x = tokenizer.encode(sentence, return_tensors='pt').to(DEVICE)
    out = model(x)['logits']
    pred = out.argmax(-1)
    return pred.cpu()

In [61]:
EPOCHS=1

for i in range(EPOCHS):
    train_epoch(model, train_dataloader, tokenizer, optimizer)
    test_epoch(model, test_dataloader, tokenizer)

Iter [1/6250] Loss 0.736974
Iter [51/6250] Loss 0.742749
Iter [101/6250] Loss 0.728999
Iter [151/6250] Loss 0.722682
Iter [201/6250] Loss 0.719684
Iter [251/6250] Loss 0.720169
Iter [301/6250] Loss 0.716960
Iter [351/6250] Loss 0.714148
Iter [401/6250] Loss 0.712896
Iter [451/6250] Loss 0.711548
Iter [501/6250] Loss 0.710801
Iter [551/6250] Loss 0.708464
Iter [601/6250] Loss 0.709407
Iter [651/6250] Loss 0.710074


KeyboardInterrupt: ignored

In [63]:
# summary 학습 코드를 짜보자 ㅠㅠ
# 1. data load
from datasets import load_dataset
datasets = load_dataset('xsum')

# 2. model 골라서 tokenizer, model 불러오기 , optimizer 만듥기
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = model.cuda()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# 3. custom dataset class
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data
        # document : 입력 epdlxj
        #  summary : 출력 데이터 ( 정답지 )
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        x = item['document']
        y = item['summary']
        return x, y

# 4. 학습 코드
train_dataset = CustomDataset(datasets['train'])
valid_dataset = CustomDataset(datasets['validation'])

train_dataloader = DataLoader(train_dataset, batch_size = 4, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size = 4, shuffle=False)

# 5. 학습 진행


Using custom data configuration default
Reusing dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [83]:

batch = next(iter(valid_dataloader))

def tokenizing(batch):
    document = batch[0]
    summary = batch[1]
    document_features = tokenizer(list(document), return_tensors='pt',
                                  padding='max_length', max_length=512, truncation=True )
    summary_features = tokenizer(list(summary), return_tensors='pt', 
                                 padding='max_length', max_length=512, truncation=True )
    
    return document_features, summary_features   # gpu

for epoch in range(5):
    model.train()
    train_loss = 0
    for i, batch in enumerate(train_dataloader):
        #tokenizing + tensor + gpu
        document_features, summary_features = tokenizing(batch)
        loss = model(document_features['input_ids'].to(DEVICE),
            attention_mask = document_features['attention_mask'].to(DEVICE),
            labels = summary_features['input_ids'].to(DEVICE))['loss']
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item() # tensor.item()  [1]  - >   1
        
        if i % 100 == 0 :
            print('iter[{}] train loss [{:.6f}]'.format(i, 
                                                        len(train_dataloader),
                                                        train_loss / (i+1)))
    # 1epoch 끝나고 출력
    print('train loss : {:.5f}'.format(train_loss/len(train_dataloader)))

    # validation
    model.eval()
    valid_loss = 0
    with torch.no_grad():
        # gradient 그래프 유지 ㄴㄴ >> 속도 4-5배 빠름
        for batch in train_dataloader:
        #tokenizing + tensor + gpu
            document_features, summary_features = tokenizing(batch)
            model(document_features['input_ids'].to(DEVICE),
                attention_mask = document_features['attention_mask'].to(DEVICE),
                labels = summary_features['input_ids'].to(DEVICE))['loss']
            
        valid_loss += loss.item() # tensor.item()  [1]  - >   1
        
    print('train loss : {:.5f}'.format(valid_loss/len(valid_dataloader)))



iter[0] train loss [51012.000000]


KeyboardInterrupt: ignored

In [68]:
document_features

{'input_ids': [[37, 1215, 18, 19915, 53, 3, 13720, 11958, 24283, 3415, 3991, 3, 8321, 12, 8, 264, 26, 1924, 5716, 2941, 3, 18, 3, 9, 7813, 12, 3033, 540, 21, 7904, 29, 2600, 5, 1363, 264, 26, 1924, 6, 6862, 6, 19, 22801, 4977, 28, 17813, 10740, 262, 89, 15, 6, 8537, 6, 17557, 6, 943, 11, 7872, 6, 6426, 5, 2276, 2741, 53, 44, 8, 3525, 24042, 2283, 6, 66, 662, 11958, 8, 326, 1433, 5, 37, 1567, 3, 21679, 12, 10883, 2319, 84, 3, 18280, 808, 286, 344, 2628, 11, 5832, 3084, 6, 45, 14599, 6, 262, 89, 15, 11, 17557, 6, 13, 15993, 9145, 6, 11, 7872, 6, 45, 493, 226, 1306, 6, 33, 788, 12, 1518, 3689, 16, 1718, 5, 328, 130, 66, 1883, 30, 15794, 5, 1], [3152, 2897, 47, 5241, 12, 14848, 4781, 30, 3, 3840, 227, 5706, 8, 2871, 298, 3, 27759, 383, 8, 1334, 7666, 3314, 28, 27362, 30, 314, 1515, 5, 4551, 7, 994, 897, 12, 43, 8, 3746, 223, 21, 70, 332, 1755, 272, 5064, 467, 581, 17944, 44, 2809, 31, 7, 30, 220, 1660, 5, 37, 6862, 18, 1201, 18, 1490, 65, 5799, 3, 13427, 3154, 16, 662, 166, 18, 4057, 1031,

In [None]:
import torch
import torch.nn as nn

class FreezeModel(nn.Module):
    def __init__(self,):
        from transformers import AutoModel
        model_name = 't5-small'

        #hidden state vector

        self.t5 = AutoModel.from_pretrained(model_name)
        
        for param_name, parameter in self.t5.named_parameters():
            parameter.required_grad = False

        self.out_layer = nn.Linear(512, 30000) # 512 dim vector -> 30000 개 vocab

    def forward(self, x):
        with torch.no_grad():
            x = self.t5(x)
        x = self.

# 과제

- Text summary 에 fine-tuned 되어있는 모델을 불러와 아래의 글들을 요약해봅시다.
- 정상적으로 보이는 글이 완성되면 과제 통과입니다. 
- 완벽하게 요약하지 않아도 됩니다. 완전 이상한 글만 아니면 통과입니다!
    - 이상한 글 예시: 이 글은 이 이 글은, . , , pad 이 것  (학습이 제대로 안 된 결과)
    - 정상적인 글 예시: 이건 과제에 관한 글이다

In [84]:
from transformers import PreTrainedTokenizerFast, BartForConditionalGeneration
tokenizer = PreTrainedTokenizerFast.from_pretrained("ainize/kobart-news")
model = BartForConditionalGeneration.from_pretrained("ainize/kobart-news")

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/302 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

In [None]:


text = "과거를 떠올려보자. 방송을 보던 우리의 모습을. 독보적인 매체는 TV였다. 온 가족이 둘러앉아 TV를 봤다. 간혹 가족들끼리 뉴스와 드라마, 예능 프로그램을 둘러싸고 리모컨 쟁탈전이 벌어지기도  했다. 각자 선호하는 프로그램을 ‘본방’으로 보기 위한 싸움이었다. TV가 한 대인지 두 대인지 여부도 그래서 중요했다. 지금은 어떤가. ‘안방극장’이라는 말은 옛말이 됐다. TV가 없는 집도 많다. 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다. 각자의 방에서 각자의 휴대폰으로, 노트북으로, 태블릿으로 콘텐츠 를 즐긴다."

raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

summary_ids = model.generate(torch.tensor([input_ids]))
tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

In [108]:
import torch
from transformers import PreTrainedTokenizerFast
from transformers import BartForConditionalGeneration

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-summarization')
model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-summarization')

Downloading:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BartTokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

In [110]:
text = "과거를 떠올려보자. 방송을 보던 우리의 모습을. 독보적인 매체는 TV였다. 온 가족이 둘러앉아 TV를 봤다. 간혹 가족들끼리 뉴스와 드라마, 예능 프로그램을 둘러싸고 리모컨 쟁탈전이 벌어지기도  했다. 각자 선호하는 프로그램을 ‘본방’으로 보기 위한 싸움이었다. TV가 한 대인지 두 대인지 여부도 그래서 중요했다. 지금은 어떤가. ‘안방극장’이라는 말은 옛말이 됐다. TV가 없는 집도 많다. 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다. 각자의 방에서 각자의 휴대폰으로, 노트북으로, 태블릿으로 콘텐츠 를 즐긴다."

# 1. tokenizer를 이용해 토크나이즈를 진행합니다. (huggingface library에 있는 예제를 참고해보세요.)

raw_input_ids = tokenizer.encode(text)

input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

summary_ids = model.generate(torch.tensor([input_ids]))

tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'TV가 없는 집도 많아지고 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다.'

In [112]:
text = '수학에서 순환소수인 0.999…는 실수 1의 또 다른 십진법 소수 표현이다. 즉 "0.999…"와 "1"은 같은 수이다. 이러한 증명은 실수론의 전개, 배경이 있는 가정, 역사적 맥락, 대상이 되는 청자(듣는 사람) 등에 맞는 수준에 따른 것으로서 여러 단계의 수학적 엄밀함을 적절하게 고려한 다양한 정식화가 있다.'

raw_input_ids = tokenizer.encode(text)

input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

summary_ids = model.generate(torch.tensor([input_ids]))

tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'수학에서 순환소수인 0.999는 실수 1의 또 다른 십진법'

In [114]:
text = '암모니아(영어: ammonia)는 질소와 수소로 이루어진 화합물이다. 분자식은 NH3이다. 상온에서는 특유의 자극적인 냄새가 나는 무색의 기체 상태로 존재하고있다. 대기 중에도 소량의 양이 포함되어 있으며, 천연수에 미량 함유되어 있기도 하다. 토양 중에도 세균의 질소 유기물의 분해 과정에서 생겨난 암모니아가 존재할 수 있다. 대표적인 반자성체 중 하나이다.'

raw_input_ids = tokenizer.encode(text)

input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

summary_ids = model.generate(torch.tensor([input_ids]))

tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'토양 중에도 질소와 수소로 이루어진 화합물인 암모니아가 존재할'

In [115]:
import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))

article_text = "과거를 떠올려보자. 방송을 보던 우리의 모습을. 독보적인 매체는 TV였다. 온 가족이 둘러앉아 TV를 봤다. 간혹 가족들끼리 뉴스와 드라마, 예능 프로그램을 둘러싸고 리모컨 쟁탈전이 벌어지기도  했다. 각자 선호하는 프로그램을 ‘본방’으로 보기 위한 싸움이었다. TV가 한 대인지 두 대인지 여부도 그래서 중요했다. 지금은 어떤가. ‘안방극장’이라는 말은 옛말이 됐다. TV가 없는 집도 많다. 미디어의 혜 택을 누릴 수 있는 방법은 늘어났다. 각자의 방에서 각자의 휴대폰으로, 노트북으로, 태블릿으로 콘텐츠 를 즐긴다."

model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)

Downloading:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/730 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.17G [00:00<?, ?B/s]

‘안방극장’이라는 말은 옛말이 됐다.


In [118]:
article_text = '수학에서 순환소수인 0.999…는 실수 1의 또 다른 십진법 소수 표현이다. 즉 "0.999…"와 "1"은 같은 수이다. 이러한 증명은 실수론의 전개, 배경이 있는 가정, 역사적 맥락, 대상이 되는 청자(듣는 사람) 등에 맞는 수준에 따른 것으로서 여러 단계의 수학적 엄밀함을 적절하게 고려한 다양한 정식화가 있다.'

model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=40,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)

십진법 수학에서 순환소수인 0.500...과 "0"이 같은 표현이 있다.


In [117]:
article_text ='암모니아(영어: ammonia)는 질소와 수소로 이루어진 화합물이다. 분자식은 NH3이다. 상온에서는 특유의 자극적인 냄새가 나는 무색의 기체 상태로 존재하고있다. 대기 중에도 소량의 양이 포함되어 있으며, 천연수에 미량 함유되어 있기도 하다. 토양 중에도 세균의 질소 유기물의 분해 과정에서 생겨난 암모니아가 존재할 수 있다. 대표적인 반자성체 중 하나이다.'


model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)

암모니아는 인간의 대표적인 반자성체다.
