<a href="https://colab.research.google.com/github/SeongwonTak/TIL_swtak/blob/master/learning_BERT_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 3.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 56.4 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 4.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 41.6 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3

In [4]:
# BERT는 transformer 패키지를 통해 사용할 수 있다.
import pandas as pd
from transformers import BertTokenizer

## BERT _ Tokenizer

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #Bert-base tokenizer
result = tokenizer.tokenize('This function is monotonically increasing and analytic')
print(result)
# 다음과 같이 monotonically는 토큰에 존재하지 않아 더 쪼개버렷다.

['this', 'function', 'is', 'mono', '##tonic', '##ally', 'increasing', 'and', 'analytic']


In [11]:
# 다음과 같이 존재하지 않는 단어의 경우는 Keyerror 발생
tokenizer.vocab['holomorphic']

KeyError: ignored

In [12]:
tokenizer.vocab['continuous']

7142

## BERT _ Simple Example : 네이버 영화 리뷰 감성 분류
실습은 다음 링크를 통해 진행하였다.
https://zzaebok.github.io/deep_learning/nlp/Bert-for-classification/

In [14]:
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from pytorch_transformers import BertTokenizer, BertForSequenceClassification, BertConfig
from torch.optim import Adam
import torch.nn.functional as F

In [15]:
# dataset_download
!git clone https://github.com/e9t/nsmc.git

Cloning into 'nsmc'...
remote: Enumerating objects: 14763, done.[K
remote: Total 14763 (delta 0), reused 0 (delta 0), pack-reused 14763[K
Receiving objects: 100% (14763/14763), 56.19 MiB | 18.64 MiB/s, done.
Resolving deltas: 100% (1749/1749), done.
Checking out files: 100% (14737/14737), done.


In [16]:
train_df = pd.read_csv('./nsmc/ratings_train.txt', sep='\t')
test_df = pd.read_csv('./nsmc/ratings_test.txt', sep='\t')

In [17]:
train_df.head(5)

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [18]:
train_df.isna().sum()

id          0
document    5
label       0
dtype: int64

In [19]:
test_df.isna().sum()

id          0
document    3
label       0
dtype: int64

document가 없으면 분류가 불가능 하므로 dropna를 한다.

In [21]:
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

In [22]:
print(len(train_df))
print(len(test_df))

149995
49997


In [23]:
# 빠른 훈련을 위하 데이터는 소형으로 사용한다.
train_df = train_df.sample(frac = 0.2, random_state = 42)
test_df = test_df.sample(frac = 0.2, random_state=42)

In [24]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')

100%|██████████| 995526/995526 [00:00<00:00, 7818936.21B/s]
100%|██████████| 625/625 [00:00<00:00, 235571.53B/s]
100%|██████████| 714314041/714314041 [00:18<00:00, 37735488.48B/s]


In [28]:
# DataLoader 만들기
class Movie_Dataset(Dataset):
  def __init__(self, df):
    self.df = df

  def __len__(self):
    return len(self.df)

  def __getitem__(self, idx):
    text = self.df.iloc[idx, 1]
    label = self.df.iloc[idx, 2]
    return text, label

In [29]:
train_dataset = Movie_Dataset(train_df)
train_loader = DataLoader(train_dataset, batch_size = 4, shuffle = True, num_workers = 2)

In [30]:
optimizer = Adam(model.parameters(), lr = 1e-5)

In [None]:
model.train()
epochs = 1

itr = 1
p_itr = 500
total_loss = 0
total_len = 0
total_correct = 0

for epoch in range(epochs):
  for text, label in train_loader:
    optimizer.zero_grad()

    encoded_list = [tokenizer.encode(t, add_special_tokens=True) for t in text]
    padded_list = [e + [0] * (512-len(e)) for e in encoded_list]

    sample = torch.tensor(padded_list)
    labels = torch.tensor(label)
    outputs = model(sample, labels = labels)
    loss, logits = outputs

    pred = torch.argmax(F.softmax(logits), dim = 1)
    correct = pred.eq(labels)
    total_correct = correct.sum().item()
    total_len += len(labels)
    total_loss += loss.item()
    loss.backward()
    optimizer.step()

    if itr % p_itr == 0:
            print('[Epoch {}/{}] Iteration {} -> Train Loss: {:.4f}, Accuracy: {:.3f}'.format(epoch+1, epochs, itr, total_loss/p_itr, total_correct/total_len))
            total_loss = 0
            total_len = 0
            total_correct = 0
    itr+=1

