<a href="https://colab.research.google.com/github/Philocreation/My_Deep_learning/blob/main/Template/korean_word_sequence_classification_with_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bert를 사용한 한글 단어열 분류

copy from https://github.com/NLP-kr/tensorflow-ml-nlp-tf2/blob/master/7.PRETRAIN_METHOD/7.2.1.bert_finetune_NSMC.ipynb

https://colab.research.google.com/github/dhrim/MDC_2021/blob/master/material/deep_learning/korean_word_sequence_classification_with_bert.ipynb#scrollTo=5-pgao0HxEbn

# 필요 라이브러리 설치 

In [1]:
!pip install transformers==3.0.2
!pip install sentencepiece

Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[?25l[K     |▍                               | 10 kB 44.3 MB/s eta 0:00:01[K     |▉                               | 20 kB 35.4 MB/s eta 0:00:01[K     |█▎                              | 30 kB 22.1 MB/s eta 0:00:01[K     |█▊                              | 40 kB 18.6 MB/s eta 0:00:01[K     |██▏                             | 51 kB 17.3 MB/s eta 0:00:01[K     |██▋                             | 61 kB 16.3 MB/s eta 0:00:01[K     |███                             | 71 kB 16.3 MB/s eta 0:00:01[K     |███▍                            | 81 kB 17.9 MB/s eta 0:00:01[K     |███▉                            | 92 kB 16.0 MB/s eta 0:00:01[K     |████▎                           | 102 kB 14.5 MB/s eta 0:00:01[K     |████▊                           | 112 kB 14.5 MB/s eta 0:00:01[K     |█████▏                          | 122 kB 14.5 MB/s eta 0:00:01[K     |█████▌                          | 133 k

In [2]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

from tqdm import tqdm

from transformers import BertTokenizer
from transformers import TFBertModel

import tensorflow as tf 

In [3]:
# random seed 고정
tf.random.set_seed(1234)
np.random.seed(1234)

SEQ_LENGTH = 128
BERT_MODEL_NAME = 'bert-base-multilingual-cased'

# 데이터 

## 데이터 다운로드 

In [4]:
!wget https://github.com/dhrim/deep_learning_data/raw/master/movie_ratings.txt

--2022-01-18 06:20:52--  https://github.com/dhrim/deep_learning_data/raw/master/movie_ratings.txt
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dhrim/deep_learning_data/master/movie_ratings.txt [following]
--2022-01-18 06:20:52--  https://raw.githubusercontent.com/dhrim/deep_learning_data/master/movie_ratings.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19515078 (19M) [text/plain]
Saving to: ‘movie_ratings.txt’


2022-01-18 06:20:53 (307 MB/s) - ‘movie_ratings.txt’ saved [19515078/19515078]



In [5]:
df = pd.read_table("movie_ratings.txt")
df.head()

Unnamed: 0,id,document,label
0,8112052,어릴때보고 지금다시봐도 재밌어요ㅋㅋ,1
1,8132799,"디자인을 배우는 학생으로, 외국디자이너와 그들이 일군 전통을 통해 발전해가는 문화산...",1
2,4655635,폴리스스토리 시리즈는 1부터 뉴까지 버릴께 하나도 없음.. 최고.,1
3,9251303,와.. 연기가 진짜 개쩔구나.. 지루할거라고 생각했는데 몰입해서 봤다.. 그래 이런...,1
4,10067386,안개 자욱한 밤하늘에 떠 있는 초승달 같은 영화.,1


## 데이터 섞기 

In [6]:
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


## 필요 입출력 값 준비 

In [7]:
reviews = df.document.values.copy().astype(np.str)
labels = df.label.values.copy().astype(np.int)

In [8]:
print(reviews.shape)
print(labels.shape)

(200000,)
(200000,)


In [9]:
reviews = reviews[:10000]
labels = labels[:10000]

## 토큰나이저 생성 

In [10]:
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME, do_lower_case=False, model_max_length=SEQ_LENGTH)

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

In [11]:
encoded_tokens = tokenizer.encode("토크나이징이 잘 될까요?")
print(encoded_tokens)
print(tokenizer.decode(encoded_tokens))

[101, 9873, 20308, 16439, 10739, 119233, 10739, 9654, 9100, 118671, 48549, 136, 102]
[CLS] 토크나이징이 잘 될까요? [SEP]


In [12]:
tokenized  = tokenizer("토크나이징이 잘 될까요?", max_length=20, padding='max_length')
print(tokenized.keys())
print(tokenized['input_ids'])
print(tokenized['attention_mask'])
print(tokenized['token_type_ids'])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
[101, 9873, 20308, 16439, 10739, 119233, 10739, 9654, 9100, 118671, 48549, 136, 102, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## x,y 생성 

In [13]:
def build_model_input(reviews):
  input_ids = []
  attention_masks = []
  token_type_ids = []

  for review in reviews:
    tokenized = tokenizer(review, max_length=SEQ_LENGTH, padding='max_length')
    # tokenized = {'input_ids': [101, ...], 'token_type_ids': [0, ...], 'attention_mask': [1, ...]}
    input_ids.append(tokenized['input_ids'][:SEQ_LENGTH])
    attention_masks.append(tokenized['attention_mask'][:SEQ_LENGTH])
    token_type_ids.append(tokenized['token_type_ids'][:SEQ_LENGTH])

  return (np.array(input_ids), np.array(attention_masks), np.array(token_type_ids))

In [14]:
x = build_model_input(reviews)
y = labels

In [15]:
print(x[0].shape)

(10000, 128)


## train/test 분리 

In [16]:
def split_bert_data(x, y, test_ratio):
  split_index = int(len(y)*(1-test_ratio))
  train_x = (x[0][:split_index], x[1][:split_index], x[2][:split_index])
  test_x = (x[0][split_index:], x[1][split_index:], x[2][split_index:])
  train_y, test_y = y[:split_index], y[split_index:]

  return(train_x, train_y), (test_x, test_y)

(train_x, train_y), (test_x, test_y) = split_bert_data(x, y, test_ratio=0.2)

# 학습 

## 모델 생성 

In [17]:
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.layers import Dense, Dropout

class TFBertClassifier(tf.keras.Model):
  def __init__(self):
    super(TFBertClassifier, self).__init__()

    self.bert = TFBertModel.from_pretrained(BERT_MODEL_NAME)
    self.dropout = Dropout(self.bert.config.hidden_dropout_prob)
    self.classifier = Dense(2, kernel_initializer=TruncatedNormal(self.bert.config.initializer_range), 
                            name="classifier", activation="softmax")

  def call(self, inputs, attention_mask=None, token_type_ids=None, training=True):

    outputs = self.bert(inputs, attention_mask=attention_mask, token_type_ids=token_type_ids)
    # outputs 값: # sequence_output, pooled_output, (hidden_states), (attentions)
    pooled_output = outputs[1] 
    v = self.dropout(pooled_output, training=training)
    out = self.classifier(v)

    return out

model = TFBertClassifier()


Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.


참고로 Bert의 default 설정은 다음과 같다.

In [18]:
print(model.bert.config)

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 119547
}



In [19]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

optimizer = Adam(3e-5)
loss = SparseCategoricalCrossentropy()
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

## 학습 실행

In [20]:
history = model.fit(train_x, train_y, epochs=5, batch_size=32, validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [21]:
loss, acc = model.evaluate(test_x, test_y, batch_size=32)
print("loss =", loss)
print("acc =", acc)

loss = 0.5931382775306702
acc = 0.809499979019165


## 분류 실행 

In [22]:
def do_classify(test_text):
  model_input = build_model_input([test_text])
  y_ = model.predict(model_input)
  predicted = "긍정" if y_[0][1]>0.5 else "부정"

  print(test_text, "-->", predicted, ",score :",y_[0][1])

do_classify("여운이 많이 남는 영화")
do_classify("여운이 많이 남는 영화. 스토리 전개는 뻔함.")
do_classify("여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용")
do_classify("여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용, 비추.")

여운이 많이 남는 영화 --> 긍정 ,score : 0.9960723
여운이 많이 남는 영화. 스토리 전개는 뻔함. --> 긍정 ,score : 0.9838864
여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용 --> 긍정 ,score : 0.5103531
여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용, 비추. --> 긍정 ,score : 0.7719643


## 다른 형태의 모델 코드

In [23]:
import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense, Flatten, Dropout, Lambda

input_ids = Input(shape=(SEQ_LENGTH,), dtype=tf.int32, name="input_ids")
attention_mask = Input(shape=(SEQ_LENGTH,), dtype=tf.int32, name="attention_mask")
token_type_ids = Input(shape=(SEQ_LENGTH,), dtype=tf.int32, name="token_type_ids")

bert_layer = TFBertModel.from_pretrained(BERT_MODEL_NAME)

outputs = bert_layer(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
pooled_output = outputs[1] 
v = Dropout(0.1)(pooled_output)
output = Dense(2, activation="softmax", kernel_initializer=TruncatedNormal(0.02), name="classifier")(v)

model = tf.keras.Model([input_ids, attention_mask, token_type_ids], output)


optimizer = Adam(3e-5)
loss = SparseCategoricalCrossentropy()
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

model.summary()

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 128)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 128)]        0           []                               
                                                                                                  
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  ((None, 128, 768),   177853440   ['input_ids[0][0]',              
                                 (None, 768))                     'attention_mask[0][0]',     

In [24]:
history = model.fit(train_x, train_y, epochs=5, batch_size=32, validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [25]:
def do_classify(test_text):
  model_input = build_model_input([test_text])
  y_ = model.predict(model_input)
  predicted = "긍정" if y_[0][1]>0.5 else "부정"

  print(test_text, "-->", predicted, ",score :",y_[0][1])

do_classify("여운이 많이 남는 영화")
do_classify("여운이 많이 남는 영화. 스토리 전개는 뻔함.")
do_classify("여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용")
do_classify("여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용, 비추.")

여운이 많이 남는 영화 --> 긍정 ,score : 0.99804
여운이 많이 남는 영화. 스토리 전개는 뻔함. --> 긍정 ,score : 0.99697363
여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용 --> 부정 ,score : 0.21323574
여운이 많이 남는 영화. 스토리 전개는 뻔함. 시간 때우기 용, 비추. --> 부정 ,score : 0.0059361993
