# About this kernel

I've seen a lot of people pooling the output of BERT, then add some Dense layers. I also saw various learning rates for fine-tuning. In this kernel, I wanted to try some ideas that were used in the original paper that did not appear in many public kernel. Here are some examples:
* *No pooling, directly use the CLS embedding*. The original paper uses the output embedding for the `[CLS]` token when it is finetuning for classification tasks, such as sentiment analysis. Since the `[CLS]` token is the first token in our sequence, we simply take the first slice of the 2nd dimension from our tensor of shape `(batch_size, max_len, hidden_dim)`, which result in a tensor of shape `(batch_size, hidden_dim)`.
* *No Dense layer*. Simply add a sigmoid output directly to the last layer of BERT, rather than experimenting with different intermediate layers.
* *Fixed learning rate, batch size, epochs, optimizer*. As specified by the paper, the optimizer used is Adam, with a learning rate between 2e-5 and 5e-5. Furthermore, they train the model for 3 epochs with a batch size of 32. I wanted to see how well it would perform with those default values.

I also wanted to share this kernel as a **concise, reusable, and functional example of how to build a workflow around the TF2 version of BERT**. Indeed, it takes less than **50 lines of code to write a string-to-tokens preprocessing function and model builder**.

## References

* Source for `bert_encode` function: https://www.kaggle.com/user123454321/bert-starter-inference
* All pre-trained BERT models from Tensorflow Hub: https://tfhub.dev/s?q=bert

In [0]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [0]:
!pip install sentencepiece



In [0]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

import tokenization

# Helper Functions

In [0]:
def bert_encode(texts, tokenizer, max_len=90):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [0]:
def build_model(bert_layer, max_len=90):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# Load and Preprocess

- Load BERT from the Tensorflow Hub
- Load CSV files containing training data
- Load tokenizer from the bert layer
- Encode the text into tokens, masks, and segment flags

In [0]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

CPU times: user 27.3 s, sys: 5.2 s, total: 32.5 s
Wall time: 40.6 s


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls '/content/drive/My Drive/boaz_study/miniproj'

'한겨레_1차 전처리.csv'       '미니프로젝트 BERT Model ver.1.ipynb'
'동아일보_1차 전처리.csv'     '중앙일보 미니프로젝트 데이터(전처리 ver_1).csv'
'1차 전처리 데이터 종합.csv'  '경향신문 미니프로젝트 데이터(전처리 ver_1).csv'


In [0]:
data = pd.read_csv("/content/drive/My Drive/boaz_study/miniproj/1차 전처리 데이터 종합.csv")

In [0]:
data.isnull().sum()

기사 제목      0
기사 내용    186
label      0
dtype: int64

In [0]:
data.dropna(axis = 0, inplace = True)

In [0]:
data.reset_index(inplace = True)

In [0]:
data_0 = data.loc[data['label'] == 0]
data_1 = data.loc[data['label'] == 1]

In [0]:
len(data_0)

44618

In [0]:
len(data_1)

59831

In [0]:
data_0.reset_index(inplace = True)
data_1.reset_index(inplace = True)

In [0]:
data_0 = data_0.loc[0:30000]

In [0]:
data_1 = data_1.loc[0:30000]

In [0]:
data = pd.concat([data_0, data_1], axis = 0)

In [0]:
data.drop(["level_0", "index"], inplace = True, axis = 1)

In [0]:
data.reset_index(inplace = True)

In [0]:
data.head()

Unnamed: 0,index,기사 제목,기사 내용,label
0,0,하태경 임을 위한 행진곡 은 민주주의 한류 보수가 앞장서서 수출해야,하태경 미래통합당 의원이 18일 임을 위한 행진곡 은 자랑스러운 민주주의 한류로...,0
1,1,단독 여야 과거사법 배상 조항 빼기로 합의 20일 마무리 본회의서 민생...,여야가 20대 국회 마지막 본회의를 오는 20일에 열고 코로나19 대응 관련 법안과...,0
2,2,정총리 5 18의 실체적 진실 역사의 심판대 위에 올려야,정세균 국무총리는 18일 아직 숨겨진 5 18민주화운동의 실체적 진실을 역사의 심...,0
3,3,정세균 총리 민주유공자 유족 가슴 아프게 하는 왜곡 폄훼 없어야,정세균 국무총리가 소설가 한강의 작품 소년이 온다 를 인용하면서 5 18 민주유공...,0
4,4,광주 간 잠룡들,김부겸 보수가 좋아 찍었다고 하는 게 나아 지역감정 비판유승민 보수 5 18 ...,0


In [0]:
data["기사 제목"] = data["기사 제목"].astype("string")

In [0]:
len_data = []
for i in range(len(data)):
  a = len(data["기사 제목"][i])
  len_data.append(a)

print(max(len_data))

86


In [0]:
from sklearn.model_selection import train_test_split
from sklearn import metrics

X_train = data.drop("label", axis = 1)
X_target = data.drop(["기사 제목", "기사 내용", "index"], axis = 1)
train, test, train_labels, test_labels = train_test_split(X_train, X_target, test_size = 0.3, random_state = 2000)

In [0]:
train = pd.DataFrame(data = train, columns=X_train.columns)
test = pd.DataFrame(data = test, columns=X_train.columns)
train_labels = pd.DataFrame(data = train_labels, columns=X_target.columns)
test_labels = pd.DataFrame(data = test_labels, columns=X_target.columns)

In [0]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [0]:
train_input = bert_encode(train["기사 제목"].values, tokenizer, max_len=90)
test_input = bert_encode(test["기사 제목"].values, tokenizer, max_len=90)
train_labels = train_labels.label.values

# Model: Build, Train, Predict, Submit

In [0]:
model = build_model(bert_layer, max_len=90)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 90)]         0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 90)]         0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 90)]         0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [0]:
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=3,
    callbacks=[checkpoint],
    batch_size=16
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [0]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)
test_pred_BERT_int = test_pred.round().astype('int')

In [0]:
test_pred_int = pd.DataFrame(test_pred_BERT_int, columns=X_target.columns)

In [0]:
accuracy = np.mean(np.equal(test_labels,test_pred_int))

In [0]:
accuracy*100

label    80.851064
dtype: float64

In [0]:
test_labels.head(10)

Unnamed: 0,label
52830,1
36248,1
8719,0
43285,1
52574,1
52541,1
10067,0
57669,1
51311,1
898,0


In [0]:
test_pred_int.head(10)

Unnamed: 0,label
0,1
1,1
2,0
3,1
4,0
5,0
6,0
7,1
8,1
9,0


In [0]:
test.loc[52574][["기사 제목"]]

기사 제목    사전투표율 5시 기준 10% 넘어…역대 최고 기록하나
Name: 52574, dtype: object

In [0]:
import torch
import time
a = torch.tensor(1).cuda()
while True:
  a += 1
  time.sleep(30)

# 프리트레인 알아보기
# koBERT 구현된 것 찾아보기
# 전처리 더해보기