# About this kernel

I've seen a lot of people pooling the output of BERT, then add some Dense layers. I also saw various learning rates for fine-tuning. In this kernel, I wanted to try some ideas that were used in the original paper that did not appear in many public kernel. Here are some examples:
* *No pooling, directly use the CLS embedding*. The original paper uses the output embedding for the `[CLS]` token when it is finetuning for classification tasks, such as sentiment analysis. Since the `[CLS]` token is the first token in our sequence, we simply take the first slice of the 2nd dimension from our tensor of shape `(batch_size, max_len, hidden_dim)`, which result in a tensor of shape `(batch_size, hidden_dim)`.
* *No Dense layer*. Simply add a sigmoid output directly to the last layer of BERT, rather than experimenting with different intermediate layers.
* *Fixed learning rate, batch size, epochs, optimizer*. As specified by the paper, the optimizer used is Adam, with a learning rate between 2e-5 and 5e-5. Furthermore, they train the model for 3 epochs with a batch size of 32. I wanted to see how well it would perform with those default values.

I also wanted to share this kernel as a **concise, reusable, and functional example of how to build a workflow around the TF2 version of BERT**. Indeed, it takes less than **50 lines of code to write a string-to-tokens preprocessing function and model builder**.

## References

* Source for `bert_encode` function: https://www.kaggle.com/user123454321/bert-starter-inference
* All pre-trained BERT models from Tensorflow Hub: https://tfhub.dev/s?q=bert

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls drive/MyDrive

In [3]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [None]:
!pip install tensorflow

In [5]:
#Is a tokenizer us
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 34.2 MB/s eta 0:00:01[K     |▌                               | 20 kB 22.5 MB/s eta 0:00:01[K     |▉                               | 30 kB 17.4 MB/s eta 0:00:01[K     |█                               | 40 kB 15.3 MB/s eta 0:00:01[K     |█▍                              | 51 kB 7.7 MB/s eta 0:00:01[K     |█▋                              | 61 kB 7.6 MB/s eta 0:00:01[K     |██                              | 71 kB 8.1 MB/s eta 0:00:01[K     |██▏                             | 81 kB 9.0 MB/s eta 0:00:01[K     |██▍                             | 92 kB 9.4 MB/s eta 0:00:01[K     |██▊                             | 102 kB 7.4 MB/s eta 0:00:01[K     |███                             | 112 kB 7.4 MB/s eta 0:00:01[K     |███▎                            | 122 kB 7.4 MB/s eta 0:00:01[K     |███▌      

In [6]:
!pip install tensorflow_hub



In [7]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

import tokenization

# Helper Functions

In [8]:
def encoader_funct(docs, tokenizer, max_len=512):
    tokens_list = []
    masks = []
    segments = []
    
    for doc in docs:
        doc = tokenizer.tokenize(doc)
            
        doc = doc[:max_len-2]
        input_sequence = ["[CLS]"] + doc + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        tokens_list.append(tokens)
        masks.append(pad_masks)
        segments.append(segment_ids)
    
    return np.array(tokens_list), np.array(masks), np.array(segments)

In [9]:
def build_model_funct(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# Load and Preprocess

- Load BERT from the Tensorflow Hub
- Load CSV files containing training data
- Load tokenizer from the bert layer
- Encode the text into tokens, masks, and segment flags

In [10]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

CPU times: user 24.6 s, sys: 5.71 s, total: 30.4 s
Wall time: 35.1 s


In [11]:
train = pd.read_csv("train_clean.csv")
test = pd.read_csv("test_clean.csv")
submission = pd.read_csv("sample_submission.csv")

In [None]:
train

In [13]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [14]:
#test = test.dropna(subset = ['trans']) 
train = train.dropna(subset = ['trans'])

In [15]:
test = test.fillna("")

In [16]:
#tokenizing the text
train_input = encoader_funct(train.trans.values, tokenizer, max_len=160)
test_input = encoader_funct(test.trans.values, tokenizer, max_len=160)
train_labels = train.target.values

In [None]:
len(train_input[1])

7613

# Model: Build, Train, Predict, Submit

In [17]:
model = build_model_funct(bert_layer, max_len=160)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

  "The `lr` argument is deprecated, use `learning_rate` instead.")


In [None]:
checkpoint = ModelCheckpoint('build_model_funct/MyDrive/model.h5', monitor='val_loss', save_best_only=True)
#training the model
train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=5,
    callbacks=[checkpoint],
    batch_size=16
)

In [None]:
!ls drive/MyDrive

In [None]:
model.save_weights("drive/MyDrive/model.h5")

In [None]:
#with 3 epoch
metrics=pd.DataFrame(model.history.history)
metrics

Unnamed: 0,loss,accuracy,val_loss,val_accuracy
0,0.448901,0.802299,0.394063,0.829941
1,0.2705,0.895731,0.455615,0.814183
2,0.134778,0.952874,0.603888,0.809586
3,0.083507,0.969787,0.689941,0.812213
4,0.055357,0.977176,0.785013,0.815496


In [None]:
#with cleaned data
#with 5 epoch
metrics=pd.DataFrame(model.history.history)
metrics

Unnamed: 0,loss,accuracy,val_loss,val_accuracy
0,0.448901,0.802299,0.394063,0.829941
1,0.2705,0.895731,0.455615,0.814183
2,0.134778,0.952874,0.603888,0.809586
3,0.083507,0.969787,0.689941,0.812213
4,0.055357,0.977176,0.785013,0.815496


In [19]:
model.load_weights('drive/MyDrive/model.h5')
test_pred = model.predict(test_input)

In [20]:
submission['target'] = test_pred.round().astype(int)
submission.to_csv('submission.csv', index=False)

In [None]:
test= test.reset_index(drop = True)

In [None]:
pd.DataFrame(test_pred.round().astype(int))