# Predicting Movie Review Sentriment with BERT
This example is slight different with the tutorial in [**google's BERT**](https://github.com/google-research/bert#pre-trained-models) which uses pre-trained bert model. Some codes are borrowed from [this tutorial](https://github.com/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb).

The purpose of this tutorial is telling you how to use **BERT_with_keras** to pretrain an encoding model and train a classification model. Let's get started!

## Data

Let's use Standord's Large Movie Review Dataset for BERT pretraining and fine-tuning, the code below, which downloads,extracts and imports the dateset, is borrowed from this [tensorflow tutorial](https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub). The dataset consists of IMDB movie reviews labeled by positivity from 1 to 10.

In [1]:
import os
import re
import tensorflow as tf
import pandas as pd

In [2]:
# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
    data = {}
    data["sentence"] = []
    data["sentiment"] = []
    for file_path in os.listdir(directory):
        with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
            data["sentence"].append(f.read())
            data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
    return pd.DataFrame.from_dict(data)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
    pos_df = load_directory_data(os.path.join(directory, "pos"))
    neg_df = load_directory_data(os.path.join(directory, "neg"))
    pos_df["polarity"] = 1
    neg_df["polarity"] = 0
    return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
    dataset = tf.keras.utils.get_file(
        fname="aclImdb.tar.gz", 
        origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
        extract=True)
  
    train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                         "aclImdb", "train"))
    test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                          "aclImdb", "test"))
    return train_df, test_df
 
train, test = download_and_load_datasets()

In [4]:
train.head()

Unnamed: 0,sentence,sentiment,polarity
0,First I liked that movie. It seemed to me a ni...,4,0
1,...but this has to be the worst A Christmas Ca...,1,0
2,"The Sopranos stands out as an airtight, dynami...",9,1
3,I enjoyed the cinematographic recreation of Ch...,8,1
4,'The Student of Prague' is an early feature-le...,7,1


## Pre-training
Let's use the IMDB movie reviews data for bert's pretraining. When you want to get a better **bert pre-training model**, you need to prepare data by yourself.

In [9]:
import os
import spacy
import random
from const import bert_data_path,bert_model_path
from preprocess import create_pretraining_data_from_docs
from pretraining import bert_pretraining

Before pre-training bert model, we need to transform the movie revies data to the format we use in **Bert_with_keras model**

In [24]:
# use spacy for sentence split
nlp = spacy.load('en')

# use IMDB movie review as pretraining data
texts = train['sentence'].tolist() + test['sentence'].tolist()

# To keep training fast on my macbook, I take a sample of 50 movie reviews. 
# Due to this operation, pretraining model may be very bad. 
# So you can try with all movie reviews to get better pre-training model.
random.shuffle(texts)
texts = texts[0:50]

# sentence split
sentences_texts=[]
for text in texts:
    doc = nlp(text)
    sentences_texts.append([s.text for s in doc.sents])

vocab_path = os.path.join(bert_data_path, 'vocab.txt')

# set dupe_factor=5 to reduce the samples of pre-training data. defaut:10
create_pretraining_data_from_docs(sentences_texts,
                                  vocab_path=vocab_path,
                                  save_path=os.path.join(bert_data_path,'pretraining_data.npz'),
                                  token_method='wordpiece',
                                  language='en',
                                  dupe_factor=5
                                 )

num-0: tokens: [CLS] we also have lots of scenes with the hero fl ##au ##nting all the [MASK] of respects and protocol which the rest of the tibetan [MASK] accord ##s the dalai lama , [MASK] as we [MASK] that the hero has deep and profound [MASK] ##erence [MASK] these people and their spiritual [MASK] . [SEP] that this guy is now [MASK] buddhist , sort of , in his own way , even though we ourselves don ##´ ##t seem to know what his transformation en ##tails or how [MASK] [MASK] [MASK] it to go . and last but not least , we hang a stat ##istic onto the [MASK] [MASK] the film about [MASK] app ##all ##ingly the chinese have treated the tibetan ##s ( which is [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: False
masked_lm_positions: 9 15 26 33 36 44 4

Now that we've prepared the pre-training data, let's focus on training a pre-training model.

In [26]:
# pretraining a bert encoder model
bert_pretraining(train_data_path=os.path.join(bert_data_path,'pretraining_data.npz'),
                 bert_config_file=os.path.join(bert_data_path, 'bert_config.json'),
                 save_path=bert_model_path,
                 batch_size=8,
                 seq_length=128,
                 max_predictions_per_seq=20,
                 val_batch_size=128,
                 multi_gpu=0,
                 num_warmup_steps=10,
                 checkpoints_interval_steps=110,
                 max_num_val=1000,
                 pretraining_model_name='bert_pretraining_movie_reviews.h5',
                 encoder_model_name='bert_movie_reviews_encoder.h5')



Epoch 1/2
step 0000120: cur_lm_acc is 0.25802, cur_is_random_nex_acc is 0.56604

Step 0000120: val_acc improved from -inf to 0.41203, saving model to /Users/zhongan/Documents/competition/BERT_with_keras/models/bert_pretraining_movie_reviews.h5
Epoch 2/2
step 0000230: cur_lm_acc is 0.25142, cur_is_random_nex_acc is 0.56604

Step 0000230: val_acc did not improve from 0.41203, current is 0.40873


## Fune-tuning
Use the above pre-training model as the initial point for your NLP model.Here the pre-training model is used to classify the movie reviews(i.e. classifying whether a movie review is positive or negtive).

In [10]:
import os
import keras
import numpy as np
from const import bert_data_path, bert_model_path
from modeling import BertConfig
from classifier import SingleSeqDataProcessor, convert_examples_to_features, text_classifier, save_features
from tokenization import FullTokenizer
from optimization import AdamWeightDecayOpt
from checkpoint import StepModelCheckpoint

First, we need to convert movie reviews data to what we need in bert.

In [11]:
train_examples = SingleSeqDataProcessor.get_train_examples(train_data=train['sentence'].tolist(),labels=train['polarity'].tolist())
dev_exmaples = SingleSeqDataProcessor.get_dev_examples(dev_data=test['sentence'].tolist(), labels=test['polarity'].tolist())

# `word piece tokenizer` need to a prepared vocabulary.
vocab_path = os.path.join(bert_data_path, 'vocab.txt')

# load vocab to tokenizer
tokenizer = FullTokenizer(vocab_path, do_lower_case=True)

# convert the train and dev examples to features
train_features = convert_examples_to_features(train_examples, 
                                              label_list=[0,1], 
                                              max_seq_length=128, 
                                              tokenizer= tokenizer)
dev_features = convert_examples_to_features(dev_exmaples, label_list=[0,1], max_seq_length=128, tokenizer=tokenizer)

# convert features to a dictionary of numpy arrays.
train_features_array_dict = save_features(features=train_features)
dev_features_array_dict = save_features(features=dev_features)

# get train and validation data
train_x = [train_features_array_dict['input_ids'], train_features_array_dict['input_mask'], train_features_array_dict['segment_ids']]
train_y = keras.utils.to_categorical(train_features_array_dict['label_ids'], 2)
val_x = [dev_features_array_dict['input_ids'], dev_features_array_dict['input_mask'], dev_features_array_dict['segment_ids']]
val_y = keras.utils.to_categorical(dev_features_array_dict['label_ids'],2)

Secend, we need to set the args of bert model. we also need to define a opimizer and a checkponter.

In [5]:
# load bert configuration file
config = BertConfig.from_json_file(os.path.join(bert_data_path, 'bert_config.json'))
epochs = 3
num_gpus = None
# if you come across a OOM problem, reduce the batch size.
batch_size = 16

# calculation the number of training steps by epoch size.
num_train_samples = len(train_features_array_dict['input_ids'])
num_train_steps = int(np.ceil(num_train_samples / batch_size)) * epochs
print("number of train steps: {}".format(num_train_steps))

# Use weight decay adam optimizer. this optimizer is sightly different with Keras's Standard Adam optimizer. 
# For more details, view source code of AdamWeightDecayOpt.
adam = AdamWeightDecayOpt(
        lr=5e-5,
        num_train_steps=num_train_steps,
        num_warmup_steps=100,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-6,
        weight_decay_rate=0.01,
        exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]
    )

# This checkpoint evaluate the bert model performance on batch end.
checkpoint = StepModelCheckpoint(filepath="%s/%s" % (bert_model_path, 'imdb_classifer_model.h5'),
                                 verbose=1, monitor='val_acc',
                                 save_best_only=True,
                                 xlen=3,
                                 period=100,
                                 start_step=100,
                                 val_batch_size=128)

number of train steps: 4689


Last, create a bert classification model and train it.

In [None]:
# create a model
classifier = text_classifier(bert_config=config,
                             pretrain_model_path=os.path.join(bert_model_path, 'bert_movie_reviews_encoder.h5'),
                             batch_size=batch_size,
                             seq_length=128,
                             optimizer=adam,
                             num_classes=2,
                             multi_gpu= num_gpus
                             )

# when using multi-gpus, the parallel model of bert cann't be used to evaluate/predict.
# You can only use the cpu_build model to evalate and predict.
if num_gpus is not None:
    checkpoint.single_gpu_model = classifier.model

# train model
history = classifier.fit(x=train_x,
                         y=train_y,
                         epochs=epochs,
                         shuffle=True,
                         callbacks=[checkpoint],
                         validation_data=(val_x,val_y)
                         )

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
 2432/25000 [=>............................] - ETA: 4:42:21 - loss: 0.7182