# BERT - Balanced

*Pre-training of Deep Bidirectional Transformers for Language Understanding.*

Using transfer learning techniques: first training a model architecture on one language modeling objective, and then fine-tune it for a supervised downstream task.

The BERT model’s architecture is a bidirectional Transformer encoder. The Transformer model is good at capturing long-distance dependencies compared to a recurrent neural network architecture. The bidirectional encoder meanwhile is a standout feature that differentiates BERT from OpenAI GPT (a left-to-right Transformer) and ELMo (a concatenation of independently trained left-to-right and right- to-left LSTM).

BERT is a huge model with 24 Transformer blocks, 1024 hidden layers, and 340M parameters.

The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words).

In the pre-training process, researchers took an approach which involved randomly masking a percentage of the input tokens (15 percent) to train a deep bidirectional representation. They refer to this method as a Masked Language Model (MLM).

A pre-trained language model cannot understand relationships between sentences, which is vital to language tasks such as question answering and natural language inferencing. Researchers therefore pre-trained a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.

From Tim Dettmers regarding compute for BERT: uses 256 TPU-days similar to the OpenAI model. Lots of TPUs parallelize about 25% better than GPUs. RTX 2080 Ti and V100 should be ~70% matmul and ~90% matmul perf vs TPU if you use 16-bit (important!). Therefore BERT ~= 375 RTX 2080 Ti days or 275 V100 days.

That would burn a large hole in the wallet. At 5.22 USD per TPU per hour for a Cloud TPU v2 in Asia Pacific, that equates to 32,072 USD or 44,347 AUD (not including experiments and optimization). Likely in the 100,000's.

Are we entering the era that while algorithms are available as open source, the data and compute required to implement them are out of reach for most people and organisations who albeit can rent the outcomes?

See [Cloud TPU Pricing](https://cloud.google.com/tpu/docs/pricing). Also see [BERT sets new standards](https://medium.com/syncedreview/best-nlp-model-ever-google-bert-sets-new-standards-in-11-language-tasks-4a2a189bc155).

In [1]:
import csv
import numpy as np
import os
import sys
import tensorflow as tf

In [2]:
sys.path.append('/Users/d777710/src/bert')
sys.path.append('../..')

In [3]:
from text_classification_benchmarks.metrics import perf_summary, print_perf_summary

In [4]:
DATA_DIR = '/Users/d777710/src/DeepLearning/dltemplate/src/text_classification_benchmarks/fastai'
BERT_BASE_DIR = '/Users/d777710/src/DeepLearning/dltemplate/pretrained_models/bert/uncased_L-12_H-768_A-12'
CODI_BERT_MODEL = '/Users/d777710/src/bert/codiout3'
OUTPUT_DIR = '/tmp/output'

In [5]:
import modeling
import run_classifier as clf
import tokenization

In [6]:
tf.logging.set_verbosity(tf.logging.FATAL)

In [7]:
max_api_calls = 200

In [8]:
config_file = os.path.join(BERT_BASE_DIR, 'bert_config.json')
bert_config = modeling.BertConfig.from_json_file(config_file)

In [9]:
processor = clf.CsvProcessor()

In [10]:
label_list = processor.get_labels(DATA_DIR)

In [11]:
vocab_file = os.path.join(BERT_BASE_DIR, 'vocab.txt')
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)

In [12]:
# init_checkpoint = os.path.join(BERT_BASE_DIR, 'bert_model.ckpt')
init_checkpoint = CODI_BERT_MODEL
model_fn = clf.model_fn_builder(
    bert_config=bert_config,
    num_labels=len(label_list),
    init_checkpoint=init_checkpoint,
    learning_rate=5e-5,
    num_train_steps=None,
    num_warmup_steps=None,
    use_tpu=False,
    use_one_hot_embeddings=False
)

In [13]:
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
run_config = tf.contrib.tpu.RunConfig(
      cluster=None,
      model_dir=OUTPUT_DIR,
      save_checkpoints_steps=1000,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=1000,
          num_shards=8,
          per_host_input_for_training=is_per_host
      )
)

In [14]:
estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=False,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=32,
    eval_batch_size=8,
    predict_batch_size=1
)

In [15]:
max_seq_length = 128

In [16]:
sample_utterance = 'How long do I have left on my plan'

In [17]:
examples = [clf.InputExample(guid=1, text_a=tokenization.convert_to_unicode(sample_utterance))]
features = clf.convert_examples_to_features(examples, 
                                            label_list, 
                                            max_seq_length=max_seq_length, 
                                            tokenizer=tokenizer)

In [18]:
input_fn = clf.input_fn_builder(
    features=features,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False
)

In [19]:
result = estimator.predict(input_fn=input_fn)

In [20]:
val_csv = '../fastai/balanced/val.csv'
test_csv = '../fastai/balanced/test.csv'
classes_txt = '../fastai/classes.txt'

In [21]:
classes = np.genfromtxt(classes_txt, dtype=str)

In [None]:
for label in result:
    print(classes[label])

In [23]:
val_lines = []
with tf.gfile.Open(val_csv, 'r') as f:
    reader = csv.reader(f)
    for line in reader:
        val_lines.append(line)

In [None]:
y_val = []
y_val_pred = []
incorrect = 0
for label, utterance in val_lines[:max_api_calls]:
    y_true = int(label)
    y_val.append(y_true)
    examples = [clf.InputExample(guid=1, text_a=tokenization.convert_to_unicode(utterance))]
    features = clf.convert_examples_to_features(examples,
                                                label_list,
                                                max_seq_length=max_seq_length,
                                                tokenizer=tokenizer)
    input_fn = clf.input_fn_builder(
        features=features,
        seq_length=max_seq_length,
        is_training=False,
        drop_remainder=False
    )
    pred = next(estimator.predict(input_fn=input_fn))
    y_val_pred.append(pred)
    if pred != y_true:
        print('Utterance:', utterance)
        print('Actual   :', classes[int(label)])
        print('Pred     :', classes[int(pred)])
        print(':)' if int(label) == int(pred) else ':(')
        print('')
        incorrect += 1

In [25]:
print('Dev set performance:')
stats = perf_summary(y_val, y_val_pred)
print_perf_summary(stats, rounded=2)

Dev set performance:
Precision (weighted avg): 0.79
Recall (weighted avg)   : 0.72
F1 Score (weighted avg) : 0.73
Accuracy                : 0.72
ROC AUC (macro avg)     : 0.85


In [26]:
test_lines = []
with tf.gfile.Open(test_csv, 'r') as f:
    reader = csv.reader(f)
    for line in reader:
        test_lines.append(line)

In [None]:
y_test = []
y_test_pred = []
for label, utterance in test_lines[:200]:
    y_true = int(label)
    y_test.append(y_true)
    examples = [clf.InputExample(guid=1, text_a=tokenization.convert_to_unicode(utterance))]
    features = clf.convert_examples_to_features(examples,
                                                label_list,
                                                max_seq_length=max_seq_length,
                                                tokenizer=tokenizer)
    input_fn = clf.input_fn_builder(
        features=features,
        seq_length=max_seq_length,
        is_training=False,
        drop_remainder=False
    )
    pred = next(estimator.predict(input_fn=input_fn))
    y_test_pred.append(pred)
    print('Utterance:', utterance)
    print('Actual   :', classes[int(label)])
    print('Pred     :', classes[int(pred)])
    print(':)' if int(label) == int(pred) else ':(')
    print('')

In [28]:
print('Test set performance:')
stats = perf_summary(y_test, y_test_pred)
print_perf_summary(stats, rounded=2)

Test set performance:
Precision (weighted avg): 0.76
Recall (weighted avg)   : 0.73
F1 Score (weighted avg) : 0.71
Accuracy                : 0.73
ROC AUC (macro avg)     : 0.87


In [29]:
eval_examples = processor.get_dev_examples(DATA_DIR)
eval_features = clf.convert_examples_to_features(eval_examples, label_list, max_seq_length, tokenizer)

In [30]:
eval_input_fn = clf.input_fn_builder(
    features=eval_features,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False)

eval_result = estimator.evaluate(input_fn=eval_input_fn, steps=None)

In [31]:
for key in sorted(eval_result.keys()):
    print('  {} = {}'.format(key, eval_result[key]))

  eval_accuracy = 0.718799352645874
  eval_loss = 2.0258591175079346
  global_step = 0
  loss = 2.0037953853607178


In [32]:
test_examples = processor.get_test_examples(DATA_DIR)
test_features = clf.convert_examples_to_features(test_examples, label_list, max_seq_length, tokenizer)

In [33]:
test_input_fn = clf.input_fn_builder(
    features=test_features,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False)

test_result = estimator.evaluate(input_fn=test_input_fn, steps=None)

In [34]:
for key in sorted(test_result.keys()):
    print('  {} = {}'.format(key, test_result[key]))

  eval_accuracy = 0.6031294465065002
  eval_loss = 2.745549440383911
  global_step = 0
  loss = 2.7473702430725098


## Conclusions

1. Inference slower than API services.
2. Best suited to batch predictions.