# BERT

Pre-training of Deep Bidirectional Transformers for Language Understanding.

Using transfer learning techniques - first training a model architecture on one language modeling objective, and then fine-tune it for a supervised downstream task.

The BERT model’s architecture is a bidirectional Transformer encoder. The Transformer model is good at capturing long-distance dependencies compared to a recurrent neural network architecture. The bidirectional encoder meanwhile is a standout feature that differentiates BERT from OpenAI GPT (a left-to-right Transformer) and ELMo (a concatenation of independently trained left-to-right and right- to-left LSTM).

BERT is a huge model, with 24 Transformer blocks, 1024 hidden layers, and 340M parameters.

The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words).

In the pre-training process, researchers took an approach which involved randomly masking a percentage of the input tokens (15 percent) to train a deep bidirectional representation. They refer to this method as a Masked Language Model (MLM).

A pre-trained language model cannot understand relationships between sentences, which is vital to language tasks such as question answering and natural language inferencing. Researchers therefore pre-trained a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.

From Tim Dettmers; regarding compute for BERT: uses 256 TPU-days similar to the OpenAI model. Lots of TPUs parallelize about 25% better than GPUs. RTX 2080 Ti and V100 should be ~70% matmul and ~90% matmul perf vs TPU if you use 16-bit (important!). 

BERT ~= 375 RTX 2080 Ti days or 275 V100 days.

That would burn a large hole in the wallet. At 5.22 USD per TPU per hour for a Cloud TPU v2 in Asia Pacific, that equates to 32,072 USD or 44,347 AUD (not including experiments and optimization). Likely in the 100,000's.

Are we entering the era that while algorithms are available as open source, the data and compute required to implement them are out of reach for most people and organisations, who can rent the outcomes.

See [Cloud TPU Pricing](https://cloud.google.com/tpu/docs/pricing).

Also see [BERT sets new standards](https://medium.com/syncedreview/best-nlp-model-ever-google-bert-sets-new-standards-in-11-language-tasks-4a2a189bc155).

In [1]:
import csv
import numpy as np
import os
import sys
import tensorflow as tf

In [2]:
sys.path.append('/Users/d777710/src/bert')
sys.path.append('../..')

In [3]:
from text_classification_benchmarks.metrics import perf_summary, print_perf_summary

In [33]:
DATA_DIR = '/Users/d777710/src/DeepLearning/dltemplate/src/text_classification_benchmarks/fastai'
BERT_BASE_DIR = '/Users/d777710/src/DeepLearning/dltemplate/pretrained_models/bert/uncased_L-12_H-768_A-12'
CODI_BERT_MODEL = '/Users/d777710/src/bert/codiout2'
OUTPUT_DIR = '/tmp/output'

In [34]:
import modeling
import run_classifier as clf
import tokenization

In [35]:
tf.logging.set_verbosity(tf.logging.FATAL)

In [36]:
config_file = os.path.join(BERT_BASE_DIR, 'bert_config.json')
bert_config = modeling.BertConfig.from_json_file(config_file)

In [37]:
processor = clf.CsvProcessor()

In [38]:
label_list = processor.get_labels(DATA_DIR)

In [39]:
vocab_file = os.path.join(BERT_BASE_DIR, 'vocab.txt')
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)

In [40]:
# init_checkpoint = os.path.join(BERT_BASE_DIR, 'bert_model.ckpt')
init_checkpoint = CODI_BERT_MODEL
model_fn = clf.model_fn_builder(
    bert_config=bert_config,
    num_labels=len(label_list),
    init_checkpoint=init_checkpoint,
    learning_rate=5e-5,
    num_train_steps=None,
    num_warmup_steps=None,
    use_tpu=False,
    use_one_hot_embeddings=False
)

In [41]:
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
run_config = tf.contrib.tpu.RunConfig(
      cluster=None,
      model_dir=OUTPUT_DIR,
      save_checkpoints_steps=1000,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=1000,
          num_shards=8,
          per_host_input_for_training=is_per_host
      )
)

In [42]:
estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=False,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=32,
    eval_batch_size=8,
    predict_batch_size=1
)

In [43]:
max_seq_length = 128

In [44]:
sample_utterance = 'How long do I have left on my plan'

In [45]:
examples = [clf.InputExample(guid=1, text_a=tokenization.convert_to_unicode(sample_utterance))]
features = clf.convert_examples_to_features(examples, 
                                            label_list, 
                                            max_seq_length=max_seq_length, 
                                            tokenizer=tokenizer)

In [46]:
input_fn = clf.input_fn_builder(
    features=features,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False
)

In [47]:
result = estimator.predict(input_fn=input_fn)

In [48]:
for item in result:
    print(item)

119


In [49]:
val_csv = '/Users/d777710/src/DeepLearning/dltemplate/src/text_classification_benchmarks/fastai/val.csv'
test_csv = '/Users/d777710/src/DeepLearning/dltemplate/src/text_classification_benchmarks/fastai/test.csv'
classes_txt = '/Users/d777710/src/DeepLearning/dltemplate/src/text_classification_benchmarks/fastai/classes.txt'

In [50]:
val_lines = []
with tf.gfile.Open(val_csv, 'r') as f:
    reader = csv.reader(f)
    for line in reader:
        val_lines.append(line)

In [51]:
classes = np.genfromtxt(classes_txt, dtype=str)

In [32]:
y_val = []
y_val_pred = []
incorrect = 0
for label, utterance in val_lines:
    y_true = int(label)
    y_val.append(y_true)
    examples = [clf.InputExample(guid=1, text_a=tokenization.convert_to_unicode(utterance))]
    features = clf.convert_examples_to_features(examples,
                                                label_list,
                                                max_seq_length=max_seq_length,
                                                tokenizer=tokenizer)
    input_fn = clf.input_fn_builder(
        features=features,
        seq_length=max_seq_length,
        is_training=False,
        drop_remainder=False
    )
    pred = next(estimator.predict(input_fn=input_fn))
    y_val_pred.append(pred)
    if pred != y_true:
        print(classes[y_true], classes[pred], utterance)
        incorrect += 1

Information-E_Contact_Us Service_Management-T_Find_Phone_Number What is the customer number?
Account_Management-E_Name_Change Account_Management-E_Update_Change_Contact_Phone_Number i want to update my contact details with telstra for my name
Billing-E_Dispute Payment-E_Refund_Payment I have been overcharged
Service_Management-E_Price_Plan_Change Sales-E_Port_In so change to post paid, and then my Mx plan is active?
Complaints-T_Troubleshooting_Call Service_Management-T_MessageBank_Remoteaccess My text messages are not going thru
Payment-E_Payment_History Billing-T_Bill_Copy Hi Codi, I wanted to ask whether I can get my Telstra Prepaid Mobile Invoices sent to me for this financial year.
Service_Management-T_Find_Phone_Number Information-T_Definitions What is my 13 digit account code
Flow_Control_Break_Flow-Another_Question TEMP_ChitChat-Talk_to_me I have another concern
Information-T_Definitions Service_Management-T_MessageBank_Message2Txt What is Message2Txt
Account_Management-T_Email

Complaints-T_Troubleshooting_Data Complaints-T_Troubleshooting_Broadband i can't get internet on my mobile broadband device
Complaints-T_Troubleshooting_Data Complaints-T_Troubleshooting_Device why is my ipad not connecting... yesterday i had $*** today it says i have $*** but when i try to connect it says no internet
TEMP_Payment-E_Method_Of_Payment_Inquiry Account_Management-E_Close_Cancel_Account ok... it says: 'delete payment method. do you really wan tot delete the following saved payment method?' do i press 'yes'?
Service_Management-T_Carrier_Billing Service_Management-T_Premium_SMS Subscription services charging my account
Service_Management-E_Price_Plan_Inquiry Information-E_About_Us Just want to know the details of my plant
Payment-E_Missing_Misapplied_Payment Complaints-T_Troubleshooting_Credit_or_Recharge I recharged yesterday and accidentally did it on my son's phone not mine. Can I fix it as my son is now with vodaphone.
Complaints-E_Network Complaints-T_Troubleshooting_Br

Complaints-T_Troubleshooting_Activation Service_Management-E_Activate_Prepaid_Plan I bought a simcard this morning and I am still waiting for activation of the number: ***. Can you please help?
Service_Management-E_Price_Plan_Inquiry Service_Management-T_Price_Plan_Offers I am looking at your plan M30GB
Flow_Control-No Flow_Control-Yes 
Service_Management-T_Contract_Details Sales-T_Recontract_Service Hi I was wondering when my plan runs out so I can get a new phone?
Payment-E_Defer_Payment Billing-E_Dispute I want to delay payment for my account
Complaints-T_Troubleshooting_Profile Service_Management-T_Data_Share_SIM hi i've just set up a new mobile that i wanted linked to my telstra online account. using the 'add account' button isn't working for this number can you help please?
Service_Management-T_Activate_Broadband Off_Topic-T_Out_of_Scope Need to talk to someone with common sense in Australia on setting up a new Telstra supplied modem for cable broadband
Account_Management-T_24x7A

In [25]:
len(y_val_pred)

703

In [28]:
stats = perf_summary(y_val, y_val_pred)
print_perf_summary(stats, rounded=2)

Precision (weighted avg): 0.76
Recall (weighted avg)   : 0.74
F1 Score (weighted avg) : 0.73
Accuracy                : 0.74
ROC AUC (macro avg)     : 0.87


In [52]:
y_val = []
y_val_pred = []
incorrect = 0
for label, utterance in val_lines:
    y_true = int(label)
    y_val.append(y_true)
    examples = [clf.InputExample(guid=1, text_a=tokenization.convert_to_unicode(utterance))]
    features = clf.convert_examples_to_features(examples,
                                                label_list,
                                                max_seq_length=max_seq_length,
                                                tokenizer=tokenizer)
    input_fn = clf.input_fn_builder(
        features=features,
        seq_length=max_seq_length,
        is_training=False,
        drop_remainder=False
    )
    pred = next(estimator.predict(input_fn=input_fn))
    y_val_pred.append(pred)
    if pred != y_true:
        print(classes[y_true], classes[pred], utterance)
        incorrect += 1

Information-E_Contact_Us Service_Management-T_Find_Phone_Number What is the customer number?
Account_Management-E_Name_Change Account_Management-E_Update_Change_Contact_Phone_Number i want to update my contact details with telstra for my name
Billing-E_Dispute Payment-E_Refund_Payment I have been overcharged
Service_Management-E_Price_Plan_Change Service_Management-E_Activate_Prepaid_Plan so change to post paid, and then my Mx plan is active?
Complaints-T_Troubleshooting_Call Service_Management-T_MessageBank_Remoteaccess My text messages are not going thru
Payment-E_Payment_History Billing-T_Bill_Copy Hi Codi, I wanted to ask whether I can get my Telstra Prepaid Mobile Invoices sent to me for this financial year.
Service_Management-T_Find_Phone_Number Information-T_Definitions What is my 13 digit account code
Flow_Control_Break_Flow-Another_Question TEMP_ChitChat-Talk_to_me I have another concern
Information-T_Definitions Service_Management-T_MessageBank_Message2Txt What is Message2Txt

Service_Management-T_Carrier_Billing Service_Management-T_Premium_SMS Subscription services charging my account
Service_Management-E_Price_Plan_Inquiry Information-E_About_Us Just want to know the details of my plant
Service_Management-T_Price_Plan_Offers Service_Management-E_Add_Service_Features I will sign up right now for 12mo if you can do it for $40
Payment-E_Missing_Misapplied_Payment Complaints-T_Troubleshooting_Credit_or_Recharge I recharged yesterday and accidentally did it on my son's phone not mine. Can I fix it as my son is now with vodaphone.
Complaints-E_Network Complaints-T_Troubleshooting_Broadband We have a line connectivity issue with our wifi: there was an outage noted in our area from 29 November to 12 December but we still have no wifi. The Telstra diagnosis has said it's a line connectivity issue. What do I do to get back on line?
Service_Management-T_MessageBank_Misc Service_Management-T_MessageBank_Greeting my friend was telling me I need to change my voice mail

Payment-E_Recharge TEMP_Information-T_Credit_Information what are the steps to top up my phone credit
Service_Management-E_Coverage_Area_Inquiry Information-T_Internet_Speed I want to know what the data speeds are in my area
Service_Management-T_Benefits_Add TEMP_Service_Management-T_Benefits_Cancel "Can you help with a problem accession AFL live app? Tried live chat earlier and was told to contact Crowd support but not told how"
Service_Management-T_Reconnect_Service Service_Management-E_Remove_Service_Features I need to get my internet turned back on please
Off_Topic-T_Out_of_Scope Service_Management-E_Price_Plan_Change i want to change my package for my pay tv
Service_Management-T_Price_Plan_Offers Sales-E_Port_In Hi Codi, I would like to talk to someone about joining Telstra and what Plans may be suitable for me.
Account_Management-E_Misc Billing-E_Misc Account Enquiry
Service_Management-T_Carrier_Billing Complaints-T_Troubleshooting_24x7App i would like to purchase app from window

In [53]:
stats = perf_summary(y_val, y_val_pred)
print_perf_summary(stats, rounded=2)

Precision (weighted avg): 0.78
Recall (weighted avg)   : 0.76
F1 Score (weighted avg) : 0.75
Accuracy                : 0.76
ROC AUC (macro avg)     : 0.88


In [None]:
test_lines = []
with tf.gfile.Open(test_csv, 'r') as f:
    reader = csv.reader(f)
    for line in reader:
        test_lines.append(line)

In [None]:
for label, utterance in test_lines[:5]:
    examples = [clf.InputExample(guid=1, text_a=tokenization.convert_to_unicode(utterance))]
    features = clf.convert_examples_to_features(examples,
                                                label_list,
                                                max_seq_length=max_seq_length,
                                                tokenizer=tokenizer)
    input_fn = clf.input_fn_builder(
        features=features,
        seq_length=max_seq_length,
        is_training=False,
        drop_remainder=False
    )
    pred = next(estimator.predict(input_fn=input_fn))
    print(classes[int(label)], classes[int(pred)], utterance)
    print('')

In [None]:
eval_examples = processor.get_dev_examples(DATA_DIR)
eval_features = clf.convert_examples_to_features(eval_examples, label_list, max_seq_length, tokenizer)

In [None]:
eval_input_fn = clf.input_fn_builder(
    features=eval_features,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False)

eval_result = estimator.evaluate(input_fn=eval_input_fn, steps=None)

In [None]:
for key in sorted(eval_result.keys()):
    print('  {} = {}'.format(key, eval_result[key]))

In [None]:
test_examples = processor.get_test_examples(DATA_DIR)
test_features = clf.convert_examples_to_features(test_examples, label_list, max_seq_length, tokenizer)

In [None]:
test_input_fn = clf.input_fn_builder(
    features=test_features,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False)

test_result = estimator.evaluate(input_fn=test_input_fn, steps=None)

In [None]:
for key in sorted(test_result.keys()):
    print('  {} = {}'.format(key, test_result[key]))

## Conclusions

1. Inference slower than API services. Best suited to batch predictions.