<a href="https://colab.research.google.com/github/MHDBST/BERT_examples/blob/master/SciBERT_TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning SciBERT on BioMed dataset


In this experiment, we will be pre-training a state-of-the-art Natural Language Understanding model [BERT](https://arxiv.org/abs/1810.04805.) on arbitrary text data using Google Cloud infrastructure.

This guide covers all stages of the procedure, including:

1. Setting up the training environment
2. Downloading raw text data
3. Preprocessing text data
4. Learning a new vocabulary
5. Creating sharded pre-training data
6. Setting up GCS storage for data and model
7. Training the model on a cloud TPU

For persistent storage of training data and model, you will require a Google Cloud Storage bucket. 
Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) to create a GCP account and GCS bucket. New Google Cloud users have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. 

Steps 1-5 of this tutorial can be run without a GCS bucket for demonstration purposes. In that case, however, you will not be able to train the model.

**Note** 
The only parameter you *really have to set* is BUCKET_NAME in steps 5 and 6. Everything else has default values which should work for most use-cases.

**Note** 
Pre-training a BERT-Base model on a TPUv2 will take about 54 hours. Google Colab is not designed for executing such long-running jobs and will interrupt the training process every 8 hours or so. For uninterrupted training, consider using a preemptible TPUv2 instance. 

That said, at the time of writing (09.05.2019), with a Colab TPU, pre-training a BERT model from scratch can be achieved at a negligible cost of storing the said model and data in GCS  (~1 USD).

Now, let's get to business.

MIT License

Copyright (c) [2019] [Antyukhov Denis Olegovich]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Step 1: setting up training environment
First and foremost, we get the packages required to train the model. 
The Jupyter environment allows executing bash commands directly from the notebook by using an exclamation mark ‘!’. I will be exploiting this approach to make use of several other bash commands throughout the experiment.

Now, let’s import the packages and authorize ourselves in Google Cloud.

In [1]:
!pip install sentencepiece
# !git clone https://github.com/google-research/bert

import os
import sys
import json
import nltk
import random
import logging
import tensorflow as tf
import sentencepiece as spm

from glob import glob
from google.colab import auth, drive
from tensorflow.keras.utils import Progbar

sys.path.append("bert")
!pip install bert_tensorflow

from bert import modeling, optimization, tokenization, run_classifier
# from bert.run_pretraining import input_fn_builder, model_fn_builder

auth.authenticate_user()
  
# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s :  %(message)s')
sh = logging.StreamHandler()
sh.setLevel(logging.INFO)
sh.setFormatter(formatter)
log.handlers = [sh]

if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False




2020-01-17 02:15:44,058 :  Using TPU runtime
2020-01-17 02:15:44,065 :  TPU address is grpc://10.57.116.162:8470
2020-01-17 02:15:44,066 :  
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## Step 2: getting the data

We begin with obtaining a corpus of raw text data. For this experiment, we will be using the [OpenSubtitles](http://www.opensubtitles.org/) dataset, which is available for 65 languages [here](http://opus.nlpl.eu/OpenSubtitles-v2016.php). 

Unlike more common text datasets (like Wikipedia) it does not require any complex pre-processing. It also comes pre-formatted with one sentence per line.

Feel free to use the dataset for your language instead by changing the language code (en) below.

In [0]:
data_pref = 'gs://ie_scibert/bert_model_classifier/'


For demonstration purposes, we will only use a small fraction of the whole corpus for this experiment. 

When training the real model, make sure to uncheck the DEMO_MODE checkbox to use a 100x larger dataset.

Rest assured, 100M lines are perfectly sufficient to train a reasonably good BERT-base model.

In [0]:
# DEMO_MODE = True #@param {type:"boolean"}

# if DEMO_MODE:
#   CORPUS_SIZE = 1000000
# else:
#   CORPUS_SIZE = 100000000 #@param {type: "integer"}
  
# !(head -n $CORPUS_SIZE dataset.txt) > subdataset.txt
# !mv subdataset.txt dataset.txt

## Step 3: preprocessing text

The raw text data we have downloaded contains punсtuation, uppercase letters and non-UTF symbols which we will remove before proceeding. During inference, we will apply the same normalization procedure to new data.

If your use-case requires different preprocessing (e.g. if uppercase letters or punctuation are expected during inference), feel free to modify the function below to accomodate for your needs.

In [0]:
regex_tokenizer = nltk.RegexpTokenizer("\w+")

def normalize_text(text):
#   # lowercase text
   text = str(text).lower()
#   # remove non-UTF
   text = text.encode("utf-8", "ignore").decode()
#   # remove punktuation symbols
   text = " ".join(regex_tokenizer.tokenize(text))
   return text

def count_lines(filename):
   count = 0
   with open(filename) as fi:
     for line in fi:
       count += 1
   return count

Check how that works.

In [0]:
# normalize_text('Thanks to the advance, they have succeeded in getting over their adversaries.')

Apply normalization to the whole dataset.

In [6]:
 data_pref = 'gs://ie_scibert/bert_model_classifier/'
 RAW_DATA_FPATH = "train.csv" #@param {type: "string"}
 PRC_DATA_FPATH = "proc_train.csv" #@param {type: "string"}

# # apply normalization to the dataset
# # this will take a minute or two
import tensorflow as tf
 
 fi = tf.gfile.GFile(data_pref +str(RAW_DATA_FPATH),mode='r')
 lines =  fi.readlines()
 total_lines = len(lines)
 bar = Progbar(total_lines)
 # #  with open(fraw,encoding = 'utf-8') as ff:
 with open(PRC_DATA_FPATH, "w",encoding="utf-8") as fo:
     for l in lines:
       fo.write(normalize_text(l)+"\n")
       bar.add(1)



In [0]:
# f = open(PRC_DATA_FPATH,encoding="utf-8")
# f.readlines()[1].split('\t')


## Step 4: building the vocabulary

For the next step, we will learn a new vocabulary that we will use to represent our dataset. 

The BERT paper uses a WordPiece tokenizer, which is not available in opensource. Instead, we will be using SentencePiece tokenizer in unigram mode. While it is not directly compatible with BERT, with a small hack we can make it work.

SentencePiece requires quite a lot of RAM, so running it on the full dataset in Colab will crash the kernel. To avoid this, we will randomly subsample a fraction of the dataset for building the vocabulary. Another option would be to use a machine with more RAM for this step - that decision is up to you.

Also, SentencePiece adds BOS and EOS control symbols to the vocabulary by default. We disable them explicitly by setting their indices to -1.

The typical values for VOC_SIZE are somewhere in between 32000 and 128000. We reserve NUM_PLACEHOLDERS tokens in case one wants to update the vocabulary and fine-tune the model after the pre-training phase is finished.

In [7]:
MODEL_PREFIX = "tokenizer" #@param {type: "string"}
# ## 32000
VOC_SIZE = 32000 #@param {type:"integer"} 
# ## 128000
SUBSAMPLE_SIZE = 128000 #@param {type:"integer"}
# ## 256
NUM_PLACEHOLDERS = 256 #@param {type:"integer"}

SPM_COMMAND = ('--input={} --model_prefix={} '
               '--vocab_size={} --input_sentence_size={} '
               '--shuffle_input_sentence=true ' 
               '--bos_id=-1 --eos_id=-1').format(
               PRC_DATA_FPATH, MODEL_PREFIX, 
               VOC_SIZE - NUM_PLACEHOLDERS, SUBSAMPLE_SIZE)

spm.SentencePieceTrainer.Train(SPM_COMMAND)

True

Now let's see how we can make SentencePiece tokenizer work for the BERT model. 

Below is a sentence tokenized using the WordPiece vocabulary from a pretrained English [BERT-base](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) model from the official [repo](https://github.com/google-research/bert). 

In [0]:
testcase = "Colorless geothermal substations are generating furiously"
# wordpiece.tokenize("Colorless geothermal substations are generating furiously")



```
>>> wordpiece.tokenize("Colorless geothermal substations are generating furiously")

['color',
 '##less',
 'geo',
 '##thermal',
 'sub',
 '##station',
 '##s',
 'are',
 'generating',
 'furiously']
```



As we can see, the WordPiece tokenizer prepends the subwords which occur in the middle of words with '##'. The subwords occurring at the beginning of words are unchanged. If the subword occurs both in the beginning and in the middle of words, both versions (with and without '##') are added to the vocabulary.

Now let's have a look at the vocabulary that the SentencePiece tokenizer has learned.

In [9]:
!ls

adc.json	       proc_train.csv  shards		tokenizer.vocab
bert_model_classifier  sample_data     tokenizer.model	vocab.txt


SentencePiece has created two files: tokenizer.model and tokenizer.vocab. Let's have a look at the learned vocabulary:

In [10]:
!head -n 100 tokenizer.vocab

<unk>	0
s	-3.34309
▁the	-3.61561
▁and	-3.6847
▁of	-3.7019
▁protein	-3.71013
▁1	-3.90194
▁in	-3.96808
d	-4.09448
1	-4.41868
f	-4.52776
▁to	-4.55149
ed	-4.65651
ing	-4.66733
_	-4.67664
▁xre	-4.68623
▁a	-4.76593
▁that	-4.82168
▁by	-4.84483
▁0	-4.90863
2	-4.90881
▁chemical	-4.93598
▁other	-4.98606
bibr	-5.01241
ly	-5.07178
y	-5.20206
▁	-5.20729
▁with	-5.23067
3	-5.27961
▁2	-5.29763
▁cells	-5.355
▁family	-5.37114
▁cell	-5.43653
▁is	-5.43692
▁p	-5.44109
e	-5.48877
▁process	-5.53333
▁biological	-5.5991
▁expression	-5.61498
▁increase	-5.68976
▁il	-5.71863
▁induce	-5.76153
▁as	-5.76381
▁complex	-5.77693
a	-5.78655
4	-5.78843
▁be	-5.79557
▁was	-5.8471
▁3	-5.91566
▁for	-5.96643
▁inhibit	-5.96884
▁or	-5.97982
▁which	-5.99791
ion	-6.0563
fig	-6.09327
▁not	-6.11934
▁we	-6.14038
▁activation	-6.15365
▁level	-6.19325
▁c	-6.20907
▁it	-6.23385
▁this	-6.24482
▁apoptosis	-6.25483
▁on	-6.26116
▁also	-6.27227
n	-6.27434
▁cd	-6.30242
▁activate	-6.31407
▁receptor	-6.33635
▁an	-6.3393
c	-6.35721
▁at	-6.36519
▁b

In [11]:
def read_sentencepiece_vocab(filepath):
  voc = []
  with open(filepath, encoding='utf-8') as fi:
    for line in fi:
      voc.append(line.split("\t")[0])
  # skip the first <unk> token
  voc = voc[1:]
  return voc

snt_vocab = read_sentencepiece_vocab("{}.vocab".format(MODEL_PREFIX))
print("Learnt vocab size: {}".format(len(snt_vocab)))
print("Sample tokens: {}".format(random.sample(snt_vocab, 10)))

Learnt vocab size: 31743
Sample tokens: ['▁revert', 'yama', '7326', '▁angiogene', '▁uzu', '▁glioblastoma', '▁3638', '62242', '▁protru', '▁nek']


As we may observe, SentencePiece does quite the opposite to WordPiece. From the [documentation](https://github.com/google/sentencepiece/blob/master/README.md):


SentencePiece first escapes the whitespace with a meta-symbol "▁" (U+2581) as follows:

`Hello▁World`.

Then, this text is segmented into small pieces, for example:

`[Hello] [▁Wor] [ld] [.]`

Subwords which occur after whitespace (which are also those that most words begin with) are prepended with '▁', while others are unchanged. This excludes subwords which only occur at the beginning of sentences and nowhere else. These cases should be quite rare, however. 

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a simple conversion, removing "▁" from the tokens that contain it and adding "##"  to the ones that don't.

In [0]:
def parse_sentencepiece_token(token):
    if token.startswith("▁"):
        return token[1:]
    else:
        return "##" + token

In [0]:
bert_vocab = list(map(parse_sentencepiece_token, snt_vocab))

We also add some special control symbols which are required by the BERT architecture. By convention, we put those at the beginning of the vocabulary.

In [0]:
ctrl_symbols = ["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]"]
bert_vocab = ctrl_symbols + bert_vocab

We also append some placeholder tokens to the vocabulary. Those are useful if one wishes to update the pre-trained model with new, task-specific tokens. 

In that case, the placeholder tokens are replaced with new real ones, the pre-training data is re-generated, and the model is fine-tuned on new data.

In [15]:
bert_vocab += ["[UNUSED_{}]".format(i) for i in range(VOC_SIZE - len(bert_vocab))]
print(len(bert_vocab))

32000


Finally, we write the obtained vocabulary to file.

In [0]:
VOC_FNAME = "vocab.txt" #@param {type:"string"}

with open(VOC_FNAME, "w") as fo:
  for token in bert_vocab:
    fo.write(token+"\n")

Now let's see how the new vocabulary works in practice:

In [0]:
bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME)
# bert_tokenizer.tokenize(testcase)

Looking good!

## Step 5: generating pre-training data

With the vocabulary at hand, we are ready to generate pre-training data for the BERT model. Since our dataset might be quite large, we will split it into shards:

In [18]:
!mkdir ./shards
!split -a 4 -l 256000 -d $PRC_DATA_FPATH ./shards/shard_
!ls ./shards/

mkdir: cannot create directory ‘./shards’: File exists
shard_0000  shard_0003	shard_0006  shard_0009	shard_0012
shard_0001  shard_0004	shard_0007  shard_0010
shard_0002  shard_0005	shard_0008  shard_0011


Before we start generating, we need to set some model-specific parameters.  

In [0]:
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param
MAX_PREDICTIONS = 20 #@param {type:"integer"}
DO_LOWER_CASE = True #@param {type:"boolean"}
PROCESSES = 2 #@param {type:"integer"}
PRETRAINING_DIR = "pretraining_data" #@param {type:"string"}

Now, for each shard we need to call *create_pretraining_data.py* script. To that end, we will employ the  *xargs* command. 

Running this might take quite some time depending on the size of your dataset.

In [0]:
# XARGS_CMD = ("ls ./shards/ | "
#              "xargs -n 1 -P {} -I{} "
#              "python3 bert/create_pretraining_data.py "
#              "--input_file=./shards/{} "
#              "--output_file={}/{}.tfrecord "
#              "--vocab_file={} "
#              "--do_lower_case={} "
#              "--max_predictions_per_seq={} "
#              "--max_seq_length={} "
#              "--masked_lm_prob={} "
#              "--random_seed=34 "
#              "--dupe_factor=5")

# XARGS_CMD = XARGS_CMD.format(PROCESSES, '{}', '{}', PRETRAINING_DIR, '{}', 
#                              VOC_FNAME, DO_LOWER_CASE, 
#                              MAX_PREDICTIONS, MAX_SEQ_LENGTH, MASKED_LM_PROB)

In [0]:
# tf.gfile.MkDir(PRETRAINING_DIR)
# !$XARGS_CMD

## Step 6: setting up persistent storage

To preserve our hard-earned assets, we will persist them to Google Cloud Storage. Provided that you have created the GCS bucket, this should be simple.

We will create two directories in GCS, one for the data and one for the model.
In the model directory, we will put the model vocabulary and configuration file.

**Configure your BUCKET_NAME variable here before proceeding, otherwise the model and data will not be saved.**

In [0]:
BUCKET_NAME = "ie_scibert" #@param {type:"string"}
MODEL_DIR = "bert_model_classifier" #@param {type:"string"}
tf.gfile.MkDir(MODEL_DIR)

if not BUCKET_NAME:
  log.warning("WARNING: BUCKET_NAME is not set. "
              "You will not be able to train the model.")

Below is the sample hyperparameter configuration for BERT-base. Change at your own risk.

In [0]:
# use this for BERT-base

bert_base_config = {
  "attention_probs_dropout_prob": 0.1, 
  "directionality": "bidi", 
  "hidden_act": "gelu", 
  "hidden_dropout_prob": 0.1, 
  "hidden_size": 768, 
  "initializer_range": 0.02, 
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12, 
  "num_hidden_layers": 12, 
  "pooler_fc_size": 768, 
  "pooler_num_attention_heads": 12, 
  "pooler_num_fc_layers": 3, 
  "pooler_size_per_head": 128, 
  "pooler_type": "first_token_transform", 
  "type_vocab_size": 2, 
  "vocab_size": VOC_SIZE
}

with open("{}/bert_config.json".format(MODEL_DIR), "w") as fo:
  json.dump(bert_base_config, fo, indent=2)
  
with open("{}/{}".format(MODEL_DIR, VOC_FNAME), "w") as fo:
  for token in bert_vocab:
    fo.write(token+"\n")

In [24]:
if BUCKET_NAME:
  !gsutil -m cp -r $MODEL_DIR $PRETRAINING_DIR gs://$BUCKET_NAME

CommandException: No URLs matched: pretraining_data
Copying file://bert_model_classifier/bert_config.json [Content-Type=application/json]...
Copying file://bert_model_classifier/vocab.txt [Content-Type=text/plain]...
/ [2/2 files][219.3 KiB/219.3 KiB] 100% Done                                    
Operation completed over 2 objects/219.3 KiB.                                    
CommandException: 1 file/object could not be transferred.


## Step 7: training the model

We are almost ready to begin training our model. If you wish  to continue an interrupted training run, you may skip steps 2-6 and proceed from here.

**Make sure that you have set the BUCKET_NAME here as well.**

In [25]:
BUCKET_NAME = "ie_scibert" #@param {type:"string"}
MODEL_DIR = "bert_model_classifier" #@param {type:"string"}
PRETRAINING_DIR = "pretraining_data" #@param {type:"string"}
VOC_FNAME = "vocab.txt" #@param {type:"string"}

# Input data pipeline config
TRAIN_BATCH_SIZE = 128 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 2e-5
TRAIN_STEPS = 1000000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 2500 #@param {type:"integer"}
NUM_TPU_CORES = 8

if BUCKET_NAME:
  BUCKET_PATH = "gs://{}".format(BUCKET_NAME)
else:
  BUCKET_PATH = "."

BERT_GCS_DIR = "{}/{}".format(BUCKET_PATH, MODEL_DIR)
DATA_GCS_DIR = "{}/{}".format(BUCKET_PATH, PRETRAINING_DIR)

VOCAB_FILE = os.path.join(BERT_GCS_DIR, VOC_FNAME)
CONFIG_FILE = os.path.join(BERT_GCS_DIR, "bert_config.json")

INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)

bert_config = modeling.BertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(DATA_GCS_DIR,'*tfrecord'))

log.info("Using checkpoint: {}".format(INIT_CHECKPOINT))
log.info("Using {} data shards".format(len(input_files)))

2020-01-17 02:18:02,435 :  Using checkpoint: gs://ie_scibert/bert_model_classifier/model.ckpt-267500
2020-01-17 02:18:02,436 :  Using 2 data shards


Prepare the training run configuration, build the estimator and input function, power up the bass cannon.

In [0]:
# model_fn = model_fn_builder(
#       bert_config=bert_config,
#       init_checkpoint=INIT_CHECKPOINT,
#       learning_rate=LEARNING_RATE,
#       num_train_steps=TRAIN_STEPS,
#       num_warmup_steps=10,
#       use_tpu=USE_TPU,
#       use_one_hot_embeddings=True)

# tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

# run_config = tf.contrib.tpu.RunConfig(
#     cluster=tpu_cluster_resolver,
#     model_dir=BERT_GCS_DIR,
#     save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
#     tpu_config=tf.contrib.tpu.TPUConfig(
#         iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
#         num_shards=NUM_TPU_CORES,
#         per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

# estimator = tf.contrib.tpu.TPUEstimator(
#     use_tpu=USE_TPU,
#     model_fn=model_fn,
#     config=run_config,
#     train_batch_size=TRAIN_BATCH_SIZE,
#     eval_batch_size=EVAL_BATCH_SIZE)
  
# train_input_fn = input_fn_builder(
#         input_files=input_files,
#         max_seq_length=MAX_SEQ_LENGTH,
#         max_predictions_per_seq=MAX_PREDICTIONS,
#         is_training=True)

Fire!

In [0]:
# estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

Training the model with the default parameters for 1 million steps will take ~53 hours. 

In case the kernel is restarted, you may always continue training from the latest checkpoint. 

This concludes the guide to pre-training BERT from scratch on a cloud TPU. However, the really fun stuff is still  to come, so stay tuned.

Keep learning!

In [28]:
!gsutil cp gs://hermes_assets/russian_uncased_L-12_H-768_A-12.zip gs://bert_resourses/

AccessDeniedException: 403 mohaddeseh.bastan@stonybrook.edu does not have storage.objects.list access to hermes_assets.


# Run classifier for the model


In [0]:

# import pandas as pd
# train = pd.read_csv(tf.gfile.GFile(data_pref + str('train.csv')), encoding='latin-1')[:200000]
# print(len(train))
# #pd.read_csv(open('train.csv'),error_bad_lines=False)
# train = train.dropna()
# label_list = [0,1]
# DATA_COLUMN = 'sentence'
# LABEL_COLUMN = 'reg_label'
# train_InputExamples = train.apply(lambda x: run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
#                                                                    text_a = x[DATA_COLUMN], 
#                                                                    text_b = None, 
#                                                                    label = x[LABEL_COLUMN]), axis = 1)




In [0]:
import pickle

# train_features = run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, bert_tokenizer)

# f = tf.gfile.GFile(data_pref + str('train_features.out'), mode='rb')#file_io.FileIO(data_pref + str('train_features.out'),mode='r')
# train_features = pickle.load(f,encoding='utf-8')


In [0]:

# len(train_features)
# import pickle
# pickle.dump(train_features,open('train_features.out','wb') )

In [0]:
import datetime
def model_train(estimator):
  # We'll set sequences to be at most 128 tokens long.
  
  print('***** Started training at {} *****'.format(datetime.datetime.now()))
  print('  Num examples = {}'.format(len(train_InputExamples)))
  print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
  tf.logging.info("  Num steps = %d", TRAIN_STEPS)
  train_input_fn = run_classifier.input_fn_builder(
      features=train_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=True,
      drop_remainder=True)
  estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)
  print('***** Finished training at {} *****'.format(datetime.datetime.now()))


In [32]:

model_fn = run_classifier.model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=INIT_CHECKPOINT,
      learning_rate=LEARNING_RATE,
      num_train_steps=TRAIN_STEPS,
      num_warmup_steps=10,
      use_tpu=USE_TPU,
      use_one_hot_embeddings=True,
      num_labels = 2)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=BERT_GCS_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=8)
  
# train_input_fn = run_classifier.input_fn_builder(
#     features = train_features,
#     seq_length = MAX_SEQ_LENGTH,
#     drop_remainder = True,
#     is_training=True)

2020-01-17 02:18:05,580 :  Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f6d00482620>) includes params argument, but params are not passed to Estimator.
2020-01-17 02:18:05,582 :  Using config: {'_model_dir': 'gs://ie_scibert/bert_model_classifier', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 2500, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.57.116.162:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6d4103f

In [0]:
# model_train(estimator=estimator)

In [0]:
def model_predict(estimator,prediction_examples,input_features,checkpoint_path=None):
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
  if checkpoint_path: 
    predictions = estimator.predict(predict_input_fn,checkpoint_path=checkpoint_path)
  else:
    predictions = estimator.predict(predict_input_fn)
  return [(sentence, prediction['probabilities']) for sentence, prediction in zip(prediction_examples, predictions)]



In [35]:
import pandas as pd
dev = pd.read_csv(tf.gfile.GFile(data_pref + str('dev.csv')), encoding='latin-1')
print(len(dev))
dev = dev.dropna()
label_list = [0,1]
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'reg_label'
dev_InputExamples = dev.apply(lambda x: run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

dev_features = run_classifier.convert_examples_to_features(dev_InputExamples, label_list, MAX_SEQ_LENGTH, bert_tokenizer)


347943


2020-01-17 02:18:17,775 :  From /usr/local/lib/python3.6/dist-packages/bert/run_classifier.py:774: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

2020-01-17 02:18:17,776 :  Writing example 0 of 347689
2020-01-17 02:18:17,778 :  *** Example ***
2020-01-17 02:18:17,779 :  guid: None
2020-01-17 02:18:17,779 :  tokens: [CLS] one possibl ##e explanation for these results is that the know ##n amide [UNK] bond [UNK] cleavage activity of ra ##vz targets not only lc ##3 [UNK] pe but also ubiquitin conjugate ##d molecule ##s on scv ##s [UNK] [UNK] [UNK] [UNK] [UNK] in this context [UNK] ubiquitin or lc ##3 locate ##d on not only bacterial vacuole ##s but also cytosol ##ic bacteria mig ##ht be target ##ed by ra ##vz activity [UNK] [SEP]
2020-01-17 02:18:17,780 :  input_ids: 2 410 538 39 3026 53 94 146 37 21 6 335 69 5963 1 2467 1 801 80 8 399 22851 640 59 249 708 32 1 2311 92 68 675 3479 12 391 5 67 20574 5 1 1 1 1 1 11 65 1581 1 675 55 708 32 2110 12 67 59

In [36]:
tf.logging.set_verbosity(tf.logging.ERROR)

predictions = model_predict(estimator=estimator, prediction_examples=dev_InputExamples, input_features=dev_features)

2020-01-17 02:28:13,378 :  Error recorded from infeed: Step was cancelled by an explicit call to `Session::Close()`.
