<a href="https://colab.research.google.com/github/MHDBST/BERT_examples/blob/master/SciBERT_TPU_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning SciBERT on BioMed dataset


In this experiment, we will be pre-training a state-of-the-art Natural Language Understanding model [BERT](https://arxiv.org/abs/1810.04805.) on arbitrary text data using Google Cloud infrastructure.

This guide covers all stages of the procedure, including:

1. Setting up the training environment
2. Downloading raw text data
3. Preprocessing text data
4. Learning a new vocabulary
5. Creating sharded pre-training data
6. Setting up GCS storage for data and model
7. Training the model on a cloud TPU

For persistent storage of training data and model, you will require a Google Cloud Storage bucket. 
Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) to create a GCP account and GCS bucket. New Google Cloud users have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. 

Steps 1-5 of this tutorial can be run without a GCS bucket for demonstration purposes. In that case, however, you will not be able to train the model.

**Note** 
The only parameter you *really have to set* is BUCKET_NAME in steps 5 and 6. Everything else has default values which should work for most use-cases.

**Note** 
Pre-training a BERT-Base model on a TPUv2 will take about 54 hours. Google Colab is not designed for executing such long-running jobs and will interrupt the training process every 8 hours or so. For uninterrupted training, consider using a preemptible TPUv2 instance. 

That said, at the time of writing (09.05.2019), with a Colab TPU, pre-training a BERT model from scratch can be achieved at a negligible cost of storing the said model and data in GCS  (~1 USD).

Now, let's get to business.

MIT License

Copyright (c) [2019] [Antyukhov Denis Olegovich]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Step 1: setting up training environment
First and foremost, we get the packages required to train the model. 
The Jupyter environment allows executing bash commands directly from the notebook by using an exclamation mark ‘!’. I will be exploiting this approach to make use of several other bash commands throughout the experiment.

Now, let’s import the packages and authorize ourselves in Google Cloud.

In [1]:
!pip install sentencepiece
# !git clone https://github.com/google-research/bert
!git clone https://github.com/allenai/scibert.git

import os
import sys
import json
import nltk
import random
import logging
import tensorflow as tf
import sentencepiece as spm

from glob import glob
from google.colab import auth, drive
from tensorflow.keras.utils import Progbar

sys.path.append("bert")
!pip install bert_tensorflow

from bert import modeling, optimization, tokenization, run_classifier
# from bert.run_pretraining import input_fn_builder, model_fn_builder

auth.authenticate_user()
  
# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s :  %(message)s')
sh = logging.StreamHandler()
sh.setLevel(logging.INFO)
sh.setFormatter(formatter)
log.handlers = [sh]

if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |▎                               | 10kB 21.2MB/s eta 0:00:01[K     |▋                               | 20kB 2.2MB/s eta 0:00:01[K     |█                               | 30kB 3.2MB/s eta 0:00:01[K     |█▎                              | 40kB 2.1MB/s eta 0:00:01[K     |█▋                              | 51kB 2.6MB/s eta 0:00:01[K     |██                              | 61kB 3.1MB/s eta 0:00:01[K     |██▏                             | 71kB 3.5MB/s eta 0:00:01[K     |██▌                             | 81kB 4.0MB/s eta 0:00:01[K     |██▉                             | 92kB 4.4MB/s eta 0:00:01[K     |███▏                            | 102kB 3.5MB/s eta 0:00:01[K     |███▌                            | 112kB 3.5MB/s eta 0:00:01[K     |███▉                     

2020-01-22 23:10:11,557 :  Using TPU runtime
2020-01-22 23:10:11,567 :  TPU address is grpc://10.106.209.170:8470


## Step 2: getting the data

We begin with obtaining a corpus of raw text data. For this experiment, we will be using the [OpenSubtitles](http://www.opensubtitles.org/) dataset, which is available for 65 languages [here](http://opus.nlpl.eu/OpenSubtitles-v2016.php). 

Unlike more common text datasets (like Wikipedia) it does not require any complex pre-processing. It also comes pre-formatted with one sentence per line.

Feel free to use the dataset for your language instead by changing the language code (en) below.

In [0]:
data_pref = 'gs://ie_scibert/bert_model_classifier/'


For demonstration purposes, we will only use a small fraction of the whole corpus for this experiment. 

When training the real model, make sure to uncheck the DEMO_MODE checkbox to use a 100x larger dataset.

Rest assured, 100M lines are perfectly sufficient to train a reasonably good BERT-base model.

In [0]:
# DEMO_MODE = True #@param {type:"boolean"}

# if DEMO_MODE:
#   CORPUS_SIZE = 1000000
# else:
#   CORPUS_SIZE = 100000000 #@param {type: "integer"}
  
# !(head -n $CORPUS_SIZE dataset.txt) > subdataset.txt
# !mv subdataset.txt dataset.txt

## Step 3: preprocessing text

The raw text data we have downloaded contains punсtuation, uppercase letters and non-UTF symbols which we will remove before proceeding. During inference, we will apply the same normalization procedure to new data.

If your use-case requires different preprocessing (e.g. if uppercase letters or punctuation are expected during inference), feel free to modify the function below to accomodate for your needs.

In [0]:
# regex_tokenizer = nltk.RegexpTokenizer("\w+")

# def normalize_text(text):
# #   # lowercase text
#    text = str(text).lower()
# #   # remove non-UTF
#    text = text.encode("utf-8", "ignore").decode()
# #   # remove punktuation symbols
#    text = " ".join(regex_tokenizer.tokenize(text))
#    return text

# def count_lines(filename):
#    count = 0
#    with open(filename) as fi:
#      for line in fi:
#        count += 1
#    return count

Check how that works.

In [0]:
# normalize_text('Thanks to the advance, they have succeeded in getting over their adversaries.')

Apply normalization to the whole dataset.

In [0]:
#  data_pref = 'gs://ie_scibert/bert_model_classifier/'
#  RAW_DATA_FPATH = "train.csv" #@param {type: "string"}
#  PRC_DATA_FPATH = "proc_train.csv" #@param {type: "string"}

# # # apply normalization to the dataset
# # # this will take a minute or two
# import tensorflow as tf
 
#  fi = tf.gfile.GFile(data_pref +str(RAW_DATA_FPATH),mode='r')
#  lines =  fi.readlines()
#  total_lines = len(lines)
#  bar = Progbar(total_lines)
#  # #  with open(fraw,encoding = 'utf-8') as ff:
#  with open(PRC_DATA_FPATH, "w",encoding="utf-8") as fo:
#      for l in lines:
#        fo.write(normalize_text(l)+"\n")
#        bar.add(1)

In [0]:
# f = open(PRC_DATA_FPATH,encoding="utf-8")
# f.readlines()[1].split('\t')


## Step 4: building the vocabulary

For the next step, we will learn a new vocabulary that we will use to represent our dataset. 

The BERT paper uses a WordPiece tokenizer, which is not available in opensource. Instead, we will be using SentencePiece tokenizer in unigram mode. While it is not directly compatible with BERT, with a small hack we can make it work.

SentencePiece requires quite a lot of RAM, so running it on the full dataset in Colab will crash the kernel. To avoid this, we will randomly subsample a fraction of the dataset for building the vocabulary. Another option would be to use a machine with more RAM for this step - that decision is up to you.

Also, SentencePiece adds BOS and EOS control symbols to the vocabulary by default. We disable them explicitly by setting their indices to -1.

The typical values for VOC_SIZE are somewhere in between 32000 and 128000. We reserve NUM_PLACEHOLDERS tokens in case one wants to update the vocabulary and fine-tune the model after the pre-training phase is finished.

In [0]:
MODEL_PREFIX = "tokenizer" #@param {type: "string"}
# ## 32000
VOC_SIZE = 32000 #@param {type:"integer"} 
# ## 128000
SUBSAMPLE_SIZE = 128000 #@param {type:"integer"}
# ## 256
NUM_PLACEHOLDERS = 256 #@param {type:"integer"}

# SPM_COMMAND = ('--input={} --model_prefix={} '
#                '--vocab_size={} --input_sentence_size={} '
#                '--shuffle_input_sentence=true ' 
#                '--bos_id=-1 --eos_id=-1').format(
#                PRC_DATA_FPATH, MODEL_PREFIX, 
#                VOC_SIZE - NUM_PLACEHOLDERS, SUBSAMPLE_SIZE)

# spm.SentencePieceTrainer.Train(SPM_COMMAND)

Now let's see how we can make SentencePiece tokenizer work for the BERT model. 

Below is a sentence tokenized using the WordPiece vocabulary from a pretrained English [BERT-base](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) model from the official [repo](https://github.com/google-research/bert). 

In [0]:
testcase = "Colorless geothermal substations are generating furiously"
# wordpiece.tokenize("Colorless geothermal substations are generating furiously")



```
>>> wordpiece.tokenize("Colorless geothermal substations are generating furiously")

['color',
 '##less',
 'geo',
 '##thermal',
 'sub',
 '##station',
 '##s',
 'are',
 'generating',
 'furiously']
```



As we can see, the WordPiece tokenizer prepends the subwords which occur in the middle of words with '##'. The subwords occurring at the beginning of words are unchanged. If the subword occurs both in the beginning and in the middle of words, both versions (with and without '##') are added to the vocabulary.

Now let's have a look at the vocabulary that the SentencePiece tokenizer has learned.

In [0]:
# !ls

SentencePiece has created two files: tokenizer.model and tokenizer.vocab. Let's have a look at the learned vocabulary:

In [0]:
# !head -n 100 tokenizer.vocab

In [0]:
# def read_sentencepiece_vocab(filepath):
#   voc = []
#   with open(filepath, encoding='utf-8') as fi:
#     for line in fi:
#       voc.append(line.split("\t")[0])
#   # skip the first <unk> token
#   voc = voc[1:]
#   return voc

# snt_vocab = read_sentencepiece_vocab("{}.vocab".format(MODEL_PREFIX))
# print("Learnt vocab size: {}".format(len(snt_vocab)))
# print("Sample tokens: {}".format(random.sample(snt_vocab, 10)))

As we may observe, SentencePiece does quite the opposite to WordPiece. From the [documentation](https://github.com/google/sentencepiece/blob/master/README.md):


SentencePiece first escapes the whitespace with a meta-symbol "▁" (U+2581) as follows:

`Hello▁World`.

Then, this text is segmented into small pieces, for example:

`[Hello] [▁Wor] [ld] [.]`

Subwords which occur after whitespace (which are also those that most words begin with) are prepended with '▁', while others are unchanged. This excludes subwords which only occur at the beginning of sentences and nowhere else. These cases should be quite rare, however. 

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a simple conversion, removing "▁" from the tokens that contain it and adding "##"  to the ones that don't.

In [0]:
# def parse_sentencepiece_token(token):
#     if token.startswith("▁"):
#         return token[1:]
#     else:
#         return "##" + token

In [0]:
# bert_vocab = list(map(parse_sentencepiece_token, snt_vocab))

We also add some special control symbols which are required by the BERT architecture. By convention, we put those at the beginning of the vocabulary.

In [0]:
# ctrl_symbols = ["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]"]
# bert_vocab = ctrl_symbols + bert_vocab

We also append some placeholder tokens to the vocabulary. Those are useful if one wishes to update the pre-trained model with new, task-specific tokens. 

In that case, the placeholder tokens are replaced with new real ones, the pre-training data is re-generated, and the model is fine-tuned on new data.

In [0]:
# bert_vocab += ["[UNUSED_{}]".format(i) for i in range(VOC_SIZE - len(bert_vocab))]
# print(len(bert_vocab))

Finally, we write the obtained vocabulary to file.

In [0]:
# VOC_FNAME = "vocab.txt" #@param {type:"string"}

# with open(VOC_FNAME, "w") as fo:
#   for token in bert_vocab:
#     fo.write(token+"\n")

Now let's see how the new vocabulary works in practice:

In [0]:
# bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME)
# # bert_tokenizer.tokenize(testcase)

Looking good!

## Step 5: generating pre-training data

With the vocabulary at hand, we are ready to generate pre-training data for the BERT model. Since our dataset might be quite large, we will split it into shards:

In [18]:
!mkdir ./shards
!split -a 4 -l 256000 -d $PRC_DATA_FPATH ./shards/shard_
!ls ./shards/

split: cannot open './shards/shard_' for reading: No such file or directory


Before we start generating, we need to set some model-specific parameters.  

In [0]:
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param
MAX_PREDICTIONS = 20 #@param {type:"integer"}
DO_LOWER_CASE = True #@param {type:"boolean"}
PROCESSES = 2 #@param {type:"integer"}
PRETRAINING_DIR = "pretraining_data" #@param {type:"string"}

Now, for each shard we need to call *create_pretraining_data.py* script. To that end, we will employ the  *xargs* command. 

Running this might take quite some time depending on the size of your dataset.

In [0]:
# XARGS_CMD = ("ls ./shards/ | "
#              "xargs -n 1 -P {} -I{} "
#              "python3 bert/create_pretraining_data.py "
#              "--input_file=./shards/{} "
#              "--output_file={}/{}.tfrecord "
#              "--vocab_file={} "
#              "--do_lower_case={} "
#              "--max_predictions_per_seq={} "
#              "--max_seq_length={} "
#              "--masked_lm_prob={} "
#              "--random_seed=34 "
#              "--dupe_factor=5")

# XARGS_CMD = XARGS_CMD.format(PROCESSES, '{}', '{}', PRETRAINING_DIR, '{}', 
#                              VOC_FNAME, DO_LOWER_CASE, 
#                              MAX_PREDICTIONS, MAX_SEQ_LENGTH, MASKED_LM_PROB)

In [0]:
# tf.gfile.MkDir(PRETRAINING_DIR)
# !$XARGS_CMD

## Step 6: setting up persistent storage

To preserve our hard-earned assets, we will persist them to Google Cloud Storage. Provided that you have created the GCS bucket, this should be simple.

We will create two directories in GCS, one for the data and one for the model.
In the model directory, we will put the model vocabulary and configuration file.

**Configure your BUCKET_NAME variable here before proceeding, otherwise the model and data will not be saved.**

In [0]:
BUCKET_NAME = "ie_scibert" #@param {type:"string"}
MODEL_DIR = "scibert_model_classifier" #@param {type:"string"}
tf.gfile.MkDir(MODEL_DIR)

if not BUCKET_NAME:
  log.warning("WARNING: BUCKET_NAME is not set. "
              "You will not be able to train the model.")

Below is the sample hyperparameter configuration for BERT-base. Change at your own risk.

In [0]:
# # use this for BERT-base

# bert_base_config = {
#   "attention_probs_dropout_prob": 0.1, 
#   "directionality": "bidi", 
#   "hidden_act": "gelu", 
#   "hidden_dropout_prob": 0.1, 
#   "hidden_size": 768, 
#   "initializer_range": 0.02, 
#   "intermediate_size": 3072, 
#   "max_position_embeddings": 512, 
#   "num_attention_heads": 12, 
#   "num_hidden_layers": 12, 
#   "pooler_fc_size": 768, 
#   "pooler_num_attention_heads": 12, 
#   "pooler_num_fc_layers": 3, 
#   "pooler_size_per_head": 128, 
#   "pooler_type": "first_token_transform", 
#   "type_vocab_size": 2, 
#   "vocab_size": VOC_SIZE
# }

# with open("{}/bert_config.json".format(MODEL_DIR), "w") as fo:
#   json.dump(bert_base_config, fo, indent=2)
  
# with open("{}/{}".format(MODEL_DIR, VOC_FNAME), "w") as fo:
#   for token in bert_vocab:
#     fo.write(token+"\n")

In [0]:
# if BUCKET_NAME:
#   !gsutil -m cp -r $MODEL_DIR $PRETRAINING_DIR gs://$BUCKET_NAME

## Step 7: training the model

We are almost ready to begin training our model. If you wish  to continue an interrupted training run, you may skip steps 2-6 and proceed from here.

**Make sure that you have set the BUCKET_NAME here as well.**

In [25]:
BUCKET_NAME = "ie_scibert" #@param {type:"string"}
MODEL_DIR = "sciBERT_Model/scibert_scivocab_uncased" #@param {type:"string"}
PRETRAINING_DIR = "pretraining_data" #@param {type:"string"}
VOC_FNAME = "vocab.txt" #@param {type:"string"}

# Input data pipeline config
TRAIN_BATCH_SIZE = 128 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 2e-5
TRAIN_STEPS = 200000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 2500 #@param {type:"integer"}
NUM_TPU_CORES = 8

if BUCKET_NAME:
  BUCKET_PATH = "gs://{}".format(BUCKET_NAME)
else:
  BUCKET_PATH = "."

BERT_GCS_DIR = "{}/{}".format(BUCKET_PATH, MODEL_DIR)
DATA_GCS_DIR = "{}/{}".format(BUCKET_PATH, PRETRAINING_DIR)

VOCAB_FILE = os.path.join(BERT_GCS_DIR, VOC_FNAME)
CONFIG_FILE = os.path.join(BERT_GCS_DIR, "bert_config.json")
print(BERT_GCS_DIR)
INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)
# INIT_CHECKPOINT = tf.saved_model.load_v2(BERT_GCS_DIR)

bert_config = modeling.BertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(DATA_GCS_DIR,'*tfrecord'))

log.info("Using checkpoint: {}".format(INIT_CHECKPOINT))
log.info("Using {} data shards".format(len(input_files)))
log.info("Using {} vocab file".format(VOCAB_FILE))

gs://ie_scibert/sciBERT_Model/scibert_scivocab_uncased


2020-01-22 23:10:43,564 :  From /usr/local/lib/python3.6/dist-packages/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

2020-01-22 23:10:43,853 :  Using checkpoint: gs://ie_scibert/sciBERT_Model/scibert_scivocab_uncased/model.ckpt-140000
2020-01-22 23:10:43,854 :  Using 2 data shards
2020-01-22 23:10:43,856 :  Using gs://ie_scibert/sciBERT_Model/scibert_scivocab_uncased/vocab.txt vocab file


Prepare the training run configuration, build the estimator and input function, power up the bass cannon.

In [0]:
# model_fn = model_fn_builder(
#       bert_config=bert_config,
#       init_checkpoint=INIT_CHECKPOINT,
#       learning_rate=LEARNING_RATE,
#       num_train_steps=TRAIN_STEPS,
#       num_warmup_steps=10,
#       use_tpu=USE_TPU,
#       use_one_hot_embeddings=True)

# tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

# run_config = tf.contrib.tpu.RunConfig(
#     cluster=tpu_cluster_resolver,
#     model_dir=BERT_GCS_DIR,
#     save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
#     tpu_config=tf.contrib.tpu.TPUConfig(
#         iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
#         num_shards=NUM_TPU_CORES,
#         per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

# estimator = tf.contrib.tpu.TPUEstimator(
#     use_tpu=USE_TPU,
#     model_fn=model_fn,
#     config=run_config,
#     train_batch_size=TRAIN_BATCH_SIZE,
#     eval_batch_size=EVAL_BATCH_SIZE)
  
# train_input_fn = input_fn_builder(
#         input_files=input_files,
#         max_seq_length=MAX_SEQ_LENGTH,
#         max_predictions_per_seq=MAX_PREDICTIONS,
#         is_training=True)

Fire!

In [0]:
# estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

Training the model with the default parameters for 1 million steps will take ~53 hours. 

In case the kernel is restarted, you may always continue training from the latest checkpoint. 

This concludes the guide to pre-training BERT from scratch on a cloud TPU. However, the really fun stuff is still  to come, so stay tuned.

Keep learning!

# Run classifier for the model


In [62]:

import pandas as pd
f = tf.gfile.GFile(data_pref + str('train.csv'))
all_train = pd.read_csv(f, encoding='latin-1')
all_train = all_train.dropna()
train = all_train[500000:1000000]

dev = pd.read_csv(tf.gfile.GFile(data_pref + str('dev.csv')), encoding='latin-1')
dev = dev.dropna()


test = pd.read_csv(tf.gfile.GFile(data_pref + str('test.csv')), encoding='latin-1')
test = test.dropna()

print('---------- dataset statistics: ---------')
print('training set length : ')
print(len(all_train))

print('dev set length : ')
print(len(dev))

print('test set length : ')
print(len(test))

print('number of unique element types ( controllee types) in train set : ')
print(len(set(all_train['element_type'])))

print('number of unique reg types ( controller types) in train set : ')
print(len(set(all_train['reg_type'])))

print('number of unique element types ( controllee types) in dev set : ')
print(len(set(dev['element_type'])))

print('number of unique reg types ( controller types) in dev set : ')
print(len(set(dev['reg_type'])))

print('number of unique element types ( controllee types) in test set : ')
print(len(set(test['element_type'])))

print('number of unique reg types ( controller types) in test set : ')
print(len(set(test['reg_type'])))


print('number of unique element types ( controller types) for the whole dataset : ')
a = list(all_train['reg_type']) + list(dev['reg_type']) + list(test['reg_type'])
print(len(a))

print('number of unique element types ( controllee types) for the whole dataset : ')
a = list(all_train['element_type']) + list(dev['element_type']) + list(test['element_type'])
print(len(a))




---------- dataset statistics: ---------
training set length : 
3129209
dev set length : 
347689
test set length : 
386320
number of unique element types ( controllee types) in train set : 
96
number of unique element types ( controller types) in train set : 
328
number of unique element types ( controllee types) in dev set : 
52
number of unique element types ( controller types) in dev set : 
151
number of unique element types ( controllee types) in test set : 
56
number of unique element types ( controller types) in test set : 
151


In [0]:
label_list = [0,1]
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'reg_label'
train_InputExamples = train.apply(lambda x: run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)




In [29]:
import pickle
from bert import tokenization 
DO_LOWER_CASE = True
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)

train_features = run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

# f = tf.gfile.GFile(data_pref + str('train_features.out'), mode='rb')#file_io.FileIO(data_pref + str('train_features.out'),mode='r')
# train_features = pickle.load(f,encoding='utf-8')


2020-01-22 23:11:33,242 :  From /usr/local/lib/python3.6/dist-packages/bert/run_classifier.py:774: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

2020-01-22 23:11:33,243 :  Writing example 0 of 499662
2020-01-22 23:11:33,245 :  *** Example ***
2020-01-22 23:11:33,246 :  guid: None
2020-01-22 23:11:33,246 :  tokens: [CLS] free fab / ( fab ) 2 likely originate from the fragmentation of whole abs driven by the catalytic activity of gr ##p ##94 itself [ xr ##ef _ bib ##r , xr ##ef _ bib ##r ] that is similar to that displayed by serine proteases on the igg molecule [ xr ##ef _ bib ##r , xr ##ef _ bib ##r ] . [SEP]
2020-01-22 23:11:33,247 :  input_ids: 102 2159 4987 1352 145 4987 546 170 1987 18503 263 111 11961 131 2868 1430 6920 214 111 8741 1071 131 544 30121 3609 3987 260 17106 5730 4627 13652 30114 422 17106 5730 4627 13652 30114 1901 198 165 868 147 198 5983 214 13966 19031 191 111 6902 5745 260 17106 5730 4627 13652 30114 422 17106 5730 4627 13

In [0]:
#### save features on file
# len(train_features)
# import pickle
# pickle.dump(train_features,open('train_features.out','wb') )

In [0]:
import datetime
def model_train(estimator):
  # We'll set sequences to be at most 128 tokens long.
  
  print('***** Started training at {} *****'.format(datetime.datetime.now()))
  print('  Num examples = {}'.format(len(train_InputExamples)))
  print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
  tf.logging.info("  Num steps = %d", TRAIN_STEPS)
  train_input_fn = run_classifier.input_fn_builder(
      features=train_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=True,
      drop_remainder=True)
  estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)
  print('***** Finished training at {} *****'.format(datetime.datetime.now()))


In [32]:

model_fn = run_classifier.model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=INIT_CHECKPOINT,
      learning_rate=LEARNING_RATE,
      num_train_steps=TRAIN_STEPS,
      num_warmup_steps=10,
      use_tpu=USE_TPU,
      use_one_hot_embeddings=True,
      num_labels = 2)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=BERT_GCS_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=8)
  
train_input_fn = run_classifier.input_fn_builder(
    features = train_features,
    seq_length = MAX_SEQ_LENGTH,
    drop_remainder = True,
    is_training=True)

2020-01-22 23:17:40,796 :  Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7fee9a319d08>) includes params argument, but params are not passed to Estimator.
2020-01-22 23:17:40,797 :  Using config: {'_model_dir': 'gs://ie_scibert/sciBERT_Model/scibert_scivocab_uncased', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 2500, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.106.209.170:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec obj

In [0]:
# model_train(estimator=estimator)

In [0]:
def model_predict(estimator,prediction_examples,input_features,checkpoint_path=None):
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
  if checkpoint_path: 
    predictions = estimator.predict(predict_input_fn,checkpoint_path=checkpoint_path)
  else:
    predictions = estimator.predict(predict_input_fn)
  return [(sentence, prediction['probabilities']) for sentence, prediction in zip(prediction_examples, predictions)]



In [38]:
import pandas as pd

label_list = [0,1]
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'reg_label'
dev_InputExamples = dev.apply(lambda x: run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[DATA_COLUMN], 
                                                                   text_b = None, 
                                                                   label = x[LABEL_COLUMN]), axis = 1)

dev_features = run_classifier.convert_examples_to_features(dev_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)


347943


In [36]:
tf.logging.set_verbosity(tf.logging.ERROR)

predictions = model_predict(estimator=estimator, prediction_examples=dev_InputExamples, input_features=dev_features)

2020-01-22 23:27:27,983 :  Error recorded from infeed: Step was cancelled by an explicit call to `Session::Close()`.


In [39]:
import numpy as np
from sklearn import metrics
labels_val = []
true_label = list(dev['reg_label'])
for item in predictions:
    labels_val.append(label_list[np.argmax(item[1])])
print(metrics.confusion_matrix(y_pred=labels_val,y_true=true_label))
print(metrics.classification_report(y_pred=labels_val,y_true = true_label))


[[ 96106  14675]
 [ 15021 221887]]
              precision    recall  f1-score   support

           0       0.86      0.87      0.87    110781
           1       0.94      0.94      0.94    236908

    accuracy                           0.91    347689
   macro avg       0.90      0.90      0.90    347689
weighted avg       0.91      0.91      0.91    347689



In [110]:
predictions = model_predict(estimator=estimator, prediction_examples=train_InputExamples, input_features=train_features)
labels_val = []
true_label = list(train['reg_label'])
for item in predictions:
    labels_val.append(label_list[np.argmax(item[1])])
print(metrics.confusion_matrix(y_pred=labels_val,y_true=true_label))
print(metrics.classification_report(y_pred=labels_val,y_true = true_label))


2020-01-22 20:47:57,304 :  Closing session due to error Step was cancelled by an explicit call to `Session::Close()`.
2020-01-22 20:55:11,726 :  Error recorded from infeed: Step was cancelled by an explicit call to `Session::Close()`.


[[156688   1919]
 [  3272 337778]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.98    158607
           1       0.99      0.99      0.99    341050

    accuracy                           0.99    499657
   macro avg       0.99      0.99      0.99    499657
weighted avg       0.99      0.99      0.99    499657



In [113]:
list(train['sentence'][:5])

["In contrast gabapentin could n't prevent the increase of duration time as it showed a rapid increase from 4.52 s to 8.09 s from 2 week to 4 week .",
 'Adenosine , on the other hand , acts on adenosine receptors , of which A 1 and A 3 adenosine receptors inhibit , while A 2A and A 2B adenosine receptors stimulate adenylyl cyclase .',
 'XREF_BIBR Ketoconazole is a potent cytochrome P450 enzyme inhibitor , and has interactions with the following drugs : benzodiazepines aside from lorazepam ; a variety of calcium channel blockers such as nifedipine and verapamil ; 3-hydroxy-3-methyl-glutaryl- CoA reductase inhibitors , aside from pravastatin and fluvastatin ; phenytoin ; warfarin ; and theophylline .',
 'H3K9me1 and H3K9me2 are catalyzed by the histone methyltransferase G9a , whereas H3K9me3 is established by the histone methyltransferase Suv39h1 , which is often expressed in senescence and postmitotic cells .',
 'In animal models , native GLP-1 stimulates beta-cell proliferation and inh

In [114]:
train[:5]

Unnamed: 0.1,Unnamed: 0,sentence,element_name,element_type,reg_name,reg_type,reg_label
0,680292,In contrast gabapentin could n't prevent the i...,duration,Chemical,gabapentin,Chemical,0
1,3233655,"Adenosine , on the other hand , acts on adenos...",adenylyl cyclase,Protein Family|Protein Complex,adenosine receptors,Other,1
2,866649,XREF_BIBR Ketoconazole is a potent cytochrome ...,calcium,Chemical,verapamil,Chemical,0
3,1252943,H3K9me1 and H3K9me2 are catalyzed by the histo...,H3K9me1,Other,histone,Protein Family|Protein Complex,1
4,700504,"In animal models , native GLP-1 stimulates bet...",proliferation,Biological Process,GLP-1,Protein,1


In [46]:
print(len(all_train))
print(len(dev))

3129209
347689


In [47]:
import pandas as pd
test = pd.read_csv(tf.gfile.GFile(data_pref + str('test.csv')), encoding='latin-1')
test = test.dropna()
print(len(test))

386320


In [0]:
a = list(all_train['sentence'])

In [66]:
len(max(a , key=len))

28393

In [68]:
len(all_train) + len(dev)+ len(test)

3863218

In [75]:
len(set(a))

352