<a href="https://colab.research.google.com/github/AnttonLA/BINP37/blob/master/bioBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a small guide I wanted to write about the setup process and use of the BioBERT software (https://github.com/dmis-lab/biobert). I tried to be thorough, but I'm sure some thing are still confusing.

Please keep in mind that I am not too familiar with BioBERT myself. The things explained here I learned mostly reading comments on the GitHub page of BioBERT or through experimentaion. It is quite likely that some the things I say here are incorrect!!

#BioBERT setup in Colab (or elsewhere) for Name Entity Recognition

First step is to download all the necesary files to make bioBERT work. That includes the pre-trained weights, the required packages and the datasets used in the fine-tuning process of the model.

We'll first clone bioBERT from GitHub and make sure we have the required packages installed. There is a compatibility issue between Tensorflow (which bioBERT is based on) and the 'cuda' repository. Aparently TensorFlow is only compatible with versions 9.0 or lower. This measn we will have to install an older version. **Beware, as the cell will prompt a Yes/No question that needs to be answered for the installation to procede**. The package will take some minutes to install.

In [2]:
!git clone https://github.com/dmis-lab/biobert.git

Cloning into 'biobert'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 293 (delta 7), reused 0 (delta 0), pack-reused 278[K
Receiving objects: 100% (293/293), 489.78 KiB | 1.57 MiB/s, done.
Resolving deltas: 100% (169/169), done.


In [5]:
!pip install -r biobert/requirements.txt



In [4]:
!sudo dpkg -i drive/My\ Drive/project_biobert/cuda-repo-ubuntu1704_9.0.176-1_amd64.deb
!sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
!sudo apt-get update
!sudo apt-get install cuda-9-0 #REMEMBER YOU HAVE TO ANSWER THE PROMPT!

Selecting previously unselected package cuda-repo-ubuntu1704.
(Reading database ... 144542 files and directories currently installed.)
Preparing to unpack .../cuda-repo-ubuntu1704_9.0.176-1_amd64.deb ...
Unpacking cuda-repo-ubuntu1704 (9.0.176-1) ...
Setting up cuda-repo-ubuntu1704 (9.0.176-1) ...

Configuration file '/etc/apt/sources.list.d/cuda.list'
 ==> File on system created by you or by a script.
 ==> File also in package provided by package maintainer.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** cuda.list (Y/I/N/O/D/Z) [default=N] ? y
Installing new version of config file /etc/apt/sources.list.d/cuda.list ...
Executing: /tmp/apt-key-gpghome.c0Ff9XyVzs/gpg.1.sh --fetch-keys http://developer.downlo

Next, we will download the pre-trained weights from https://github.com/naver/biobert-pretrained

In my particular case, I decided to use the BioBERT-Base v1.1 (+ PubMed 1M) weights, which I stored in my Drive. I connected the Drive to the colab notebook and loaded the files from there.

In [6]:
!tar -xvf drive/My\ Drive/project_biobert/biobert_v1.1_pubmed.tar

biobert_v1.1_pubmed/
biobert_v1.1_pubmed/model.ckpt-1000000.data-00000-of-00001
biobert_v1.1_pubmed/model.ckpt-1000000.meta
biobert_v1.1_pubmed/bert_config.json
biobert_v1.1_pubmed/vocab.txt
biobert_v1.1_pubmed/model.ckpt-1000000.index


The main bioBERT GitHub page offers eight datasets to perfrom the fine-tuning with. These datasets can be found here: https://github.com/dmis-lab/biobert 

Again, I had the files stored in my Drive and loaded them from there.

In [7]:
!unzip drive/My\ Drive/project_biobert/NERdata.zip

Archive:  drive/My Drive/project_biobert/NERdata.zip
   creating: BC2GM/
  inflating: BC2GM/devel.tsv         
  inflating: BC2GM/test.tsv          
  inflating: BC2GM/train.tsv         
  inflating: BC2GM/train_dev.tsv     
   creating: BC4CHEMD/
  inflating: BC4CHEMD/devel.tsv      
  inflating: BC4CHEMD/test.tsv       
  inflating: BC4CHEMD/train.tsv      
  inflating: BC4CHEMD/train_dev.tsv  
   creating: BC5CDR-chem/
  inflating: BC5CDR-chem/devel.tsv   
  inflating: BC5CDR-chem/test.tsv    
  inflating: BC5CDR-chem/train.tsv   
  inflating: BC5CDR-chem/train_dev.tsv  
   creating: BC5CDR-disease/
  inflating: BC5CDR-disease/devel.tsv  
  inflating: BC5CDR-disease/test.tsv  
  inflating: BC5CDR-disease/train.tsv  
  inflating: BC5CDR-disease/train_dev.tsv  
   creating: JNLPBA/
  inflating: JNLPBA/devel.tsv        
  inflating: JNLPBA/test.tsv         
  inflating: JNLPBA/train.tsv        
  inflating: JNLPBA/train_dev.tsv    
   creating: linnaeus/
  inflating: linnaeus/devel.tsv

With this, we have all of the scripts, packages and data we will need loaded on Colab. Before we procede with the fine tuning, we will create some environmental variables to make our lives easier.

The variables will contain the paths to the pre-trained model, the datasets, and the location where the output files should be stored. We will also create this output folder.

In [0]:
import os
os.environ['BIOBERT_DIR']= './biobert_v1.1_pubmed' #Pre-trained model.
os.environ['NER_DIR'] = './BC5CDR-chem' # Dataset. Could be any dataset you want to use.
os.environ['OUTPUT_DIR'] = './ner_outputs' # Output
!mkdir $OUTPUT_DIR

# Fine tunning

Everyting is ready for the actual fine-tuning. **Make sure to change the Colab notebook to GPU mode**, otherwise it will take too long to execute. You can do this by going to 'Edit > Configure Notebook'.

Check how much GPU you have access to, as the training will fail if you don't have enough. In my experience, you want to have at least around 15000MB

In [1]:
#CHECK HOW MUCH GPU COLAB IS GIVING YOU
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()


Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-cp36-none-any.whl size=7413 sha256=522afaa4377a4e2e1e81c276de5033905459bc6cee9f2cea274cc092b74fc044
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Gen RAM Free: 12.7 GB  | Proc size: 158.8 MB
GPU RAM Free: 16280MB | Used: 0MB | Util   0% | Total 16280MB


The training will take a while to complete, depending on the number of epochs selected. (It takes around an hour for me with 10 epochs)

The GitHub page has more information about the specific flags of the *run_ner.py* script. Otherwise you can use '--help'.

In [9]:
!python biobert/run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  'command line!' % flag_name)
INFO:tensorflow:Using config: {'_model_dir': './ner_outputs', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe4be541438>, '_task_type': 'wo

Once the model has been fine-tuned to the dataset we have selected, we will get (among other things) two .txt files in our output directory: **token_test.txt** and **label_test.txt**. These are the main output files of the *run_ner.py* script other than the model itself.

They can be used to asses the accuracy of the model against the test-set. In order to get this data, we will need to "detokenize" the output. We will use a script in biobert called _ner_detokenize.py_ to do so. This will produce a .txt file named **NER_result_conll.txt** in our output directory. This file contains three columns of text, the first being a word, the second being the "true" classification, and the third being the classification made by our model.

The three possible classifications are: **B-MISC** (Beggining of Entity), **I-MISC** (Inside Entity) or **O-MISC** (Outside Entity).

The Perl script *conlleval.pl* gives the proper statistics of the resutls.

In [0]:
!python biobert/biocodes/ner_detokenize.py --token_test_path=$OUTPUT_DIR/token_test.txt --label_test_path=$OUTPUT_DIR/label_test.txt --answer_path=$NER_DIR/test.tsv --output_dir=$OUTPUT_DIR

In [11]:
!head $OUTPUT_DIR/NER_result_conll.txt #Take a look at the output file.

Torsade O-MISC O-MISC
de O-MISC O-MISC
pointes O-MISC O-MISC
ventricular O-MISC O-MISC
tachycardia O-MISC O-MISC
during O-MISC O-MISC
low O-MISC O-MISC
dose O-MISC O-MISC
intermittent O-MISC O-MISC
dobutamine B-MISC B-MISC


In [12]:
!perl biobert/biocodes/conlleval.pl < $OUTPUT_DIR/NER_result_conll.txt

processed 124750 tokens with 5385 phrases; found: 5401 phrases; correct: 5043.
accuracy:  99.22%; precision:  93.37%; recall:  93.65%; FB1:  93.51
             MISC: precision:  93.37%; recall:  93.65%; FB1:  93.51  5401


# Using our model for predition

Once we know that our model is working as intended and is capable of classifying the entities we are interested in, we want to put it to work with our own data.

However, while BioBERT is certainly capable of doing this, there are certain limitations to using it for predicition with our own data, mainly related to data formatting. I'll try to explain the problems (and their solutions) as well as possible. 

While the GitHub page claims that the *run_ner.py* script can be used in "classifier mode" by altering the flags *---do_train=false* and *---do_predict=true*, this does not work properly. Instead, we will need to use a little workaround.

We will keep *---do_train==true*. We will, however, decrease the number of epochs. This way, since the epochs are fewer as before, the software will skip the training process, and will insetead perform the prediction directly.

Unfortunately, we will still need the training datasets for this to work. Therefore, our data will need to be in a file called **test.tsv**, and be stored in the path saved at $NER_DIR along with the files *devel.tsv*, *train_dev.tsv* and *train.tsv*. The easiest way of doing this is taking one of the datasets for the fine-tunning process and swapping the **test.tsv** file for our data.

**Beware of the format of the test.tsv file!** Every line will need to be composed of *word* + *\\t* + *label* + *\\n*. The label is a single character, either an "O", an "I" or a "B". **The last line of the file also needs to be a newline (\\n) character**, otherwise the script won't work.

In [0]:
!python biobert/run_ner.py --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --data_dir=$NER_DIR --do_train=true --num_train_epochs=1.0 --do_predict=true --output_dir=$OUTPUT_DIR

Detokenize the output files again, same as in the training. This time, we can ignore the second colum. The third column of the file is the classification made by our model.

In [0]:
!python biobert/biocodes/ner_detokenize.py --token_test_path=$OUTPUT_DIR/token_test.txt --label_test_path=$OUTPUT_DIR/label_test.txt --answer_path=$NER_DIR/test.tsv --output_dir=$OUTPUT_DIR

#Creating our own 'test.tsv' files

We want to transform the given 100-paper dataset into a 'test.tsv' file to perform NER in.

*WORK IN PROGRESS*

# Extracting output information

We want our output to be on the Pubannotation format. 

*WORK IN PROGRESS*