Notebook for running Multi-task learning *for* biomedical 
relation extraction

Link to Google Drive, useful for saving models

In [None]:
from google.colab  import drive
drive.mount("/content/gdrive",force_remount=True)

Mounted at /content/gdrive


In [None]:
#Checking available GPU
!nvidia-smi

Clone repository

In [1]:
!git clone https://github.com/IoSylar/Multi-task-Learning-for-Biomedical-Relation-Extraction.git

Cloning into 'Multi-task-Learning-for-Biomedical-Relation-Extraction'...
remote: Enumerating objects: 590, done.[K
remote: Counting objects: 100% (590/590), done.[K
remote: Compressing objects: 100% (423/423), done.[K
remote: Total 590 (delta 157), reused 590 (delta 157), pack-reused 0[K
Receiving objects: 100% (590/590), 23.55 MiB | 14.73 MiB/s, done.
Resolving deltas: 100% (157/157), done.
Checking out files: 100% (646/646), done.


Script for downloading and preprocessing all clinical NLP datasets. Note that the i2b2 files must be downloaded from the official site under license. Therefore, the DDI2013, Chemprot, and I2B22010 datasets have already been processed offline and are present in the Multi-task-Learning-for-Biomedical-Relation-Extraction/Dataset folder.

In [None]:
!bash download_all_task_data.sh

In [None]:
!pip install bioc
!pip install fire

In [None]:
!bash preprocess_all_classification_datasets.sh

I move to the main directory and install the requirements.

In [None]:
%cd "/content/Multi-task-Learning-for-Biomedical-Relation-Extraction/mt-dnn"

In [None]:
!pip install -r requirements.txt

You need to install the latest version of apex, so you uninstall and then reinstall apex.

In [None]:
!pip3 uninstall apex

In [None]:
!git clone https://www.github.com/nvidia/apex
%cd apex
!python3 setup.py install

You should use the link that leads to my Mega.nz account to download the pre-trained roberta model from the clinical NLP paper. This model, originally in PyTorch format (.bin), has been converted to model.pt. Link: https://mega.nz/folder/Bc0CXJaa#qhY1Cp4CGaaBaOGU__bFig. It is recommended to not specify any path, but to use the default choice: mt-dnn_models.

In [8]:
import sys, os, urllib.request
import time
import subprocess
import contextlib
from IPython.display import clear_output
#@markdown <br><center><img src='https://mega.nz/favicon.ico?v=3' height="50" alt="MEGA-logo"/></center>
#@markdown <center><h2>Transfer from Mega to GDrive</h2></center><br>
HOME = os.path.expanduser("~")
if not os.path.exists(f"{HOME}/.ipython/ocr.py"):
    hCode = "https://raw.githubusercontent.com/biplobsd/" \
                "OneClickRun/master/res/ocr.py"
    urllib.request.urlretrieve(hCode, f"{HOME}/.ipython/ocr.py")

from ocr import (
    runSh,
    loadingAn,
)
#@title MEGA public link download
URL = "https://mega.nz/folder/Bc0CXJaa#qhY1Cp4CGaaBaOGU__bFig" #@param {type:"string"}
OUTPUT_PATH = "" #@param {type:"string"}
#@markdown #####_*Sometimes this cell doesn't stop itself after the completion of the transfer. In case of that stop the cell manually._
if not OUTPUT_PATH:
  os.makedirs("mt_dnn_models", exist_ok=True)
  OUTPUT_PATH = "mt_dnn_models"
# MEGAcmd installing
if not os.path.exists("/usr/bin/mega-cmd"):
    loadingAn()
    print("Installing MEGA ...")
    runSh('sudo apt-get -y update')
    runSh('sudo apt-get -y install libmms0 libc-ares2 libc6 libcrypto++6 libgcc1 libmediainfo0v5 libpcre3 libpcrecpp0v5 libssl1.1 libstdc++6 libzen0v5 zlib1g apt-transport-https')
    runSh('sudo curl -sL -o /var/cache/apt/archives/MEGAcmd.deb https://mega.nz/linux/MEGAsync/Debian_9.0/amd64/megacmd-Debian_9.0_amd64.deb', output=True)
    runSh('sudo dpkg -i /var/cache/apt/archives/MEGAcmd.deb', output=True)
    print("MEGA is installed.")
    clear_output()

# Unix, Windows and old Macintosh end-of-line
newlines = ['\n', '\r\n', '\r']

def unbuffered(proc, stream='stdout'):
    stream = getattr(proc, stream)
    with contextlib.closing(stream):
        while True:
            out = []
            last = stream.read(1)
            # Don't loop forever
            if last == '' and proc.poll() is not None:
                break
            while last not in newlines:
                # Don't loop forever
                if last == '' and proc.poll() is not None:
                    break
                out.append(last)
                last = stream.read(1)
            out = ''.join(out)
            yield out


def transfare():
    import codecs
    decoder = codecs.getincrementaldecoder("UTF-8")()
    cmd = ["mega-get", URL, OUTPUT_PATH]
    proc = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        # Make all end-of-lines '\n'
        universal_newlines=True,
    )
    for line in unbuffered(proc):
        print(line)
        


transfare()



TRANSFERRING ||............................................||(0/0 KB:   0.00 %)  
TRANSFERRING ||############################################||(0/0 KB: 100.00 %)  
Download finished: /content/Multi-task-Learning-for-Biomedical-Relation-Extraction/mt-dnn/mt_dnn_models/robertaFB
TRANSFERRING ||########################################||(953/953 MB: 100.00 %)  


Each dataset must be pre-processed with prepro_std.py. Each dataset will be of the type ddi2013_train.tsv -dev.tsv - test.tsv. You can insert a local model as in the example, or any model available on huggingface, the root dir in which the datasets should be placed is tutorial. For simplicity, they have already been moved there. The definition of tasks is in tutorial_task_def.yml. Each task must first be defined in this file, then it can be processed. Always refer to the main directory.

In [9]:
%cd "/content/Multi-task-Learning-for-Biomedical-Relation-Extraction/mt-dnn"

/content/Multi-task-Learning-for-Biomedical-Relation-Extraction/mt-dnn


In [12]:
!python prepro_std.py --model mt_dnn_models/robertaFB  --root_dir tutorials/ --task_def tutorials/tutorial_task_def.yml

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
{'input_ids': [0, 22963, 384, 261, 3332, 265, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
10/09/2021 09:23:48 Task chemprot
10/09/2021 09:23:48 tutorials/mt_dnn_models/robertaFB/chemprot_train.json
10/09/2021 09:23:52 tutorials/mt_dnn_models/robertaFB/chemprot_dev.json
10/09/2021 09:23:55 tutorials/mt_dnn_models/robertaFB/chemprot_test.json
10/09/2021 09:24:00 Task ddi2013
10/09/2021 09:24:00 tutorials/mt_dnn_models/robertaFB/ddi2013_train.json
10/09/2021 09:24:08 tutorials/mt_dnn_models/robertaFB/ddi2013_dev.json
10/09/2021 09:24:10 tutorials/mt_dnn_models/robertaFB/ddi2013_test.json
10/09/2021 09:24:11 Task i2b2
10/09/2021 09:24:11 tutorials/mt_dnn_models/robertaFB/i2b2_train.json
10/09/2021 09:24:16 tutorials/mt_dnn_models/robertaFB/i2b2_dev.json

Run the multi-task training on the entire dataset. To run single task, specify only one task to use and remove the MTL-related parameters. The directory to save the model and results: output_dir is set by the user.

In [15]:
!python train.py --task_def tutorials/tutorial_task_def.yml --data_dir tutorials/mt_dnn_models/robertaFB   --train_datasets ddi2013,chemprot,i2b2 --test_datasets ddi2013,chemprot,i2b2 --epochs=10 --batch_size=8 --bert_model_type="roberta"  --encoder_type=2  --output_dir="Addestramento" --init_checkpoint="mt_dnn_models/robertaFB" --grad_clipping=1.0 --adam_eps=1e-7  --seed=2010 --mtl_opt=1  #--model_ckpt="SingleI2B2SEED2040/model_2.pt" --resume

10/09/2021 09:25:40 Launching the MT-DNN training
10/09/2021 09:25:40 Loading tutorials/mt_dnn_models/robertaFB/ddi2013_train.json as task 0
Loaded 29333 samples out of 29333
10/09/2021 09:25:41 Loading tutorials/mt_dnn_models/robertaFB/chemprot_train.json as task 1
Loaded 19460 samples out of 19460
10/09/2021 09:25:42 Loading tutorials/mt_dnn_models/robertaFB/i2b2_train.json as task 2
Loaded 21384 samples out of 21384
False
False
Loaded 7244 samples out of 7244
Loaded 5761 samples out of 5761
Loaded 11820 samples out of 11820
Loaded 16943 samples out of 16943
Loaded 872 samples out of 872
Loaded 43000 samples out of 43000
10/09/2021 09:25:45 ####################
10/09/2021 09:25:45 {'log_file': 'mt-dnn-train.log', 'tensorboard': False, 'tensorboard_logdir': 'tensorboard_logdir', 'init_checkpoint': 'mt_dnn_models/robertaFB', 'data_dir': 'tutorials/mt_dnn_models/robertaFB', 'data_sort_on': False, 'name': 'farmer', 'task_def': 'tutorials/tutorial_task_def.yml', 'train_datasets': ['ddi201

Example of MTL few shot learning. As always, tokenization is required first.

In [None]:
!python train.py --task_def tutorials/tutorial_task_def.yml --data_dir "Directory_SHOT_tokenizzati"   --train_datasets ddi2013 --test_datasets ddi2013  --epochs=30 --batch_size=8 --bert_model_type="roberta"  --encoder_type=2  --output_dir="AddestramentoFewShot" --init_checkpoint="mt_dnn_models/robertaFB" --grad_clipping=1.0 --adam_eps=1e-7 --seed=9   #--model_ckpt="SingleTaskI2B2MEDIUMFiltrato10BERT81MIL2000/model_6.pt" --resume


In this case, adversarial learning is performed on the few shot with the adv_opt and adv parameters. The losses can be modified from the loss file in the mt-dnn directory and then in the tutorial_def.yml file

In [None]:
!python train.py --task_def tutorials/tutorial_task_def.yml --data_dir "Directory_SHOT_tokenizzati"   --train_datasets ddi2013,chemprot,i2b2 --test_datasets ddi2013,chemprot,i2b2  --epochs=30 --batch_size=8 --bert_model_type="roberta"  --encoder_type=2  --output_dir="AddestramentoFewShotMTL" --init_checkpoint="mt_dnn_models/RobertaVoc/RobertaVoc2/" --grad_clipping=1.0 --adam_eps=1e-7 --seed=9  --adv=1 --adv_opt=1 #--model_ckpt="SingleTaskI2B2MEDIUMFiltrato10BERT81MIL2000/model_6.pt" --resume


The evaluation on the test set can be performed simultaneously with the training. In the train.py file, the part related to this needs to be uncommented. However, given the huge training times, it is recommended to perform the prediction phase afterwards, predicting on the model that has the best performance on the validation set.

The predict.py script can be used without many modifications when making predictions on a model trained on a single task. When using a multi-task model, you first need to load a single task model and then initialize it with the weights of the multi-task model, using the corresponding multi-task layer as the output layer. This can be set in the initial part of the predict.py script.

In [None]:
#predict
!python predict.py --task_def tutorials/tutorial_task_def.yml --task chemprot --task_id=0 --prep_input="tutorials/robertaFB/chemprot_train.json" --score="ScoreRidotto/ScoreChemprotAssistantT4.txt"  --model_checkpoint="MTL01F2040/model_9.pt" --checkpoint="ChemprotSingleF2000/model_9.pt"  --with_label

Knowledge distillation phase. This code does not work if the input data, already tokenized in .json, do not have a soft label column.

In [None]:
!python train.py --task_def tutorials/tutorial_task_def.yml --data_dir tutorials/robertaFB/   --train_datasets ddi2013,chemprot,i2b2 --test_datasets ddi2013,chemprot,i2b2 --epochs=10 --batch_size=8 --bert_model_type="roberta"  --encoder_type=2  --output_dir="Addestramento" --init_checkpoint="mt_dnn_models/RobertaFB" --grad_clipping=1.0 --adam_eps=1e-7  --seed=2000 --mtl_opt=1 --mkd_opt=1   #--model_ckpt="SingleI2B2SEED2040/model_2.pt" --resume

The json file with the soft label columns is obtained through an offline mechanism. First, prediction is made on the test set using the predict.py script for the task of interest. Then, the prepare_distillation_data.py script is used to obtain the soft labels. The soft labels will then be concatenated with the already tokenized dataset json file and given as input to the network.

In [None]:
#Example of creating soft labels for chemprot. The prepare distillation file needs to be properly set for the type of task required. 
!python prepare_distillation_data.py  --task_def tutorials/tutorial_task_def.yml --task chemprot --add_soft_label --std_input="tutorials/chemprot_train.tsv" --score="ScoreEnsemble/Chemprot012000.txt" --std_output="ScoreEnsemble/Distillati/Chemprot012000.txt"


Subsequently, the training dataset's JSON file is read, the newly determined soft label's txt file is read, and they are concatenated, resulting in the file that is required for distilled training.

**Task Analysis ** : The similarity between the datasets will be calculated through the sentence embeddings.

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12')
print("Max Sequence Length:", model.max_seq_length)
model.max_seq_length = 512
print("Max Sequence Length:", model.max_seq_length)

In [None]:
import pandas as pd
DatasetTrainDDI2013=pd.read_csv("Multi-task-Learning-for-Biomedical-Relation-Extraction/Dataset/DDI2013/ddi2013_train.tsv", sep='\t',header=None)
DatasetTrainChemprot=pd.read_csv("Multi-task-Learning-for-Biomedical-Relation-Extraction/Dataset/Chemprot/chemprot_train.tsv", sep='\t',header=None)
DatasetTrainI2B2=pd.read_csv("Multi-task-Learning-for-Biomedical-Relation-Extraction/Dataset/I2B2-2010RE/i2b2_train.tsv", sep='\t',header=None)


In [None]:
SentencesDDI=[]
for i in range(len(DatasetTrainDDI2013)):
  SentencesDDI.append(DatasetTrainDDI2013[0][i])
print(SentencesDDI[0:10])

In [None]:
SentencesChemprot=[]
for i in range(len(DatasetTrainChemprot)):
  SentencesChemprot.append(DatasetTrainChemprot[0][i])
print(SentencesChemprot[0:10])

In [None]:
SentencesI2B2=[]
for i in range(len(DatasetTrainI2B2)):
  SentencesI2B2.append(DatasetTrainI2B2[0][i])
print(SentencesI2B2[0:10])

sentences encoding

In [None]:
sentence_embeddingsDDI = model.encode(SentencesDDI)
sentence_embeddingsCHEM = model.encode(SentencesChemprot)
sentence_embeddingsI2B2 = model.encode(SentencesI2B2)

Similarity between DDI amd Chemprot

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
score_DDICHEM=[]
for i in range(len(SentencesDDI)):
  score_DDICHEM.append(np.sum(cosine_similarity(
      [sentence_embeddingsDDI[i]],
      sentence_embeddingsCHEM[1:]
  ))/len(SentencesChemprot))
print(score_DDICHEM) #poi dovrei fare la somma e dividere sti valori

In [None]:
import numpy as np
np.sum(score_DDICHEM)/len(SentencesDDI)