# Siamese BERT Pipeline for Entity Normalization

This pipeline contains an experiment of siamese bert model for protein name normalization.
We use BioCreative Dataset as query set as explained in [NSEEN Paper](https://www.isi.edu/~ambite/papers/NSEEN__Neural_Semantic_Embedding_for_Entity_Normalization.pdf).

#### Set your CUDA Visible Devices in case you need to work with multiple GPU

We need to set cuda device here before we import any pytorch module

In [None]:
%env CUDA_VISIBLE_DEVICES=7

In [None]:
import os
os.environ['CUDA_VISIBLE_DEVICES']

In [None]:
ROOT_DIR = '../'

In [None]:
import random
import math
import time
import copy
import os
import sys
from datetime import date

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

import logging
import nltk
import glob

from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel, BertForSequenceClassification, BertModel
from transformers import AdamW

from sentence_transformers import SentenceTransformer, models, SentencesDataset, InputExample, losses, evaluation

# Add relative uitls folder Path
sys.path.append(ROOT_DIR)

from utils.uniprot_loader import *
from utils.annoy_helper import *
from utils.biocreative_helper import *
from utils.hard_negative import *

logger = logging.getLogger()
logger.setLevel(logging.INFO)

%load_ext autoreload
%autoreload 2

In [None]:
# Set default device to cuda if available
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
device

# STEP 1 Finetune BioBERT

### DATA Preparation

In this step we will use a built-in function to read data from uniprot_data_prep.

You can simply come up with your own dataset by providing train_data and dev_data as a list of triplet consisting of name1, name2 and score [name1, name2, score]

name1: str
name2: str
score: float

```
[
    ['IF2(mt)', 'MTIF2', 1],
    ['FRDA', 'Frataxin intermediate form', 1],
    ['GATL3', 'L-JAK', 0],
]
```
 
Or you can also use `get_data_from_path` to prepare the data for you

`get_data_from_path` expects the following files in the folder. Each files must contain name1, name2 and label column.
- train.tsv
- dev.tsv

An example of the file is shown below

In [None]:
pd.read_csv(os.path.join(ROOT_DIR, 'data/temp/dev.tsv'), delimiter='\t').sample(frac=0.1).to_csv('../data/dev.tsv', sep='\t', index=False)

In [None]:
pd.read_csv(os.path.join(ROOT_DIR, 'data/temp/train.tsv'), delimiter='\t').sample(frac=0.005).to_csv('../data/train.tsv', sep='\t', index=False)

In [None]:
train_data, dev_data = get_data_from_path(os.path.join(ROOT_DIR, 'data'))

In [None]:
p1train = set([x[0] for x in train_data])
p2train = set([x[1] for x in train_data])
p1dev = set([x[0] for x in dev_data])
p2dev = set([x[1] for x in dev_data])

In [None]:
print("All Train: {}".format(len(p1train.union(p2train))))
print("All Dev: {}".format(len(p1dev.union(p2dev))))
print("Seen in Train: {}".format(len(p1dev.union(p2dev).intersection(p1train.union(p2train)))))

### Remove Seen Dev Data

To make sure that there is no cheating in our training step, We will remove the previously seen dev data in training data 

In [None]:
trainset = p1train.union(p2train)
dev_data_cleaned = [data for data in dev_data if data[0] not in trainset and data[1] not in trainset]

p1dev_clean = set([x[0] for x in dev_data_cleaned])
p2dev_clean = set([x[1] for x in dev_data_cleaned])

In [None]:
print("All Train: {}".format(len(p1train.union(p2train))))
print("All Dev: {}".format(len(p1dev_clean.union(p2dev_clean))))
print("Seen in Train: {}".format(len(p1dev_clean.union(p2dev_clean).intersection(p1train.union(p2train)))))

### Add Training Heuristic

To help the model learn syntactic similarity between two name, we add a function to generate heuristic from the name to help it learn. One of heuristic examples is all lower case name vs Upper case for the first letter name. In this case, we gave a score of 0.9 to the name pair. The format of the data point must be a list or tuple of [name1, name2, score].

__Example:__

- ["Aspirin", "aspirin", 0.9]

In [None]:
train_heuristic_1 = generate_heuristic(p1train.union(p2train))
dev_heuristic_1 = generate_heuristic(p1dev_clean.union(p2dev_clean))

train_data = [[s1,s2, float(score)] for s1,s2,score in train_data]
dev_data_cleaned = [[s1,s2,float(score)] for s1,s2,score in dev_data_cleaned]

In [None]:
train_data_with_extra = train_data + train_heuristic_1
dev_data_cleaned_with_extra = dev_data_cleaned + dev_heuristic_1

### Load Pretrained Model

In this step we will load our saved pretrained model. The default model is `dmislab_biobert_v1.1` however you can supply any pretrained model into `get_model`

```
model = get_model('siamese-biobert-v1-1-ep-5-Dec-15-2020-with-heuris', device=device)
```

In [None]:
model = get_model(device=device)

### Load Training and Dev Data into dataloader

It is important that we put the data into DataLoader so that the data get batched and shuffled properly during the training time. In this example, we create training dataset using InputExample provided by Sentence Transformer Library. This is similar to DataLoader in Pytorch.

We will also supply dev data set into evaluator object so that the model can always evaluate the the performance during the training

In [None]:
train_dataloader = get_train_dataloader(model, train_data_with_extra)

In [None]:
evaluator = get_dev_dataloader(model, dev_data_cleaned_with_extra, evaltype='cosine')

In [None]:
num_epochs = 1
total_iter = num_epochs * len(train_dataloader)
evaluation_steps = len(train_dataloader)

In this example we use CosineSimilarityLoss. However, we can always change the loss function of our modeltraining here. Some useful losses can be founded here https://www.sbert.net/docs/package_reference/losses.html

In [None]:
train_loss = losses.CosineSimilarityLoss(model=model)

In [None]:
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=evaluation_steps,
         )

In [None]:
print("FINISH TRAINING ... SAVING NOW")

In [None]:
today = date.today()
saved_model = os.path.join(ROOT_DIR, 'trained-model/siamese-biobert-v1-1-ep-{}-{}-with-heuris-2'.format(num_epochs, today.strftime("%b-%d-%Y")))
model.save(saved_model)
print("FINISH SAVING")

# Step 2 Prepare dataset for Semantic Search with Annoy

In the previous step, we finetune our siamese biobert model to let the model learn a good embedding for protein name. Since linear scan take up to O(N) time to get us the closest name pair so it's impractical to use it to search the similar name in realtime. In this experiment, we use [Annoy](https://github.com/spotify/annoy) to get the approximate nearest neightbor of the query term (query vector)

In [None]:
from sentence_transformers import SentenceTransformer, util
import os
import csv
import pickle
import time
import torch
from annoy import AnnoyIndex

In this example, we will embed all HUMAN proteins into Annoy File so that we can do the fast look up.
However, any input data with ["id", "name"] columns should also work

In [None]:
REF_DATA_PATH =  os.path.join(ROOT_DIR, 'data/reference_data.tsv')

In [None]:
df = pd.read_csv(REF_DATA_PATH, delimiter='\t')
df

In [None]:
logger.setLevel(logging.CRITICAL)
finetune_biobert = get_model(os.path.join(ROOT_DIR, "trained-model/siamese-biobert-v1-1-ep-1-Nov-28-2021-with-heuris-2"), device=device)

In [None]:
annoy_object = AnnoyObjectWrapper(index_path='./mesh-embedding-4096-trees.ann', 
                                  embedding_path='./mesh-768-embedding.pkl', 
                                  reference_dataset_path=REF_DATA_PATH, 
                                  name2id_path='./mesh-name2id-embedding-size-1500000', 
                                  model=finetune_biobert, n_trees=4096, embedding_size=768, max_corpus_size=1500000)

In [None]:
annoy_object.create_embedding_and_index(create_new_embedding=True, create_new_index=True)

# Step 3 Hard Negative Mining

Finetuning the model with random negative may not be ideal for the model to learn how to distinguish between actual synonym and the name which only look similar to the query term. Therefore, we introduce hard negative mining in this step so that we can help improving the model performance

In [None]:
def list_contains_no_answer_in_higher_rank(lowest_correct_answer, hits, possible_answer, corpus_sentences):
    for i in range(lowest_correct_answer, -1, -1):
        hitname = corpus_sentences[hits[i]['corpus_id']]
        if hitname not in possible_answer:
            return True
    return False

In [None]:
def get_id2name(dataset_path):
    all_name = pd.read_csv(dataset_path, delimiter='\t')
    all_name = all_name.dropna() 
    
    id2name = {}
    for idx, row in all_name.iterrows():
        if row['id'].strip() not in id2name:
            id2name[row['id']] = set()
        id2name[row['id']].add(row['name'].strip())
        
    return id2name

In [None]:
def get_name2id(dataset_path):
    all_name = pd.read_csv(dataset_path, delimiter='\t')
    all_name = all_name.dropna() 
    
    name2id = {}
    for idx, row in all_name.iterrows():
        if row['name'].strip() not in name2id:
            name2id[row['name']] = set()
        name2id[row['name']].add(row['id'].strip())
        
    return name2id

In [None]:
import os
def get_hard_neg(hard_neg_path, last_k_files=None):
    df_list = []
    filenames = glob.glob(hard_neg_path + '/*')
    filenames.sort(key=os.path.getmtime)
    if last_k_files:
        filenames = filenames[-last_k_files:]
    print("reading hard_neg from filenames: ", filenames)    
    for file in filenames:
        df_list.append(pd.read_csv(file, delimiter='\t'))
    try:
        df = pd.concat(df_list)
        return df
    except Exception as e:
        df = pd.DataFrame(columns = ['name1', 'name2'])
        return df

In [None]:
def hard_neg_mining_from_ref(model, annoy_object_wrapper, ref, id2name, top_k_hits=100):
    corpus_sentences = annoy_object_wrapper.embedding_object.corpus_sentences
    corpus_embeddings = annoy_object_wrapper.embedding_object.corpus_embeddings
    name_to_id = annoy_object_wrapper.embedding_object.name_to_id
    annoy_index = annoy_object_wrapper.annoy_index

    hard_negative_pairs = []
    from tqdm import tqdm
    pbar = tqdm(total=len(ref), position=0, leave=True)
    for i, kv in enumerate(ref):
        query, answers = kv
        query_embedding = model.encode(query)

        found_corpus_ids, scores = annoy_index.get_nns_by_vector(query_embedding, top_k_hits, include_distances=True)
        hits = []
        for id, score in zip(found_corpus_ids, scores):
            hits.append({'corpus_id': id, 'score': 1-((score**2) / 2)})

        end_time = time.time()
        
        possible_answer = set()
        for ans in answers:
            possible_answer = possible_answer.union(id2name.get(ans, set()))
            
        # Get lowest top_k_hits
        lowest_rank = top_k_hits - 1
        lowest_hit = "None"
        for i, hit in enumerate(hits[0:top_k_hits]):
            hitname = corpus_sentences[hit['corpus_id']]
            if hitname in possible_answer:
                lowest_rank = i
        hard_neg_count = 0
        has_wrong_answer_above = list_contains_no_answer_in_higher_rank(lowest_rank, hits[0:top_k_hits], possible_answer, corpus_sentences)
        for i, hit in enumerate(hits[0:top_k_hits]):
            hitname = corpus_sentences[hit['corpus_id']]
            if hitname not in possible_answer:
                if has_wrong_answer_above:
                    if i < lowest_rank:
                        hard_negative_pairs.append([query, hitname])

        pbar.update(1)
    return hard_negative_pairs

In [None]:
logger.setLevel(logging.CRITICAL)

id2name = get_id2name(REF_DATA_PATH)
name2id = get_name2id(REF_DATA_PATH)
all_dev_name = set() # We add everything into reference and evaluate with another dataset. So, we leave all_dev_name empty
ref_for_hardneg = [(name, id_set) for name, id_set in list(name2id.items()) if name not in all_dev_name]
hard_negative_pairs = hard_neg_mining_from_ref(finetune_biobert, annoy_object, ref_for_hardneg, id2name)
len(hard_negative_pairs)

In [None]:
import glob
with open(os.path.join(ROOT_DIR, 'experiments/hard_neg/hard_neg_{}.tsv'.format(len(glob.glob(os.path.join(ROOT_DIR, 'experiments/hard_neg/*'))) + 1)), 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
    writer.writerow(['name1', 'name2'])
    for pair in hard_negative_pairs:
        writer.writerow(pair)

## Finetune siamese model with HardNeg Mining

In this step, we will combine everything we have and finetune the model with hard negative mining.
First we will read from the current_pretrained which we want to do a hard negative mining for.
The following section simply combine input reading, finetuning model, create hard negative mining names and evaluate with BioCreative Data

**BioCreative Data**: BioCreative Data is a dataset which consists of annotated protein name from BioMedical Publication. We will treat this dataset as an test dataset as we have already used every human protein in uniprot for training purpose.

In [None]:
num_training_epoch = 3

In [None]:
cur_pretrained = os.path.join(ROOT_DIR, 'trained-model/siamese-biobert-v1-1-ep-1-Nov-28-2021-with-heuris-2')
id2name = get_id2name(REF_DATA_PATH)
name2id = get_name2id(REF_DATA_PATH)
top_k_hits = 10
for training_round in range(num_training_epoch):
    logger.setLevel(logging.INFO)
    train_data, dev_data = get_data_from_path(os.path.join(ROOT_DIR, 'data'))
    p1train = set([x[0] for x in train_data])
    p2train = set([x[1] for x in train_data])
    p1dev = set([x[0] for x in dev_data])
    p2dev = set([x[1] for x in dev_data])

    # clean dev data
    trainset = p1train.union(p2train)
    dev_data_cleaned = [data for data in dev_data if data[0] not in trainset and data[1] not in trainset]

    train_heuristic_1 = generate_heuristic(p1train.union(p2train))
    dev_heuristic_1 = generate_heuristic(p1dev_clean.union(p2dev_clean))

    train_data = [[s1,s2, float(score)] for s1,s2,score in train_data]
    dev_data_cleaned = [[s1,s2,float(score)] for s1,s2,score in dev_data_cleaned]

    devname = set()
    for n1, n2, s in dev_data_cleaned:
        devname.add(n1)
        devname.add(n2)

    hard_neg_list = get_hard_neg(os.path.join(ROOT_DIR, 'experiments/hard_neg'))
    hard_neg_train, hard_neg_dev = remove_dup_hard_neg(hard_neg_list, devname)

    train_data_with_extra = train_data + train_heuristic_1 + hard_neg_train
    dev_data_cleaned_with_extra = dev_data_cleaned + dev_heuristic_1 + hard_neg_dev

    model = get_model(cur_pretrained, device=device)

    # Create Train Dataset using InputExample Provided by Sentence Transformer Library
    examples = []
    batch_size = 64
    for _, data in enumerate(train_data_with_extra):
        s1, s2, score = data
        ex = InputExample(texts=[s1,s2],label=score)
        examples.append(ex)

    train_dataset = SentencesDataset(examples, model)
    train_dataloader = DataLoader(train_dataset[:100], shuffle=True, batch_size=batch_size)

    num_epochs = 1
    total_iter = num_epochs * len(train_dataloader)

    # Create Word Embedding model with max_seq_length of 256
    train_loss = losses.CosineSimilarityLoss(model=model)
    
    model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs)

    today = date.today()
    saved_model = os.path.join(ROOT_DIR, 'trained-model/siamese-biobert-v1-1-ep-{}-{}-with-heuris-hard-neg-{}'.format(training_round, today.strftime("%b-%d-%Y"), len(glob.glob(os.path.join(ROOT_DIR, 'experiments/hardneg/*')))))
    model.save(saved_model)
    print("FINISH SAVING")

    # Load Saved Model
    logger.setLevel(logging.CRITICAL)
    finetune_biobert = get_model(saved_model, device=device)
    annoy_object = AnnoyObjectWrapper(index_path='./mesh-embedding-4096-trees.ann', 
                                  embedding_path='./mesh-768-embedding.pkl', 
                                  reference_dataset_path=REF_DATA_PATH, 
                                  name2id_path='./mesh-name2id-embedding-size-1500000', 
                                  model=finetune_biobert, n_trees=4096, embedding_size=768, max_corpus_size=1500000)
    annoy_object.create_embedding_and_index(create_new_embedding=True, create_new_index=True)
    print("Getting HardNegMining")
    
    logger.setLevel(logging.CRITICAL)
    ref_for_hardneg = [(name, id_set) for name, id_set in list(name2id.items())]
    hard_negative_pairs = hard_neg_mining_from_ref(finetune_biobert, annoy_object, ref_for_hardneg, id2name)
    
    with open(os.path.join(ROOT_DIR, 'experiments/hard_neg/hard_neg_{}.tsv'.format(len(glob.glob(os.path.join(ROOT_DIR, 'experiments/hard_neg/*'))) + 1)), 'w') as csvfile:
        writer = csv.writer(csvfile, delimiter='\t')
        writer.writerow(['name1', 'name2'])
        for pair in hard_negative_pairs:
            writer.writerow(pair)

    cur_pretrained = saved_model
    print("Finish Epoch ", training_round)