<a href="https://colab.research.google.com/github/aivscovid19/covid-19_research_collaboration/blob/master/notebooks/Extract_BioMedBERT_LG_Embedding_bioasq_8b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MRR for BERT and BERT + RESCORE

[Evaluation of Ranking for Medical Papers](https://github.com/aivscovid19/covid-19_research_collaboration/issues/9)

This notebook shows how to perform 3 Experiments (as of May 24 2020) on ranking using the dataset from Bioasq.

- Experiment 1: score: 0.9124373431039607
    This uses the vanilla elastic search ranking algorithm 

- Experiment 2: BioMedBERT LG @ 770K train step: 0.5496368744211725

- Experiment 3: 0.38342058777179133 using [universal-sentence-encoder-lite](https://tfhub.dev/google/universal-sentence-encoder-lite/2)


Right now Elastic search algorithm is the best.


# Setup Environment

In [0]:
from google.colab.auth import authenticate_user
authenticate_user()

In [2]:
%tensorflow_version 1.x

import tensorflow as tf
tf.__version__

TensorFlow 1.x selected.


'1.15.2'

In [0]:
#@title Load extensions
%load_ext google.colab.data_table

In [0]:
try:
    %pip install --user --upgrade --quiet elasticsearch
    import elasticsearch
except ModuleNotFoundError:
    import os
    os.kill(os.getpid(), 9) 

In [0]:
import csv
import os
import time
import re
import sys
import json

import numpy as np
import pandas as pd

import nltk
#from nltk.corpus import stopwords 
#from nltk.tokenize import word_tokenize

from tqdm import tqdm

from elasticsearch import helpers, Elasticsearch

## Setup Elastic Search Account

Login to [Kibana](https://d7e0e807713c441295ed9707b13a089f.us-central1.gcp.cloud.es.io:9243/app/kibana#/dev_tools/console)
Create a access token than expries in 30 days

```
POST /_security/api_key
{
  "name": "username",
  "expiration": "30d", 
  "role_descriptors": { 
    "role-a": {
      "cluster": ["all"],
      "index": [
        {
          "names": ["bioasq-8b-baseline", "ncbi"],
          "privileges": ["read"]
        }
      ]
    }
  }
}
```


In [0]:
# paste the token here
ES_ACCOUNT = {
  "id" : "bR3YinIBcfscfBmK1I1B",
  "name" : "fabrizio",
  "expiration" : 1594059365426,
  "api_key" : "CEmtFQR4Tnq2dapnWS0tkw"
}

In [0]:
ES_ENDPOINT='https://d1f43211bd5c4fc29e56a232832b7b17.us-central1.gcp.cloud.es.io:9243'

In [8]:
es = Elasticsearch([ES_ENDPOINT], api_key=(ES_ACCOUNT['id'], ES_ACCOUNT['api_key']))
es.info()

{'cluster_name': 'd1f43211bd5c4fc29e56a232832b7b17',
 'cluster_uuid': 'gCEe5S4fTBCM_w0uejAP9w',
 'name': 'instance-0000000006',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2020-03-26T06:34:37.794943Z',
  'build_flavor': 'default',
  'build_hash': 'ef48eb35cf30adf4db14086e8aabd07ef6fb113f',
  'build_snapshot': False,
  'build_type': 'tar',
  'lucene_version': '8.4.0',
  'minimum_index_compatibility_version': '6.0.0-beta1',
  'minimum_wire_compatibility_version': '6.8.0',
  'number': '7.6.2'}}

In [0]:
#@title Elastic SEARCH
def SEARCH(text, index, key, limit=31):
    res = es.search(index=index,
                    body={
                        "query": {
                            "match": {
                                'context': {
                                    "query": text,
                                    "operator": "or",
                                    "fuzziness": "0"
                                }
                            }
                        },
                        "min_score": -1,
                    },
                    size=limit)

    return ([(x.get('_source'), x.get('_score')) for x in res['hits']['hits']])


In [0]:
q_results = SEARCH("what is a virus?", 'bioasq-8b-baseline', 'context', limit=1)

In [11]:
len(q_results)

1

In [0]:
hit, score = q_results[0]

In [13]:
 hit['context']

'the five key questions of human performance modeling: 1) Why we build models of human performance; 2) What the expectations of a good human performance model are; 3) What the procedures and requirements in building and verifying a human performance model are; 4) How we integrate a human performance model with system design; and 5) What the possible future directions of human performance modeling research are. This paper describes and summarizes the five key questions of human performance modeling: 1) Why we build models of human performance; 2) What the expectations of a good human performance model are; 3) What the procedures and requirements in building and verifying a human performance model are; 4) How we integrate a human performance model with system design; and 5) What the possible future directions of human performance modeling research are.We conducted a systematic review of the literature to address 5 key questions specific to this population: 1) What are the current nationa

# Load BERT-BREATHE

In [0]:
#TODO: substitude these with the latest release?
bucket_name = 'ekaba-assets'
model_dir = 'COPY_biomedbert_base_bert_weights_and_vocab'

In [0]:
%%capture --no-stderr
![ -d bert ] || git clone https://github.com/google-research/bert

In [0]:
sys.path.append('/content/bert')

In [0]:
#@title read_examples
import modeling
import tokenization
from extract_features import InputExample

def read_examples(text_lines=[]):
    """Read a list of `InputExample`s from an input file."""
    examples = []
    unique_id = 0
    for line in text_lines:
        line = line.strip()
        linet = tokenization.convert_to_unicode(line)
        text_a = None
        text_b = None
        m = re.match(r"^(.*) \|\|\| (.*)$", line)
        if m is None:
            text_a = line
        else:
            text_a = m.group(1)
            text_b = m.group(2)
        examples.append(
            InputExample(unique_id=unique_id, 
                            text_a=text_a, 
                            text_b=text_b))
        unique_id += 1
    return examples

In [0]:
#@title input function builder 

def input_fn_builder():
  all_unique_ids = []
  all_input_ids = []
  all_input_mask = []
  all_input_type_ids = []

  for feature in features:
    all_unique_ids.append(feature.unique_id)
    all_input_ids.append(feature.input_ids)
    all_input_mask.append(feature.input_mask)
    all_input_type_ids.append(feature.input_type_ids)

  def input_fn(params):
    """The actual input function."""
    batch_size = params["batch_size"]

    num_examples = len(features)

    # This is for demo purposes and does NOT scale to large data sets. We do
    # not use Dataset.from_generator() because that uses tf.py_func which is
    # not TPU compatible. The right way to load data is with TFRecordReader.
    d = tf.data.Dataset.from_tensor_slices({
        "unique_ids":
            tf.constant(all_unique_ids, shape=[num_examples], dtype=tf.int32),
        "input_ids":
            tf.constant(
                all_input_ids, shape=[num_examples, seq_length],
                dtype=tf.int32),
        "input_mask":
            tf.constant(
                all_input_mask,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
        "input_type_ids":
            tf.constant(
                all_input_type_ids,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
    })

    d = d.batch(batch_size=batch_size, drop_remainder=False)
    return d

  return input_fn


In [0]:
#@title load_model
from extract_features import convert_examples_to_features, model_fn_builder, input_fn_builder

import collections

def load_model(model_path, batch_size=8):

    init_checkpoint = tf.train.latest_checkpoint(model_path)
    vocab_file = os.path.join(model_path, 'vocab.txt')
    bert_config_file = os.path.join(model_path, 'bert_config.json')
    # TODO: assert the files exists

    # we only pick the last layer
    layer_indexes = [-1]
    
    bert_config = modeling.BertConfig.from_json_file(bert_config_file)
   
    do_lower_case = True

    tokenizer = tokenization.FullTokenizer(
        vocab_file=vocab_file, do_lower_case=do_lower_case)
    
    is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
    master = None
    num_tpu_cores = 8
    run_config = tf.contrib.tpu.RunConfig(
        master=master,
        tpu_config=tf.contrib.tpu.TPUConfig(
            num_shards=num_tpu_cores,
            per_host_input_for_training=is_per_host))
   
    use_tpu = False
    use_one_hot_embeddings = False
    # BERT MODEL
    model_fn = model_fn_builder(
        bert_config=bert_config,
        init_checkpoint=init_checkpoint,
        layer_indexes=layer_indexes,
        use_tpu=use_tpu,
        use_one_hot_embeddings=use_one_hot_embeddings)
    
    # If TPU is not available, this will fall back to normal Estimator on CPU
    # or GPU.
    estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=use_tpu,
        model_fn=model_fn,
        config=run_config,
        predict_batch_size=batch_size)
    
    return estimator, tokenizer


def predict_fn(input_values, estimator, tokenizer, max_seq_length=128):
    # process input 
    examples = read_examples(input_values)

    # Build 
    features = convert_examples_to_features(
        examples=examples, seq_length=max_seq_length, tokenizer=tokenizer)
    
    # input_fn = input_fn_builder(
    #     features=features, seq_length=max_seq_length)
    
    all_unique_ids = []
    all_input_ids = []
    all_input_mask = []
    all_input_type_ids = []

    for feature in features:
        all_unique_ids.append(feature.unique_id)
        all_input_ids.append(feature.input_ids)
        all_input_mask.append(feature.input_mask)
        all_input_type_ids.append(feature.input_type_ids)

    def input_fn(params):
        """The actual input function."""
        batch_size = params["batch_size"]
    
        num_examples = len(features)
    
        # This is for demo purposes and does NOT scale to large data sets. We do
        # not use Dataset.from_generator() because that uses tf.py_func which is
        # not TPU compatible. The right way to load data is with TFRecordReader.
        d = tf.data.Dataset.from_tensor_slices({
            "unique_ids":
                tf.constant(all_unique_ids, shape=[num_examples], dtype=tf.int32),
            "input_ids":
                tf.constant(
                    all_input_ids, shape=[num_examples, max_seq_length],
                    dtype=tf.int32),
            "input_mask":
                tf.constant(
                    all_input_mask,
                    shape=[num_examples, max_seq_length],
                    dtype=tf.int32),
            "input_type_ids":
                tf.constant(
                    all_input_type_ids,
                    shape=[num_examples, max_seq_length],
                    dtype=tf.int32),
        })

        d = d.batch(batch_size=batch_size)
        return d

    unique_id_to_feature = {}
    for feature in features:
        unique_id_to_feature[feature.unique_id] = feature

    layer_indexes = [-1]
    result_list = []
    for result in estimator.predict(input_fn, yield_single_examples=True):
        unique_id = int(result["unique_id"])
        feature = unique_id_to_feature[unique_id]
        output_json = collections.OrderedDict()
        output_json["linex_index"] = unique_id
        all_features = []
        for (i, token) in enumerate(feature.tokens):
            all_layers = []
            for (j, layer_index) in enumerate(layer_indexes):
                layer_output = result["layer_output_%d" % j]
                layers = collections.OrderedDict()
                layers["index"] = layer_index
                layers["values"] = [
                    round(float(x), 6) for x in layer_output[i:(i + 1)].flat
                ]
                all_layers.append(layers)
            features = collections.OrderedDict()
            features["token"] = token
            features["layers"] = all_layers
            all_features.append(features)
        output_json["features"] = all_features
        result_list.append(output_json)
    return result_list

In [20]:
MODEL_LOCATION=f'gs://{bucket_name}/{model_dir}/'
MODEL_LOCATION

'gs://ekaba-assets/COPY_biomedbert_base_bert_weights_and_vocab/'

In [21]:
estimator, tokenizer = load_model(MODEL_LOCATION, batch_size=64)


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmphc2waln9', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, 

In [0]:
def extract_embeddings(question_context_list):
    # 1. extract context
    context_list = [ q['context'] for q in question_context_list]
    query_list = [ q['question'] for q in question_context_list ] 

    # 2. compute the embeddings
    embedding_list = predict_fn(query_list + context_list, estimator, tokenizer)

    query_embeddings = embedding_list[:len(query_list)] 
    context_embeddings = embedding_list[len(query_list):] 
    assert len(query_embeddings) == len(context_embeddings)

    query_embeddings = get_sent_embed(query_embeddings)
    context_embeddings = get_sent_embed(context_embeddings)
    
    return query_embeddings, context_embeddings

# Compute MRR

The '''mean reciprocal rank''' is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness.

The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 
- 1 for first place, 
- $\frac12$ for second place, 
- $\frac13$ for third place and so on.

The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries 

$$
 \text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}
$$

where $\text{rank}_i$ refers to the rank position of the ''first'' relevant document for the ''i''-th query.



# Read BIOASQ 8b question json  

In [23]:
!gsutil cp gs://bioasq/DATA/training8b_list.json .

Copying gs://bioasq/DATA/training8b_list.json...
/ [0 files][    0.0 B/  2.2 MiB]                                                / [1 files][  2.2 MiB/  2.2 MiB]                                                
Operation completed over 1 objects/2.2 MiB.                                      


In [24]:
!ls .

adc.json  bert	sample_data  training8b_list.json


In [0]:
import json
with open('./training8b_list.json') as fd:
    dataset = json.load(fd)

In [26]:
!head training8b_list.json

{"questions": [{"question": "List signaling molecules (ligands) that interact with the receptor EGFR?", "ideal_answer": ["The 7 known EGFR ligands  are: epidermal growth factor (EGF), betacellulin (BTC), epiregulin (EPR), heparin-binding EGF (HB-EGF), transforming growth factor-\u03b1 [TGF-\u03b1], amphiregulin (AREG) and epigen (EPG)."], "exact_answer": [["epidermal growth factor"], ["betacellulin"], ["epiregulin"], ["heparin-binding epidermal growth factor"], ["transforming growth factor-\u03b1"], ["amphiregulin"], ["epigen"]], "context": "the epidermal growth factor receptor (EGFR) ligands, such as epidermal growth factor (EGF) and amphiregulin (AREG) EGFR ligands epidermal growth factor (EGF), amphiregulin (AREG) and transforming growth factor alpha (TGF\u03b1) EGFR and its ligand EGF Among EGFR ligands, heparin-binding EGF-like growth factor, TGF-\u03b1 and Betacellulin (BTC) are produced in the tumor microenvironment of FDC-S at RNA level. . Plasma amphiregulin (AR), epidermal gr

In [27]:
questions = dataset['questions']
len(dataset['questions'])

644

In [0]:
sample_question = dataset['questions'][0]

In [29]:
sample_question['question']

'List signaling molecules (ligands) that interact with the receptor EGFR?'

In [30]:
sample_question['context']

'the epidermal growth factor receptor (EGFR) ligands, such as epidermal growth factor (EGF) and amphiregulin (AREG) EGFR ligands epidermal growth factor (EGF), amphiregulin (AREG) and transforming growth factor alpha (TGFα) EGFR and its ligand EGF Among EGFR ligands, heparin-binding EGF-like growth factor, TGF-α and Betacellulin (BTC) are produced in the tumor microenvironment of FDC-S at RNA level. . Plasma amphiregulin (AR), epidermal growth factor (EGF), transforming growth factor-α, and heparin binding-EGF were assessed by ELISA in 45 chemorefractory mCRC patientsAmong EGFR ligands, heparin-binding epidermal growth factor (HB-EGF) Of the six known EGFR ligands, transforming growth factor alpha (TGFα) was expressed more highly in triple-negative breast tumors than in tumors of other subtypes.the 7 known EGFR ligands (EGF, betacellulin, epiregulin, heparin-binding EGF, transforming growth factor-α [TGF-α], amphiregulin, and epigen) EGFR ligands based on the two affinity classes: EGF>

# Experiment 1: Run each BioASQ question on Elastic Search

In [31]:
scores = []
all_results = []

it = tqdm(enumerate(questions), total=len(questions))

for qix, q in it:
    qin = q['question']
    ranking = SEARCH(qin, 'bioasq-8b-baseline', 'context', limit=700)
    all_results.append(ranking)
    for i, (r, _) in enumerate(ranking):
        if r['context'] == q['context']:
            scores.append(1./(i+1.)) 
            break

100%|██████████| 644/644 [01:44<00:00,  6.15it/s]


In [32]:
len(scores)

643

In [33]:
mrr = np.average(scores)
experiments_results = [{'Experiment': 'Elastic Search', 'MRR': mrr}]
pd.DataFrame(experiments_results)

Unnamed: 0,Experiment,MRR
0,Elastic Search,0.912437


# Experiment 2: Use BERT Embeddings to compute MRR

In [34]:
#@title Make sure you have a GPU enabled
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
# force the GPU load
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    pass

Num GPUs Available:  1
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0



In [35]:
#@title Get all BioASQ sentence embeddings
with open('./training8b_list.json') as fd:
    dataset = json.load(fd)
    questions = dataset['questions']

# 1. extract context
context_list = [ q['context'] for q in questions]
query_list = [ q['question'] for q in questions] 

# 2. compute the embeddings
embedding_list = predict_fn(query_list + context_list, estimator, tokenizer)

query_embeddings = embedding_list[:len(query_list)] 
context_embeddings = embedding_list[len(query_list):] 


INFO:tensorflow:*** Example ***
INFO:tensorflow:unique_id: 0
INFO:tensorflow:tokens: [CLS] list signaling molecules ( l ##igan ##ds ) that interact with the receptor e ##g ##f ##r ? [SEP]
INFO:tensorflow:input_ids: 101 2190 16085 10799 113 181 10888 3680 114 1115 12254 1114 1103 10814 174 1403 2087 1197 136 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [0]:
assert len(query_embeddings) == len(context_embeddings)

In [0]:
def get_sent_embed(dict_list) :
    sent_embed = []
    for line in dict_list:
        feat_embed = np.array(line['features'][0]['layers'][0]['values'])
        # assert feat_embed.sum() > 0
        sent_embed.append(feat_embed)
    return sent_embed

In [0]:
query_embeddings_x = get_sent_embed(query_embeddings)

In [0]:
context_embeddings_x = get_sent_embed(context_embeddings)

In [0]:
QE = np.array(query_embeddings_x)

In [0]:
CE = np.array(context_embeddings_x)

In [43]:
QE.shape, CE.shape

((644, 1024), (644, 1024))

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

In [0]:
result = cosine_similarity(QE, CE)

In [46]:
result.shape

(644, 644)

In [47]:
cosine_similarity(CE, CE)


array([[1.        , 0.60882387, 0.57962614, ..., 0.67935559, 0.55139151,
        0.60941475],
       [0.60882387, 1.        , 0.60887039, ..., 0.689746  , 0.58207824,
        0.56694001],
       [0.57962614, 0.60887039, 1.        , ..., 0.68568682, 0.61297892,
        0.69365426],
       ...,
       [0.67935559, 0.689746  , 0.68568682, ..., 1.        , 0.70755747,
        0.63964693],
       [0.55139151, 0.58207824, 0.61297892, ..., 0.70755747, 1.        ,
        0.56895656],
       [0.60941475, 0.56694001, 0.69365426, ..., 0.63964693, 0.56895656,
        1.        ]])

In [48]:
cosine_similarity(QE, QE)

array([[1.        , 0.5792667 , 0.57194834, ..., 0.62601613, 0.69520312,
        0.58311678],
       [0.5792667 , 1.        , 0.55983656, ..., 0.41129551, 0.48551593,
        0.45518869],
       [0.57194834, 0.55983656, 1.        , ..., 0.48125675, 0.49455673,
        0.4804643 ],
       ...,
       [0.62601613, 0.41129551, 0.48125675, ..., 1.        , 0.57228723,
        0.46873744],
       [0.69520312, 0.48551593, 0.49455673, ..., 0.57228723, 1.        ,
        0.51643087],
       [0.58311678, 0.45518869, 0.4804643 , ..., 0.46873744, 0.51643087,
        1.        ]])

In [49]:
qix = 100
qe = QE[qix]
questions[qix]['question'], questions[qix]['context'] 

('Which proteins act as factors that promote transcription-coupled repair in bacteria?',
 'Transcription-coupled repair (TCR) is a cellular process by which some forms of DNA damage are repaired more rapidly from transcribed strands of active genes than from nontranscribed strands or the overall genome. In humans, the TCR coupling factor, CSB, plays a critical role in restoring transcription following both UV-induced and oxidative DNA damage. It also contributes indirectly to the global repair of some forms of oxidative DNA damage. The Escherichia coli homolog, Mfd, is similarly required for TCR of UV-induced lesions.Transcription coupled nucleotide excision repair (TC-NER) is involved in correcting UV-induced damage and other road-blocks encountered in the transcribed strand. Mutation frequency decline (Mfd) is a transcription repair coupling factor, involved in repair of template strand during transcription.the transcription-repair coupling factor, Mfd, promotes direct restart of the

In [50]:
CE.shape

(644, 1024)

In [0]:
CES= CE[:100,]

In [52]:
scores = cosine_similarity(QE, CES)
scores.shape

(644, 100)

In [53]:
scores = cosine_similarity([qe], CE)
scores.shape

(1, 644)

In [54]:
ranking = scores[0].argsort() # [::-1]
ranking.shape

(644,)

In [55]:
qix

100

In [0]:
result_ix = np.where(ranking == qix)

In [0]:
rix = result_ix[0][0]

In [58]:
questions[rix]


{'context': 'o assess clinical outcomes including imaging findings on computed tomography (CT), pulmonary function testing (PFT), and glucocorticoid (GC) use in patients with the antisynthetase syndrome (AS) and interstitial lung disease (ILD) treated with rituximab (RTX). RIX is the preferred off-label biologic drug for poly- and dermatomyositis in Germany.Rituximab (RTX) may be a treatment option for children and young people with JIA, although it is not licensed for this indication rituximab in children and young people with juvenile idiopathic arthritischronic lymphocytic leukemia (CLL) patients, including the Bruton\'s tyrosine kinase (BTK) inhibitor ibrutinib, phosphatidylinositol-3-kinase (PI3K) delta isoform inhibitor idelalisib combined with rituximabrituximab\'s efficacy has been well-documented in adults with refractory PFRituximab in the management of juvenile pemphigus foliaceusBACKGROUND\nRituximab is a chimeric, anti-CD20 monoclonal antibody registered for the treatment 

In [59]:
questions[qix]

{'context': 'Transcription-coupled repair (TCR) is a cellular process by which some forms of DNA damage are repaired more rapidly from transcribed strands of active genes than from nontranscribed strands or the overall genome. In humans, the TCR coupling factor, CSB, plays a critical role in restoring transcription following both UV-induced and oxidative DNA damage. It also contributes indirectly to the global repair of some forms of oxidative DNA damage. The Escherichia coli homolog, Mfd, is similarly required for TCR of UV-induced lesions.Transcription coupled nucleotide excision repair (TC-NER) is involved in correcting UV-induced damage and other road-blocks encountered in the transcribed strand. Mutation frequency decline (Mfd) is a transcription repair coupling factor, involved in repair of template strand during transcription.the transcription-repair coupling factor, Mfd, promotes direct restart of the fork following the collision by facilitating displacement of the RNAP.We repo

In [0]:
count = 0
scores = []
for qix, q in enumerate(questions):
    qe = QE[qix]
    
    score = cosine_similarity([qe], CE)
    ranking = score[0].argsort()[::-1]
    result_ix = np.where(ranking == qix)
    rix = result_ix[0][0]
    scores.append( 1./(rix + 1) )

    assert questions[ranking[rix]]['context'] == questions[qix]['context']

In [61]:
MRR = np.average(scores)
MRR

0.5496368744211725

In [62]:
experiments_results.append({ 'Experiment': 'BERT on 700k train',  'MRR':  MRR})
pd.DataFrame(experiments_results)

Unnamed: 0,Experiment,MRR
0,Elastic Search,0.912437
1,BERT on 700k train,0.549637


# Experiment 3: Universal Sentence Encoder Embeddings 
Use a Sentence Encoder to generete the embeddings.

https://tfhub.dev/google/universal-sentence-encoder-lite/2



In [0]:
%pip install --quiet --user --upgrade sentencepiece
import sentencepiece as spm
import tensorflow_hub as hub

In [0]:
def process_to_IDs_in_sparse_format(sp, sentences):
  # An utility method that processes sentences with the sentence piece processor
  # 'sp' and returns the results in tf.SparseTensor-similar format:
  # (values, indices, dense_shape)
  ids = [sp.EncodeAsIds(x) for x in sentences]
  max_len = max(len(x) for x in ids)
  dense_shape=(len(ids), max_len)
  values=[item for sublist in ids for item in sublist]
  indices=[[row,col] for row in range(len(ids)) for col in range(len(ids[row]))]
  return (values, indices, dense_shape)


In [0]:
def compute_embeddings(sentences):  
    with tf.Session() as session:
        input_placeholder = tf.sparse_placeholder(tf.int64, shape=[None, None])

        module = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-lite/2")
        
        spm_path = session.run(module(signature="spm_path"))
        
        sp = spm.SentencePieceProcessor()
        sp.Load(spm_path)
    
        embeddings = module(
            inputs=dict(
                values=input_placeholder.values,
                indices=input_placeholder.indices,
                dense_shape=input_placeholder.dense_shape))
        
        values, indices, dense_shape = process_to_IDs_in_sparse_format(sp, sentences)
        # initialize
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])        
        # compute
        message_embeddings = session.run(
            embeddings,
            feed_dict={input_placeholder.values: values,
                        input_placeholder.indices: indices,
                        input_placeholder.dense_shape: dense_shape})
    return message_embeddings 

In [0]:
context_list = [ q['context'] for q in questions]
query_list = [ q['question'] for q in questions] 

In [67]:
CE = compute_embeddings(context_list)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [68]:
CE.shape

(644, 512)

In [69]:
QE = compute_embeddings(query_list)


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [0]:
count = 0
scores = []
for qix, q in enumerate(questions):
    qe = QE[qix]
    
    score = cosine_similarity([qe], CE)
    ranking = score[0].argsort()[::-1]
    result_ix = np.where(ranking == qix)
    rix = result_ix[0][0]
    scores.append( 1./(rix + 1) )

    assert questions[ranking[rix]]['context'] == questions[qix]['context']

In [0]:
MRR = np.average(scores)

In [0]:
experiments_results.append( {  'Experiment' : "Universal Sentence Encoder", "MRR": MRR } )

# Overall Results

In [73]:
pd.DataFrame(experiments_results).T

Unnamed: 0,0,1,2
Experiment,Elastic Search,BERT on 700k train,Universal Sentence Encoder
MRR,0.912437,0.549637,0.383421
