# 03. SciBERT vectorization

BERT is a model trained on general purposes language corpora. It works pretty well for different domain agnostic tasks, however, there is plenty of extensions that use BERT-like modelling, but are trained on some specific language subset. Just to name a few:
- **SciBERT** - a BERT model for scientific text
- **BioBERT** - a pre-trained biomedical language representation model
- **ClinicalBERT** - a Clinical Notes modelling
- **VideoBERT** - a Joint Model for Video and Language Representation Learning
- **PatentBERT** - for patent classification
- **DocBERT** - BERT for document classification

These domain-specific models are typically working better with the kind of documents they were trained on. We will try to use SciBERT, as the scientific papers are considered in the scope of the project. 

In [1]:
from bert_serving.client import BertClient
from IPython.display import display, HTML
from scipy.spatial import distance

In [2]:
import pandas as pd
import numpy as np
import logging

In [3]:
logger = logging.getLogger(__name__)

In [4]:
covid_19_articles_df = pd.read_parquet("./data/covid_19_articles_df.parquet")

In [5]:
covid_19_articles_df.shape

(560, 5)

In [6]:
covid_19_articles_df.sample(n=3)

Unnamed: 0_level_0,title,abstract,body_text,back_matter,license
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
abe590b95aa51e7309844156acdae4085870ea33,Analysis of early renal injury in COVID-19 and...,The aim of the study was to analyze the incide...,This study intends to use a number of laborato...,,biorxiv_medrxiv
7ab9f9fcea519ebce527c3ede8091beedbb26ad9,Structural modeling of 2019-novel coronavirus ...,The 2019 novel coronavirus (2019-nCoV) is curr...,The coronaviruses belong to the Coronaviridae ...,We thank all member of the Whittaker and Danie...,biorxiv_medrxiv
91098f6fe46a21565bf0cb06fe960cdb2c3f5e38,The role of institutional trust in preventive ...,"1 Background Since December 2019, pneumonia as...","1 Background Since December 2019, pneumonia as...",We thank the study participants for their prom...,biorxiv_medrxiv


In [7]:
paragraphs_df = pd.read_parquet("./data/paragraphs_df.parquet")

In [8]:
paragraphs_df = paragraphs_df[
    paragraphs_df.index.get_level_values("paper_id").isin(covid_19_articles_df.index)
]

In [9]:
paragraphs_df.shape

(14847, 2)

In [10]:
paragraphs_df.sample(n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,paragraph_text,paper_license
paper_id,paragraph_number,Unnamed: 2_level_1,Unnamed: 3_level_1
d4cfb1fc4fc53abbc1d4b0ca60276c6af6632c3c,73,The copyright holder for this preprint (which ...,biorxiv_medrxiv
0a32446730827ad8152c6a61e4738e4e0b231412,39,VDR is a nuclear receptor that mediates most b...,biorxiv_medrxiv
6726b273f345838661eaa39e8adb5df4e1815a9a,0,The coronavirus SARS-CoV-2 (previously known a...,biorxiv_medrxiv


## Text vectorization with SciBERT

The process, we are going to follow, is exactly the same like we did for BERT. Let's vectorize all the documents (please note, this time service is running on a different host) and use the embeddings to look for the answers to given questions.

In [11]:
# sciBERT is running on different host than BERT
bc = BertClient(ip="scibert_server", port=5555, port_out=5556)

In [12]:
try:
    paragraphs_df = pd.read_parquet("./data/paragraphs_df-scibert-last-layer.parquet")
except OSError as e:
    logger.warning("Could not find vectorized paragraphs file: %s", e)

In [13]:
if "bert_vector" not in paragraphs_df:
    paragraphs_df["bert_vector"] = paragraphs_df["paragraph_text"] \
        .map(lambda text: np.array(bc.encode([text]).flat))

In [14]:
paragraphs_df.to_parquet("./data/paragraphs_df-scibert-last-layer.parquet")

## SciBERT-based question answering system

We would like to have a question answering system. That approach failed a bit with the general purposes BERT, but we can check how SciBERT is able to do that.

In [15]:
question = "Where does the first patient diagnosed with COVID-19 come from?"

In [16]:
vectorized_question = np.array(bc.encode([question]).flat)

In [17]:
vectorized_question.shape

(768,)

In [18]:
vector_distance = paragraphs_df["bert_vector"] \
    .map(lambda v: distance.cosine(vectorized_question, v))
idx = vector_distance.nsmallest(n=5).index
closest_paragraphs_df = paragraphs_df.loc[idx]

In [19]:
closest_paragraphs_iter = closest_paragraphs_df["paragraph_text"].iteritems()
for i, ((paper_id, paragraph_order), paragraph_text) in enumerate(closest_paragraphs_iter):
    try:
        paper = covid_19_articles_df.loc[paper_id]
        paper_title = paper["title"]
        display(HTML(f"<h4>{i + 1}. {paper_title}</h4><p>{paragraph_text}</p>"))
    except KeyError as e:
        logger.warning("Could not find the key %s", paper_id)

### Searching for similar sentences

SciBERT struggles with questions in the same way like BERT did, so we can consider putting an affirmative sentence again.

In [20]:
from helper import display_most_similar_paragraphs

In [21]:
sentence = "The first patient diagnosed with COVID-19 comes from"

In [22]:
display_most_similar_paragraphs(sentence, paragraphs_df, covid_19_articles_df, bc)