# 02. BERT vectorization

[BERT](https://github.com/google-research/bert) is an NLP language representation method introduced by Google. It is a model which creates a contextual vector representation for given text. As a result, unilike word2vec or GloVe, it doesn't have a single word vector, but different ones, depending on its neighbouring words (context). By these means, the algorithm is able to encode some more complex concepts.

It is tested in an unsupervised manner, using publicly available texts and it's deeply bidirectional. Currently, BERT is state-of-the-art algorithm for many different NLP tasks. A great explanation of the internals, may be found here: http://jalammar.github.io/illustrated-bert/.

## BERT in comparison to previous word embeddings

BERT brought a new view on language representation, however different word embeddings have been on the market for a quite long time. Let's make a quick recap for those of you, who are not so familiar with the historical approaches to text encoding. 

### Words ambiguation

![Just a second decided on a second place](images/text-vectorization-example.png)

### Bag of words

![Bag of words](images/example-bag-of-words.png)

### TFIDF

![TFIDF](images/example-tfidf.png)

### Word2Vec

![Word2Vec](images/example-word2vec.png)

### BERT

![BERT](images/example-bert.png)

## Loading the datasets created in the previous notebook

We've already created the datasets for the target application, so we can use them out of the box. Let's load everything.

In [1]:
from bert_serving.client import BertClient
from IPython.display import display, HTML
from scipy.spatial import distance

In [2]:
import pandas as pd
import numpy as np
import logging

In [3]:
logger = logging.getLogger(__name__)

In [4]:
covid_19_articles_df = pd.read_parquet("./data/covid_19_articles_df.parquet")

In [5]:
covid_19_articles_df.shape

(560, 5)

In [6]:
covid_19_articles_df.sample(n=3)

Unnamed: 0_level_0,title,abstract,body_text,back_matter,license
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
c8df44a3612e85e267351e936ddeb8fc5867afa1,The timing of one-shot interventions for epide...,The apparent early success in China's large-sc...,The Influenza pandemic of 1918 was one of the ...,,biorxiv_medrxiv
ebed882da10cbf669cbb86802e9fa07a4a33ef91,"Exuberant elevation of IP-10, MCP-3 and IL-1ra...",,and Murray scores ( Figure S2 ). Our previous...,All rights reserved. No reuse allowed without ...,biorxiv_medrxiv
8a1fde8c65e439496ac5810504de23ef77312f28,Protein structure and sequence re-analysis of ...,As the infection of 2019-nCoV coronavirus is q...,"The 2019 novel conronavirus, or 2019-nCoV, rec...",This work is supported in part by the National...,biorxiv_medrxiv


In [7]:
paragraphs_df = pd.read_parquet("./data/paragraphs_df.parquet")

In [8]:
paragraphs_df = paragraphs_df[
    paragraphs_df.index.get_level_values("paper_id").isin(covid_19_articles_df.index)
]

In [9]:
paragraphs_df.shape

(14847, 2)

In [10]:
paragraphs_df.sample(n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,paragraph_text,paper_license
paper_id,paragraph_number,Unnamed: 2_level_1,Unnamed: 3_level_1
ead6fda7cdb2bb2469ff48365833bd63d0b7dd1a,4,By inferring the effectiveness of intervention...,comm_use_subset
dbefc8ad2a3de5d1696b7e604de8bce1da2ea8cd,1,With the current severe outbreak of coronaviru...,biorxiv_medrxiv
2858f25f5364b0ef37ceeaa370471ee6b3fac29d,13,"March 5, 2020 4/28 . CC-BY-NC-ND 4.0 Internati...",biorxiv_medrxiv


## Text vectorization with BERT

Creating an embedding vector for a given text with BERT is extremely easy. We take a text, tokenize it and put as an input to the network and then take a value of one of the hidden layers. To make things even simpler, there is a [bert-as-service](https://github.com/hanxiao/bert-as-service) project that exposes the BERT functionality over TCP (internally ZeroMQ is being used). Now, the vectorization is possible within just 2 lines of code.

In [11]:
bc = BertClient(ip="bert_server", port=5555, port_out=5556)

In [12]:
try:
    paragraphs_df = pd.read_parquet("./data/paragraphs_df-bert.parquet")
except OSError as e:
    logger.warning("Could not find vectorized paragraphs file: %s", e)

In [13]:
if "bert_vector" not in paragraphs_df:
    paragraphs_df["bert_vector"] = paragraphs_df["paragraph_text"] \
        .map(lambda text: np.array(bc.encode([text]).flat))

In [14]:
paragraphs_df.to_parquet("./data/paragraphs_df-bert.parquet")

## BERT-based question answering system

Perfectly, we should be able to ask a question and the system should automatically find the answer. Let's try it out.

In [15]:
question = "Where does the first patient diagnosed with COVID-19 come from?"

In [16]:
vectorized_question = np.array(bc.encode([question]).flat)

In [17]:
vectorized_question.shape

(1024,)

In [18]:
vector_distance = paragraphs_df["bert_vector"] \
    .map(lambda v: distance.cosine(vectorized_question, v))
idx = vector_distance.nsmallest(n=5).index
closest_paragraphs_df = paragraphs_df.loc[idx]

In [19]:
closest_paragraphs_iter = closest_paragraphs_df["paragraph_text"].iteritems()
for i, ((paper_id, paragraph_order), paragraph_text) in enumerate(closest_paragraphs_iter):
    try:
        paper = covid_19_articles_df.loc[paper_id]
        paper_title = paper["title"]
        display(HTML(f"<h4>{i + 1}. {paper_title}</h4><p>{paragraph_text}</p>"))
    except KeyError as e:
        logger.warning("Could not find the key %s", paper_id)

### Question answering issue

It seems the results are not as promising as we would expect. Let's think a while about why we didn't get the right answers. BERT is trained on natural languge by taking some publicly available texts, and tries to predict a word, given context. That means, if we put a question, then we should get similar questions, but not the answers. Instead, let's try to put an affirmative sentence which will imitate the answer we expect to get.

For the easiness, we created a helper function that displays the most similar paragraphs.

In [20]:
from helper import display_most_similar_paragraphs

In [21]:
sentence = "The first patient diagnosed with COVID-19 comes from"

In [22]:
display_most_similar_paragraphs(sentence, paragraphs_df, covid_19_articles_df, bc)