In [62]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Search COVID papers with Deep Learning
*Transformers + Elastic Search = ❤️*

Good news everyone! In this article, we are not going to fit a linear regression on the sars-cov-19 data. Rather, we will do something more interesting. Most of the work is based on [this project](https://github.com/gsarti/covid-papers-browser) in which I am working with students from the Universita of Triest (Italy). A live demo is available TODO

## Data

We are going to use this dataset from Kaggle, we only need the `metadata.csv` file that contains information about the paper and the full text of the abstract. Let's take a look!

In [2]:
import pandas as pd
from Project import Project
# Project holds all the path
pr = Project()

df = pd.read_csv(pr.data_dir / 'metadata.csv', nrows=100)

df.head(1)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,zjufx4fo,b2897e1277f56641193a6db73825f707eed3e4c9,PMC,Sequence requirements for RNA strand transfer ...,10.1093/emboj/20.24.7220,PMC125340,11742998,unk,Nidovirus subgenomic mRNAs contain a leader se...,2001-12-17,"Pasternak, Alexander O.; van den Born, Erwin; ...",The EMBO Journal,,,True,True,custom_license,http://europepmc.org/articles/pmc125340?pdf=re...


## Embed

Now we need a way to create an embedding from the data. We defined a class `Embedder` that loads automatically a model from `hugging_faces`. On top of that we add a pooling layer to create one single `768` vector for each input.

**TODO** Explain model choice

In [13]:
from dataclasses import dataclass
from sentence_transformers import models, SentenceTransformer

@dataclass
class Embedder:
    name: str = 'gsarti/scibert-nli'
    max_seq_length: int  = 128
    do_lower_case: bool  = True
    
    def __post_init__(self):
        word_embedding_model = models.BERT(
            'gsarti/biobert-nli',
            max_seq_length=128,
            do_lower_case=True
        )
        pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                pooling_mode_mean_tokens=True,
                pooling_mode_cls_token=False,
                pooling_mode_max_tokens=False
            )
    
        self.model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
        
    def __call__(self, text):
        return self.model.encode(text) 

Let's test it out by embedding one title

In [15]:
embedder = Embedder()

emb = embedder(ds[0].title)

emb[0].shape

(768,)

Et voilà! Let's put everything together in a new class. To give our search engine more context, we will embed the `title` and the `abstract` together/

In [67]:
@dataclass
class CovidPapersEmbedder(Embedder):
    df: pd.DataFrame = None
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        
        row['embed'] = self(f'{row.title} {row.abstract}')[0]
        return row

    
    @classmethod
    def from_path(cls, path, *args, **kwargs):
        df = pd.read_csv(path, nrows=100)
        return cls(df=df, *args, **kwargs)

In [68]:
covid_papers_embedder = CovidPapersEmbedder.from_path(pr.data_dir / 'metadata.csv')

covid_papers_embedder[0].embed

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


array([ 2.3567837e-01, -6.4051300e-02, -1.1932276e-02,  2.6868689e-01,
       -5.2030039e-01, -8.3378005e-01, -1.8351300e-01,  5.7961351e-01,
        8.1316918e-01,  6.7535788e-01, -1.3528778e-01,  8.3216673e-01,
       -8.2107680e-03,  7.5717159e-03, -1.3362406e-01,  7.2499067e-01,
        2.5037944e-01,  2.2913964e-01,  1.2199627e+00,  8.0350995e-01,
       -1.4881083e-01, -9.1403693e-01,  2.6385650e-01, -6.7890716e-01,
       -1.1853302e+00,  2.1824436e-01,  6.5506983e-01,  4.2183042e-01,
        7.9776478e-01,  9.0444809e-01, -1.4536711e+00,  1.7199700e-01,
       -7.5994641e-01,  1.6139181e-01, -9.4416934e-01,  3.5611987e-01,
        6.7621166e-01, -8.3796614e-01, -3.3323213e-01,  2.5969371e-01,
       -9.3440801e-01, -1.0839705e+00, -8.9600086e-01, -7.7506495e-01,
        1.3246822e-05,  5.9640449e-01,  7.4947214e-01,  7.0424813e-01,
        7.5518823e-01,  4.4089183e-01,  5.8574969e-01,  8.1302541e-01,
        9.0680104e-01,  2.5342101e-01,  1.8209414e-01, -5.2997476e-01,
      

## Search

Okay, we now have a way to embed each paper, but how can we search in the data using a query? Assuming we have embedded **all** the papers we could also **embed the query** and compute the cosine similarity between the query and all the embedding and then show the results sorted by the distance. Intuitively, the closer the more similar. 

So, how we can do it? We need a proper way to handle the data and to run the cosine similarity fast enough. Fortunately, Elastic Search comes to the rescue!

### Elastic Search

[Elastic Search](https://www.elastic.co/) is a database with one goal, yes you guessed right: searching. We will first store all the embedding in elastic and then use its API to perform the searching. If you are lazy like me you can [install elastic search with docker](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html)

```
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.2
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.2

```

Perfect. The next step is to store the embeddings and the papers' information on elastic search. It is a very straightforward process. We first need to create an `index` (a new database) and then to create one entry for each paper.

When we create an `index` we need to describe for elastic what we wish to store. In our case:


```
{
    "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    },
    "mappings": {
        "dynamic": "true",
        "_source": {
            "enabled": "true"
        },
        "properties": {
            "cord_id": {
                "type": "text"
            },
            "url": {
                "type": "text"
            },
            ... all other properties (columns of the datafarme)
            "bibliography": [],
            "embed": {
                "type": "dense_vector",
                "dims": 768
            }
        }
    }
}

```

The last entry we define a property `embed` as a dense vector with `768`. This is indeed our embed.

In [72]:
from dataclasses import dataclass, field
from pathlib import Path
from elasticsearch import Elasticsearch
from tqdm import tqdm
import pandas as pd
import json
from elasticsearch.helpers import bulk

@dataclass
class ElasticSearchProvider:
    entries: list()
    index_file: dict
    client: Elasticsearch = Elasticsearch()
    doc: list = field(default_factory=list)
    index_name: str = 'covid-19'

    def load(self, file_path: Path):
        """Load a file that contains the documents indeces
      
      :param file_path: [description]
      :type file_path: Path
      """
        with open(file_path, 'r') as f:
            self.doc = json.load(f)

    def drop(self):
        """Drop the current index
        """
        self.client.indices.delete(index=self.index_name, ignore=[404])

    def create_index(self):
        """Fill up elastic search
        """
        self.client.indices.create(index=self.index_name, body=self.index_file)

    def create_and_bulk_documents(self, batch_size: int = 128):
        """Iteratively create and bulk documents to elasticsearch using a batch-wise process.
        `self.doc` is used to store the batches, so it will always hold the *last batch*
        :param batch_size: [description], defaults to 128
        :type batch_size: int, optional
        """
        for entry in tqdm(self.entries):
            entry_elastic = {
                **entry,
                **{
                    '_op_type': 'index',
                    '_index': self.index_name
                }
            }

            self.doc.append(entry_elastic)
            should_bulk = len(self.doc) >= batch_size
            if should_bulk:
                bulk(self.client, self.doc)
                self.doc = []

    def __call__(self, batch_size: int = 128):
        """In order
        - [ ] create all the documents
        - [ ] create a new index from the document
        
        """
        self.drop()
        self.create_index()
        self.create_and_bulk_documents(batch_size)

        return self

In [76]:
with open(pr.base_dir / 'es_index.json', 'r') as f:
    index_file = json.load(f)
    es_provider = ElasticSearchProvider(covid_papers_embedder, index_file)

In [None]:
es_provider(batch_size=2)

0it [00:00, ?it/s]