# Zero Shot Topic Classification on CORD-19

## Introduction

In this notebook we'll build a Zero Shot Topic Classifier on the COVID-19 Open Research Dataset (CORD-19, Wang et al., 2020).
Essentially, we aim to build a web application capable of receiving natural language questions, such as "what do we know about vaccines and therapeutics?", and then displaying the most relevant research literature regarding the specific question.
This dataset has received wide attention in the data mining and natural language processing community in order to develop tools to aid health workers stay up-to-date with the latest and most relevant research about the current pandemic.

Recent advances in NLP, such as OpenAI's GPT-3 (Brown et al., 2020), have shown that large language models can achieve competitive performance on downstream tasks with less task-specific data than it'd be required by smaller models.
However, GPT-3 is currently difficult to use on real world applications due to its size of ~175 billions of parameters.

Recent experiments made at HuggingFace (Davison, 2020) explored the potential of using Sentence-BERT (Reimers and Gurevych, 2020) to separately embed sentences and never-seen-before topic labels.
Then, they'd rank the sentence's topics by measuring the cosine distance between both vectors (Veeranna, 2016), obtaining promising results.

In another experiment, they use a pre-trained natural languange inference (NLI) sequence-pair classifier as an out of-the-box zero shot text classifier, as proposed by Yin et al. (2020).
By using a pre-trained BART model fine-tuned on the Multigenre NLI corpus, they were able to score an F1 score of 53.7 on the Yahoo News dataset.
The dataset has 10 classes and the current supervised models state of the art is an accuracy of 77.62.

## Proposed method

First, we'll use Sentence-BERT to embed both the papers and the never-seen-before question in order to measure the cosine distance and assess the paper relevance to the question.
For the sake of efficiency, we'll iterate over the dataset and precompute the papers representations using their title and abstract.

In [None]:
# We'll use sentence-transformers from UKPLab
!pip install torch



In [None]:
!pip install --no-cache-dir --force-reinstall "transformers>=2.9.0,<3.0.0"

Collecting transformers<3.0.0,>=2.9.0
  Downloading transformers-2.11.0-py3-none-any.whl (674 kB)
[K     |████████████████████████████████| 674 kB 16.9 MB/s eta 0:00:01
[?25hCollecting packaging
  Downloading packaging-20.4-py2.py3-none-any.whl (37 kB)
Collecting regex!=2019.12.17
  Downloading regex-2020.6.8-cp37-cp37m-manylinux2010_x86_64.whl (661 kB)
[K     |████████████████████████████████| 661 kB 29.5 MB/s eta 0:00:01
[?25hCollecting numpy
  Downloading numpy-1.19.0-cp37-cp37m-manylinux2010_x86_64.whl (14.6 MB)
[K     |████████████████████████████████| 14.6 MB 35.9 MB/s eta 0:00:01
[?25hCollecting tokenizers==0.7.0
  Downloading tokenizers-0.7.0-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 37.2 MB/s eta 0:00:01
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.43.tar.gz (883 kB)
[K     |████████████████████████████████| 883 kB 52.6 MB/s eta 0:00:01
[?25hCollecting requests
  Downloading requests-2.24.0-py2.py3-none-any.

In [None]:
!pip install sentence-transformers



In [None]:
import math
import pandas as pd
from sentence_transformers import SentenceTransformer
from torch.nn import functional as F
from fastprogress import progress_bar

from risotto.artifacts import load_papers_artifact

papers = load_papers_artifact()

model = SentenceTransformer("bert-base-nli-mean-tokens")

batch_size = 256
num_rows = len(papers)
num_batches = math.ceil(num_rows / batch_size)

papers["representation"] = pd.Series([], dtype=object)

for batch_id in progress_bar(range(num_batches)):
    # Concatenate title and abstract
    start_idx = batch_id * batch_size
    end_idx = start_idx + batch_size
    slice_df = papers.iloc[start_idx:end_idx]
    title_abstract = (slice_df.title + ". " + slice_df.abstract).fillna("").values.tolist()
    
    sentence_embeddings = model.encode(title_abstract)
    
    # Store representations
    for i, (paper_idx, _) in enumerate(slice_df.iterrows()):
        papers.at[paper_idx, "representation"] = sentence_embeddings[i]



In [None]:
query = "what do we know about vaccines and therapeutics?"
query_encoded = model.encode([query])
query_encoded

[array([ 9.53735173e-01,  4.36801702e-01,  7.16318130e-01, -2.06302524e-01,
         4.44031447e-01, -8.81401241e-01,  7.04655945e-01, -7.60947168e-01,
         7.14303672e-01,  7.53922239e-02,  3.50104094e-01,  4.42823052e-01,
         8.72542083e-01,  9.91872072e-01, -8.06705475e-01,  4.04296905e-01,
        -5.35847902e-01,  4.63045746e-01,  6.29461825e-01,  4.02464896e-01,
        -5.28995275e-01, -9.88435745e-02, -4.39947844e-01, -3.26235555e-02,
         3.49090904e-01,  7.73622990e-01,  1.14979945e-01, -1.19985998e+00,
         4.55172211e-01,  1.40891567e-01,  3.75481099e-01,  4.20022726e-01,
        -1.00398707e+00, -3.90061140e-01, -1.27515778e-01,  7.72433758e-01,
        -1.96735248e-01,  8.81975651e-01, -2.92830974e-01,  8.53389502e-02,
        -1.62797165e+00, -7.95061111e-01, -1.69390440e-01,  6.28248870e-01,
        -5.43486893e-01, -8.86552513e-01,  6.62972406e-02,  2.87867397e-01,
        -1.00299418e+00,  5.72913349e-01, -5.10117531e-01, -7.17701614e-01,
         2.8

In [None]:
!pip install scipy



In [None]:
import numpy as np

papers_encoded = np.array(papers["representation"].to_list())
papers_encoded

array([[-0.6483415 ,  0.6435061 , -0.358263  , ..., -0.2738189 ,
         0.6773935 ,  0.3480342 ],
       [-0.44235474,  0.36395422, -0.20020767, ..., -0.03206707,
         0.05722679,  0.6758838 ],
       [ 0.06069116,  0.25295445,  0.2979453 , ...,  0.454517  ,
        -0.7178886 ,  0.88681674],
       ...,
       [-0.17319249,  0.7134304 ,  0.18607832, ...,  0.12789835,
        -0.14198978,  0.14640926],
       [-0.21967985,  0.49324775, -0.7882281 , ...,  0.14976002,
        -0.2041433 ,  0.27260515],
       [-0.00324283,  0.07330745,  0.6464749 , ...,  0.780558  ,
         0.3459313 , -0.00968279]], dtype=float32)

In [None]:
import scipy

distances = scipy.spatial.distance.cdist(query_encoded, papers_encoded, "cosine")[0]
distances

array([0.67021698, 0.5535392 , 0.5280368 , ..., 0.59998628, 0.68826074,
       0.53678836])

In [None]:
distances_series = pd.Series(distances, index=papers.index, name="distances")
distances_series

cord_uid
ug7v899j    0.670217
02tnwd4m    0.553539
ejv2xln0    0.528037
2b73a28n    0.486982
9785vg6d    0.490476
              ...   
2upc2spn    0.359398
48kealmj    0.534712
7goz1agp    0.599986
twp49jg3    0.688261
wtoj53xy    0.536788
Name: distances, Length: 77304, dtype: float64

In [None]:
papers_with_distances = papers.join(distances_series)
papers_with_distances

Unnamed: 0_level_0,pagerank,affiliation,country,sha,source_x,title,doi,pmcid,pubmed_id,license,...,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,representation,distances
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ug7v899j,0.000005,,,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,1.14726e+07,no-cc,...,BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,"[-0.6483415, 0.6435061, -0.358263, -0.2691593,...",0.670217
02tnwd4m,0.000006,University of Alabama at Birmingham,USA,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,1.1668e+07,no-cc,...,Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,"[-0.44235474, 0.36395422, -0.20020767, -0.4626...",0.553539
ejv2xln0,0.000026,Washington University School of Medicine,USA,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,1.1668e+07,no-cc,...,Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,"[0.06069116, 0.25295445, 0.2979453, -0.0984704...",0.528037
2b73a28n,0.000008,,,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,1.16869e+07,no-cc,...,Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,"[-0.77728766, 0.14971603, -0.20489854, -0.3885...",0.486982
9785vg6d,0.000006,National Institutes of Health (Laboratory of H...,USA,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,1.16869e+07,no-cc,...,Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,"[-0.0014710592, 0.75503856, -0.29300198, -0.27...",0.490476
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2upc2spn,0.000005,"Emory University, Tongji University, Fudan Uni...","USA, China",0eda9491295d6a851db35c358ac0c4fe956dce1c; da3d...,Medline; PMC,CE-BLAST makes it possible to compute antigeni...,10.1038/s41467-018-04171-2,PMC5932059,2.97206e+07,cc-by,...,Nat Commun,,,,document_parses/pdf_json/0eda9491295d6a851db35...,document_parses/pmc_json/PMC5932059.xml.json,https://doi.org/10.1038/s41467-018-04171-2; ht...,19157016.0,"[-0.36055383, 0.4955496, 0.5226442, -0.0462066...",0.359398
48kealmj,0.000009,University Medical Centre Utrecht,The Netherlands,8c4d11d0eba3961e79a2401f0ce117ccdd336e35,Medline; PMC,Hidden Behind Autophagy: The Unconventional Ro...,10.1111/tra.12091,PMC7169877,2.38376e+07,no-cc,...,Traffic,,,,document_parses/pdf_json/8c4d11d0eba3961e79a24...,document_parses/pmc_json/PMC7169877.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/23837619/;...,21166432.0,"[-0.45784405, 0.2272223, 0.42906934, -0.006873...",0.534712
7goz1agp,0.000005,University Medical Center Utrecht,"the Netherlands, the Netherlands., the Netherl...",0a0147800d4984f382f5ccb44abb6af6fdf65399,Medline; PMC,Conventional Influenza Vaccination Is Not Asso...,10.1093/aje/kwg027,PMC7110252,1.26976e+07,no-cc,...,Am J Epidemiol,,,,document_parses/pdf_json/0a0147800d4984f382f5c...,document_parses/pmc_json/PMC7110252.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/12697573/,25557223.0,"[-0.17319249, 0.7134304, 0.18607832, -0.475285...",0.599986
twp49jg3,0.000005,,,e3de6d6d50592102725cbe2ee5cb0fe02b851aac,Medline; PMC,Inhibitory Effect of Resveratrol against Duck ...,10.1371/journal.pone.0065213,PMC3679110,2.37765e+07,cc-by,...,PLoS One,,,,document_parses/pdf_json/e3de6d6d50592102725cb...,document_parses/pmc_json/PMC3679110.xml.json,https://doi.org/10.1371/journal.pone.0065213; ...,18190905.0,"[-0.21967985, 0.49324775, -0.7882281, 0.301461...",0.688261


In [None]:
papers_with_distances.sort_values(by="distances", ascending=True)

Unnamed: 0_level_0,pagerank,affiliation,country,sha,source_x,title,doi,pmcid,pubmed_id,license,...,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,representation,distances
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
noydd8mw,0.000005,,,75e68861e7c9e6a9630f62bd55a997187031bca2,PMC,Advances in Vaccines,10.1007/10_2019_107,PMC7120466,3.14464e+07,no-cc,...,Current Applications of Pharmaceutical Biotech...,,,,document_parses/pdf_json/75e68861e7c9e6a9630f6...,document_parses/pmc_json/PMC7120466.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,"[0.67727727, 0.2827368, 0.9689374, -0.21286546...",0.176778
swwz4kzd,0.000009,Aston University (Medicines Research Unit),UK,d5e96d010d0c08646164d722ab81f2ab64ab94c9,Elsevier; Medline; PMC,The rational design of vaccines,10.1016/s1359-6446(05)03600-7,PMC7108399,1.62574e+07,no-cc,...,Drug Discov Today,,,,document_parses/pdf_json/d5e96d010d0c08646164d...,document_parses/pmc_json/PMC7108399.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/16257375/;...,45048133.0,"[0.23579514, 0.3860431, 0.673824, -0.12953536,...",0.231984
r42hzazp,0.000005,,,73f5a997787dadb9180f388449025f310ac8b636,PMC,Do Vaccines Trigger Neurological Diseases? Epi...,10.1007/s40263-019-00670-y,PMC7224038,3.15765e+07,no-cc,...,CNS Drugs,,,,document_parses/pdf_json/73f5a997787dadb9180f3...,document_parses/pmc_json/PMC7224038.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,"[0.0702827, 0.55133075, 0.38673177, -0.2263814...",0.240559
y4eyse0y,0.000038,East Carolina University,USA,e9ee637040a204668c5d9950fba63fe4e5841bec,Medline; PMC,SARS vaccines: where are we?,10.1586/erv.09.43,PMC7105754,1.95381e+07,no-cc,...,Expert Rev Vaccines,,,,document_parses/pdf_json/e9ee637040a204668c5d9...,document_parses/pmc_json/PMC7105754.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/19538115/;...,10433997.0,"[0.16316232, 0.54352087, 0.5199337, -0.2975174...",0.255897
v5wxkdbk,0.000005,,,5c03f7971e51a825be6a88d675b1cd05d1c3b527,Medline; PMC,Delivery technologies for human vaccines,10.1093/bmb/62.1.29,PMC7110014,1.21768e+07,no-cc,...,Br Med Bull,,,,document_parses/pdf_json/5c03f7971e51a825be6a8...,document_parses/pmc_json/PMC7110014.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/12176848/,15635233.0,"[0.21022908, 0.37452778, 0.49110535, -0.367973...",0.256064
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
m73nepej,0.000006,"Queensland University of Technology, Griffith ...",Australia,1872ab78cc17a79cc217aceb47401372672340f5,Elsevier; Medline; PMC,The determinants of Chinese visitors to Austra...,10.1016/j.tourman.2017.06.015,PMC7127086,32287751,no-cc,...,Tour Manag,,,,document_parses/pdf_json/1872ab78cc17a79cc217a...,document_parses/pmc_json/PMC7127086.xml.json,https://api.elsevier.com/content/article/pii/S...,157633808.0,"[-0.43199056, 0.42718202, -0.5737817, -0.00269...",0.918100
m5o1qdas,0.000005,Madagascar Research and Conservation Program,Madagascar,de880a53d425340bb4c2d52503d02c32ddfe64f0,PMC,Fruit Characteristics of Species Dispersed by ...,10.1111/j.1744-7429.2001.tb00201.x,PMC7161794,3.23133e+07,no-cc,...,Biotropica,,,,document_parses/pdf_json/de880a53d425340bb4c2d...,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,"[-0.44773257, 0.68033385, -0.5234621, -0.08303...",0.919014
ddz6m6qb,0.000005,,,c7ad552957c6e5421afc818e1b1f56e9b4883405,Medline; PMC,"Bats in urban areas of Brazil: roosts, food re...",10.1007/s11252-016-0632-3,PMC7089172,32214783,no-cc,...,Urban Ecosyst,,,,document_parses/pdf_json/c7ad552957c6e5421afc8...,document_parses/pmc_json/PMC7089172.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/32214783/;...,22669230.0,"[0.0070760944, 1.0855699, -0.43016323, 0.35315...",0.922604
crnsk87x,0.000005,,,030b9ba1c0202c07d05f55ea3f0cc28c431b262b,PMC,The ‘Next People’: And the Zombies Shall Inher...,10.1007/978-981-287-934-9_3,PMC7122991,,no-cc,...,Generation Z,,,,document_parses/pdf_json/030b9ba1c0202c07d05f5...,document_parses/pmc_json/PMC7122991.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,"[-0.35151976, 0.5011075, 0.9157388, -0.1275989...",0.935586


In [None]:
# TODO:
# - Git commit
# - Expand proposed method description
# - Encapsulate the logic in functions
# - Dump the new paper dataframe
# - Try to build a quick Streamlit application

In [None]:
papers.to_hdf("artifacts/papers_sbert.hdf", key="papers_sbert")

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['affiliation', 'country', 'sha', 'source_x', 'title', 'doi', 'pmcid',
       'pubmed_id', 'license', 'abstract', 'publish_time', 'authors',
       'journal', 'who_covidence_id', 'arxiv_id', 'pdf_json_files',
       'pmc_json_files', 'url', 'representation'],
      dtype='object')]

  encoding=encoding,


## References

- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165
- Davison, J. (2020). Zero-Shot Learning in Modern NLP. https://joeddav.github.io/blog/2020/05/29/ZSL.html
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. http://arxiv.org/abs/1910.13461
- Reimers, N., & Gurevych, I. (2020). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3982–3992. https://doi.org/10.18653/v1/d19-1410
- Veeranna, S. P., Nam, J., Mencía, E. L., & Fürnkranz, J. (2016). Using semantic similarity for multi-label zero-shot classification of text documents. ESANN 2016 - 24th European Symposium on Artificial Neural Networks, April, 423–428.
- Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R., Liu, Z., Merrill, W., Mooney, P., Murdick, D., Rishi, D., Sheehan, J., Shen, Z., Stilson, B., Wade, A. D., Wang, K., Wilhelm, C., … Kohlmeier, S. (2020). CORD-19: The Covid-19 Open Research Dataset. https://arxiv.org/abs/2004.10706
- Yin, W., Hay, J., & Roth, D. (2020). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3914–3923. https://doi.org/10.18653/v1/d19-1404