# Zero Shot Topic Classification on CORD-19

## Introduction

In this notebook we'll build a Zero Shot Topic Classifier on the COVID-19 Open Research Dataset (CORD-19, Wang et al., 2020).
Essentially, we aim to build a web application capable of receiving natural language questions, such as "what do we know about vaccines and therapeutics?", and then displaying the most relevant research literature regarding the specific question.
This dataset has received wide attention in the data mining and natural language processing community in order to develop tools to aid health workers stay up-to-date with the latest and most relevant research about the current pandemic.

Recent advances in NLP, such as OpenAI's GPT-3 (Brown et al., 2020), have shown that large language models can achieve competitive performance on downstream tasks with less task-specific data than it'd be required by smaller models.
However, GPT-3 is currently difficult to use on real world applications due to its size of ~175 billions of parameters.

Recent experiments made at HuggingFace (Davison, 2020) explored the potential of using Sentence-BERT (Reimers and Gurevych, 2020) to separately embed sentences and never-seen-before topic labels.
Then, they'd rank the sentence's topics by measuring the cosine distance between both vectors (Veeranna, 2016), obtaining promising results.

In another experiment, they use a pre-trained natural languange inference (NLI) sequence-pair classifier as an out of-the-box zero shot text classifier, as proposed by Yin et al. (2020).
By using a pre-trained BART model fine-tuned on the Multigenre NLI corpus, they were able to score an F1 score of 53.7 on the Yahoo News dataset.
The dataset has 10 classes and the current supervised models state of the art is an accuracy of 77.62.

## Proposed method

First, we'll use Sentence-BERT to embed both the papers and the never-seen-before question in order to measure the cosine distance and assess the paper relevance to the question.
For the sake of efficiency, we'll iterate over the dataset and precompute the papers representations using their title and abstract.

In [None]:
# We'll load Sentence-BERT from HuggingFace's model hub
!pip install torch transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/24/47/0ed64014af68aaf36f2e0a42bb30a5caf82e54edf92329d8aca4959ba9d7/sentence-transformers-0.2.6.2.tar.gz (60kB)
[K     |████████████████████████████████| 61kB 2.1MB/s eta 0:00:011
Collecting transformers==2.11.0 (from sentence-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 7.3MB/s eta 0:00:01
Collecting nltk (from sentence-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 15.8MB/s eta 0:00:01
Collecting tokenizers==0.7.0 (from transformers==2.11.0->sentence-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/98/a2/11e6465beaecbf92

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as F
from fastprogress import progress_bar

from risotto.artifacts import load_papers_artifact

papers = load_papers_artifact()

tokenizer = AutoTokenizer.from_pretrained("deepset/sentence_bert")
model = AutoModel.from_pretrained("deepset/sentence_bert")

batch_size = 6
num_rows = len(papers)
num_batches = (num_rows // batch_size) + 1

papers["representation"] = pd.Series([], dtype=object)

for batch_id in progress_bar(range(num_batches)):
    # Concatenate title and abstract
    start_idx = batch_id * batch_size
    end_idx = start_idx + batch_size
    slice_df = papers.iloc[start_idx:end_idx]
    title_abstract = (slice_df.title + ". " + slice_df.abstract).fillna("").values.tolist()
    
    # Tokenize title-abstract
    inputs = tokenizer.batch_encode_plus(
        title_abstract,
        return_tensors="pt",
        pad_to_max_length=True
    )
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]
    
    # Build representations
    output = model(input_ids, attention_mask=attention_mask)
    representations = output[0].mean(dim=1).detach().numpy()
    
    # Store representations
    for i, (paper_idx, _) in enumerate(slice_df.iterrows()):
        papers.at[paper_idx, "representation"] = representations[i]

"""
Currently getting an IndexError.
Related issues:
- https://github.com/huggingface/transformers/issues/4153
"""

IndexError: index out of range in self

## References

- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165
- Davison, J. (2020). Zero-Shot Learning in Modern NLP. https://joeddav.github.io/blog/2020/05/29/ZSL.html
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. http://arxiv.org/abs/1910.13461
- Reimers, N., & Gurevych, I. (2020). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3982–3992. https://doi.org/10.18653/v1/d19-1410
- Veeranna, S. P., Nam, J., Mencía, E. L., & Fürnkranz, J. (2016). Using semantic similarity for multi-label zero-shot classification of text documents. ESANN 2016 - 24th European Symposium on Artificial Neural Networks, April, 423–428.
- Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R., Liu, Z., Merrill, W., Mooney, P., Murdick, D., Rishi, D., Sheehan, J., Shen, Z., Stilson, B., Wade, A. D., Wang, K., Wilhelm, C., … Kohlmeier, S. (2020). CORD-19: The Covid-19 Open Research Dataset. https://arxiv.org/abs/2004.10706
- Yin, W., Hay, J., & Roth, D. (2020). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3914–3923. https://doi.org/10.18653/v1/d19-1404