CAiRE-COVID

A machine learning-based system that uses state-of-the-art natural language processing (NLP) question answering (QA) techniques combined with summarization for mining the available scientific literature

System Online!

Currently the CAiRE-COVID system has already been launched online. Please access the system by http://caire.ust.hk/covid.

Kaggle CORD-19 Task Winner

We are honored to be informed that our submission has won as the best response for the task What has been published about information sharing and inter-sectoral collaboration?

Install

You can install the requirements by:

pip install -r requirements.txt

In addition, you also need to install pytorch.

System Modules Usage

If you are interested in trying out the system modules yourself, you can utilize the system module by the following methods:

Document Retriever

1. Query Paraphrasing
For this part, you can implement your own methods or skip this step if your queries are relatively short and simple or you don't persuit SOTA performance. 2. Search Engine 2.1 install Python dependencies and pre-built index
Following the lucene+answerini information retrieval as described in: https://github.com/castorini/anserini/blob/master/docs/experiments-covid.md, set up JAVA sdk 11 first:

curl -O https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
mv openjdk-11.0.2_linux-x64_bin.tar.gz /usr/lib/jvm/; cd /usr/lib/jvm/; tar -zxvf openjdk-11.0.2_linux-x64_bin.tar.gz
update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk-11.0.2/bin/java 1
update-alternatives --set java /usr/lib/jvm/jdk-11.0.2/bin/java

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/jdk-11.0.2"

2.2 Get the pyserini library, which is anserini wrapped with python:

pip install pyserini==0.8.1.0

We can build the lucene index of the COVID-19 dataset from scratch, or get one of the pre-built indexes. Using the paragraph indexing which indexes each paragraph of an article (already uploaded the index as a dataset to use), can be downloaded from: https://www.dropbox.com/s/xg2b4aapjvmx3ve/lucene-index-cord19-paragraph-2020-04-24.tar.gz.

from pyserini.search import pysearch
COVID_INDEX = '../input/luceneindexcovidparagraph20200410/lucene-index-covid-paragraph-2020-04-24'

The indexing is done based on each paragraph merged with the title and abstract. Given an article with id doc_id, the index will be as follows:

doc_id : title + abstract
doc_id.00001 : title + abstract + 1st paragraph
docid.00002: title + abstract + 2nd paragraph
docid.00003: title + abstract + 3rd paragraph

2.3 Try the example!

python project/retrieval.py

Relevent Snippet Selection

You can use our package by install with pip or use the source code.

pip install caireCovid

Question Answering System

In this system, we build QA modules by a ensemble of two QA models, which are BioBERT model which fine-tuned on SQuAD, and MRQA model which is our submission to MRQA@EMNLP 2019.

The MRQA model and the exported BioBERT model that are utilized in this project can bo downloaded by this link.

If you want to use our MRQA model in your work, please cite the following paper. The bibtex is listed below:

@inproceedings{su2019generalizing,
  title={Generalizing Question Answering System with Pre-trained Language Model Fine-tuning},
  author={Su, Dan and Xu, Yan and Winata, Genta Indra and Xu, Peng and Kim, Hyeondey and Liu, Zihan and Fung, Pascale},
  booktitle={Proceedings of the 2nd Workshop on Machine Reading for Question Answering},
  pages={203--211},
  year={2019}
}

We provide the example script, while you need to change the paths to the QA models in project/qa.py. Note that the final output is already re-ranked based on re-ranking score.

python project/qa.py

Hightlighting

Keyword highlighting is mainly implemented by term matching, of which the code could be found in src/covidQA/highlights.py.

Summarization

You can use our package by install with pip or use the source code.

pip install covidSumm

We provide the example scripts for both abstractive and extractive summarization.

python project/abstractive_summarization.py

python project/extractive_summarization.py

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
img		img
project		project
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

project

project

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

CAiRE-COVID

System Online!

Kaggle CORD-19 Task Winner

Install

System Modules Usage

Document Retriever

Relevent Snippet Selection

Question Answering System

Hightlighting

Summarization

About

Releases

Packages

Languages

yana-xuyan/CAiRE-COVID

Folders and files

Latest commit

History

Repository files navigation

CAiRE-COVID

System Online!

Kaggle CORD-19 Task Winner

Install

System Modules Usage

Document Retriever

Relevent Snippet Selection

Question Answering System

Hightlighting

Summarization

About

Resources

Stars

Watchers

Forks

Languages