Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



45 Commits

Repository files navigation


Covid Knowledge Extractor

The code is organized to help work through the filtering process of the CORD-19 annotations from SciBiteAI. Each step relies on the previous script in the process having been completed.


The following files must be place in the directory data.

cv19_scc.tsv: This is the sentence level co-occurrence annotations from SciBiteAI. Please download from here Under sentence-co-occurrence-CORD-19, please select the latest version of the dataset.

structure_links.csv: This file contains data on all chemicals in DrugBank. It is available here To enable downloading, a free DrugBank account must be created.

DBCAT000021:This is the website hosted which contains a list of all amino acids currently in DrugBank.

uniprot_links.csv: This is a file from drugbank which describes all DrugBank -> UniProt target ids. Please download from here Under Target Drug-UniProt Links, Drug Group "All"

human_proteins.tsv: This file comes from UniProt and contains a hand reviewed list of all human proteins. The database may be inspected here, for convience and to keep parameters we suggest running the cURL command below

curl ',entry%20name,reviewed,protein%20names,genes,organism,length&compress=yes' --compressed > human_proteins.tsv.gz

corona_virus_proteins.tsv: This file comes from UniProt and contains a hand reviewed list of all Corona Virus viral proteins. The database may be inspected here, for convience and to keep parameters we suggest running the cURL command below

curl ',entry%20name,reviewed,protein%20names,genes,organism,length&compress=yes' --compressed > corona_virus_proteins.tsv.gz

metadata.csv: This file comes from Kaggle and contains metadata on all papers present in CORD19.

Here are the different sections of the code:

Step 1: This step takes in cv19_scc.tsv and finds all papers in which a SARSCOV ontological tag is found. It then removes all papers without this tag. This filtered dataset is placed in data/cv19_scc_filtered_for_COVID.tsv.

Step 2: This step takes in cv19_scc_filtered_for_COVID.tsv and runs our co-occurrence algorithm against it. Once these co-occurrence counts are found they are placed in data/CORD19_co-occurrence_pairs.csv.

Step 3: This step takes in CORD19_co-occurrence_pairs.csv and runs our hypergeometric scoring algorithm against it. This then adds that as a column on the csv and outputs that to data/CORD19_co-occurrence_pairs_scored.csv

Step 4: This step takes in CORD19_co-occurrence_pairs_scored.csv and filters for any pairing which is not a protein->chemical relationship. This is then saved as data/CORD19_co-occurrence_pairs_scored_filtered.csv

Step 5: This step requires all of the DrugBank and UniProt files mentioned in setup to be placed in the data directory. It takes in CORD19_co-occurrence_pairs_scored_filtered.csv matches the UniProt identifiers and ChEMBL identifiers against those in the provided datasets, if a match is not found the pair will be removed. This is outputted in data/CORD19_co-occurrence_pairs_scored_filtered_db_uniprot.csv

Step 6: This step generates csvs of all pair information to be downloaded by the users of the site. It also calculates number of known drug targets for a specific compound, to perform this calculation the DrugBank information on uniprot targets must be placed in the data directory. This step takes in CORD19_co-occurrence_pairs_scored_filtered_db_uniprot.csv and outputs as many csvs as there are viable targets into data/csvs.

Step 7: This step generates static HTML and JSON files for the website. The files take in the csvs generated in Step6 and output "index.html","dtd_table.html","dtd_table_singleton.html","target_info.html", and "downloads.html" into the data/html directory. It also outputs as many jsons as there are viable targets into data/jsons.


Covid Knowledge Extractor







No releases published


No packages published