# Interactive curation of  entity occurrence data

In this notebook we will illustrate how the curation application included in `cord_19` package can be used to perform interactive curation of named entity occurrence data.

In [13]:
import json
import zipfile

import pandas as pd

from cord_19.utils import (generate_curation_table,
                           link_ontology)
from cord_19.apps.curation_app import curation_app

## Preparing entity occurrence data

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a Named Entity Recognition (NER) model.

In [4]:
data = pd.read_csv("../data/literature_NER_with_types.csv")

In [5]:
data

Unnamed: 0,entity,entity_type,occurrence
0,Hyperglycemia,DISEASE,35198:Title:0
1,Prognosis,DISEASE,35198:Title:0
2,COVID-19,DISEASE,35198:Title:0
3,COVID-19,DISEASE,35198:Title:0
4,coronavirus,ORGANISM,35198:Abstract:1
...,...,...,...
3521,ACE2,PROTEIN,214924:Caption:30
3522,COVID-19,DISEASE,214924:Caption:31
3523,Diabetes,DISEASE,214924:Caption:31
3524,Mellitus,DISEASE,214924:Caption:31


In [6]:
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....
Done.


In [7]:
curation_input_table

Unnamed: 0,entity,raw_entity_types,paragraph,paper,section,paper_frequency,entity_type
0,4e-bp1,"[PROTEIN, PROTEIN]",[214924:The Interplay Between Covid-19 And Amp...,[214924],[214924:The Interplay Between Covid-19 And Amp...,1,PROTEIN
1,ace,"[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...",[179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,"[179426, 214924, 184360]",[214924:Angiotensin-Converting Enzyme 2 Expres...,3,PROTEIN
2,ace-2,"[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...",[179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,[179426],[179426:Role Of Antidiabetic Drugs In Current ...,1,PROTEIN
3,ace2,"[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...","[184360:Caption:70, 197804:Discussion:53, 2149...","[179426, 211373, 214924, 211125, 160564, 19780...","[184360:Gliptins , 214924:Caption, 214924:Angi...",7,PROTEIN
4,ace2 receptor,"[PROTEIN, PROTEIN]","[214924:Conclusion:28, 211373:Introduction:5]","[214924, 211373]","[211373:Introduction, 214924:Conclusion]",2,PROTEIN
...,...,...,...,...,...,...,...
263,viral replication,"[PATHWAY, PATHWAY]",[214924:The Immune Response To Sars-Cov-2 ::: ...,[214924],[214924:The Protective Role Of Angiotensin-Con...,1,PATHWAY
264,virus,"[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...",[179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,"[179426, 214924, 211373, 197804, 184360]","[184360:Gliptins , 214924:The Immune Response ...",5,ORGANISM
265,virus entry,"[PATHWAY, PATHWAY, PATHWAY, PATHWAY, PATHWAY]",[184360:Anti-Dpp4 Vaccine ::: Therapeutic Pote...,"[211373, 184360]","[184360:Abstract, 184360:Anti-Dpp4 Vaccine , 2...",2,PATHWAY
266,viruses,"[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...",[214924:The Immune Response To Sars-Cov-2 ::: ...,"[197804, 214924, 160564, 211125]","[197804:Introduction, 160564:Introduction, 214...",4,ORGANISM


In [8]:
factor_counts

{'paper': 20, 'section': 108, 'paragraph': 286}

## Loading NCIT ontology linking data

CPU times: user 2.39 s, sys: 289 ms, total: 2.68 s
Wall time: 2.7 s


In [15]:
%%time
print("Loading the ontology linking data...")
print("\tDecompressing the linking file...")
with zipfile.ZipFile("../data/NCIT_ontology_linking.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")
    
print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv("../data/NCIT_ontology_linking.csv")

print("\tLoading ontology type mapping...")
with open("../data/NCIT_type_mapping.json", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
	Decompressing the linking file...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Done.
CPU times: user 5.99 s, sys: 822 ms, total: 6.81 s
Wall time: 6.86 s


In [16]:
ontology_linking

Unnamed: 0,mention,concept,uid,definition,semantic_type,taxonomy
0,endotracheal secretion,endotracheal,http://purl.obolibrary.org/obo/NCIT_C84984,Within the trachea.,Classification,[('http://purl.obolibrary.org/obo/NCIT_C13442'...
1,immunity,immunity,http://purl.obolibrary.org/obo/NCIT_C16710,The protection against infectious disease conf...,Organ or Tissue Function,[('http://purl.obolibrary.org/obo/NCIT_C17937'...
2,mycoplasma pneumoniae infections,mycoplasma pneumoniae,http://purl.obolibrary.org/obo/NCIT_C86599,"A species of anaerobic, Gram-negative, pseudo-...",Bacterium,[('http://purl.obolibrary.org/obo/NCIT_C73540'...
3,infections,infectious disorder,http://purl.obolibrary.org/obo/NCIT_C26726,A disorder resulting from the presence and act...,Disease or Syndrome,[('http://purl.obolibrary.org/obo/NCIT_C93210'...
4,community-acquired,community-acquired pneumonia,http://purl.obolibrary.org/obo/NCIT_C115163,Pneumonia that is not acquired in a hospital o...,Disease or Syndrome,"[('http://purl.obolibrary.org/obo/NCIT_C3333',..."
...,...,...,...,...,...,...
1015167,cyp3a,cytochrome p450,http://purl.obolibrary.org/obo/NCIT_C16484,A family of cytochromes that are involved in e...,"Amino Acid, Peptide, or Protein",[('http://purl.obolibrary.org/obo/NCIT_C16486'...
1015168,nadh,nadide,http://purl.obolibrary.org/obo/NCIT_C87339,A dinucleotide of adenine and nicotinamide. It...,Pharmacologic Substance,"[('http://purl.obolibrary.org/obo/NCIT_C1505',..."
1015169,copd,chronic obstructive pulmonary disease,http://purl.obolibrary.org/obo/NCIT_C3199,A chronic and progressive lung disorder charac...,Disease or Syndrome,[('http://purl.obolibrary.org/obo/NCIT_C98541'...
1015170,cvd,cardiovascular disorder,http://purl.obolibrary.org/obo/NCIT_C2931,A non-neoplastic or neoplastic disorder affect...,Disease or Syndrome,[('http://purl.obolibrary.org/obo/NCIT_C27551'...


In [17]:
type_mapping

{'DISEASE': {'include': ['Disease, Disorder or Finding']},
 'Condition': {'include': ['Risk Factor',
   'Personal Behavior',
   'Industrial Waste',
   'Health',
   'Biospecimen Condition',
   'Event',
   'Care',
   'Population Group Characteristic']},
 'Biomarkers': {'include': ['Biomarker Analysis', 'Biomarker']},
 'ORGANISM': {'include': ['Organism']},
 'DRUG': {'include': ['Substance of Abuse',
   'Dietary Supplement',
   'Drug Class Measurement',
   'Pharmacologic Substance']},
 'CELL_COMPARTMENT': {'include': ['Cell Part']},
 'PATHWAY': {'include': ['Biochemical Pathway', 'Biological Process']},
 'CHEMICAL': {'include': ['Chemical Modifier',
   'Drug or Chemical by Structure',
   'Food or Food Product',
   'Physiology-Regulatory Factor']},
 'PROTEIN': {'include': ['Blood Protein Measurement',
   'Protein or Riboprotein Complex',
   'Protein or Enzyme Type Measurement',
   'Protein',
   'Hemoglobin Measurement',
   'Gene Product',
   'Vitamin Measurement',
   'Cytokine Measurement'

## Running the curation app

In [18]:
default_term_filters = ["glucose"]

In [19]:
curation_app.set_default_terms_to_include(default_term_filters)
curation_app.set_table(curation_input_table.copy())
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

# Try setting `mode="external"` to open the app in the new tab
curation_app.run(port=8070, mode="inline")

Merging the occurrence data with the ontology linking...


## Extracting curated data

In [20]:
curated_occurrence_data = curation_app.get_curated_table()

In [21]:
curated_occurrence_data

Unnamed: 0_level_0,aggregated_entities,raw_entity_types,paragraph,paper,section,uid,definition,semantic_type,taxonomy,paper_frequency,entity_type,entity_type_label
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4e-bp1,[4e-bp1],"[PROTEIN, PROTEIN]",[214924:The Interplay Between Covid-19 And Amp...,[214924],[214924:The Interplay Between Covid-19 And Amp...,,,,,1,PROTEIN,PROTEIN
ace inhibitor,"[acei, acei, acei, acei, acei, acei]","[PROTEIN, DRUG, DRUG, DRUG, DRUG, DRUG, PROTEI...",[184360:Aceis And Arbs ::: Therapeutic Potenti...,"[197804, 184360]",[197804:Management Of Children And Young Peopl...,http://purl.obolibrary.org/obo/NCIT_C247,Any substance that inhibits angiotensin-conver...,Chemical Viewed Functionally,"[('http://purl.obolibrary.org/obo/NCIT_C270', ...",2,DRUG,DRUG
acetaminophen,"[acetaminophen, acetaminophen, paracetamol, pa...","[CHEMICAL, DRUG, CHEMICAL, DRUG, CHEMICAL, DRU...","[197804:Discussion:52, 197804:Management Of Ch...","[179426, 197804]",[197804:Management Of Children And Young Peopl...,http://purl.obolibrary.org/obo/NCIT_C198,A p-aminophenol derivative with analgesic and ...,Pharmacologic Substance,"[('http://purl.obolibrary.org/obo/NCIT_C2356',...",2,DRUG,DRUG
acute lung injury,"[ali, ali, lung injury, lung injury, lung inju...","[DISEASE, DISEASE, DISEASE, DISEASE, DISEASE, ...",[214924:The Protective Role Of Angiotensin-Con...,"[179426, 214924, 197804, 184360]","[184360:Ang-(1-7) Analogues , 184360:Aceis And...",http://purl.obolibrary.org/obo/NCIT_C155766,A finding of acute onset of bilateral pulmonar...,Finding,[('http://purl.obolibrary.org/obo/NCIT_C45233'...,4,DISEASE,DISEASE
acute respiratory distress syndrome,"[acute respiratory distress syndrome, acute re...","[DISEASE, DISEASE, DISEASE, DISEASE, DISEASE, ...",[184360:Pathophysiology Of Covid-19: Pulmonar ...,"[179426, 129074, 214924, 211125, 197804, 184360]","[184360:Ang-(1-7) Analogues , 211125:Introduct...",http://purl.obolibrary.org/obo/NCIT_C3353,Progressive and life-threatening pulmonary dis...,Disease or Syndrome,[('http://purl.obolibrary.org/obo/NCIT_C28193'...,6,DISEASE,DISEASE
...,...,...,...,...,...,...,...,...,...,...,...,...
viral,"[viral, viral, viral, viral]","[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...",[179426:Association Of Diabetes With Acute Vir...,"[179426, 211373, 214924, 211125, 184360]",[184360:Mechanisms Of Sars-Cov-2 Entry Into Ho...,http://purl.obolibrary.org/obo/NCIT_C27985,,Qualitative Concept,[('http://purl.obolibrary.org/obo/NCIT_C27993'...,5,ORGANISM,ORGANISM
viral entry,[viral entry],"[PATHWAY, PATHWAY, PATHWAY, PATHWAY, PATHWAY, ...",[179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,"[179426, 214924]","[179426:Conclusion, 214924:Angiotensin-Convert...",,,,,2,PATHWAY,PATHWAY
viral infection,"[viral infection, viral infection, viral infec...","[DISEASE, PATHWAY, DISEASE, DISEASE, DISEASE, ...",[184360:Mechanisms Of Sars-Cov-2 Entry Into Ho...,"[211373, 214924, 211125, 184360]",[184360:Mechanisms Of Sars-Cov-2 Entry Into Ho...,http://purl.obolibrary.org/obo/NCIT_C3439,Any disease caused by a virus.,Disease or Syndrome,[('http://purl.obolibrary.org/obo/NCIT_C26726'...,4,DISEASE,DISEASE
virus,"[virus, virus, virus, viruses, viruses, viruse...","[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...","[197804:Discussion:44, 184360:Sdpp4 As Soluble...","[179426, 211373, 214924, 211125, 160564, 19780...","[184360:Gliptins , 197804:Introduction, 214924...",http://purl.obolibrary.org/obo/NCIT_C14283,An infectious agent which consists of two part...,Virus,[('http://purl.obolibrary.org/obo/NCIT_C14250'...,7,ORGANISM,ORGANISM


In [22]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(lambda x: set(x))
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(lambda x: set(x))
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(lambda x: set(x))

In [23]:
curation_meta_data = {
    "factor_counts": factor_counts,
    "nodes_to_keep": curation_app.get_terms_to_include(),
    "n_most_frequent": curation_app.n_most_frequent if curation_app.n_most_frequent else 100
}

In [24]:
curation_meta_data

{'factor_counts': {'paper': 20, 'section': 108, 'paragraph': 286},
 'nodes_to_keep': ['glucose'],
 'n_most_frequent': 500}

In [25]:
# curated_occurrence_data.to_csv("data")
# with open("", "w") as f:
#     json.dumps(curation_meta_data, f)