# Interactive curation of  entity occurrence data

In this notebook we will illustrate how the curation application included in the `cord_19` package can be used to perform interactive curation of named entity occurrence data.

The input data for this notebook contains the named entities extracted from a small selection of 20 articles from the dataset of [COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

In [80]:
import json
import zipfile

import pandas as pd

from cord_19.utils import (generate_curation_table,
                           link_ontology)
from cord_19.apps.curation_app import curation_app

## Preparing entity occurrence data

The input dataset contains occurrences of different terms in paragraphs of scientific articles from the CORD-19 dataset previously extracted by means of a Named Entity Recognition (NER) model.

In [81]:
data = pd.read_csv("../data/Glucose_risk_20_papers.csv")

In [102]:
%%time
print("Decompressing the input data file...")
with zipfile.ZipFile("../data/Glucose_risk_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")
data = pd.read_csv("../data/Glucose_risk_3000_papers.csv")
print("Done.")

Decompressing the input data file...
Done.
CPU times: user 2.32 s, sys: 302 ms, total: 2.62 s
Wall time: 2.67 s


On the first preparation step, we group and aggregate the input data by unique entities.

In [106]:
data["entity"] = data["entity"].apply(lambda x: x.lower())

In [103]:
%%time
print("Prepating curatation data...")
curation_input_table, factor_counts = generate_curation_table(data)
print("Done.")

Prepating curatation data...
Cleaning up the entities...
Aggregating occurrences of entities....


KeyboardInterrupt: 

The resulting dataframe contains a row per unique named entity together with the following occurrence data: 
- sets of paragraphs, papers, sections, where the corresponding entity is mentioned (`paper`, `section`, `paragraph` columns);
- number of total occurrences (the `raw_frequency` column);
- number of unique papers where it occurs (the `paper_frequency` column);
- entity type assigned by the NER model (the `entity_type` column).

In [104]:
curation_input_table

Unnamed: 0,entity,entity_type,paragraph,paper,section,paper_frequency,raw_frequency
0,4e-bp1,PROTEIN,[214924:The Interplay Between Covid-19 And Amp...,[214924],[214924:The Interplay Between Covid-19 And Amp...,1,2
1,ace,PROTEIN,[179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,"[214924, 179426, 184360]","[179426:Role Of Ace/Arbs , 214924:The Protecti...",3,9
2,ace-2,PROTEIN,[179426:Role Of Antidiabetic Drugs In Current ...,[179426],[179426:Role Of Antidiabetic Drugs In Current ...,1,10
3,ace2,PROTEIN,"[184360:Caption:71, 184360:Combined Therapeuti...","[184360, 211373, 211125, 214924, 160564, 17942...","[214924:The Immune Response To Sars-Cov-2 , 19...",7,79
4,ace2 receptor,PROTEIN,"[211373:Introduction:5, 214924:Conclusion:28]","[214924, 211373]","[214924:Conclusion, 211373:Introduction]",2,2
...,...,...,...,...,...,...,...
263,viral replication,PATHWAY,[214924:The Immune Response To Sars-Cov-2 ::: ...,[214924],"[214924:The Immune Response To Sars-Cov-2 , 21...",1,2
264,virus,ORGANISM,"[214924:Introduction:3, 184360:Gliptins ::: Th...","[184360, 211373, 214924, 179426, 197804]","[184360:Abstract, 214924:Abstract, 214924:The ...",5,24
265,virus entry,PATHWAY,[184360:Anti-Dpp4 Vaccine ::: Therapeutic Pote...,"[184360, 211373]","[184360:Abstract, 184360:Combined Therapeutic ...",2,5
266,viruses,ORGANISM,"[197804:Introduction:2, 211125:Discussion:25, ...","[211125, 160564, 214924, 197804]","[214924:The Immune Response To Sars-Cov-2 , 21...",4,6


The second output of the data preparation step outputs the counts of different instances of occurrence factors: number of distinct papers/sections/paragraphs in the input corpus.

In [85]:
factor_counts

{'paper': 20, 'section': 108, 'paragraph': 286}

## Loading the NCIT ontology linking data

To group synonymical entities in the previously extracted table (e.g. `ace2`, `ace-2`, `angiotensin-converting enzyme 2`), as well as assign additional semantics to these entities (e.g. human-readable definition, taxonomy, etc), we peform further _linking_ of the entities to the terms from the [NCIT ontology](https://ncithesaurus.nci.nih.gov/ncitbrowser/).

To perform such ontology linking, we load some additional (pre-computed) data.

In [113]:
%%time
print("Loading the ontology linking data...")
    
print("\tDecompressing the input data file...")
with zipfile.ZipFile("../data/NCIT_ontology_linking_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")

print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv("../data/NCIT_ontology_linking_3000_papers.csv")

print("\tLoading ontology type mapping...")
with open("../data/NCIT_type_mapping.json", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

Loading the ontology linking data...
	Decompressing the input data file...
	Loading the linking dataframe in memory...
	Loading ontology type mapping...
Done.
CPU times: user 1.04 s, sys: 155 ms, total: 1.2 s
Wall time: 1.23 s


The ontology linking table contains the following columns:
- `mention`
- `concept`
- `uid`
- `definition`
- `semantic_type`
- `taxonomy`

In [115]:
ontology_linking

Unnamed: 0,mention,concept,uid,definition,semantic_type,taxonomy
0,protein-d,lithostathine-1-alpha,http://purl.obolibrary.org/obo/NCIT_C131324,"Lithostathine-1-alpha (166 aa, ~19 kDa) is enc...","Amino Acid, Peptide, or Protein",[('http://purl.obolibrary.org/obo/NCIT_C18466'...
1,pulmonary,pulmonary,http://purl.obolibrary.org/obo/NCIT_C13304,Relating to the lungs as the intended site of ...,Qualitative Concept,[('http://purl.obolibrary.org/obo/NCIT_C13442'...
2,host,host,http://purl.obolibrary.org/obo/NCIT_C66819,An organism that nourishes and supports anothe...,Organism,[('http://purl.obolibrary.org/obo/NCIT_C14376'...
3,sp-d,surfactant protein d measurement,http://purl.obolibrary.org/obo/NCIT_C111322,The determination of the amount of surfactant ...,Laboratory Procedure,[('http://purl.obolibrary.org/obo/NCIT_C64430'...
4,innate response,communication response,http://purl.obolibrary.org/obo/NCIT_C82658,A statement (either spoken or written) that is...,Social Behavior,[('http://purl.obolibrary.org/obo/NCIT_C16452'...
...,...,...,...,...,...,...
160780,pparα-knock,pericardial knock,http://purl.obolibrary.org/obo/NCIT_C168024,An auscultated finding describing an early dia...,Finding,[('http://purl.obolibrary.org/obo/NCIT_C167450...
160781,capacity.[47,capacity,http://purl.obolibrary.org/obo/NCIT_C25443,The amount that can be contained. It can refe...,Quantitative Concept,[('http://purl.obolibrary.org/obo/NCIT_C25447'...
160782,obesity.[48,"obesity, ctcae",http://purl.obolibrary.org/obo/NCIT_C55334,A disorder characterized by having a high amou...,Finding,[('http://purl.obolibrary.org/obo/NCIT_C143174...
160783,mice.[49,mouse,http://purl.obolibrary.org/obo/NCIT_C14238,Any of numerous species of small rodents belon...,Mammal,[('http://purl.obolibrary.org/obo/NCIT_C14246'...


## Running the curation app

In [116]:
default_term_filters = ["glucose"]

In [117]:
curation_app.set_default_terms_to_include(default_term_filters)
curation_app.set_table(curation_input_table.copy())
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))

# Try setting `mode="external"` to open the app in the new tab
curation_app.run(port=8070, mode="inline")

Merging the occurrence data with the ontology linking...


## Extracting curated data

In [17]:
curated_occurrence_data = curation_app.get_curated_table()

In [18]:
curated_occurrence_data

Unnamed: 0_level_0,aggregated_entities,paragraph,paper,section,raw_types,uid,definition,semantic_type,taxonomy,paper_frequency,entity_type,entity_type_label
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4e-bp1,[4e-bp1],[214924:The Interplay Between Covid-19 And Amp...,[214924],[214924:The Interplay Between Covid-19 And Amp...,"[PROTEIN, PROTEIN]",,,,,1,PROTEIN,PROTEIN
ace,[ace],[179426:Role Of Dpp4 Enzyme And Dpp4 Inhibitor...,"[214924, 179426, 184360]","[179426:Role Of Ace/Arbs , 214924:The Protecti...","[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...",,,,,3,PROTEIN,PROTEIN
ace2,[ace2],"[184360:Caption:71, 184360:Combined Therapeuti...","[184360, 211373, 211125, 214924, 160564, 17942...","[214924:The Immune Response To Sars-Cov-2 , 19...","[PROTEIN, PROTEIN, PROTEIN, PROTEIN, PROTEIN, ...",,,,,7,PROTEIN,PROTEIN
ace2 receptor,[ace2 receptor],"[211373:Introduction:5, 214924:Conclusion:28]","[214924, 211373]","[214924:Conclusion, 211373:Introduction]","[PROTEIN, PROTEIN]",,,,,2,PROTEIN,PROTEIN
ace2 receptors,[ace2 receptors],"[211125:Discussion:25, 214924:Angiotensin-Conv...","[211125, 214924]",[214924:Angiotensin-Converting Enzyme 2 Expres...,"[PROTEIN, PROTEIN]",,,,,2,PROTEIN,PROTEIN
...,...,...,...,...,...,...,...,...,...,...,...,...
viral infection,[viral infection],"[211373:Introduction:5, 184360:Mechanisms Of S...","[214924, 184360, 211373]",[184360:Mechanisms Of Sars-Cov-2 Entry Into Ho...,"[PATHWAY, DISEASE, PATHWAY, DISEASE, PATHWAY, ...",http://purl.obolibrary.org/obo/NCIT_C3439,Any disease caused by a virus.,Disease or Syndrome,[('http://purl.obolibrary.org/obo/NCIT_C26726'...,3,DISEASE,DISEASE
viral infections,[viral infections],[214924:The Interplay Between Covid-19 And Amp...,"[211125, 214924]",[214924:The Interplay Between Covid-19 And Amp...,"[DISEASE, DISEASE, DISEASE]",,,,,2,DISEASE,DISEASE
viral replication,[viral replication],[214924:The Immune Response To Sars-Cov-2 ::: ...,[214924],"[214924:The Immune Response To Sars-Cov-2 , 21...","[PATHWAY, PATHWAY]",,,,,1,PATHWAY,PATHWAY
virus,"[virus, viruses]","[184360:Abstract:1, 211125:Discussion:25, 2113...","[184360, 211373, 211125, 214924, 160564, 17942...","[214924:Abstract, 184360:Abstract, 214924:The ...","[ORGANISM, ORGANISM, ORGANISM, ORGANISM, ORGAN...",,,,,7,ORGANISM,ORGANISM


In [91]:
curated_occurrence_data["paper"] = curated_occurrence_data["paper"].apply(lambda x: set(x))
curated_occurrence_data["paragraph"] = curated_occurrence_data["paragraph"].apply(lambda x: set(x))
curated_occurrence_data["section"] = curated_occurrence_data["section"].apply(lambda x: set(x))

In [92]:
curation_meta_data = {
    "factor_counts": factor_counts,
    "nodes_to_keep": curation_app.get_terms_to_include(),
    "n_most_frequent": curation_app.n_most_frequent if curation_app.n_most_frequent else 100
}

In [93]:
curation_meta_data

{'factor_counts': {'paper': 20, 'section': 108, 'paragraph': 286},
 'nodes_to_keep': ['glucose'],
 'n_most_frequent': 500}

In [94]:
# curated_occurrence_data.to_csv("data")
# with open("", "w") as f:
#     json.dumps(curation_meta_data, f)

In [95]:
# Glucose_risk_3000_paper_data.pkl