# Nexus-hosted co-occurrence network analysis of CORD-19

In this notebook we will illustrate how different datasets from the co-occurrence analysis can be pushed and accessed within a Blue Brain Nexus project. If you are not familiar with the co-occurrence analysis pipeline proposed in the `cord19kg` package, we recommend you to have a look at the `Co-occurrence analysis tutorial` notebook.

In [1]:
import getpass
import json
import jwt
import yaml
import zipfile

from pathlib import Path

import dash_cytoscape as cyto
import nexussdk as nexus
import pandas as pd

from kgforge.core import KnowledgeGraphForge

from cord19kg.apps.topic_widgets import (TopicWidget, DataSaverWidget)
from cord19kg.utils import (generate_curation_table,
                            link_ontology,
                            generate_cooccurrence_analysis)
from cord19kg.apps.curation_app import curation_app
from cord19kg.apps.visualization_app import visualization_app

## 1. Setting up a project and creating a `kgforge` configuration

If you have already set up a Nexus project and generated a 'forge' configuration, simply get your access token (__1.1__) and go directly to the step __2. Set up a topic__.

### 1.1. Login to Nexus and get the access token

The [Nexus web application](https://staging.nexus.ocp.bbp.epfl.ch/) can be used to login and get a token:

1. Click on the login button on the right corner and follow the instructions.
<img src="../figures/nexus_log_in.png" alt="Drawing" style="width: 1000px;"/>


2. Once logged in, click on the `Copy token` button. The token will be copied to the clipboard.
<img src="../figures/nexus_logged_in.png" alt="Drawing" style="width: 1000px;"/>


Run the cell below and paste the token in the input field generated by the cell.

In [2]:
TOKEN = getpass.getpass()

NameError: name 'getpass' is not defined

In the cell below, choose a new project name.

In [3]:
# # TODO: Bring back the sandbox endpoint
# endpoint = "https://sandbox.bluebrainnexus.io/v1"
# org ="tutorialnexus"
# project ="cord19kgExampleProject"  # TODO: Choose a project name

endpoint = "https://staging.nexus.ocp.bbp.epfl.ch/v1"
org = "covid19-kg"
project = "data"
description = "cord19kg save/load example project"

### 1.2. Create a Nexus project programmatically

In the cell below modify the variable `project` to chose a new project name.

In [4]:
nexus.config.set_environment(endpoint)
nexus.config.set_token(TOKEN)
try:
    nexus.projects.create(org_label=org,project_label=project, description=description)
except:
    pass

### 1.3. Generate a `kgforge` configuration for your project

The following cell automatically generates a configuration file for the project set up above.

In [5]:
config = dict()

neuroshapes_path = "../models/neuroshapes"
if not Path(neuroshapes_path).is_dir():
    ! git clone https://github.com/INCF/neuroshapes.git $neuroshapes_path
    ! cp -R $neuroshapes_path/shapes/neurosciencegraph/datashapes/core/dataset $neuroshapes_path/shapes/neurosciencegraph/commons/
    ! cp -R $neuroshapes_path/shapes/neurosciencegraph/datashapes/core/activity $neuroshapes_path/shapes/neurosciencegraph/commons/
    ! cp -R $neuroshapes_path/shapes/neurosciencegraph/datashapes/core/entity $neuroshapes_path/shapes/neurosciencegraph/commons/
    ! cp -R $neuroshapes_path/shapes/neurosciencegraph/datashapes/core/ontology $neuroshapes_path/shapes/neurosciencegraph/commons/
    ! cp -R $neuroshapes_path/shapes/neurosciencegraph/datashapes/core/person $neuroshapes_path/shapes/neurosciencegraph/commons/

config['Model'] = {
    "name": "RdfModel",
    "origin": "directory",
    "source": f"{neuroshapes_path}/shapes/neurosciencegraph/commons/",
    "context": {
        "iri": "../models/neuroshapes_context.json",
    },
}

config["Store"] = {
    "name": "BlueBrainNexus",
    "endpoint": endpoint,
    "searchendpoints":{
        "sparql":{
            "endpoint":"https://bluebrain.github.io/nexus/vocabulary/defaultSparqlIndex"
        }
    },
    "bucket": f"{org}/{project}",
    "versioned_id_template": "{x.id}?rev={x._store_metadata._rev}",
    "file_resource_mapping": "../config/file-to-resource-mapping.hjson"
}

with open("../config/forge-config.yml", "w") as f:
    yaml.dump(config, f)

## 2. Set up a topic

Blue Brain Nexus provides a prepared schema for storing CORD-19 datasets included as a part of this repository. The main entity of the underlying knowledge graph is a `Topic`: the user can create multiple topics of interest, annotate them and push/pull datasets related to the co-occurrence analysis.

Fist, let us create a 'forge' to manage (create, access and deploy) knowledge within the given Blue Brain Nexus Project.

In [6]:
forge_config_file = "../config/forge-config.yml"
forge = KnowledgeGraphForge(forge_config_file, token=TOKEN, debug=True)

Launch the topic selection widget by executing the following cell. Using the widget, create a new topic using the respective widget tab. Go back to the `Select topic` tab, click on `List all your topics` and select the newly created topic in the dropdown menu. This tab also provides means for updating the selected topic meta-data.

In [1]:
widget = TopicWidget(forge, TOKEN)
widget.display()

NameError: name 'TopicWidget' is not defined

Having selected a topic using the selection widget, you retreive the `topic_resource_id` variable corresponding to the unique identifier of the topic in your Nexus project. 

In [8]:
visualization_configs = None # By default, no visualization app configs are provided

topic_resource_id = widget.get_topic_resource_id()
topic_resource_id

AttributeError: 'NoneType' object has no attribute 'id'

## 3. Run the curation/analysis pipeline

### 3.1. Load the input data

In [None]:
occurrence_data = pd.read_csv("../data/Glucose_risk_20_papers.csv")

### 3.2. Transform the input data into aggregated occurrences

In [None]:
curation_input_table, factor_counts = generate_curation_table(occurrence_data)

### 3.3. Load the ontology linking data

In [None]:
%%time
print("Loading the ontology linking data...")
    
print("\tDecompressing the input data file...")
with zipfile.ZipFile("../data/NCIT_ontology_linking_3000_papers.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/")

print("\tLoading the linking dataframe in memory...")
ontology_linking = pd.read_csv("../data/NCIT_ontology_linking_3000_papers.csv")

print("\tLoading ontology type mapping...")
with open("../data/NCIT_type_mapping.json", "rb") as f:
    type_mapping = json.load(f)
print("Done.")

### 3.4. Initialize and run the curation app

(If you don't perform ontology linking you may experience some errors)

In [None]:
curation_app.set_table(curation_input_table.copy())
default_entities_to_keep = ["glucose", "covid-19"]
curation_app.set_default_terms_to_include(default_entities_to_keep)
curation_app.set_ontology_linking_callback(lambda x: link_ontology(ontology_linking, type_mapping, x))
curation_app.run(port=8074, mode="inline")

### 3.5. Retrieve curated data table

In [None]:
# If this cell produces KeyError, make sure you clicked Linked NCIT ontology in the app
curated_occurrence_data = curation_app.get_curated_table()
curation_meta_data = {
    "factor_counts": factor_counts,
    "nodes_to_keep": curation_app.get_terms_to_include(),
    "n_most_frequent": curation_app.n_most_frequent if curation_app.n_most_frequent else 100
}

### 3.4. Generate co-mention networks

In [None]:
from pandas.api.types import is_numeric_dtype, is_string_dtype

In [None]:
is_numeric_dtype(curated_occurrence_data["paper_frequency"])

In [None]:
%%time
type_data = curated_occurrence_data[["entity_type"]].rename(columns={"entity_type": "type"})

graphs, trees = generate_cooccurrence_analysis(
    curated_occurrence_data,  factor_counts,
    n_most_frequent=curation_app.n_most_frequent if curation_app.n_most_frequent else 100,
    type_data=type_data, 
    factors=["paper", "paragraph"],
    keep=curation_app.get_terms_to_include(),
    cores=10)
print("Done.")

### 3.5. Initialize and run the visualization app

In [None]:
visualization_app.set_backend("graph_tool")

In [None]:
visualization_app.add_graph(
    "Paper-based graph", graphs["paper"],
    tree=trees["paper"], default_top_n=100)
visualization_app.add_graph(
    "Paragraph-based graph", graphs["paragraph"],
    tree=trees["paragraph"], default_top_n=100)
visualization_app.set_current_graph("Paragraph-based graph")

paper_data = pd.read_csv("../data/Glucose_risk_3000_paper_meta_data.csv")
paper_data = paper_data.set_index("id")
paper_data.head(3)

def list_papers(paper_data, selected_papers, limit=200):
    selected_paper_data = paper_data.loc[[int(p) for p in selected_papers]].head(200)
    return selected_paper_data.to_dict("records")

visualization_app.set_list_papers_callback(lambda x: list_papers(paper_data, x))

In [None]:
def top_n(data_dict, n, smallest=False):
    """Return top `n` keys of the input dictionary by their value."""
    df = pd.DataFrame(dict(data_dict).items(), columns=["id", "value"])
    if smallest:
        df = df.nsmallest(n, columns=["value"])
    else:
        df = df.nlargest(n, columns=["value"])
    return(list(df["id"]))


def get_aggregated_entities(entity, n):
    if "aggregated_entities" in curated_occurrence_data.columns:
        aggregated = curated_occurrence_data.loc[entity]["aggregated_entities"]
    else:
        aggregated = [entity]
    if curation_input_table is not None:
        df = curation_input_table.set_index("entity")
        if entity in curated_occurrence_data.index:
            freqs = df.loc[aggregated]["paper_frequency"].to_dict()
        else:
            return {}
    else:
        df = data.copy()
        df["entity"] = data["entity"].apply(lambda x: x.lower())
        freqs = df[df["entity"].apply(lambda x: x.lower() in aggregated)].groupby("entity").aggregate(
            lambda x: len(x))["entity_type"].to_dict()
    if len(freqs) == 0:
        return {}
    return {e: freqs[e] for e in top_n(freqs, n)}

visualization_app.set_aggregated_entities_callback(
    lambda x: get_aggregated_entities(x, 10))

definitions = ontology_linking[["concept", "definition"]].groupby(
    "concept").aggregate(lambda x: list(x)[0]).to_dict()["definition"]
visualization_app.set_entity_definitons(definitions)

if visualization_configs is not None:
    visualization_app._current_graph = visualization_configs["current_graph"]
    visualization_app._configure_layout(visualization_configs)

The cell below loads additional graph layouts used in the graph visualization app

In [None]:
cyto.load_extra_layouts()

In [None]:
visualization_app.run(port=8084, mode="external")

## 4. Save datasets under the selected topic

In [None]:
from bluegraph.backends.graph_tool import GTGraphProcessor
from cord19kg.apps.resources import CORD19_PROP_TYPES

In [None]:
proc = GTGraphProcessor.from_graph_object(visualization_app._graphs['Paper-based graph']["object"])

In [None]:
CORD19_PROP_TYPES["nodes"]["paper_frequency"] = "numeric"

In [None]:
proc.get_pgframe(node_prop_types=CORD19_PROP_TYPES["nodes"], edge_prop_types=CORD19_PROP_TYPES["edges"])

In [None]:
visualization_app._graphs.keys()

In [None]:
exported_graphs = visualization_app.export_graphs(
    ["Paper-based graph", "Paragraph-based graph"], 
)
visualization_configs = visualization_app.get_configs()
edit_history = visualization_app.get_edit_history()

Now we can launch the saver widget and, using the provided form, specify the name and the descripton of the dataset. To save the dataset, press `Register Dataset`. If you want to upload a new version of already existing dataset you can click on `Tag Dataset`.

In [None]:
saver_widget = DataSaverWidget(
    forge, TOKEN, topic_resource_id,
    occurrence_data,
    curated_occurrence_data,
    curation_meta_data,
    exported_graphs,
    visualization_configs,
    edit_history,
    temp_prefix="../data")

saver_widget.display()

### 5. Load datasets

The previously seen topic selection widget allows to load the datasets pushed to Nexus. Have selected a topic press the `Show datasets for selected topic` button. On the right you will see all the datasets already available for this topic in your Nexus project. Select a dataset and click `Reuse selected dataset`.

In [None]:
widget = TopicWidget(forge, TOKEN)
widget.display()

By executing the cell below, you can retrieve the following variables (if available, otherwise `None`):

- `occurrence_data` table containing raw entity occurrence data;
- `curated_occurrence_data` and `curation_meta_data` table containing curated entity occurrence data and some curation meta-data;
- `loaded_graphs` graph objects saved under the current topic resource;
- `visualization_configs` configurations of the last graph visualization session of the selected topic.

If you want to restart the analysis pipeline starting from the raw entity occurrence data (loaded from Nexus), you can proceed directly to the section 3.2.

If you want to restart the analysis pipeline starting from the curated table, you can proceed directly to 3.4.

In [None]:
(
    occurrence_data,
    curated_occurrence_data,
    curation_meta_data,
    loaded_graphs,
    visualization_configs,
    topic_resource_id
) = widget.get_all()

If you want to restart the analysis pipeline from the previously generated graphs, you can run the following cell and proceed directly to 3.5. Note that the last session of the configuration app will be restored by default.

In [None]:
loaded_graphs

In [None]:
graphs["paper"] = loaded_graphs["Paper-based graph"]["graph"]
graphs["paragraph"] = loaded_graphs["Paragraph-based graph"]["graph"]

trees["paper"] = loaded_graphs["Paper-based graph"]["tree"]
trees["paragraph"] = loaded_graphs["Paragraph-based graph"]["tree"]