# Workflow D: Explain

This Workflow D is designed to demonstrate the Translator's ability to
explain a biological phenomenon or observation by filling in missing
pieces of a possible mechanistic causal chain that connect two concepts
(entities) A and B observed to be associated.  This type of question are
on the rise in the era of big data medicine. For instance, why are, as
seen in metabolomics study in a cohort, the metabolites A and B
anti-correlated?  Why do users of drug A, as seen in a big data
analytics of an EHR data set, have a lower risk of disease B?  With the
anticipated spread of multi-omics profiling of patient material in
research and diagnostics, many unexplained associations will emerge
– between drug use, blood metabolites an porteins, clinical
manifestations etc. The goal of a query is to find an explanatory
multi-hop path in a knowledge graph that may help explain the empirical
association between the entities A and B.

Currently, Translator queries assume a particular structure of the
explanation, embodied by the query graph. Then the task of the
Translator is to "fill in the blanks" defined in the query graph. In the
future identifying the best query graph will be part of the task by the Translator)

NOTE: Because a query graph structure is still required even if the user
may not know the struture of the explanatory mechanism they are looking
for, to test the Workflow, the queries were reverse-engineered from
known or possible answers (in the form of knowledge graphs). The latter
were previously designed based on SME knowledge or literature, or
extracted from SPOKE KP via the neighborhood explorer tool that gives
GUI access to the [SPOKE KG](https://spoke.rbvi.ucsf.edu/) – see example
in D.2. The graphs were then encoded as TRAPI JSON but it may need to be
broken down into step-wise queries for a Translator workflow that can
realistically executed at this point since there is no operation yet
for: "connect the dots without specified qgraph structure". 

The four initial queries for a first round of testing (starting July
2021) seek to answer the following questions:

## Queries

### D.1. Why do Crohn's disease patients have a higher risk to develop Parkinson's disease?
**PURPOSE:** Two independently established gene-disease relationships
(one-hop) are joined by the common gene that is involved in both
diseases – explaining why pateints with one disease (Crohn's disease)
are at risk of the other, apparently independent disease (Parkinson's
disease). Can the Translator, starting with the name of these two
diseases, identify the common gene? This question tests a two-hop query
in which the central node is the unknown gene that connects the two
input diseases. 

**BIOLOGY:**

![D.1](images/D.1_parkisons-crohns.png)

*blue = query input, red = unknown, to be returned*

**ANTICIPATED RETURN:** This example is based on established
"ground truth". Its simple structure makes thus query essentially a
loop-up in the form of an 'AND' search. The return should identify the
gene 'LRRK2' as the genetic basis of both diseases. But other genes
(PARK7, MOD2, NO2,..) have also been associated with both these
diseases.

First we set up our environment with some helpful functions:

In [None]:
from query_helpers import (
    ARS_URL_DEV,
    KEY_FOUND_IN_AGENTS,
    expand_expected_results,
    find_expected_results,
    get_ars_results,
    get_all_results,
    get_predicates_from_agent_responses,
    open_query,
    print_query,
    print_edge_results,
    print_unified_results,
    submit_to_ars,
    unify_results
)

---

Next, we can open and view the query:

In [None]:
query_d1 = open_query('D.1_parkinsons-crohns.json')
print_query(query_d1)

This query might be represented as follows:  

**(Parkinson's Disease) - is related to - (Any Node) - is related to - (Alzheimer's Disease)**

Now we will send it to the ARS:

In [None]:
ars_pk_for_query_d1 = submit_to_ars(query_d1)

In [None]:
# A cached PK, if necessary
ars_pk_for_query_d1 = 'abb8f85a-a0cc-4083-aad9-6695cb4b638a'

---

Once it looks like the queries have finished, we can start pulling them here:

In [None]:
query_d1_ars_results = get_ars_results(ars_pk_for_query_d1)

---

We know what we expect here, so we'll set up a configuration to inspect results for LRRK2, PARK7, or MOD2

We will specify the q_node id of the unspecified node for 'NamedThing' from our query graph above

In [None]:
query_d1_node_of_interest = 'n01'
expected_d1_results = {
    query_d1_node_of_interest: [
        'NCBIGene:120892', # LRRK2
        'NCBIGene:11315', # PARK7
        'NCBIGene:110357', # MOD2
    ]
}

Now we can search for alisas for these identifiers that might be in a result

In [None]:
expanded_expected_d1_results = expand_expected_results(expected_d1_results)

Let's see what our possible list looks like now:

In [None]:
_ = [print(i) for i in expanded_expected_d1_results[query_d1_node_of_interest]]

---

Now that we have a large set of identifiers that might be used in the results, let's inspect the all of the results.

For right now, we'll only retrieve that ones that we've said we're interested in above.

We'll iterate through the results that we have and collect them all in one place.

In [None]:
agent_results = find_expected_results(
    query_d1_ars_results,
    expanded_expected_d1_results,
)

---

Now we've isolated the results that we decided ahead of time were interesting. Let's see which components returned
those results

In [None]:
unified_results = unify_results(agent_results, query_d1_node_of_interest)
print_unified_results(unified_results)

**Note that some agents return results as sets, thus the inclusion of other genes above**

---

If we want to take a look at *all* the results, we can do so by running the following:

NOTE: it can take a while to resolve all the nodes

In [None]:
all_d1_results = get_all_results(query_d1_ars_results)

In [None]:
all_d1_unified_results = unify_results(all_d1_results, query_d1_node_of_interest)
print_unified_results(all_d1_unified_results)

---

### D.2 Why do SSRI (a group of anti-depressants) have cardio-protective effect?

**PURPOSE:** This query, like D.1., also seeks one node that in a
two-hop query will connect the two input nodes: a drug (family) and a
disease (group). The challenge is higher than in D.1. since here the
type of the node queries is not defined because the very nature of the
mechanistic explanation is not known. Additional challenges come from
the fact that the nodes contain hierarchical higher-level terms (family
of drugs, group of diseases), requiring down-ward expansion for the
query. Notably, for the bottom node a proper modeling/query for the
ontology of 'heart disease' will be tested. For the top node, similarly
the drug family of 'SSRI' may have be expanded to query specific
compounds of this family. An additional challenge that the mechanism
connecting the drug and the beneficial side-effect may involve more than
one node. The query D.3. offers a simplified query graph.

**BIOLOGY**:

![D.2](images/D.2_ssri-heart-disease.png)

*blue = query input, red = unknown, to be returned*

**ANTICIPATED RETURN:** In this case, for the return there is no simple
ground truth. An empirical study has suggested that akin to the action
of anti-platelet-aggregation drugs, interference with platelet
activation may explain a benefit in reducing the risk for ischemic heart
disease, such as myocardial infarct. Thus, at least one term related to
platelet function that is connected to SSRI (possibly via serotonin) and
to cardiac disease would represent a useful return. But the ground truth
may be complex and other mechanisms are likely involved and must be
evaluated individually by SMEs. For instance, an interactive SPOKE
search retrieved the other genes that may play a role in coronary artery
diseases and myocardial infract and are affected by specific SSRI drugs,
such as 'LIPA', PROS1  or 'TGF-beta' pathway proteins (however, the
directionality with respect to increase/decrease risk of heart disease
remains to be determined). The figure below illustrates as an example of
results obtained when using an interactive approach such as the
Neighborhood Explorer web GUI extract information from the SPOKE KG to
address the question posed in this D.2. query:

![D.2 expected return](images/D.2_ssri-heart-disease-return.png)

Due to this challenge, the query graph was also generated from a simpler
solution involving only one hop (D.3.).

We've been oriented to how things work above, so we can work a little less verbosely here:

In [None]:
query_d2 = open_query('D.2_ssri-heart-disease.json')
print_query(query_d2)

This query might be represented as follows:  

**(An approved SSRI) - is related to - (Gene) - is related to - (a cardiace disorder)**

In [None]:
ars_pk_for_query_d2 = submit_to_ars(query_d2)

In [None]:
# A cached PK, if necessary
ars_pk_for_query_d2 = 'd680d232-98d8-4b0b-9767-de7d947ca036'

In [None]:
# if finished, start pulling the results here
query_d2_ars_results = get_ars_results(ars_pk_for_query_d2)

In [None]:
# set the query responses of interest
query_d2_node_of_interest = 'n01'
expected_d2_results = {
    query_d2_node_of_interest: [
        'NCBIGene:3988', # LIPA
        'NCBIGene:5627', # PROS1
        'NCBIGene:7043', # TGFB3
    ]
}


In [None]:
# expand the set of possible answers with synonyms
expanded_expected_d2_results = expand_expected_results(expected_d2_results)

In [None]:
# search the returned results for those expected
agent_results_d2 = find_expected_results(
    query_d2_ars_results,
    expanded_expected_d2_results,
)

In [None]:
# unify and view the results
unified_results_d2 = unify_results(agent_results_d2, query_d2_node_of_interest)
print_unified_results(unified_results_d2)

**At this point, the results are limited by the choice of cardiovascular diseases in the query**

Perhaps a re-examination of that choice can find a middle-ground between the vast 'heart disease'
and the specific children found in this query presently.

Also, there is no ground truth here, so we are trying to discover knowledge, making a pre-determined
list of expected answers difficult.

Again, we can examine all the results if we wish, but note there are *a lot*

In [None]:
all_d2_results = get_all_results(query_d2_ars_results)
all_d2_unified_results = unify_results(all_d2_results, query_d2_node_of_interest)
print_unified_results(all_d2_unified_results)

---

### D.3: The simplified one-hop version of D.2., by searching for specified nodes and unspecified predicates

This is a simplified version of D.2 where we want to see what predicates directly relate SSRIs to cardiovascular disease.

In particular, we'd expect to see some of the clinical KP data here.

In [None]:
query_d3 = open_query('D.3_ssri-heart-disease-one-hop.json')
print_query(query_d3)

This query might be represented as:  

**(An approved SSRI) - ??? - (Some subclasses of heart disease)**

Where we hope to discover which predicates relate these two defined entities.

In [None]:
# Submit to the ARS
ars_pk_for_query_d3 = submit_to_ars(query_d3)

In [None]:
# A cached PK if necessary:
ars_pk_for_query_d3 = '883013f5-4f35-4374-9c70-6fe0e9c3e760'

In [None]:
query_d3_ars_results = get_ars_results(ars_pk_for_query_d3)

---

Instead of looking at the nodes here, we'll call some functions to inspect the edges. There is no ground truth and the set of unique predicates should be small enough to inspect visually instead of parsing the results looking for specific edge types. 

First, we'll define the edge of interest to us, then we'll inspect the responses.

In [None]:
query_d3_edge_of_interest = 'e00'

In [None]:
edge_results = get_predicates_from_agent_responses(query_d3_ars_results)
print_edge_results(edge_results, query_d3_edge_of_interest)

We don't have many details shown here, so it's instructive to look at the ARAX interface.

**TODO**: Perhaps we can sort the predicates by their average score or something.

---

### D.4. Why are serum kynurenine and tryptophan in COVID-19 patients anti-correlated?

**PURPOSE:** An anti-correlation in the blood levels of two metabolites
of the same pathway could indicate that one is the substrate (upstream)
and the other is the product (downstream) and that in the condition
(patient cohort) in which the anti-correlation has been observed the
conversion of the substrate to product is consistently accelerated. Such
a constellation has been observed for the metabolites Tryptophan and
Kynurenine. We assume here that the user does not know that these two
metabolites form a substrate-product pair of a conversion reaction.
Biochemical reactions have been more of a challenge in data modeling
than regulatory interactions and therefore, this query tests the ability
of the Translator to retrieve reactions using the upstream and
downstream metabolites as query terms. Moreover, enzymatic reactions
typically use hypergraph to represent how an enzyme affects an edge not
a node. Hypergraphs however are not compatible with most biomedical
knowledge graphs, including those used in the Translator. Some KP model
the reaction simply as an edge others as an additional node. Can the
query be robust to such variation in graph structure of KPs that encode
the same content?

**BIOLOGY:** The conversion of Tryptophan to Kynurenine is the reaction
that we look for, and it is catalyzed by the enzyme indolamine
dioxygenase (IDO) which is upregulated by IFN-g in COVID-19 patients.

![D.4](images/D.4_tryptophan_kyurenine.png)

*blue = query input, red = unknown, to be returned*

**ANTICIPATED RETURN:** The underlying biology is straightforward, the
ground truth is clear and in textbooks: the biochemical conversion of
'tryptophan' to 'kynurenine'. It is expected that either the reaction
(e.g. implemented by some KPs as a node) is retrieved and connected to
the substrate and product (which are the input terms), or the enzyme
that catalyzes said reaction is retrieved. An optional output for future
testing of "overlays" or local query graph expansion, would be to also
retrieve 'IFN-ɣ' as inducer of the 'IDO' gene – an event that links its
elevation to 'COVID19' which causes a rise of serum IFN-ɣ. This would
provide a biologically meaningful answer to the original question
triggered by the observation.

In [None]:
query_d4 = open_query('D.4_tryptophan-kynurenine.json')
print_query(query_d4)

This query might be represented as follows:  
<pre>
    Tryptophan
        | 
   is related to
        |
    (Protein)
        |
   is related to
        |
(MolecularActivity)
        |
   is related to
        |
(MolecularActivity)
        |
   is related to
        |
    Kynurenine
</pre>

In [None]:
ars_pk_for_query_d4 = submit_to_ars(query_d4)

In [None]:
# A cached PK, if necessary
ars_pk_for_query_d4 = 'f87aa74f-921e-449e-af56-11e089a6b9da'

In [None]:
# if finished, start pulling the results here
query_d4_ars_results = get_ars_results(ars_pk_for_query_d4)

---

This one is a bit more challenging due to the nature of pathway and reaction modeling

We will look at both of the Molecular Activity nodes

In [None]:
# set the query responses of interest
query_d4_node_of_interest_n02 = 'n02'
query_d4_node_of_interest_n03 = 'n03'

d4_expected_curies = [
    'REACT:R-HSA-888614',  # IDO
    'KEGG:1.13.11.52'  # IDO
]

expected_d4_results = {
    query_d4_node_of_interest_n02: d4_expected_curies,
    query_d4_node_of_interest_n03: d4_expected_curies
}

In [None]:
# expand the set of possible answers with synonyms
expanded_expected_d4_results = expand_expected_results(expected_d4_results)

In [None]:
# search the returned results for those expected
agent_results_d4 = find_expected_results(
    query_d4_ars_results,
    expanded_expected_d4_results,
)

---

In [None]:
# unify and view the results, first for n02
unified_results_d4_n02 = unify_results(agent_results_d4, query_d4_node_of_interest_n02)
print_unified_results(unified_results_d4_n02)

In [None]:
# unify and view the results, now for n03
unified_results_d4_n03 = unify_results(agent_results_d4, query_d4_node_of_interest_n03)
print_unified_results(unified_results_d4_n03)

---

### D.6 A patient has very high ferritin levels and a biotech contact says that metformin may lower ferritin. Can we determine why?

**PURPOSE:** This use case stems from a real patient case. A routine
blood work of a cancer patient revealed extreme high serum ferritin
levels. This finding was not explained since iron metabolism parameters
appeared normal. Hyperferritinemia is a condition known to be associated
with systemic inflammation. However, it is less known but well
documented that ferritin inhibits T-cell function, and the patient was
about to obtain therapeutic T-cell infusion as immunotherapy. Therefore,
clinicians looked into possibilities to lower ferritin levels. A
pharmaceutical company helped by searching in its vast internal database
of drug (side) effects on a variety of clinical and laboratory
parameters ever observed in drug trials. The return suggested that
metformin, a commonly used drug for Type 2 diabetes, can lower ferritin.
We are interested in possible mechanisms.

The challenge here is formidable: we have not any indication about the
nature of the mechanism, i.e. we do not know how many hops the
connecting mechanistic path will contain nor the type of nodes in it.
The mechanism may be a multi-step molecular cascade. Multiple pathways
exist in the literature.

Since ferritin is not actually a protein, but a protein complex, this
query also tests the ability of the Translator to resolve a protein
complex, which are often clinical parameters, into the protein names of
its subunits that are what is listed in the protein databases. 

**BIOLOGY:**

From literature research conducted in the traditional manner of human
knowledge synthesis, we arrive at the following hypothesis for a
possible explanation for why metformin might decrease ferritin protein
levels:

![D.6](images/D.6_metformin-ferritin.png)

*blue = query input, red = unknown, to be returned*

**ANTICIPATED RETURN:** Given the complex answer, this query will
heavily rely on the reverse engineered query-graph (see introduction),
and may have to be broken down into individual queries for the purposes
of testing the content availability within the Translator. (Hence, this
query will become more a lookup-type query, as in Workflow A, than
knowledge generation). Most likely mechanistic steps involve
protein-protein interactions and gene regulation, extracted from the
literature. The key elements here are the 'mTor' pathway, which is known
to be suppressed by 'metformin' via its target the 'AMPK' protein. 
A less well-known connection is the regulation of the two 'ferritin'
subunit genes, 'FTH1' an 'FTL' by the mTor complex. Thus, this query
also involves the resolving of two protein complexes into its subunits
(or the gene that encodes them). 

In [None]:
query_d6 = open_query('D.6_metformin-ferritin.json')
print_query(query_d6)

This query might be represented as follows:  
<pre>
    Metformin
        | 
   is related to
        |
    (Protein)
        |
   is related to
        |
    (Protein)
        |
   is related to
        |
Ferritin (subunits)
</pre>

We'll also submit this query to the dev ARS, which supports asynchronous queries.

We need to give this query some extra time, because the fully unspecified Protein-Protein relationship yields many responses.

In [None]:
ars_pk_for_query_d6 = submit_to_ars(query_d6)
async_ars_pk_for_query_d6 = submit_to_ars(query_d6, ars_url=ARS_URL_DEV)

In [None]:
# A cached PK, almost certainly necessary
ars_pk_for_query_d6 = 'e9549e6b-035d-4476-bfc5-9eeede53ad41'
async_ars_pk_for_query_d6 = 'c197d1a6-5328-4e14-99e2-e338c8a4a548'

In [None]:
# if finished, start pulling the results here
query_d6_ars_results = get_ars_results(ars_pk_for_query_d6)
async_query_d6_ars_results = get_ars_results(async_ars_pk_for_query_d6, ars_url=ARS_URL_DEV)

---

Again, we are looking for multiple identifiers across multiple nodes, so we will set them up here for search

In [None]:
# set the query responses of interest
query_d6_node_of_interest_n01 = 'n01'
query_d6_node_of_interest_n02 = 'n01'

expected_d6_results = {
    query_d6_node_of_interest_n01: [
        'UniProtKB:P54646',  # AAPK
        'UniProtKB:Q9Y478',  # AAKB1
        'NCBIGene:5563'  #AMPK gene
    ],
    query_d6_node_of_interest_n02: [
        'UniProtKB:P42345'  # MTOR
    ]
}

In [None]:
# expand the set of possible answers with synonyms
expanded_expected_d6_results = expand_expected_results(expected_d6_results)

In [None]:
# search the returned results for those expected
agent_results_d6 = find_expected_results(
    query_d6_ars_results,
    expanded_expected_d6_results,
)

async_agent_results_d6 = find_expected_results(
    async_query_d6_ars_results,
    expanded_expected_d6_results,
)

---

In [None]:
# unify and view the results, first for n01
unified_results_d6_n01 = unify_results(agent_results_d6, query_d6_node_of_interest_n01)
async_unified_results_d6_n01 = unify_results(async_agent_results_d6, query_d6_node_of_interest_n01, async_=True)

We have to combine the sync and async results into one unified object

In [None]:
for entity_name, data in async_unified_results_d6_n01.items():
    if entity_name in unified_results_d6_n01:
        sync_agents = unified_results_d6_n01[entity_name][KEY_FOUND_IN_AGENTS]
        async_agents = async_unified_results_d6_n01[entity_name][KEY_FOUND_IN_AGENTS]
        unified_results_d6_n01[entity_name][KEY_FOUND_IN_AGENTS] = sync_agents.union(async_agents)
    else:
        unified_results_d6_n01[entity_name] = data


print_unified_results(unified_results_d6_n01)

Currently, no expected results

---

Now, let's do the same as above for qnode 'n02'

In [None]:
# unify and view the results, first for n01
unified_results_d6_n02 = unify_results(agent_results_d6, query_d6_node_of_interest_n02)
async_unified_results_d6_n02 = unify_results(async_agent_results_d6, query_d6_node_of_interest_n02, async_=True)

We have to combine the sync and async results into one unified object

In [None]:
for entity_name, data in async_unified_results_d6_n02.items():
    if entity_name in unified_results_d6_n02:
        sync_agents = unified_results_d6_n02[entity_name][KEY_FOUND_IN_AGENTS]
        async_agents = async_unified_results_d6_n02[entity_name][KEY_FOUND_IN_AGENTS]
        unified_results_d6_n02[entity_name][KEY_FOUND_IN_AGENTS] = sync_agents.union(async_agents)
    else:
        unified_results_d6_n02[entity_name] = data


print_unified_results(unified_results_d6_n02)

Currently, no expected results