_____
***

# PheKnowLator: Phenotype Knowledge Translator  
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Current Release:** **[`v2.0.0`](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**

<br>

**Objective:** Knowledge graphs provide meaningful ways to integrate heterogeneous biological data and represent complex biological mechanisms. This work seeks to explore the utility of incorporating existing knowledge of molecular mechanism from ontologies, publicly available data, and the literature to construct a biomedical knowledge graph that models unbiased molecular mechanisms of human disease.

<img src="https://user-images.githubusercontent.com/8030363/73126021-bd1b7d00-3f6a-11ea-8e6b-d54d902b7051.png" width="1200" height="1200">

**Build:** PheKnowLator is built using two types of data (the table below lists the nodes for build `v2.0.0` for each data type):
1. `Class Data` (i.e. data from ontologies shown in _yellow boxes_ in the figure. Data in _green boxes_ is also ontology data, but specifically for edge data)  
2. `Instance Data` (i.e. sources of linked open data, data from experiments, and/or data from the literature, shown in _purple boxes_ in the figure)


**Class Data** | **Instance Data**  
:--: | :--:  
cells/cell lines | complexes
chemicals/catalysts/cofactors | genes   
diseases | pathways 
gobps/goccs/gomfs | reactions 
phenotypes | rna    
proteins | variants 
tissues/fluids | ---
vaccines | ---

***
***

### Notebook Purpose
**Wiki Page:** **[`Release v2.0.0`](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**

<br>

**Purpose:** This notebook serves as a `main` file for the PheKnowLator project. This scripts walks through this program step-by-step and generates the knowledge graph shown above.


**How to Select a Build Type:**  
The knowledge graph build algorithm has been designed to run from three different stages of development: `full`, `partial`, and `post-closure`.

Build Type | Description | Use Cases  
:--: | -- | --   
`full` | Runs all build steps in the algorithm | You want to build a KG and will not use a reasoner  
`partial` | Runs all of the build steps in the algorithm through adding the `class-class`, `instance-class`, `class-instance`, and `instance-instance` edges<br><br> If `node_data` is provided, it will not be added to the KG, but instead used to filter the edges such that only those edges with valid node metadata are added to the KG<br><br> Node metadata can always be added to a `partial` built KG by running the build as `post-closure` | You want to build a KG and plan to run a reasoner over it<br><br> You want to build a KG, but do not want to include node metadata, filter OWL semantics, or generate triple lists  
`post-closure` | Adds node metadata (if `node_data='yes'`), determines whether owl semantics should be filtered, creates and writes triple lists, and writes node metadata | You have run the `partial` build, ran a reasoner over it, and now want to complete the algorithm<br><br> You want to use the algorithm to process metadata and owl semantics for an externally built KG

<br>

**Assumptions:**
- Please download the [OWLTools](https://github.com/owlcollab/owltools) library to the `./resources/lib` directory    
- Make sure that the following input documents have been constructed (see the [Dependencies Wiki](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies) for more information):  
  - [`resource_info.txt`](https://github.com/callahantiff/PheKnowLator/blob/master/resources/resource_info.txt)
  - [`ontology_source_list.txt`](https://github.com/callahantiff/PheKnowLator/blob/master/resources/ontology_source_list.txt)
  - [`class_source_list.txt`](https://github.com/callahantiff/PheKnowLator/blob/master/resources/class_source_list.txt)
  - [`instance_source_list.txt`](https://github.com/callahantiff/PheKnowLator/blob/master/resources/instance_source_list.txt)   

- Prepare [relations](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#relations-data) and [node metadata](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#node-metadata) files prior to running the scripts.

<br>

***
### Table of Contents
***
* [Data Sources](#data-source)  
* [Create Edge Lists](#create-edges)  
* [Build Knowledge Graph](#build-kg)  
* [Generate Mechanism Embeddings](#generate-embeddings)  
* [t-SNE Plot](#tsne-plot)  

***

<br>

**_NOTE._** _There is also a command line version of this file (`main.py`). Please see the [README](https://github.com/callahantiff/PheKnowLator/blob/master/README.md) for more information._
***

_____
### Set-Up Environment

In [7]:
# import needed libraries
import glob
import json
import pandas

# from rdflib import Graph

# import scripts
import scripts.python.DataSources
import scripts.python.EdgeDictionary
from scripts.python.KnowledgeGraph import *

***
## Download Data Sources <a class="anchor" id="data-source"></a>
***

**Wiki Page:** **[`Dependencies`](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies)**  

**Purpose:**
This portion of this portion of the algorithm is to download:
1. [Download Ontology Data](#download-ontology-data)  
2. [Download Class Data](#download-class-data)  
3. [Download Instance Data](#download-instance-data)  

<br>

**Input Files:**
- [`resource_info.txt`](https://www.dropbox.com/s/8pu6bbxpqrui8rq/resource_info.txt?dl=1)  
- [`class_source_list.txt`](https://www.dropbox.com/s/hzwbewkfsydoll2/class_source_list.txt?dl=1)  
- [`instance_source_list.txt`](https://www.dropbox.com/s/qp4cic5n4rn5t05/instance_source_list.txt?dl=1)  
- [`ontology_source_list.txt`](https://www.dropbox.com/s/bmmaavyd499d7px/ontology_source_list.txt?dl=1)  

<br>

**Assumptions:**  
All sources used to construct our knowledge graph need to be preprocessed and ready to download prior to running this code. All mapping, filtering, and label data have been generated. For assistance with creating these datasets, see the [`Data_Preparation.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/Data_Preparation.ipynb) Jupyter Notebook.


***
***
### Download Ontology Data  <a class="anchor" id="download-ontology-data"></a>
Ontologies are the core data structure used when building PheKnowLator. In the figure above, ontology data are shown in yellow boxes.

In [None]:
ont = scripts.python.DataSources.OntData('resources/ontology_source_list.txt')
ont.downloads_data_from_url('imports')
# ont.writes_source_metadata_locally()

<BR>

***

### Download Class Data   <a class="anchor" id="download-class-data"></a>
In PheKnowLator, classes are nodes that originate from ontologies (shown in the figure above in yellow boxes). Class data sources are Linked Data sources that are used to create edges in the knowledge graph and thus can connect to other class data sources or to an instance data source.

In [None]:
cls = scripts.python.DataSources.Data('resources/class_source_list.txt')
cls.downloads_data_from_url('')
# cls.writes_source_metadata_locally()

<br>

***
***

### Download Instance Data  <a class="anchor" id="download-instance-data"></a>
In PheKnowLator, instances are nodes that originate from a Linked Data source and not an ontology (shown in the figure above in purple boxes). Unlike class data, instance data sources are only used to connect other instances.

In [None]:
inst = scripts.python.DataSources.Data('resources/instance_source_list.txt')
inst.downloads_data_from_url('')
# inst.writes_source_metadata_locally()

<br>

***
***
***

## Create Edge Lists <a class="anchor" id="create-edges"></a>

**Wiki Page:** **[`Data Sources`](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources)**

**Purpose:** The code below will take the dictionaries of processed data described above and use them to create edge lists for each of the edge types specified in the [`resource_info.txt`](https://github.com/callahantiff/PheKnowLator/blob/master/resources/resource_info.txt). Each edge list will be appended to a nested dictionary (see details below).

<br>

**Assumptions:** That all code in the [`Data_Preparation.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/Data_Preparation.ipynb) Jupyter Notebook has been run. This Notebook contains code needed to generate all mapping, filtering, and label data.

<br>

**Output:** [`Master_Edge_List_Dict.json`](https://www.dropbox.com/s/4j0vrwx26dh8hd1/Master_Edge_List_Dict.json?dl=1)

**Master Edge Dictionary**  
Below is an example of what the `Master Edge Dictionary` contains for each processed resource:  
```python
master_edges = {'chemical-disease'  :
                {'source_labels'    : ';MESH_;',
                 'data_type'        : 'class-class',
                 'edge_relation'    : 'RO_0002606',
                 'uri'              : ('http://purl.obolibrary.org/obo/',
                                       'http://purl.obolibrary.org/obo/'),
                 'row_splitter'     : '#',
                 'col_splitter'     : 't',
                 'column_indicies'  : '1;4',
                 'identifier_maps'  : '0:./MESH_CHEBI_MAP.txt;1:disease-dbxref-map',
                 'evidence_criteria': "5;!=;' ",
                 'filter_criteria'  : 'None',
                 'edge_list'        : ['...']}
```

***

In [None]:
# combine data sources
combined_edges = dict(dict(cls.data_files, **inst.data_files), **ont.data_files)

# initialize edge dictionary class
master_edges = scripts.python.EdgeDictionary.EdgeList(combined_edges, './resources/resource_info.txt')
master_edges.creates_knowledge_graph_edges()

**Preview Edge Data**

In [11]:
# print basic stats on each resource
edge_data = [[key, ', '.join(master_edges.source_info[key]['edge_list'][0]), len(master_edges.source_info[key]['edge_list'])]
             for key in master_edges.source_info.keys()]

# convert dict to pandas df for nice printing
df = pandas.DataFrame(edge_data, columns = ['edge', 'example_edge', 'edge_list_count']) 
df                

Unnamed: 0,edge,example_edge,edge_list_count
0,chemical-complex,"CHEBI_24505, R-HSA-1006173",5589
1,chemical-disease,"CHEBI_81395, DOID_13677",1417127
2,chemical-gene,"CHEBI_28667, 348",17398
3,chemical-gobp,"CHEBI_81395, GO_0097190",1175928
4,chemical-gocc,"CHEBI_44975, GO_0005623",109392
5,chemical-gomf,"CHEBI_44975, GO_0043168",100656
6,chemical-pathway,"CHEBI_10033, R-HSA-1430728",27022
7,chemical-phenotype,"CHEBI_81395, HP_0002511",712747
8,chemical-protein,"CHEBI_81395, PR_000002307",97232
9,chemical-reaction,"CHEBI_10033, R-HSA-159790",24111


<br>


***
***
***

## Build Knowledge Graph  <a class="anchor" id="build-kg"></a>
**Wiki Pages:**  
- **[`KG-Construction`](https://github.com/callahantiff/PheKnowLator/wiki/KG-Construction)**  
- **[`relations-data`](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#relations-data)**  
- **[`node-metadata`](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#node-metadata)**

<br>

**Assumptions:** That if relation and/or node metadata is going to be used to build the knowledge graph, that it has been generated and added to the `./resources/relations_data` and the `./resources/node_metadata` directories. Please see the [`Data_Preparation.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/Data_Preparation.ipynb) Jupyter Notebook for details on how to create these data.  

<br>

**Input:** 
- [`Master_Edge_List_Dict.json`](https://www.dropbox.com/s/4j0vrwx26dh8hd1/Master_Edge_List_Dict.json?dl=1)  
- Directory of relations data sources - see [here](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#relations-data) for more information
- Directory of node data sources - see [here](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#node-metadata) for more information

<br>

**Output:** Please see [`Release v2.0.0 Wiki`](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0) for access to all generated output files.   
- `Knowledge Graph`  
- `Class Instance URI-UUID Map`  
- `Triple List - Integer`  
- `Triple List - Identifier`  
- `Node Integer-Identifier Map`  
- `Node Attribute Data`  

<br>

The process to build the knowledge graph is somewhat time consuming and can be broken into the following steps:  

1. Merge Ontologies  

2. Create `Class-Instance`, `Instance-Instance` and `Class-Class` Edges 

3. Add Inverse Relations and Node Data. See the [Dependencies](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies) Wiki page for details on how to construct these resources.  

4. Deductively Close Graph using [ELK](https://protegewiki.stanford.edu/wiki/ELK) Reasoner   
   - We recommend running this through the command line using [OWLTools](https://github.com/owlcollab/owltools). The code snippet below assumes that it has been downloaded to `./resources/lib` and demonstrates how the [ELK](https://www.cs.ox.ac.uk/isg/tools/ELK/) reasoner can be called and run over a knowledge graph saved in an `.owl` file.    

  ```bash
./resources/lib/owltools ./resources/knowledge_graphs/[input_graph_filename.owl] --reasoner elk --run-reasoner --assert-implied -o resources/knowledge_graphs/[output_graph_name]_Closed_ELK.owl
```

5. Filter OWL Semantics 
 - Filters the knowledge graph with the goal of removing all edges that contain entities that are needed to support owl semantics, but are not biologically meaningful. For example:
    - REMOVE - edge needed to support owl semantics that are not biologically meaningful:
      - subject: `http://purl.obolibrary.org/obo/CLO_0037294`
      - predicate: `owl:AnnotationProperty`
      - object: `rdf:about="http://purl.obolibrary.org/obo/CLO_0037294"`

    - KEEP - biologically meaningful edges:
      - subject: `http://purl.obolibrary.org/obo/CHEBI_16130`
      - predicate: `http://purl.obolibrary.org/obo/RO_0002606`
      - object: `http://purl.obolibrary.org/obo/HP_0000832`


6. Save Edge List    
 - Two versions of the knowledge graphs edges are saved as lists of triples one with: (1) node labels and (2) one with integer labels (the input requirement for the embedding algorithms), for example:

<br>

**‼ IMPORTANT NOTE:** The file containing the merged ontologies is quite large and can take up to 30 minutes to read in.  This is not a limitation of the code directly, but rather a function of the [`RDFLib Library`](https://github.com/RDFLib). While there are other ways to read in this data, we maintain reliance on this library as it is the most user-friendly for non-RDF users.  

***


In [None]:
kg = scripts.python.KnowledgeGraph.KGBuilder(build='partial',
                                             kg_version='v2.0.0',
                                             write_location='./resources/knowledge_graphs',
                                             edge_data='./resources/Master_Edge_List_Dict.json',
                                             node_data='yes',
                                             relations_data='yes',
                                             remove_owl_semantics='no')

kg.construct_knowledge_graph()

<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```