
# PheKnowLator  

***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  

**Purpose:** This notebook serves as a `main` file for the PheKnowLatopr project. This scripts walks through this program step-by-step and generates the knowledge graph shown below. Several steps must be run before this notebook can be successfully run:
- Make sure that the [master resource file](https://www.dropbox.com/s/4qu4ev96h5q6bdx/resource_info.txt?dl=0) is complete.  
- Make sure that the files specifying the [ontologies](https://www.dropbox.com/s/bmmaavyd499d7px/ontology_source_list.txt?dl=0), [class](https://www.dropbox.com/s/cpxrj1to55syhzi/class_source_list.txt?dl=0), and [instance](https://www.dropbox.com/s/71b07b1g86roz3d/instance_source_list.txt?dl=0) data sources are completed  
- Run `.scripts/NCBO_rest_api.py` to obtain mappings between onttology identifiers.

***
### Table of Contents
* [Data Sources](#data-source)  
* [Create Edge Lists](#create-edges)  
* [Build Knowledge Graph](#build-kg)  
* [Generate Mechanism Embeddings](#generate-embeddings)  
* [t-SNE Plot](#tsne-plot)  

***
***

<img src="https://user-images.githubusercontent.com/8030363/63988760-fe326280-ca9a-11e9-836d-ff1284d3fe2c.png" width="900" height="600">


**_NOTE._** _There is also a script version of this file (`./main.py`). Please see the [README](https://github.com/callahantiff/PheKnowLator/blob/master/README.md) for more information._


**Install Dependencies**

In [1]:
# import needed libraries
import json
import pandas as pd

from rdflib import Graph

# import scripts
import scripts.python.DataSources
import scripts.python.EdgeDictionary
from scripts.python.KnowledgeGraph import *


***
***

## Download Data Sources <a class="anchor" id="data-source"></a>

First, we need to download all sources that will be used to construct our knowledge graph. This portion of the script has three steps:  
1. Download Ontology Data  
2. Download Class Data  
3. Download Instance Data


**Note.** When running the cells below for the class and instance data sources, you will be prompted to enter a file name for each source. Please use the following pattern:
> **Pattern:** edge_source_datatype_source_type.txt  
**Example:** gene-pathway_string_instance_evidence.txt


***
**Step 1: Download Ontology Data**


In [None]:
ont = scripts.python.DataSources.OntData('resources/ontology_source_list.txt')
ont.parses_resource_file()
ont.downloads_data_from_url('imports')
    

In [None]:
ont.generates_source_metadata()
ont.writes_source_metadata_locally()


***
**Step 2: Download Class Data**


In [None]:
cls = scripts.python.DataSources.Data('resources/class_source_list.txt')
cls.parses_resource_file()
cls.downloads_data_from_url('')
    

In [None]:
cls.generates_source_metadata()
cls.writes_source_metadata_locally()


***
**Step 3: Download Instance Data**


In [None]:
inst = scripts.python.DataSources.Data('resources/instance_source_list.txt')
inst.parses_resource_file()
inst.downloads_data_from_url('')
    

In [None]:
inst.generates_source_metadata()
inst.writes_source_metadata_locally()


***
***

## Create Edge Lists <a class="anchor" id="create-edges"></a>

In order to create the edge lists, you will need to do the following (assuming you don't want to use the [data from the current release](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0):
 - Run `python/NCBO_rest_api.py` script. Note, that this script will require you to create an account with [BioPortal](http://basic-formal-ontology.org/) and place your API key in `resources/bioportal_api_key.txt`. 
   - When run from the command line, you will be asked to enter two ontologies (`source1=MESH`, `source2=CHEBI`).
   - This will generate a text file that contains mappings between identifiers from two ontologies specified and write the results to `resources/data_maps/source1_source2_map.txt`.  

The code below will take the dictionaries of processed data described above and use it to create edge lists for each of the edge types specificed in the [`resource_info.txt`](https://github.com/callahantiff/PheKnowLator/blob/development/resources/resource_info.txt). Each edge list will be appended to a nested dictionary (see details below).

**Master Edge Dictionary**
Below is an example of what the `Master Edge Dictionary` contains for each processed resource:  

```python

master_edges = {'chemical-disease'  :
                {'source_labels'    : ';MESH_;',
                 'data_type'        : 'class-class',
                 'edge_relation'    : 'RO_0002606',
                 'uri'              : ('http://purl.obolibrary.org/obo/',
                                       'http://purl.obolibrary.org/obo/'),
                 'row_splitter'     : '#',
                 'col_splitter'     : 't',
                 'column_indicies'  : '1;4',
                 'identifier_maps'  : '0:./MESH_CHEBI_MAP.txt;1:disease-dbxref-map',
                 'evidence_criteria': "5;!=;' ",
                 'filter_criteria'  : 'None',
                 'edge_list'        : []}
```


In [None]:
# combine data sources
combined_edges = dict(dict(cls.data_files, **inst.data_files), **ont.data_files)


# initialize edge dictionary class
master_edges = scripts.python.EdgeDictionary.EdgeList(combined_edges,
                                                      './resources/resource_info.txt')
master_edges.creates_knowledge_graph_edges()


In [2]:
# save nested edges locally
# with open('./resources/kg_master_edge_dictionary.json', 'w') as filepath:
#     json.dump(master_edges.source_info, filepath)

# load existing master_edge dictionary
with open('./resources/kg_master_edge_dictionary.json', 'r') as filepath:
    master_edges = json.load(filepath)


In [5]:
# print basic stats on each resource
edge_data = [
    [key,
     ', '.join(master_edges[key]['edge_list'][0]),
     len(master_edges[key]['edge_list'])]
    
    for key in master_edges.keys()]

# convert dict to pandas df for nice printing
df = pd.DataFrame(edge_data, columns = ['edge', 'example_edge', 'edge_list_count']) 
df
                  

Unnamed: 0,edge,example_edge,edge_list_count
0,chemical-gene,"CHEBI_81395, 596",400288
1,chemical-go,"CHEBI_81395, GO_0006309",41604
2,chemical-pathway,"CHEBI_10033, R-HSA-1430728",24327
3,chemical-disease,"CHEBI_81395, DOID_13677",2328410
4,disease-gobp,"DOID_9667, GO_0009257",1223624
5,disease-gomf,"DOID_3021, GO_0033885",138427
6,disease-gocc,"DOID_12849, GO_1990393",85166
7,disease-phenotype,"DOID_0110720, HP_0000006",120556
8,gene-gene,"381, 6712",411868
9,gene-gobp,"11163, GO_0005829",404713



***
***

## Build Knowledge Graph  <a class="anchor" id="build-kg"></a>

Once the edge lists have been created, we can start building our knowledge graph. Since this process is somewhat time consuming, we break into the following steps:  

1. Merge Ontologies   
2. Create Class-Instance Edges    
3. Create Instance-Instance and Class-Class Edges    
4. Remove Disjointness Axioms   
 - [Disjointness axioms](https://go-protege-tutorial.readthedocs.io/en/latest/Disjointness.html) are created in order to restrict ontology classes such that no individual can be a member of more than one class. We remove these types of axioms from our graph before closing our ontology because these axioms often result in unexpected errors and inconsistencies during reasoning.    
5. Deductively Close Graph using [ELK](https://protegewiki.stanford.edu/wiki/ELK) Reasoner   
6. Save Edge List    
 - Two versions of the knowledge graphs edges are saved as lists of triples one with: (1) node labels and (2) one with integer labels (the input requirement for the embedding algorithms)


**We will build two knowledge graphs:**  
1. [`Biological Knowledge Graph`](#bio-kg): includes all edges shown in the figure above  
2. [`Translational Knowledge Graph`](#transl-kg): includes the same edges as the `Biomedical Knowledge Graph` plus an extra set of edges for mappings between clinical (EHR clinical terminologies) and biomedical concepts.  

<br>

**‼ IMPORTANT NOTE:** The file containing the merged ontologies is quite large and can take up to 30 minutes to read in.  This is not a limitattion of the code directly, but rather a function of the [`RDFLib Library`](https://github.com/RDFLib). While there are other ways to read in this data, we maintain reliance on this library as it is the most user-friendly for non-RDF users.  

***


***
### Biological Knowledge Graph<a class="anchor" id="bio-kg"></a>



**Merge Ontologies**


In [2]:
# set-up vars for file manipulation
ont_files = './resources/ontologies/'
merged_onts = ont_files + 'merged_ontologies/'


In [None]:
# create list of ontologies to merge
ontology_list = [
    [ont_files + 'go_with_imports.owl', ont_files + 'hp_with_imports.owl', merged_onts + 'hp_go_merged.owl'],
    [merged_onts + 'hp_go_merged.owl', ont_files + 'chebi_lite.owl', merged_onts + 'hp_go_chebi_merged.owl'],
    [merged_onts + 'hp_go_chebi_merged.owl', ont_files + 'vo_with_imports.owl', merged_onts +
     'PheKnowVec_v2_MergedOntologies_BioKG.owl']
                 ]

# merge ontologies
merges_ontologies(ontology_list)
  

In [None]:
# read in file and count edges (n=2277644 edges)
print(len(Graph().parse(ont_file_merged + 'hp_go_merged.owl')))


In [None]:
# read in file and count edges (n=3524109 edges)
print(len(Graph().parse(ont_file_merged + 'hp_go_chebi_merged.owl')))

In [None]:
# read in file and count edges (n=3,606,052 edges)
print(len(Graph().parse(ont_file_merged + 'PheKnowVec_v2_MergedOntologies_BioKG.owl')))



**Create Edge Lists**


In [3]:
# set file path
ont_kg = './resources/knowledge_graphs/'

# separate edge lists by data type
class_edges = {}
other_edges = {}

for edge in master_edges.keys():
    if master_edges[edge]['data_type'] == 'class-instance' or master_edges[edge]['data_type'] == 'instance-class':
        class_edges[edge] = master_edges[edge]
    else:
        other_edges[edge] = master_edges[edge]


_Create Class-Instance Edges_   


In [None]:
# read in merged knowledge graph
graph = Graph().parse(merged_onts + 'PheKnowVec_v2_MergedOntologies_BioKG.owl')
# len(set([x[0::2] for x in list(graph)]))

# add edges (n=15,555,878 edges in 09:52 minutes (17.31s/it))
class_kg = creates_knowledge_graph_edges(class_edges,
                                         'class',
                                         graph,
                                         ont_kg + 'PheKnowVec_v2_ClassInstancesOnly_BioKG.owl',
                                         kg_class_iri_map={})


_Create Instance-Instance and Class-Class Edges_

In [None]:
# create instance-instance and class-class edges (n= 16,094,427 edges in 00:26 minutes (4.88 s/it))
class_inst_kg = creates_knowledge_graph_edges(other_edges,
                                              'other',
                                              class_kg,
                                              ont_kg + 'PheKnowVec_v2_Full_BioKG.owl')


_Remove Disjointness Axioms_

In [8]:
# identified and removed 333 disjointness axioms
removes_disointness_axioms(class_inst_kg, ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness.owl')



KG ended with 7524543, 16094081 nodes/edges



_Deductively Close Graph_

In [None]:
# closes_knowledge_graph(ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness.owl',
#                        'elk',
#                        ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness_Closed.owl')

graph = Graph().parse(ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness_Closed.owl')
edge_count = len(graph)
node_count = len(set([str(node) for edge in list(graph) for node in edge[0::2]]))
print('\nKG ended with {e}, {e} nodes/edges\n'.format(e=node_count, n=edge_count))


_Save Edge List_

In [None]:
# save version with labels
creates_knowledge_graph_triples_list(ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness_Closed.owl',
             ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness_ELK_Closed_Triples_Labels.txt',
             ont_kg + 'PheKnowVec_v2_ClassInstancesOnly_BioKG_ClassInstanceMap.json')

# save version witn with integers
maps_knowledge_graph_triples_list_labels_to_ints(ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness_ELK_Closed_Triples_Labels.txt',
                                                 ont_kg + 'PheKnowVec_v2_Full_BioKG_NoDisjointness_ELK_Closed_Triples_Integers.txt',
                                                 ont_kg + 'PheKnowVec_v2_Full_BioKG_Triples_Integer_Labels_Map.json')


***
### Translational Knowledge Graph<a class="anchor" id="transl-kg"></a>  


_Merge Ontologies_  

In order to enable the integration of clinical concepts (e.g. diagnosis codes, medications, lab tests) with biological concepts from ontologies, we have created and verified mappings between over 40,000 clinical terminology codes and ontology concepts. The next step is to convert tthe mappings into new ontology concepts.

The table below provides some examples of how clinical terminology concepts were mapped to ontology concepts

Domain | Terminology | Clinical Terminology Concept | Ontology | Mapping
:--: |:--: | -- | :--: | --
Condition | SNOMED CT | Acute osteomyelitis of hand | HP | **AND**(Acute, Osteomyelitis, Abnormality of the hand)
Condition | SNOMED CT | Acute osteomyelitis of hand | DOID | Osteomyelitis
 | | |
Drug | RxNorm | Ibandronic Acid | ChEBI | 1,1-bis(phosphonic acid)
Drug |RxNorm | Acellular Pertussis Vaccine, Inactivated | VO | **AND**(acellular vaccine) bordetella pertussis vaccine, inactivated vaccine
Drug |RxNorm | Agkistrodon Piscivorus Antivenin | ChEBI | antidote
Drug |RxNorm | Abatacept | PRO, ChEBI | **AND**(Inhibitor, T-lymphocyte activation antigen CD80 (human))
 | | |
Measurement | LOINC | 1-Methylhistidine/Creatinine [Ratio] in Urine (Below reference range) | HP | Decreased urinary 1-methylhistidine
Measurement | LOINC | 1-Methylhistidine/Creatinine [Ratio] in Urine (Normal) | HP | **NOT**(Abnormal urinary methylhistidine concentration)
Measurement | LOINC | 1-Methylhistidine/Creatinine [Ratio] in Urine (Above reference range) | HP | Increased urinary 1-methylhistine

<br>


In [None]:
graph = Graph().parse(ont_file_merged + 'PheKnowVec_v2_MergedOntologies_BioKG_merged.owl')


_Create Class-Instance Edges_

In [None]:
kg_transl = instance_edges(kg,
                          master_edges.source_info,
                          ont_file_processed + 'PheKnowVec_v2_Full_TranslKG.owl')


_Remove Disjointness Axioms_

In [None]:
remove_disointness(kg_transl,
                   ont_file_processed + 'PheKnowVec_v1_Full_KG_NoDisjointness_TranslKG.owl')


_Deductively Close Graph_

In [None]:
close_graph(ont_file_processed + 'PheKnowVec_v1_Full_KG_NoDisjointness_TranslKG.owl',
            'elk',
            ont_file_processed + 'PheKnowVec_v1_Full_KG_NoDisjointness_Closed_TranslKG.owl')


_Save Edge List_

In [None]:
# save version with labels
kg_edge_list(ont_file_processed + 'PheKnowVec_v2_Full_KG_NoDisjointness_Closed_TranslKG.owl',
             ont_file_processed + 'PheKnowVec_v2_Full_KG_NoDisjointness_ELK_Closed_Triples_Label_TranslKG.txt',
             ont_file_processed + 'PheKnowVec_v2_MergedOntologies_KG_ClassInstanceMap.json')

# save version witn with integers
node_mapper(ont_file_processed + 'PheKnowVec_v2_Full_KG_NoDisjointness_ELK_Closed_Triples_Label_TranslKG.txt',
            ont_file_processed + 'PheKnowVec_v2_Full_KG_NoDisjointness_ELK_Closed_Triples_Ints_TranslKG.txt',
            ont_file_processed + 'PheKnowVec_v2_MergedOntologies_KG_ClassInstanceMap_Ints_TranslKG.json')


***
***

## Generate Mechanism Embeddings <a class="anchor" id="generate-embeddings"></a>



***
***

## t-SNE Plot <a class="anchor" id="tsne-plot"></a>
To visualize the relationships between the embedded nodes, we first need to redduce the dimensions of the molecular mechanism embeddings. To this we use [t-SNE](). Once reduce, we can visualize the results in a scatter plot.


In [None]:
# CONVERT EMBEDDINGS INTO PANDAS DATAFRAME
embedding_file = open('./resources/graphs/out.txt').readlines()[1:]
node_labels = json.loads(open('./resources/graphs/KG_triples_ints_map.json').read())
node_label_dict = {val: key for (key, val) in node_labels.items()}
node_list = ['HP', 'CHEBI', 'VO', 'DOID', 'R-HSA', 'GO', 'geneid']


In [None]:
# DIMENSIONALITY REDUCTION
x_reduced = TruncatedSVD(n_components=50, random_state=1).fit_transform(list(embedding_data['embeddings']))
x_embedded = TSNE(n_components=2, random_state=1, verbose=True, perplexity=50.0).fit_transform(x_reduced)
np.save('./resources/graphs/ALL_KG_res_tsne', x_embedded)

# convert embeddings to df
embeddings = processes_integer_labeled_embeddings(embedding_file, node_label_dict, node_list)
embedding_data = pd.DataFrame(embeddings, columns=['node_type', 'node', 'embedding'])
embedding_data.to_pickle('./resources/embeddings/PheKnowLator_embedding_dataframe')


In [None]:
# set-up plot arguments
# set up colors and legend labels
colors = {'Diseases': '#009EFA',
          'Chemicals': 'indigo',
          'GO Concepts': '#F79862',
          'Genes': '#4fb783',
          'Pathways': 'orchid',
          'Phenotypes': '#A3B14B'}

names = {key: key for key in colors.keys()}

# create data frame to use for plotting data by node type
df = pd.DataFrame(dict(x=x_embedded[:, 0], y=x_embedded[:, 1], group=list(embedding_data['node_type'])))
groups = df.groupby('group')

# create legend arguments
dis = mpatches.Patch(color='#009EFA', label='Diseases')
drg = mpatches.Patch(color='indigo', label='Drugs')
go = mpatches.Patch(color='#F79862', label='GO Concepts')
ge = mpatches.Patch(color='#4fb783', label='Genes')
pat = mpatches.Patch(color='orchid', label='Pathways')
phe = mpatches.Patch(color='#A3B14B', label='Phenotypes')

legend_args = [[dis, drg, go, ge, pat, phe], 14, 'lower center', 3]
title = 't-SNE: Biological Knowledge Graph'

plots_embeddings(colors, names, groups, legend_args, 16, 100, title, 20)
