# <p style="text-align: center;">Construction of an RNA-based Knowledge Graph</p>

***
***

**Authors:** [ECavalleri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@unimi.it), [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)

**GitHub Repositories:** [RNA-KG](https://github.com/AnacletoLAB/RNA-KG), [PheKnowLator](https://github.com/callahantiff/PheKnowLator/)  
<!--- **Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)** --->
  
<br>  

**Objective:** Knowledge Graphs (KGs) provide meaningful ways to integrate heterogeneous biological data and represent complex biological mechanisms. 
[PheKnowLator](https://github.com/callahantiff/PheKnowLator) is a system that support the user in the acquisition of biomedical entities from different kinds of data sources and their representation in terms of a biomedical knowledge graph that models unbiased molecular mechanisms as prescribed by domain ontologies. In this notebook we wish to show the use of PheKnowLator for the generation of an RNA-based KG.

***
***

## Notebook Purpose
**Wiki Page:** **[`Current PheKnowLator release`](https://github.com/callahantiff/PheKnowLator/wiki/)**

<br>

**Purpose:** This notebook serves as a `main` file for the RNA-KG construction based on PheKnowLator project. This scripts walks through this program step-by-step and generates the knowledge graph shown above. Please see the [PheKnowLator README](https://github.com/callahantiff/PheKnowLator/blob/master/README.md) for more information.

<br>

**Assumptions:**
1. Make sure that the following input documents have been constructed (see the [PheKnowLator Dependencies Wiki](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies) for more information):  
  - [`resource_info.txt`](https://github.com/AnacletoLAB/RNA-KG/blob/master/resources/resource_info.txt)
  - [`ontology_source_list.txt`](https://github.com/AnacletoLAB/RNA-KG/blob/master/resources/ontology_source_list.txt)
  - [`edge_source_list.txt`](https://github.com/AnacletoLAB/RNA-KG/blob/main/resources/edge_source_list.txt)   

2. Download [`RELATIONS_LABELS.txt`](https://github.com/AnacletoLAB/RNA-KG/blob/main/resources/relations_data/RELATIONS_LABELS.txt) relations label file prior to running the scripts. Please see [PheKnowLator Dependencies Wiki](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#relations-data) for more information on how it is generated.

3. Select a knowledge graph construction method (i.e. `instance-based` or `subclass-based`).  

<br>

***
### Table of Contents
***
The three primary steps involved in building a knowledge graph are `Downloading Data Sources`, `Creating Edge Lists`, and `Building the knowledge graphs`.

* [Data Sources](#data-source)  
* [Create Edge Lists](#create-edges)  
* [Build Knowledge Graph](#build-kg)  

***

***

_____
### Set-Up Environment

In [1]:
# import needed libraries
import glob
import json
import pandas
import ray
import time

# import module
from pkt_kg.downloads import OntData, LinkedData
from pkt_kg.edge_list import CreatesEdgeList
from pkt_kg.knowledge_graph import FullBuild, PartialBuild, PostClosureBuild

***
## Download Data Sources <a class="anchor" id="data-source"></a>

**Wiki Page:** **[`Dependencies`](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies)**  

**Purpose:**
This portion of this portion of the algorithm is to download:
1. [Download Ontology Data](#download-ontology-data)  
2. [Download Edge Data](#download-edge-data)   

<br>

**Input Files:**
  - [`resource_info.txt`](https://github.com/emanuelecavalleri/testRNA-KG/blob/master/resources/resource_info.txt)
  - [`ontology_source_list.txt`](https://github.com/emanuelecavalleri/testRNA-KG/blob/master/resources/ontology_source_list.txt)
  - [`edge_source_list.txt`](https://github.com/emanuelecavalleri/testRNA-KG/blob/master/resources/edge_source_list.txt)

<br>

**Assumption:** All sources used to construct our knowledge graph need to be preprocessed and ready to download prior to running this code. All mapping, filtering, and label data have been generated prior to this step. For assistance with creating these datasets, see the [`RNA-KG_Preparation.ipynb`](https://github.com/emanuelecavalleri/testRNA-KG/blob/main/notebooks/RNA-KG_Preparation.ipynb), [`inteRNA-KG_Preparation.ipynb`](https://github.com/AnacletoLAB/RNA-KG/blob/main/notebooks/inteRNA-KG_Preparation.ipynb), and [`non-ontological_RNAentities.ipynb`](https://github.com/AnacletoLAB/RNA-KG/blob/main/notebooks/non-ontological_RNAentities.ipynb) Jupyter Notebooks.

***
***
### Ontology Data  <a class="anchor" id="download-ontology-data"></a>
Ontologies are the core data structure used when building PheKnowLator.

In [2]:
ont = OntData('resources/ontology_source_list.txt', 'resources/resource_info.txt')

ont.parses_resource_file()

In [3]:
ont.data_files = ont.source_list
ont.generates_source_metadata()


*** Generating Metadata ***



100%|██████████| 11/11 [00:00<00:00, 22473.13it/s]
100%|██████████| 11/11 [00:00<00:00, 29035.46it/s]


In [4]:
ont._writes_source_metadata_locally()

100%|██████████| 11/11 [00:00<00:00, 4117.94it/s]


In [5]:
ont.resource_info

['variant-miRNA2566|;;|entity-entity|RO_0002566|https://www.ncbi.nlm.nih.gov/snp/|https://www.mirbase.org/mature/|t|0;1|None|None|None',
 'variant-premiRNA2566|;;|entity-entity|RO_0002566|https://www.ncbi.nlm.nih.gov/snp/|https://www.mirbase.org/hairpin/|t|0;1|None|None|None',
 'variant-gene2566|;;|entity-entity|RO_0002566|https://www.ncbi.nlm.nih.gov/snp/|http://www.ncbi.nlm.nih.gov/gene/|t|0;1|None|None|None',
 'variant-disease2566|;;|entity-class|RO_0002566|https://www.ncbi.nlm.nih.gov/snp/|http://purl.obolibrary.org/obo/|t|0;1|None|None|None',
 'variant-TF2566_1|;;|entity-class|RO_0002566|https://www.ncbi.nlm.nih.gov/snp/|http://purl.obolibrary.org/obo/|t|0;1|None|None|None',
 'variant-TF2566_2|;;|entity-class|RO_0002566|https://www.ncbi.nlm.nih.gov/snp/|http://purl.obolibrary.org/obo/|t|0;1|None|None|None',
 'variant-TF2566_3|;;|entity-class|RO_0002566|https://www.ncbi.nlm.nih.gov/snp/|http://purl.obolibrary.org/obo/|t|0;1|None|None|None',
 'variant-TF2566_4|;;|entity-class|RO_000

<br>

### Edge Data   <a class="anchor" id="download-edge-data"></a>
In PheKnowLator, classes are nodes that originate from ontologies. Class data sources are Linked Data sources that are used to create edges in the knowledge graph and thus can connect to other class data sources. Sometimes we want to add data that is not already part of an ontology. In that case, data either be added as an `instance` of an existing ontology class or as its own `owl:class` by being added to the knowledge graph as a `subclass` of an existing `owl:class`.

In [6]:
edges = LinkedData('resources/edge_source_list.txt', 'resources/resource_info.txt')

edges.parses_resource_file()

In [7]:
edges.data_files = edges.source_list
edges.generates_source_metadata()


*** Generating Metadata ***



100%|██████████| 322/322 [00:00<00:00, 11391.99it/s]
100%|██████████| 322/322 [00:00<00:00, 105208.84it/s]


In [8]:
edges._writes_source_metadata_locally()

100%|██████████| 322/322 [00:00<00:00, 88370.47it/s]


In [9]:
edges.source_list.keys()

dict_keys(['variant-miRNA2566', 'variant-premiRNA2566', 'variant-gene2566', 'variant-disease2566', 'variant-TF2566_1', 'variant-TF2566_2', 'variant-TF2566_3', 'variant-TF2566_4', 'miRNA-mRNA11002_1', 'miRNA-mRNA11002_2', 'miRNA-mRNA11002_3', 'miRNA-mRNA2434', 'lncRNA-mRNA2434_1', 'lncRNA-mRNA2434_2', 'lncRNA-mRNA2434_3', 'lncRNA-mRNA2434_4', 'premiRNA-miRNA2203', 'premiRNA-premiRNAHOM0', 'premiRNA-AtoI2434', 'miRNA-gene11002', 'miRNA-gene2449', 'miRNA-gene11016', 'miRNA-gene2450', 'miRNA-gene11013', 'premiRNA-mRNA11002', 'miRNA-pseudogene11002', 'miRNA-epiModclass2434', 'premiRNA-epiModclass2434', 'miRNA-epiMod2434', 'premiRNA-epiMod2434', 'premiRNA-disease3302', 'miRNA-disease3302', 'miRNA-lncRNA2434', 'premiRNA-lncRNA2434', 'miRNA-tsRNA2434', 'tsRNA-disease3302', 'tRF-tRNA_tRFdb2202', 'tRF-tRNA_MINTbase2202', 'tRF-cellLine1025', 'tRNA-modification2434', 'tRNA-aminoacid2436', 'snoRNA-gene2434', 'snoRNA-miRNA2434', 'snoRNA-premiRNA2434', 'snoRNA-snoRNA2434', 'snoRNA-snRNA2434', 'snoRNA

***

## Create Edge Lists <a class="anchor" id="create-edges"></a>

**Wiki Page:** **[`Data Sources`](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources)**

<br>

**Purpose:** The code below will take the dictionaries of processed data described above and use them to create edge lists for each of the edge types specified in the [`resource_info.txt`](https://github.com/emanuelecavalleri/testRNA-KG/blob/master/resources/resource_info.txt). Each edge list will be appended to a nested dictionary (see details below).

<br>

**Assumptions:**  
1. All `ontology` and `edge` data sources have been downloaded.   

2. All code in the [`RNA-KG_Preparation.ipynb`](https://github.com/AnacletoLAB/RNA-KG/blob/main/notebooks/RNA-KG_Preparation.ipynb), [`inteRNA-KG_Preparation.ipynb`](https://github.com/AnacletoLAB/RNA-KG/blob/main/notebooks/inteRNA-KG_Preparation.ipynb), and [`non-ontological_RNAentities.ipynb`](https://github.com/AnacletoLAB/RNA-KG/blob/main/notebooks/non-ontological_RNAentities.ipynb) Jupyter Notebooks has been run. These Notebooks contain code needed to generate all mapping, filtering, and label data.

<br>

**Output:** `Master_Edge_List_Dict.json`. Below is an example of what the `Master Edge Dictionary` contains for each processed resource:  
```python
master_edges = {'miRNA-disease'  :
                {'source_labels'    : ';;',
                 'data_type'        : 'entity-class',
                 'edge_relation'    : 'RO_0003302',
                 'uri'              : ('https://www.mirbase.org/mature/',
                                       'http://purl.obolibrary.org/obo/'),
                 'delimiter'        : 't',
                 'column_idx'       : '2;1',
                 'identifier_maps'  : '0:./[...]/MIRNA_MIRBASE_MAP.txt;1:./[...]/DOID_MONDO_MAP.txt',
                 'evidence_criteria': 'None',
                 'filter_criteria'  : 'None',
                 'edge_list'        : ['...']}
```

***

In [10]:
import psutil
# set-up environment for parallel processing -- even if running program serially these steps are needed
cpus = psutil.cpu_count(logical=True)
#ray.shutdown()
ray.init()

2024-02-22 13:50:54,014	INFO worker.py:1625 -- Started a local Ray instance.


0,1
Python version:,3.10.12
Ray version:,2.4.0


In [11]:
# combine data sources
combined_edges = dict(edges.data_files, **ont.data_files)
resource_info_loc = './resources/resource_info.txt'

# initialize edge dictionary class
master_edges = CreatesEdgeList(data_files=combined_edges, source_file=resource_info_loc)
master_edges.runs_creates_knowledge_graph_edges(source_file=resource_info_loc, data_files=combined_edges, cpus=cpus)

[2m[36m(CreatesEdgeList pid=2238949)[0m Finished Edge: variant-miRNA2566 (variant = 77, miRNA2566 = 91); 101 unique edges
[2m[36m(CreatesEdgeList pid=2238949)[0m Finished Edge: premiRNA-miRNA2203 (premiRNA = 1917, miRNA2203 = 2656); 2879 unique edges
[2m[36m(CreatesEdgeList pid=2238950)[0m Finished Edge: snoRNA-retainedIntron2434 (snoRNA = 25, retainedIntron2434 = 6); 31 unique edges
[2m[36m(CreatesEdgeList pid=2238950)[0m Finished Edge: lncRNA-pw56 (lncRNA = 46, pw56 = 13); 58 unique edges
[2m[36m(CreatesEdgeList pid=2238950)[0m Finished Edge: ASOd-mRNA2430 (ASOd = 7, mRNA2430 = 4); 7 unique edges
[2m[36m(CreatesEdgeList pid=2238950)[0m Finished Edge: ribozyme-gobp56 (ribozyme = 3, gobp56 = 2); 3 unique edges
[2m[36m(CreatesEdgeList pid=2238949)[0m Finished Edge: miRNA-lncRNA2434 (miRNA = 2727, lncRNA2434 = 822); 14855 unique edges
[2m[36m(CreatesEdgeList pid=2238950)[0m Finished Edge: mRNA-anatomy1025 (mRNA = 3, anatomy1025 = 3); 3 unique edges
[2m[36m(Creat

**Preview Master Edge Data**  
Generate a table that includes each `edge-type`, its primary `relation`, example identifiers, and count of unique edges.

In [12]:
master_edges = json.load(open('resources/Master_Edge_List_Dict.json', 'r'))
master_edges.keys()

dict_keys(['lncRNA-mRNA2434_4', 'miRNA-disease3302', 'snoRNA-mRNA2434', 'lncRNA-role2260', 'gRNA-gene11007', 'riboswitch-bactStrain2434', 'lncRNA-anatomy1025', 'miRNA-ev1018', 'miRNA-viralProtein2434', 'mRNA-TF2434', 'othersRNA-rRNA2434', 'protein-epiMod2434', 'ribozyme-RBP2434', 'snoRNA-chemical2434', 'snRNA-subCellularLocalization1025', 'unknown-TF2434', 'scaRNA-go2331', 'miRNA-go1025', 'RNAseMRP-go2327', 'tRNA-go2331', 'lncRNA-mRNA2434_3', 'premiRNA-disease3302', 'snoRNA-rRNA2434', 'lncRNA-biologicalContext2291', 'chemical-premiRNA2434', 'riboswitch-protein2529', 'lipid-ev1018', 'lncRNA-viralProtein2434', 'miRNA-viralmRNA2434', 'mRNA-subCellularLocalization1025', 'othersRNA-pseudogene2434', 'premiRNA-viralmiRNA2434', 'ribozyme-protein2434', 'scRNA-viralmiRNA2434', 'snRNA-snRNA2434', 'unknownRNA-protein2434', 'snoRNA-go2331', 'miRNA-go2432', 'ncRNA-goBFO50', 'RNAseP-goBFO50', 'lncRNA-mRNA2434_2', 'premiRNA-epiMod2434', 'snoRNA-snRNA2434', 'lncRNA-biologicalContext2246', 'chemical-miR

In [13]:
# read in relation data
relation_data = open('./resources/relations_data/RELATIONS_LABELS.txt').readlines()
relation_dict = {x.split('\t')[0]: x.split('\t')[1].strip('\n') for x in relation_data}

# function to return key for any value
def get_key(my_dict, val):
    for key, value in my_dict.items():
        if val == key:
            return value + ' (' + key + ')'
 
    return "key doesn't exist"

# print basic stats on each resource
edge_data = [[key,
              get_key(relation_dict, master_edges[key]['edge_relation']),
              ', '.join(master_edges[key]['edge_list'][0]),
              len(master_edges[key]['edge_list'])]
             for key in master_edges.keys()]

# convert dict to pandas df for nice printing
df = pandas.DataFrame(edge_data, columns = ['Edge Type', 'Relation', 'Example Edge', 'Unique Edges']) 
df['Edge Type'] = df['Edge Type'].str.replace(r'\d+$', '', regex=True) 
df['Edge Type'] = df['Edge Type'].str.replace('HOM', '')
pandas.set_option('display.max_rows', None)
df                

Unnamed: 0,Edge Type,Relation,Example Edge,Unique Edges
0,lncRNA-mRNA2434_,interacts with (RO_0002434),"284749?lncRNA, 6900?mRNA",606435
1,miRNA-disease,causes or contributes to condition (RO_0003302),"MIMAT0000062, HP_0001402",122491
2,snoRNA-mRNA,interacts with (RO_0002434),"100033418?snoRNA, 5300?mRNA",177
3,lncRNA-role,has biological role (RO_0002260),"100048912?lncRNA, Oncogene",173
4,gRNA-gene,decreases by repression quantity of (RO_0011007),"www.addgene.org/74705, 60",72
5,riboswitch-bactStrain,interacts with (RO_0002434),"CD3578965, NCBITaxon_1496",214
6,lncRNA-anatomy,located in (RO_0001025),"100132354?lncRNA, UBERON_0000167",6
7,miRNA-ev,obsolete contained in (RO_0001018),"MIMAT0000062, GO_0070062",377
8,miRNA-viralProtein,interacts with (RO_0002434),"MIMAT0000062, PR_000008466",59
9,mRNA-TF,interacts with (RO_0002434),"10001?mRNA, PR_P35637",8629


<br><br>

***

## Build Knowledge Graph  <a class="anchor" id="build-kg"></a>
**Wiki Pages:**  
- **[`KG-Construction`](https://github.com/callahantiff/PheKnowLator/wiki/KG-Construction)**  
- **[`relations-data`](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#relations-data)**  
- **[`node-metadata`](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#node-metadata)** 

**Jupyter Notebooks:**  
- [`RNA-KG_Preparation.ipynb`](https://github.com/emanuelecavalleri/testRNA-KG/blob/master/notebooks/RNA-KG_Preparation.ipynb)  
- [`inteRNA-KG_Preparation.ipynb`](https://github.com/AnacletoLAB/RNA-KG/blob/main/notebooks/inteRNA-KG_Preparation.ipynb)
- [`non-ontological_RNAentities.ipynb`](https://github.com/AnacletoLAB/RNA-KG/blob/main/notebooks/non-ontological_RNAentities.ipynb)
- [`Ontology_Cleaning.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/notebooks/Ontology_Cleaning.ipynb)


<br>

**Assumptions:**  
- <u>Construction Approach</u>. If using the `subclass-based` construction approach, please make sure that a `pickled` dictionary mapping each non-ontology data node to an existing ontology class is created and added to the `./resources/knowledge_graph` directory (please see [here](https://github.com/callahantiff/PheKnowLator/tree/master/resources/knowledge_graphs#construction-method) for additional information).   
- <u>Relations Data</u>. If inverse relation data is going to be used to build the knowledge graph, that it has been generated and added to the `./resources/relations_data` directory (please see [here](https://github.com/callahantiff/PheKnowLator/blob/master/resources/relations_data/README.md) for additional information).  
- <u>Decoding OWL Semantics</u>. If decoding OWL-Semantics, please make sure to provide a list of owl:Property types to keep is created and added to the `./resources/knowledge_graph` directory (please see [here](https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0) for additional information). 

<br>

**Input:** 
- `Master_Edge_List_Dict.json`  
- Directory of relations data sources - see [here](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies#relations-data) for more information

<br>

**Output:** Please see [`Release v2.0.0 Wiki`](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0) for access to all generated output files.   
- `Knowledge Graph` (`.owl` and Networkx MultiDiGraph `.pkl`)  
- `Class Instance URI-UUID Map` (if "instance" construction approach)   
- `Triple List - Integer`  
- `Triple List - Identifier`  
- `Node Integer-Identifier Map`

<br>

The process to build the knowledge graph is somewhat time consuming and can be broken into the following steps:  

1. Merge Ontologies. See [here](https://github.com/callahantiff/PheKnowLator/blob/master/resources/ontologies/README.md) for additional information on how to preprocess the ontologies prior to merging them.    

2. Create Edges. Add edge lists to merged ontologies.  

3. Add Inverse Relations and Node Data. See the [Dependencies](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies) Wiki page for details on how to construct these resources.  

4. Filter OWL Semantics. Filter the knowledge graph with the goal of removing all edges that contain entities that are needed to support owl semantics, but are not biologically meaningful (please see [here](https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0) for additional information). 

<br>

**‼ IMPORTANT:**  
- The file containing the merged ontologies is quite large and can take up to 10 minutes to read in.  This is not a limitation of the code directly, but rather a function of the [`RDFLib Library`](https://github.com/RDFLib). While there are other ways to read in this data, we maintain reliance on this library as it is the most user-friendly for non-RDF users.

***


In [None]:
# specify input arguments
build = 'full'
construction_approach = 'instance'
add_node_data_to_kg = 'no'
add_inverse_relations_to_kg = 'yes'
decode_owl_semantics = 'yes'
kg_directory_location = './resources/knowledge_graphs'

In [None]:
# construct knowledge graphs
if build == 'partial':
    kg = PartialBuild(construction=construction_approach,
                      node_data=add_node_data_to_kg,
                      inverse_relations=add_inverse_relations_to_kg,
                      decode_owl=decode_owl_semantics,
                      cpus=cpus,
                      write_location=kg_directory_location)
elif build == 'post-closure':
    kg = PostClosureBuild(construction=construction_approach,
                          node_data=add_node_data_to_kg,
                          inverse_relations=add_inverse_relations_to_kg,
                          decode_owl=decode_owl_semantics,
                          cpus=cpus,
                          write_location=kg_directory_location)
else:
    kg = FullBuild(construction=construction_approach,
                   node_data=add_node_data_to_kg,
                   inverse_relations=add_inverse_relations_to_kg,
                   decode_owl=decode_owl_semantics,
                   cpus=cpus,
                   write_location=kg_directory_location)

kg.construct_knowledge_graph()
ray.shutdown()