# <p style="text-align: center;">RNA Knowledge Graph Build Data Preparation</p>
    
***
***

**Authors:** [ECavalleri](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=emanuele.cavalleri@unimi.it) (RNA mapping rules), [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com) (PheKnowLator's KG mapping rules)

**GitHub Repositories:** [RNA-KG](https://github.com/AnacletoLAB/RNA-KG/), [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)
  
<br>  
  
**Purpose:** This notebook serves as a script to create mapping rules (i.e. look-up tables) in order to build edges for RNA-KG.

<br>

**Dependencies:**   
- **Scripts**: This notebook utilizes several helper functions, which are stored in the [`data_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/data_utils.py) and [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) scripts.  
- **Data**: All downloaded and generated data sources are provided through [10.5281/zenodo.10078876](https://zenodo.org/doi/10.5281/zenodo.10078876) dedicated repository. <u>This notebook will download everything that is needed for you</u>.  
_____
***

## Set-Up Environment
_____

In [None]:
%%capture
import sys
!{sys.executable} -m pip install -r requirements.txt
sys.path.append('../')

In [None]:
# import needed libraries
import datetime
import glob
import itertools
import networkx
import numpy
import os
import pickle
import re
import requests
import tarfile
import shutil
import pandas
import pandas as pd
import gffpandas.gffpandas as gffpd
import numpy as np
pd.set_option('display.max_columns', None)
import re
import zipfile
import io

from collections import Counter
from functools import reduce
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import OWL, RDF, RDFS
from reactome2py import content
from tqdm import tqdm
from typing import Dict

from pkt_kg.utils import * 
from builds.ontology_cleaning import *

from Bio import SeqIO, Entrez

from Bio.SeqIO.FastaIO import SimpleFastaParser

from typing import Tuple

#### Define Global Variables

In [None]:
# directory to store resources
resource_data_location = '../resources/'    

# directory to use for unprocessed data
unprocessed_data_location = '../resources/processed_data/unprocessed_data/'

# directory to use for processed data
processed_data_location = '../resources/processed_data/'

# directory to write relations data to
relations_data_location = '../resources/relations_data/'

# directory to write ontology data to
ontology_data_location = '../resources/ontologies/'

# directory to write edges data to
edge_data_location = '../resources/edge_data/'

# processed data url 
processed_url = 'https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/'

# owltools location
owltools_location = '../pkt_kg/libs/owltools'

***
***
## CREATE MAPPING DATASETS  <a class="anchor" id="create-identifier-maps"></a>
***
***

In [None]:
# load data
filepath = processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl'

# defensive way to write pickle.load, allowing for very large files on all platforms
max_bytes = 2**31 - 1
input_size = os.path.getsize(filepath)
bytes_in = bytearray(0)

with open(filepath, 'rb') as f_in:
    for _ in range(0, input_size, max_bytes):
        bytes_in += f_in.read(max_bytes)

# load pickled data
reformatted_mapped_identifiers = pickle.loads(bytes_in)

***
### Ensembl Gene-Entrez Gene <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Ensembl gene identifiers to Entrez gene identifiers when creating `gene`-`gene` edges

**Output:** `ENSEMBL_GENE_ENTREZ_GENE_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                  'ensembl_gene_id', 'entrez_id', 'ensembl_gene_type', 'entrez_gene_type',
                  'gene_type_update', 'gene_type_update')

In [None]:
# load data, print the number of rows, and preview it
egeg_data = pandas.read_csv(processed_data_location + 'ENSEMBL_GENE_ENTREZ_GENE_MAP.txt',
                            header=None, delimiter='\t', low_memory=False,
                            names=['Ensembl_Gene_IDs', 'Entrez_Gene_IDs',
                                   'Ensembl_Gene_Type', 'Entrez_Gene_Type',
                                   'Master_Gene_Type1', 'Master_Gene_Type2'])

egeg_data.head(n=3)

***
### Ensembl Transcript-Protein Ontology <a class="anchor" id="ensembltranscript-proteinontology"></a>

**Purpose:** To map Ensembl transcript identifiers to Protein Ontology identifiers when creating `rna`-`protein` edges

**Output:** `ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt`


In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                  'transcript_stable_id', 'pro_id', 'ensembl_transcript_type', None,
                  'transcript_type_update', None)

In [None]:
# load data, print the number of rows, and preview it
etpr_data = pandas.read_csv(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols=[0, 1, 2, 4],
                            names=['Ensembl_Transcript_IDs', 'Protein_Ontology_IDs',
                                   'Ensembl_Transcript_Type', 'Master_Transcript_Type'])

etpr_data.head(n=3)

***
### Entrez Gene-Ensembl Transcript <a class="anchor" id="entrezgene-ensembltranscript"></a>

**Purpose:** To map entrez gene identifiers to Ensembl transcript identifiers when creating `gene`-`rna` edges

**Output:** `ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                  'entrez_id', 'transcript_stable_id', 'entrez_gene_type', 'ensembl_transcript_type',
                  'gene_type_update', 'transcript_type_update')

In [None]:
# load data, print the number of rows, and preview it
eet_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt',
                           header=None, delimiter='\t', low_memory=False,
                           names=['Entrez_Gene_IDs', 'Ensembl_Transcript_IDs',
                                  'Entrez_Gene_Type', 'Ensembl_Transcript_Type',
                                  'Master_Gene_Type', 'Master_Transcript_Type'])

eet_data.head(n=3)

***
### Entrez Gene-Protein Ontology <a class="anchor" id="entrezgene-proteinontology"></a>

**Purpose:** To map Protein Ontology identifiers to Ensembl transcript identifiers when creating the following edges:   
- chemical-protein  
- gene-protein

**Output:** `ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                  'entrez_id', 'pro_id', 'entrez_gene_type', None,
                  'gene_type_update', None)

In [None]:
# load data, print the number of rows, and preview it
egpr_data = pandas.read_csv(processed_data_location + 'ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols = [0, 1, 2, 4],
                            names=['Gene_IDs', 'Protein_Ontology_IDs',
                                   'Entrez_Gene_Type', 'Master_Gene_Type'])

egpr_data.head(n=5)

***
### Gene Symbol-Ensembl Transcript <a class="anchor" id="genesymbol-ensembltranscript"></a>

**Purpose:** To map gene symbols to Ensembl transcript identifiers when creating the following edges: 
- chemical-rna  
- rna-anatomy  
- rna-cell  

**Output:** `GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                  'symbol', 'transcript_stable_id', 'master_gene_type', 'ensembl_transcript_type',
                  'gene_type_update', 'transcript_type_update')

In [None]:
# load data, print the number of rows, and preview it
set_data = pandas.read_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt',
                           header=None, delimiter='\t', low_memory=False,
                           names=['Gene_Symbols', 'Ensembl_Transcript_IDs',
                                  'Gene_Type', 'Ensembl_Transcript_Type',
                                  'Master_Gene_Type', 'Master_Transcript_Type'])

set_data.head(n=3)

***
### Gene Symbol-Entrez <a class="anchor" id="genesymbol-ensembltranscript"></a>

**Purpose:** To map gene symbols to Entrez identifiers

**Output:** `GENE_SYMBOL_ENTREZ_MAP.txt`

In [None]:
entrez_enst_map = pd.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', sep="\t", header=None)
symbol_ensembl_map = pd.read_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt', sep="\t", header=None)
symbol_entrez_map = pd.merge(symbol_ensembl_map, entrez_enst_map, on=[1])
symbol_entrez_map = symbol_entrez_map[['0_x','0_y']].drop_duplicates()
symbol_entrez_map.head(n=3)

In [None]:
symbol_entrez_map.to_csv(processed_data_location + 'GENE_SYMBOL_ENTREZ_MAP.txt', sep="\t", header=False, index=False)

***

### STRING-Protein Ontology <a class="anchor" id="string-proteinontology"></a>

**Purpose:** To map STRING identifiers to Protein Ontology identifiers when creating `protein`-`protein` edges 

**Output:** `STRING_PRO_ONTOLOGY_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                  'protein_stable_id', 'pro_id', None, None, None, None)

In [None]:
# load data, print the number of rows, and preview it
stpr_data = pandas.read_csv(processed_data_location + 'STRING_PRO_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols=[0, 1],
                            names=['STRING_IDs', 'Protein_Ontology_IDs'])

stpr_data.head(n=5)

***

### Uniprot Accession-Protein Ontology <a class="anchor" id="uniprotaccession-proteinontology"></a>

**Purpose:** To map Uniprot accession identifiers to Protein Ontology identifiers when creating the following edges:  
- protein-gobp  
- protein-gomf  
- protein-gocc  
- protein-cofactor  
- protein-catalyst 
- protein-pathway

**Output:** `UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt`

In [None]:
genomic_id_mapper(reformatted_mapped_identifiers,
                  processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                  'uniprot_id', 'pro_id', None, None, None, None)

In [None]:
# load data, print the number of rows, and preview it
uapr_data = pandas.read_csv(processed_data_location + 'UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt',
                            header=None, delimiter='\t', low_memory=False, usecols=[0, 1],
                            names=['Uniprot_Accession_IDs', 'Protein_Ontology_IDs'])

uapr_data.head(n=5)

***
### ChEBI-MeSH Identifiers <a class="anchor" id="mesh-chebi"></a>

**Data Source Wiki Page:** [mapping-mesh-to-chebi](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#mapping-mesh-identifiers-to-chebi-identifiers)  

**Purpose:** Map MeSH identifiers to ChEBI identifiers when creating the following edges:  
- chemical-gene  
- chemical-disease

**Dependencies:** Recapitulates the [`LOOM`](https://www.bioontology.org/wiki/BioPortal_Mappings) algorithm implemented by BioPortal when creating mappings between resources. The procedure is relatively straightforward and consists of the following:
- For all MeSH `SCR Chemicals`, obtain the following information:  
  - <u>Identifiers</u>: MeSH identifiers     
  - <u>Labels</u>: string labels using the `RDFS:label` object property  
  - <u>Synonyms</u>: track down all synonyms using the `vocab:concept` and `vocab:preferredConcept` object properties   
- For all ChEBI classes, obtain the following information:  
  - <u>Labels</u>: string labels using the `RDFS:label` object property  
  - <u>Synonyms</u>: track down all synonyms using all `synonym` object properties 
  
*Alternatively:* You can use the [`ncbo_rest_api.py`](https://gist.github.com/callahantiff/a28fb3160782f42f104e9ec41553af0d) script to pull mappings from the BioPortal API, but note that it takes >2 days for it to finish.

**Output:** `CHEBI_MESH_MAP.txt`


***  
**MeSH**  
Downloads the `nt`-formatted version of the current MeSH vocabulary. Preprocesing is then performed in order to reformat the data so that it can be converted into a Pandas DataFrame in preparation of merging it with `ChEBI` in order to identify overlapping concepts.

In [None]:
# download data
url = 'ftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/2025/mesh2025.nt'
data_downloader(url, unprocessed_data_location)
    
# load data
mesh = [x.split('> ') for x in tqdm(open(unprocessed_data_location + 'mesh2025.nt').readlines())]

In [None]:
# preprocess data
mesh_dict, results = {}, []
for row in tqdm(mesh):
    dbx, lab, msh_type = None, None, None
    s, p, o = row[0].split('/')[-1], row[1].split('#')[-1], row[2]  
    if s[0] in ['C', 'D'] and ('.' not in s and 'Q' not in s) and len(s) >= 5:
        s = 'MESH_' + s
        if p == 'preferredConcept' or p == 'concept': dbx = 'MESH_' + o.split('/')[-1]
        if 'label' in p.lower(): lab = o.split('"')[1]
        if 'type' in p.lower(): msh_type = o.split('#')[1]
        if s in mesh_dict.keys():
            if dbx is not None: mesh_dict[s]['dbxref'].add(dbx)
            if lab is not None: mesh_dict[s]['label'].add(lab)
            if msh_type is not None: mesh_dict[s]['type'].add(msh_type)
        else:
            mesh_dict[s] = {'dbxref': set() if dbx is None else {dbx},
                            'label': set() if lab is None else {lab},
                            'type': set() if msh_type is None else {msh_type},
                            'synonym': set()}

# fine tune dictionary - obtain labels for each entry's synonym identifiers
for key in tqdm(mesh_dict.keys()):
    for i in mesh_dict[key]['dbxref']:
        if len(mesh_dict[key]['dbxref']) > 0 and i in mesh_dict.keys():
            mesh_dict[key]['synonym'] |= mesh_dict[i]['label']

# expand data and convert to pandas DataFrame
for key, value in tqdm(mesh_dict.items()):
    results += [[key, list(value['label'])[0], 'NAME']]
    if len(value['synonym']) > 0:
        for i in value['synonym']:
            results += [[key, i, 'SYNONYM']]
mesh_filtered = pandas.DataFrame({'CODE': [x[0] for x in results],
                                  'TYPE': [x[2] for x in results],
                                  'STRING': [x[1] for x in results]})

# lowercase all strings and remove white space and punctuation
mesh_filtered['STRING'] = mesh_filtered['STRING'].str.lower()
mesh_filtered['STRING'] = mesh_filtered['STRING'].str.replace('[^\w]','')

# preview data
mesh_filtered.head()

***  
**ChEBI**  
Downloads the flat-file containing labels and synonyms for all classes in the `ChEBI` ontology. Preprocessing is then performed in order to reformat the data so that it can be converted into a Pandas DataFrame in preparation of merging it with `MeSH` in order to identify overlapping concepts.

In [None]:
# download data
url = 'ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names.tsv.gz'
data_downloader(url, unprocessed_data_location)
    
# load data
chebi = pandas.read_csv(unprocessed_data_location + 'names.tsv', header=0, delimiter='\t')

# preprocess data
chebi_filtered = chebi[['COMPOUND_ID', 'TYPE', 'NAME']]
chebi_filtered.drop_duplicates(subset=None, keep='first', inplace=True)
chebi_filtered.columns = ['CODE', 'TYPE', 'STRING']

# append CHEBI to the number in each code
chebi_filtered['CODE'] = chebi_filtered['CODE'].apply(lambda x: "{}{}".format('CHEBI_', x))

# lowercase all strings and remove white space and punctuation
chebi_filtered['STRING'] = chebi_filtered['STRING'].str.lower()
chebi_filtered['STRING'] = chebi_filtered['STRING'].str.replace('[^\w]','')

# preview data
chebi_filtered.head()

***  
**Merge Identifier Data**  
Performs an inner merge of the `MeSH` and `ChEBI` Pandas DataFrames in order to find concepts that exist in both DataFrames. Results are then written out to a text file.

In [None]:
# merge data
chem_merge = pandas.merge(chebi_filtered[['STRING', 'CODE']], mesh_filtered[['STRING', 'CODE']], on='STRING', how='inner')

# filter results
mesh_edges = set()
for idx, row in chem_merge.drop_duplicates().iterrows():
    mesh, chebi = row['CODE_y'], row['CODE_x']
    syns = [x for x in mesh_dict[mesh]['dbxref'] if 'C' in x or 'D' in x]
    mesh_edges.add(tuple([mesh, chebi]))
    if len(syns) > 0:
        for x in syns:
            mesh_edges.add(tuple([x, chebi]))

# write resulting mappings
with open(processed_data_location + 'MESH_CHEBI_MAP.txt', 'w') as out:
    for pair in mesh_edges:
        out.write(pair[0] + '\t' + pair[1] + '\n')

In [None]:
# load data
data = pandas.read_csv(processed_data_location + 'MESH_CHEBI_MAP.txt', header=None, names=['MESH_ID', 'CHEBI_ID'], delimiter='\t')

# preview mapping results
data.head(n=3)

***

### Disease and Phenotype Identifiers <a class="anchor" id="disease-identifiers"></a>

**Data Source Wiki Page:** [DisGeNET](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet)  

**Purpose:** This script downloads the Human Phenotype Ontology (HPO), the MonDO Disease Ontology (MONDO), and [disease_mappings.tsv](https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz) in order to map UMLS identifiers to HPO and MONDO identifiers when creating the following edges:  
- chemical-disease  
- disease-phenotype  
- chemical-phenotype  
- gene-phenotype  
- variant-phenotype  

**Output:**   
- Human Disease Ontology Mappings ➞ `DISEASE_MONDO_MAP.txt`
- Human Phenotype Ontology Mappings ➞ `PHENOTYPE_HPO_MAP.txt`

***
**MONDO Identifiers**  
`MONDO` contains DbXRef mappings to other disease terminology identifiers. To make this useful, we will store the DbXRefs as a dictionary with `MONDO` identifiers as the values.

In [None]:
mondo_graph = Graph().parse(ontology_data_location + 'mondo_with_imports.owl')
print('There are {} axioms in the ontology (date: {})'.format(len(mondo_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

# get dbxrefs for all MONDO classes
dbxref_res = gets_ontology_class_dbxrefs(mondo_graph)[0]
mondo_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in dbxref_res.items() if 'MONDO' in str(v)}

# pickle dictionary
pickle.dump(mondo_dict, open(processed_data_location + 'Mondo_Identifier_Map.pkl', 'wb'), protocol=4)

***
**HPO Identifiers**  
`HPO` contains DbXRef mappings to other disease terminology identifiers. To make this useful, we will store the DbXRefs as a dictionary with `HPO` identifiers as the values.

In [None]:
# read data into RDFLib graph object
hp_graph = Graph().parse(ontology_data_location + 'hp_with_imports.owl')
print('There are {} axioms in the ontology (date: {})'.format(len(hp_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

# get dbxrefs for all HPO classes
dbxref_res = gets_ontology_class_dbxrefs(hp_graph)[0]
hp_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in dbxref_res.items() if 'HP' in str(v)}

# pickle dictionary
pickle.dump(hp_dict, open(processed_data_location + 'HPO_Identifier_Map.pkl', 'wb'), protocol=4)

***
**DisGeNET Disease Mappings**

In [None]:
# download data
url = 'https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_mappings.tsv.gz'
data_downloader(url, unprocessed_data_location)
    
# load data
disease_data = pandas.read_csv(unprocessed_data_location + 'disease_mappings.tsv', header=0, delimiter='\t')

# reformat data
disease_data['vocabulary'] = disease_data['vocabulary'].str.lower()
disease_data['diseaseId'] = disease_data['diseaseId'].str.lower()
disease_data['vocabulary'] = ['doid' if x == 'do' else 'ordoid' if x == 'ordo' else x for x in disease_data['vocabulary']]

# preview data
disease_data.head(n=3)

_Build Disease Identifier Dictionary_  
In order to improve efficiency when mapping different disease terminology identifiers to the [MonDO Disease Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#mondo-disease-ontology) and [Human Phenotype Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-phenotype-ontology), we create a dictionary of disease identifiers.

In [None]:
# get all CUIs found with HPO and MONDO
disease_data_keep = disease_data.query('vocabulary == "hpo" | vocabulary == "mondo"')

# create mondo and hpo dictionary
hp_mondo_dict = {}
for idx, row in tqdm(disease_data_keep.iterrows(), total=disease_data_keep.shape[0]):
    if row['vocabulary'] == 'mondo': key, value = 'umls:' + row['diseaseId'], 'MONDO:' + row['code']
    else: key, value = 'umls:' + row['diseaseId'], row['code']
    if key in hp_mondo_dict.keys(): hp_mondo_dict[key] |= {value}
    else: hp_mondo_dict[key] = {value}
# add ontology mappings from MONDO and HPO
for key in tqdm(hp_mondo_dict.keys()):
    if key in mondo_dict.keys():
        hp_mondo_dict[key] = set(list(hp_mondo_dict[key]) + list(mondo_dict[key]))
    if key in hp_dict.keys():
        hp_mondo_dict[key] = set(list(hp_mondo_dict[key]) + list(hp_dict[key]))

In [None]:
# get all rows for HPO/MONDO CUIs to obtain mappings to other disease identifiers
disease_data_other = disease_data[disease_data.diseaseId.isin(disease_data_keep['diseaseId'])]

# get all other codes that map to MONDO or HPO by hopping through MONDO/HPO relevant CUIs
disease_dict = {}
for idx, row in tqdm(disease_data_other.iterrows(), total=disease_data_other.shape[0]):
    if row['vocabulary'] == 'mondo' or row['vocabulary'] == 'hpo':
        key, value = 'umls:' + row['diseaseId'].lower(), row['code']
        if key in disease_dict.keys(): disease_dict[key] |= {value}
        else: disease_dict[key] = {value}
    else:
        if 'mondo' not in row['code'] or 'hp' not in row['code']:
            if ':' not in row['code']: key, value = row['vocabulary'] + ':' + row['code'], hp_mondo_dict['umls:' + row['diseaseId']]
            else: key, value = row['code'], hp_mondo_dict['umls:' + row['diseaseId']]
            if key in disease_dict.keys(): disease_dict[key] |= value
            else: disease_dict[key] = value

# add ontology dictionaries
disease_dict = {**disease_dict, **mondo_dict, **hp_dict}

_Write Mapping Data_

In [None]:
with open(processed_data_location + 'DISEASE_MONDO_MAP.txt', 'w') as outfile1, open(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', 'w') as outfile2:
    for k, v in tqdm(disease_dict.items()):
        if any(x for x in v if x.startswith('MONDO')):
            for idx in [x.replace(':', '_') for x in v if 'MONDO' in x]:
                outfile1.write(k.upper().split(':')[-1] + '\t' + idx + '\n')
        if any(x for x in v if x.startswith('HP')):
            for idx in [x.replace(':', '_') for x in v if 'HP' in x]:
                outfile2.write(k.upper().split(':')[-1]  + '\t' + idx + '\n')

_Preview Processed MONDO Disease Ontology Mappings_

In [None]:
# load data, print row count, and preview it
dis_data = pandas.read_csv(processed_data_location + 'DISEASE_MONDO_MAP.txt', header=None, names=['Disease_IDs', 'MONDO_IDs'], delimiter='\t')

print('There are {} disease-MONDO edges'.format(len(dis_data)))
dis_data.head(n=5)

_Preview Processed Human Phenotype Mappings_

In [None]:
# load data, print row count, and preview it
hp_data = pandas.read_csv(processed_data_location + 'PHENOTYPE_HPO_MAP.txt', header=None, names=['Disease_IDs', 'HP_IDs'], delimiter='\t')

print('There are {} phenotype-HPO edges'.format(len(hp_data)))
hp_data.head(n=5)

***

### Human Protein Atlas/GTEx Tissue/Cells - UBERON + Cell Ontology + Cell Line Ontology <a class="anchor" id="hpa-uberon"></a>

**Data Source Wiki Page:**  
- [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#human-protein-atlas) 
- [genotype-tissue-expression-project](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#the-genotype-tissue-expression-gtex-project)  

<br>

**Purpose:** Downloads a query for cell, tissue, and blood types with overexpressed protein-coding genes in the human proteome ([`proteinatlas_search.tsv`](https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv)) via [API](https://www.proteinatlas.org/about/help/dataaccess) and median gene-level TPM by tissue for all genes that are not protein-coding ([`GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct`](https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz)) in order to create mappings between cell and tissue type strings to the Uber-Anatomy, Cell Ontology, and Cell Line Ontology concepts (see [human-protein-atlas](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#human-protein-atlas) for details on the mapping process). The mappings are then used to create the following edge types:  
- rna-cell line  
- rna-tissue type   
- protein-cell line  
- protein-tissue type  


**Output:**  
- Final HPA-ontology mappings ➞ `HPA_GTEx_TISSUE_CELL_MAP.txt`

***
**Human Protein Atlas**  
To expedite the mapping process, all HPA tissues, cells, cell lines, and fluid types are extracted from the HPA data columns.

In [None]:
# download data
url = 'https://www.proteinatlas.org/api/search_download.php?search=&columns=g,eg,up,pe,rnatsm,rnaclsm,rnacasm,rnabrsm,rnabcsm,rnablsm,scl,t_RNA_adipose_tissue,t_RNA_adrenal_gland,t_RNA_amygdala,t_RNA_appendix,t_RNA_basal_ganglia,t_RNA_bone_marrow,t_RNA_breast,t_RNA_cerebellum,t_RNA_cerebral_cortex,t_RNA_cervix,_uterine,t_RNA_colon,t_RNA_corpus_callosum,t_RNA_ductus_deferens,t_RNA_duodenum,t_RNA_endometrium_1,t_RNA_epididymis,t_RNA_esophagus,t_RNA_fallopian_tube,t_RNA_gallbladder,t_RNA_heart_muscle,t_RNA_hippocampal_formation,t_RNA_hypothalamus,t_RNA_kidney,t_RNA_liver,t_RNA_lung,t_RNA_lymph_node,t_RNA_midbrain,t_RNA_olfactory_region,t_RNA_ovary,t_RNA_pancreas,t_RNA_parathyroid_gland,t_RNA_pituitary_gland,t_RNA_placenta,t_RNA_pons_and_medulla,t_RNA_prostate,t_RNA_rectum,t_RNA_retina,t_RNA_salivary_gland,t_RNA_seminal_vesicle,t_RNA_skeletal_muscle,t_RNA_skin_1,t_RNA_small_intestine,t_RNA_smooth_muscle,t_RNA_spinal_cord,t_RNA_spleen,t_RNA_stomach_1,t_RNA_testis,t_RNA_thalamus,t_RNA_thymus,t_RNA_thyroid_gland,t_RNA_tongue,t_RNA_tonsil,t_RNA_urinary_bladder,t_RNA_vagina,t_RNA_B-cells,t_RNA_dendritic_cells,t_RNA_granulocytes,t_RNA_monocytes,t_RNA_NK-cells,t_RNA_T-cells,t_RNA_total_PBMC,cell_RNA_A-431,cell_RNA_A549,cell_RNA_AF22,cell_RNA_AN3-CA,cell_RNA_ASC_diff,cell_RNA_ASC_TERT1,cell_RNA_BEWO,cell_RNA_BJ,cell_RNA_BJ_hTERT+,cell_RNA_BJ_hTERT+_SV40_Large_T+,cell_RNA_BJ_hTERT+_SV40_Large_T+_RasG12V,cell_RNA_CACO-2,cell_RNA_CAPAN-2,cell_RNA_Daudi,cell_RNA_EFO-21,cell_RNA_fHDF/TERT166,cell_RNA_HaCaT,cell_RNA_HAP1,cell_RNA_HBEC3-KT,cell_RNA_HBF_TERT88,cell_RNA_HDLM-2,cell_RNA_HEK_293,cell_RNA_HEL,cell_RNA_HeLa,cell_RNA_Hep_G2,cell_RNA_HHSteC,cell_RNA_HL-60,cell_RNA_HMC-1,cell_RNA_HSkMC,cell_RNA_hTCEpi,cell_RNA_hTEC/SVTERT24-B,cell_RNA_hTERT-HME1,cell_RNA_HUVEC_TERT2,cell_RNA_K-562,cell_RNA_Karpas-707,cell_RNA_LHCN-M2,cell_RNA_MCF7,cell_RNA_MOLT-4,cell_RNA_NB-4,cell_RNA_NTERA-2,cell_RNA_PC-3,cell_RNA_REH,cell_RNA_RH-30,cell_RNA_RPMI-8226,cell_RNA_RPTEC_TERT1,cell_RNA_RT4,cell_RNA_SCLC-21H,cell_RNA_SH-SY5Y,cell_RNA_SiHa,cell_RNA_SK-BR-3,cell_RNA_SK-MEL-30,cell_RNA_T-47d,cell_RNA_THP-1,cell_RNA_TIME,cell_RNA_U-138_MG,cell_RNA_U-2_OS,cell_RNA_U-2197,cell_RNA_U-251_MG,cell_RNA_U-266/70,cell_RNA_U-266/84,cell_RNA_U-698,cell_RNA_U-87_MG,cell_RNA_U-937,cell_RNA_WM-115,blood_RNA_basophil,blood_RNA_classical_monocyte,blood_RNA_eosinophil,blood_RNA_gdT-cell,blood_RNA_intermediate_monocyte,blood_RNA_MAIT_T-cell,blood_RNA_memory_B-cell,blood_RNA_memory_CD4_T-cell,blood_RNA_memory_CD8_T-cell,blood_RNA_myeloid_DC,blood_RNA_naive_B-cell,blood_RNA_naive_CD4_T-cell,blood_RNA_naive_CD8_T-cell,blood_RNA_neutrophil,blood_RNA_NK-cell,blood_RNA_non-classical_monocyte,blood_RNA_plasmacytoid_DC,blood_RNA_T-reg,blood_RNA_total_PBMC,brain_RNA_amygdala,brain_RNA_basal_ganglia,brain_RNA_cerebellum,brain_RNA_cerebral_cortex,brain_RNA_hippocampal_formation,brain_RNA_hypothalamus,brain_RNA_midbrain,brain_RNA_olfactory_region,brain_RNA_pons_and_medulla,brain_RNA_thalamus&format=tsv'
data_downloader(url, unprocessed_data_location, 'proteinatlas_search.tsv.gz')

# load data
hpa = pandas.read_csv(unprocessed_data_location + 'proteinatlas_search.tsv', header=0, delimiter='\t')
hpa.fillna('None', inplace=True)

In [None]:
# retrieve terms to map and write results
with open(unprocessed_data_location + 'HPA_tissues.txt', 'w') as outfile:
    for x in tqdm(list(hpa.columns)):
        if x.endswith('[NX]'):
            outfile.write(x.split('RNA - ')[-1].split(' [NX]')[:-1][0] + '\n')

***
**Genotype-Tissue Expression Project**  
Import the tissues, cells, cell lines, and fluids that we externally mapped from HPA and GTEx data to [UBERON](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#uber-anatomy-ontology), the [Cell Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#cell-ontology), and the [Cell Line Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#cell-line-ontology).

In [None]:
# load data
url='https://storage.googleapis.com/adult-gtex/bulk-gex/v10/rna-seq/GTEx_Analysis_v10_RNASeQCv2.4.2_gene_median_tpm.gct.gz'
data_downloader(url, unprocessed_data_location, 'GTEx_Analysis_v10_RNASeQCv2.4.2_gene_median_tpm.gct.gz')

# load data
gtex = pandas.read_csv(unprocessed_data_location + 'GTEx_Analysis_v10_RNASeQCv2.4.2_gene_median_tpm.gct', header=0, skiprows=2, delimiter='\t')
gtex.fillna('None', inplace=True)  # replace NaN with 'None'
gtex['Name'].replace('(\..*)','', inplace=True, regex=True)  # remove identifier type, which appears after '.'

In [None]:
# download data
url='https://zenodo.org/records/10056198/files/zooma_tissue_cell_mapping_04JAN2020.xlsx.zip'
data_downloader(url, unprocessed_data_location)
    
# load ontology mapping data
mapping_data = pandas.read_excel(open(unprocessed_data_location + 'zooma_tissue_cell_mapping_04JAN2020.xlsx', 'rb'),
                                 sheet_name='Concept_Mapping - 04JAN2020', header=0, engine='openpyxl')
mapping_data.fillna('None', inplace=True)  # convert NaN to None

# preview data
mapping_data.head(n=3)

_Write HPA and GTEx Mapping Data_  
The HPA and GTEx mapping data is written locally so that it can be used by the `PheKnowLator` algorithm when creating the knowledge graph edge lists. 

In [None]:
with open(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt', 'w') as out:
    for idx, row in tqdm(mapping_data.iterrows(), total=mapping_data.shape[0]):
        if row['UBERON'] != 'None': out.write(str(row['TERM']).strip() + '\t' + str(row['UBERON']).strip() + '\n')
        if row['CL'] != 'None': out.write(str(row['TERM']).strip() + '\t' + str(row['CL']).strip() + '\n')
        if row['CLO'] != 'None': out.write(str(row['TERM']).strip() + '\t' + str(row['CLO']).strip() + '\n')

In [None]:
# load mapping data
mapping_data = pandas.read_csv(processed_data_location + 'HPA_GTEx_TISSUE_CELL_MAP.txt', header=None, names=['TISSUE_CELL_TERM', 'ONTOLOGY_IDs'], delimiter='\t')

# preview data
mapping_data.head(n=3)

***

### Mapping Reactome Pathways to the Pathway Ontology <a class="anchor" id="reactome-pw"></a>

**Data Source Wiki Page:** [Pathway Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/#pathway-ontology)  

**Purpose:** This script downloads the [canonical pathways](http://compath.scai.fraunhofer.de/export_mappings) and [kegg-reactome pathway mappings](https://github.com/ComPath/resources/blob/master/mappings/kegg_reactome.csv) files from the [ComPath Ecosystem](https://github.com/ComPath) in order to create the following identifier mappings:  
- `Reactome Pathway Identifiers`  ➞ `KEGG Pathway Identifiers` ➞ `Pathway Ontology Identifiers` 

**Output:**  
- `REACTOME_PW_GO_MAPPINGS.txt`


***

**Pathway Ontology**   
Use [OWL Tools](https://github.com/owlcollab/owltools/wiki) to download the [Pathway Ontology](http://www.obofoundry.org/ontology/pw.html). Once downloaded, we read the ontology in as a `RDFLib` graph object so that we can query it to obtain all `DbXRefs`.

In [None]:
pw_graph = Graph().parse(ontology_data_location + 'pw_with_imports.owl')
print('There are {} axioms in the ontology (date: {})'.format(len(pw_graph), datetime.datetime.now().strftime('%m/%d/%Y')))

_Reformat Mapping Results_  
Create a dictionary of mapping results where pathway ontology identifiers are values and the keys are `DbXRef` identifiers.


In [None]:
# get dbxref results
dbxref_res = gets_ontology_class_dbxrefs(pw_graph)[0]
dbxref_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in dbxref_res.items() if 'PW_' in str(v)}

# get synonym results
syn_res = gets_ontology_class_synonyms(pw_graph)[0]
synonym_dict = {str(k).lower().split('/')[-1]: {str(i).split('/')[-1].replace('_', ':') for i in v} for k, v in syn_res.items() if 'PW_' in str(v)}

# combine results into single dictionary
id_mappings = {**dbxref_dict, **synonym_dict}

print('There are {} results (date: {})'.format(len(id_mappings), datetime.datetime.now().strftime('%m/%d/%Y')))

***

**Reactome Pathways**  
Download a file of all [Reactome Pathways](https://reactome.org/download/current/ReactomePathways.txt), [Reactome's GO Annotations]('https://reactome.org/download/current/gene_association.reactome.gz'), and [Reactome's mappings to CHEBI](https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt). This file will be filtered to only include human pathways.

_Reactome Pathway Stable Identifiers_

In [None]:
# download data
url = 'https://reactome.org/download/current/ReactomePathways.txt'
data_downloader(url, unprocessed_data_location)

# load data
reactome_pathways = pandas.read_csv(unprocessed_data_location + 'ReactomePathways.txt', header=None, delimiter='\t', low_memory=False)

In [None]:
# remove all non-human pathways and save as list
reactome_pathways = reactome_pathways.loc[reactome_pathways[2].apply(lambda x: x == 'Homo sapiens')] 
reactome_map = {x:set(['PW_0000001']) for x in set(list(reactome_pathways[0]))}     

_Reactome's Mappings to GO Annotations_

In [None]:
# download data
url = 'https://reactome.org/download/current/gene_association.reactome.gz'
data_downloader(url, unprocessed_data_location, 'gene_association.reactome.gz')

# load data
reactome_pathways2 = pandas.read_csv(unprocessed_data_location + 'gene_association.reactome', header=None, delimiter='\t', skiprows=4, low_memory=False)

In [None]:
# remove all non-human pathways and save as list
reactome_pathways2 = reactome_pathways2.loc[reactome_pathways2[12].apply(lambda x: x == 'taxon:9606')] 
reactome_map.update({x.split(':')[-1]:set(['PW_0000001']) for x in set(list(reactome_pathways2[5]))})     

_Reactome's Mappings to ChEBI_

In [None]:
# download data
url = 'https://reactome.org/download/current/ChEBI2Reactome_All_Levels.txt'
data_downloader(url, unprocessed_data_location, 'ChEBI2Reactome_All_Levels.txt' )

# load data
reactome_pathways3 = pandas.read_csv(unprocessed_data_location + 'ChEBI2Reactome_All_Levels.txt', header=None, delimiter='\t', low_memory=False)

In [None]:
# remove all non-human pathways and save as list
reactome_pathways3 = reactome_pathways3.loc[reactome_pathways3[5].apply(lambda x: x == 'Homo sapiens')] 
reactome_map.update({x:set(['PW_0000001']) for x in set(list(reactome_pathways3[1]))})     

***

**ComPath Reactome Pathway Mappings**  
Use [ComPath Mappings](https://github.com/ComPath/resources/tree/master/mappings) to obtain the following mappings:  `Reactome Pathways`  ➞ `KEGG Pathways` ➞ `Pathway Ontology` 

_Canonical Pathways_

In [None]:
# download data
url1 = 'http://compath.scai.fraunhofer.de/export_mappings'
data_downloader(url1, unprocessed_data_location, 'compath_canonical_pathway_mappings.txt')

# load data
compath_cannonical = pandas.read_csv(unprocessed_data_location + 'compath_canonical_pathway_mappings.txt', header=None, delimiter='\t', low_memory=False)
compath_cannonical.fillna('None', inplace=True)

In [None]:
for idx, row in tqdm(compath_cannonical.iterrows(), total=compath_cannonical.shape[0]):
    if row[6] == 'kegg' and 'kegg:' + row[5].strip('path:hsa') in id_mappings.keys() and row[2] == 'reactome':
        for x in id_mappings['kegg:' + row[5].strip('path:hsa')]:
            if row[1] in reactome_map.keys(): reactome_map[row[1]] |= set([x.split('/')[-1]])
            else: reactome_map[row[1]] = set([x.split('/')[-1]])
    if (row[2] == 'kegg' and 'kegg:' + row[1].strip('path:hsa') in id_mappings.keys()) and row[6] == 'reactome':
        for x in id_mappings['kegg:' + row[1].strip('path:hsa')]:
            if row[5] in reactome_map.keys(): reactome_map[row[5]] |= set([x.split('/')[-1]])
            else: reactome_map[row[5]] = set([x.split('/')[-1]])         

_KEGG - Reactome Mappings_

In [None]:
# download data
url2 = 'https://raw.githubusercontent.com/ComPath/resources/master/mappings/kegg_reactome.csv'
data_downloader(url2, unprocessed_data_location, 'kegg_reactome.csv')

# load data
kegg_reactome_map = pandas.read_csv(unprocessed_data_location + 'kegg_reactome.csv', header=0, delimiter=',', low_memory=False)

In [None]:
for idx, row in tqdm(kegg_reactome_map.iterrows(), total=kegg_reactome_map.shape[0]):
    if row['Source Resource'] == 'reactome' and 'kegg:' + row['Target ID'].strip('path:hsa') in id_mappings.keys():
        for x in id_mappings['kegg:' + row['Target ID'].strip('path:hsa')]:
            if row['Source ID'] in reactome_map.keys(): reactome_map[row['Source ID']] |= set([x.split('/')[-1]])
            else: reactome_map[row['Source ID']] = set([x.split('/')[-1]])
    if row['Target Resource'] == 'reactome' and 'kegg:' + row['Source Resource'].strip('path:hsa') in id_mappings.keys():
        for x in id_mappings['kegg:' + row['Source ID'].strip('path:hsa')]:
            if row['Target ID'] in reactome_map.keys(): reactome_map[row['Target ID']] |= set([x.split('/')[-1]])
            else: reactome_map[row['Target ID']] = set([x.split('/')[-1]])

***

**Reactome Pathway GO Annotation Mappings**  
Use Reactome's [API](https://reactome.org/dev/content-service) to obtain the following mappings: `Reactome Pathway Identifiers`  ➞ `Gene Ontology Identifiers`.

In [None]:
from typing import List

for request_ids in tqdm(list(chunks(list(reactome_map.keys()), 20))):
    result, key = content.query_ids(ids=','.join(request_ids)), 'goBiologicalProcess'
    if result is not None and (isinstance(result, List) or result['code'] != 404):
        for res in result:
            if key in res.keys():
                if res['stId'] in reactome_map.keys(): reactome_map[res['stId']] |= {'GO_' + res[key]['accession']}
                else: reactome_map[res['stId']] = {'GO_' + res[key]['accession']}

*Write Data*

In [None]:
# reformat identifiers -- replacing ontology concepts with ':' to '_'
temp_dict = dict()
for key, value in tqdm(reactome_map.items()):
    temp_dict[key] = set(x.replace(':', '_') for x in value)

# overwrite original reactome dict with cleaned mappings
reactome_map = temp_dict

# output data
with open(processed_data_location + 'REACTOME_PW_GO_MAPPINGS.txt', 'w') as out:
    for key in tqdm(reactome_map.keys()):
        for x in reactome_map[key]:
            if x.startswith('PW') or x.startswith('GO'): out.write(key + '\t' + x + '\n')

In [None]:
# load data, print row count, and preview it
pw_data = pandas.read_csv(processed_data_location + 'REACTOME_PW_GO_MAPPINGS.txt', header=None, names=['Pathway_IDs', 'Mapping_IDs'], delimiter='\t')

print('There are {edge_count} pathway ontology mappings'.format(edge_count=len(pw_data)))
pw_data.head(n=5)

<br>

***

### Mapping Genomic Identifiers to the Sequence Ontology <a class="anchor" id="genomic-soo"></a>

**Data Source Wiki Page:** [Sequence Ontology](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources/_edit#sequence-ontology)  

**Purpose:** This script downloads the `genomic_sequence_ontology_mappings.xlsx` file in order to create the following identifier mappings:  
- `Gene BioTypes`  ➞ `Sequence Ontology Identifiers`  
- `RNA BioTypes`  ➞ `Sequence Ontology Identifiers`  
- `variant Types`  ➞ `Sequence Ontology Identifiers`

**Output:**  
- `SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt`


In [None]:
# download data
url='https://zenodo.org/records/10056198/files/genomic_sequence_ontology_mappings.xlsx.zip'
data_downloader(url, unprocessed_data_location)

# load data
mapping_data = pandas.read_excel(open(unprocessed_data_location + 'genomic_sequence_ontology_mappings.xlsx', 'rb'),
                                 sheet_name='GenomicType_SO_Map_09Mar2020', header=0, engine='openpyxl')

print(mapping_data.head(n=3))

# convert data to dictionary
genomic_type_so_map = {}
for idx, row in tqdm(mapping_data.iterrows(), total=mapping_data.shape[0]):
    if str(row['source_*_type']) != "nan":
        genomic_type_so_map[row['source_*_type'] + '_' + row['Genomic']] = row['SO ID']

In [None]:
genomic_type_so_map['artifact_Gene'] = 'SO_0002172'
genomic_type_so_map['artifact_Transcript'] = 'SO_0002172'
genomic_type_so_map['protein_coding_CDS_not_defined_Transcript'] = 'SO_0002249'
genomic_type_so_map['protein_coding_LoF_Transcript'] = 'SO_0001841'
genomic_type_so_map['variation_Variant'] = 'SO_0001060'

***

**Genes**

In [None]:
# read in genomic mapping data
genomic_mapped_ids = pickle.load(open(processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl', 'rb'))

sequence_map = {}
for identifier in tqdm(genomic_mapped_ids.keys()):    
    if identifier.startswith('entrez_id_') and identifier.replace('entrez_id_', '') != 'None':
        id_clean = identifier.replace('entrez_id_', '')
        
        # get identifier types
        ensembl = [x.replace('ensembl_gene_type_', '') for x in genomic_mapped_ids[identifier] if x.startswith('ensembl_gene_type') and x != 'ensembl_gene_type_unknown']
        hgnc = [x.replace('hgnc_gene_type_', '')  for x in genomic_mapped_ids[identifier] if x.startswith('hgnc_gene_type') and x != 'hgnc_gene_type_unknown']
        entrez = [x.replace('entrez_gene_type_', '')  for x in genomic_mapped_ids[identifier] if x.startswith('entrez_gene_type') and x != 'entrez_gene_type_unknown']
        
        # determine gene type
        if len(ensembl) > 0: gene_type = genomic_type_so_map[ensembl[0].replace('ensembl_gene_type_', '') + '_Gene']
        elif len(hgnc) > 0: gene_type = genomic_type_so_map[hgnc[0].replace('hgnc_gene_type_', '') + '_Gene']
        elif len(entrez) > 0: gene_type = genomic_type_so_map[entrez[0].replace('entrez_gene_type_', '') + '_Gene']
        else: gene_type = 'SO_0000704'  
        
        # update sequence map
        if id_clean in sequence_map.keys(): sequence_map[id_clean] += [gene_type]
        else: sequence_map[id_clean] = [gene_type]

***

**Transcripts**

In [None]:
# read in processed Ensembl Transcript data 
transcript_data = pandas.read_csv(processed_data_location + 'ensembl_identifier_data_cleaned.txt', header=0, delimiter='\t', low_memory=False)

# convert to dictionary
transcripts = {}
for idx, row in tqdm(transcript_data.iterrows(), total=transcript_data.shape[0]):
    if row['transcript_stable_id'] != 'None':
        if row['transcript_stable_id'].replace('transcript_stable_id_', '') in transcripts.keys():
            transcripts[row['transcript_stable_id'].replace('transcript_stable_id_', '')] += [row['ensembl_transcript_type']]
        else: transcripts[row['transcript_stable_id'].replace('transcript_stable_id_', '')] = [row['ensembl_transcript_type']]
            
# update so map dictionary
for identifier in tqdm(transcripts.keys()):
    if transcripts[identifier][0] == 'protein_coding': trans_type = genomic_type_so_map['protein-coding_Transcript']
    elif transcripts[identifier][0] == 'misc_RNA': trans_type = genomic_type_so_map['miscRNA_Transcript']
    else: trans_type = genomic_type_so_map[str(list(set(transcripts[identifier]))[0]).replace('nan','miscRNA') + '_Transcript']
    sequence_map[identifier] = [trans_type, 'SO_0000673']

***

**Variants**

In [None]:
# read in variant summary data 
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz'
data_downloader(url, unprocessed_data_location)
    
# load data    
variant_data = pandas.read_csv(unprocessed_data_location + 'variant_summary.txt', header=0, delimiter='\t', low_memory=False)

# convert to dictionary
variants = {}
for idx, row in tqdm(variant_data.iterrows(), total=variant_data.shape[0]):
    if row['Assembly'] == 'GRCh38' and row['RS# (dbSNP)'] != -1:
        if 'rs' + str(row['RS# (dbSNP)']) in variants.keys(): variants['rs' + str(row['RS# (dbSNP)'])] |= set([row['Type']])
        else: variants['rs' + str(row['RS# (dbSNP)'])] = set([row['Type']])

# update so map dictionary
for identifier in tqdm(variants.keys()):
    for typ in variants[identifier]:
        var_type = genomic_type_so_map[typ.lower() + '_Variant']
        if identifier in sequence_map.keys(): sequence_map[identifier] += [var_type]
        else: sequence_map[identifier] = [var_type]

*** 
**Write Data**

In [None]:
# reformat data and write it out
with open(processed_data_location + 'SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt', 'w') as outfile:
    for key in tqdm(sequence_map.keys()):
        for map_type in sequence_map[key]:
            outfile.write(key + '\t' + map_type + '\n')

# load data, print row count, and preview it
so_data = pandas.read_csv(processed_data_location + 'SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt', header=None, delimiter='\t', names=['Identifier', 'Sequence_Ontology_ID'])

print('There are {edge_count} sequence ontology mappings'.format(edge_count=len(so_data)))
so_data.head(n=5)

In [None]:
non dovrebbe servire ma controllare'''
***

**Combine Pathway and Sequence Ontology Mapping Data in Dictionary**  
Combine the pathway and sequence mapping data into a dictionary and output it.

# combine genomic and pathway maps
subclass_mapping = {}  
sequence_map.update(reactome_map)

# iterate over pathway lists and combine them
for key in tqdm(sequence_map.keys()):
    subclass_mapping[key] = sequence_map[key]

# save a copy of the dictionary
pickle.dump(subclass_mapping, open(processed_data_location + 'subclass_construction_map.pkl', 'wb'), protocol=4)
'''

***
### Chemical labels+synonyms from ChEBI - ChEBI mapping


**Purpose:** To map Chemical labels+synonyms from ChEBI to ChEBI identifiers.

**Output:** `DESC_CHEBI_MAP.txt`

In [8]:
# Get dbxrefs for all ontology classes' label
def gets_ontology_class_label(graph: Graph) -> Tuple:
    dbx_uris: Dict = dict()
    dbx = [x for x in graph if 'label' in str(x[1]).lower() if isinstance(x[0], URIRef)]
    for x in dbx:
        if str(x[2]).lower() in dbx_uris.keys(): dbx_uris[str(x[2]).lower()].append(str(x[0]))
        else: dbx_uris[str(x[2]).lower()] = [str(x[0])]
    dbx_type = {str(x[2]).lower(): 'DbXref' for x in dbx}

    ex_uris: Dict = dict()
    ex = [x for x in graph if 'exactmatch' in str(x[1]).lower() if isinstance([0], URIRef)]
    for x in ex:
        if str(x[2]).lower() in ex_uris.keys(): ex_uris[str(x[2]).lower()].append(str(x[0]))
        else: ex_uris[str(x[2]).lower()] = [str(x[0])]
    ex_type = {str(x[2]).lower(): 'ExactMatch' for x in ex}

    return {**dbx_uris, **ex_uris}, {**dbx_type, **ex_type}

In [9]:
# Get dbxrefs for all ontology classes' label
def gets_ontology_class_synonym(graph: Graph) -> Tuple:
    dbx_uris: Dict = dict()
    dbx = [x for x in graph if 'synonym' in str(x[1]).lower() if isinstance(x[0], URIRef)]
    for x in dbx:
        if str(x[2]).lower() in dbx_uris.keys(): dbx_uris[str(x[2]).lower()].append(str(x[0]))
        else: dbx_uris[str(x[2]).lower()] = [str(x[0])]
    dbx_type = {str(x[2]).lower(): 'DbXref' for x in dbx}

    ex_uris: Dict = dict()
    ex = [x for x in graph if 'exactmatch' in str(x[1]).lower() if isinstance([0], URIRef)]
    for x in ex:
        if str(x[2]).lower() in ex_uris.keys(): ex_uris[str(x[2]).lower()].append(str(x[0]))
        else: ex_uris[str(x[2]).lower()] = [str(x[0])]
    ex_type = {str(x[2]).lower(): 'ExactMatch' for x in ex}

    return {**dbx_uris, **ex_uris}, {**dbx_type, **ex_type}

In [10]:
# Get label+synonym look-up table for an ontology
def gets_ontology_lookup(ontology_name, with_import=True) :
    # with_import --> integrated ontologies; without_import --> ontologies used to standardize edge metadata
    if with_import :
        graph = Graph().parse(ontology_data_location + ontology_name + '_with_imports.owl')
    else :
        graph = Graph().parse(ontology_data_location + ontology_name + '.owl')

    label = gets_ontology_class_label(graph)[0]
    graph_dict = {str(k): {str(i).split('/')[-1] for i in v} for k, v in label.items()}

    with open(unprocessed_data_location + 'DESC_' + ontology_name.upper() + '_MAP.txt', 'w') as outfile:
        for k, v in {**graph_dict}.items():
            outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')

    desc_map = pd.read_csv(unprocessed_data_location+'DESC_' + ontology_name.upper() + '_MAP.txt',
                           header=None, delimiter='\t')
    desc_map[1] = desc_map[1].str.split(', ')
    desc_map = desc_map.explode(1)

    syn = gets_ontology_class_synonym(graph)[0]
    graph_dict = {str(k): {str(i).split('/')[-1] for i in v} for k, v in syn.items()}

    with open(unprocessed_data_location + 'SYN_' + ontology_name.upper() + '_MAP.txt', 'w') as outfile:
        for k, v in {**graph_dict}.items():
            outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')

    syn_map = pd.read_csv(unprocessed_data_location+'SYN_' + ontology_name.upper() + '_MAP.txt',
                          header=None, delimiter='\t')
    syn_map[1] = syn_map[1].str.split(', ')
    syn_map = syn_map.explode(1)
    desc_map = pd.concat([desc_map, syn_map], ignore_index=True).drop_duplicates()
    desc_map.to_csv(processed_data_location + 'DESC_' + ontology_name.upper() + '_MAP.txt',
                    header=None, sep='\t', index=None)
    return desc_map

In [None]:
desc_chebi_map = gets_ontology_lookup('chebi')
desc_chebi_map.head(n=3)

In [None]:
# If chunks above have already been run, uncomment and run the following line to speed up construction:
#desc_chebi_map = pd.read_csv(processed_data_location + 'DESC_CHEBI_MAP.txt', header=None, sep='\t')

***
### Genomics label+synonym from SO - SO mapping


**Purpose:** To map genomics terms' label+synonym from SO to SO identifiers.

**Output:** `DESC_SO_MAP.txt`

In [None]:
desc_so_map = gets_ontology_lookup('so')
desc_so_map.head(n=3)

In [None]:
# If chunks above have already been run, uncomment and run the following line to speed up construction:
#desc_so_map = pd.read_csv(processed_data_location + 'DESC_SO_MAP.txt', header=None, sep='\t')

***
### GO terms' label+synonym from GO - GO mapping


**Purpose:** To map GO terms' label+synonym from GO to GO identifiers.

**Output:** `DESC_GO_MAP.txt`

In [None]:
desc_go_map = gets_ontology_lookup('go')
desc_go_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_go_map = pd.read_csv(processed_data_location + 'DESC_GO_MAP.txt', header=None, sep='\t')

***
### Pathways labels from Reactome - Reactome mapping


**Purpose:** To map Reactome pathways labels from Reactome to Reactome identifiers.

**Output:** `DESC_REACTOME_MAP.txt`

In [None]:
data_downloader('https://raw.githubusercontent.com/ComPath/resources/master/mappings/kegg_reactome.csv',
                unprocessed_data_location, 'kegg_reactome.csv')

kegg_reactome_map = pd.read_csv(unprocessed_data_location + 'kegg_reactome.csv', header=0, delimiter=',')[['Source Name','Source ID']]
kegg_reactome_map.columns=[0,1]
kegg_reactome_map[0] = kegg_reactome_map[0].str.lower()
kegg_reactome_map.head(n=3)

In [None]:
data_downloader('https://reactome.org/download/current/ReactomePathways.txt', unprocessed_data_location)

reactome_pathways = pd.read_csv(unprocessed_data_location + 'ReactomePathways.txt', header=None, delimiter='\t')
# remove all non-human pathways
reactome_pathways = reactome_pathways[reactome_pathways[2] == 'Homo sapiens'][[0,1]]
reactome_pathways.columns=[1,0]
reactome_pathways[0] = reactome_pathways[0].str.lower()
reactome_pathways.head(n=3)

In [None]:
desc_reactome_map = pd.concat([kegg_reactome_map, reactome_pathways])
desc_reactome_map.to_csv(processed_data_location + "DESC_REACTOME_MAP.txt", header=False, sep="\t",index=False)
desc_reactome_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_reactome_map = pd.read_csv(processed_data_location + 'DESC_REACTOME_MAP.txt', header=None, sep='\t')

***
### Pathways labels from WikiPathway - WikiPathway mapping


**Purpose:** To map WikiPathway pathways labels from WikiPathway to WikiPathway identifiers.

**Output:** `DESC_WIKIPATHWAY_MAP.txt`

In [None]:
url = 'https://data.wikipathways.org/current/gmt/wikipathways-20250110-gmt-Homo_sapiens.gmt'
data_downloader(url, unprocessed_data_location)

with open(unprocessed_data_location+'wikipathways-20250110-gmt-Homo_sapiens.gmt', 'r') as file:
    data = file.read().rstrip()
    
desc_wpw_map = pd.DataFrame([ ln.rstrip().split('\t') for ln in
    io.StringIO(data).readlines() ]).fillna('')

desc_wpw_map = desc_wpw_map[[0,1]]
desc_wpw_map.columns=[0,1]
desc_wpw_map[0] = desc_wpw_map[0].str.replace(r'%WikiPathways_.*$', '', regex=True).str.lower()
desc_wpw_map[1] = desc_wpw_map[1].str.replace('https://www.wikipathways.org/instance/', '')

desc_wpw_map.to_csv(processed_data_location + "DESC_WIKIPATHWAYS_MAP.txt", header=False, sep="\t",index=False)
desc_wpw_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_wpw_map = pd.read_csv(processed_data_location + 'DESC_WIKIPATHWAYS_MAP.txt', header=None, sep='\t')

***
### Pathways labels from PW - PW mapping


**Purpose:** To map pathways labels from PW to PW identifiers.

**Output:** `DESC_REACTOME_MAP.txt`

In [None]:
desc_pw_map = gets_ontology_lookup('pw')
desc_pw_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_pw_map = pd.read_csv(processed_data_location + 'DESC_PW_MAP.txt', header=None, sep='\t')

***
### Disease labels+synonyms from Mondo - Mondo mapping


**Purpose:** To map Diseases labels+synonyms from Mondo to Mondo identifiers.

**Output:** `DESC_MONDO_MAP.txt`

In [None]:
desc_mondo_map = gets_ontology_lookup('mondo')
desc_mondo_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_mondo_map = pd.read_csv(processed_data_location + 'DESC_MONDO_MAP.txt', header=None, sep='\t')

***
### Phenotype labels+synonyms from HPO - HPO mapping


**Purpose:** To map Phenotype labels+synonyms from HPO to HPO identifiers.

**Output:** `DESC_HP_MAP.txt`

In [None]:
desc_hpo_map = gets_ontology_lookup('hp')
desc_hpo_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_hpo_map = pd.read_csv(processed_data_location + 'DESC_HP_MAP.txt', header=None, sep='\t')

We merge diseases and phenotypes since they are closely related. Moreover, "x-disease" and "x-phenotype" interactions share the same RO properties.

In [None]:
desc_disPhe_map = pd.concat([desc_mondo_map, desc_hpo_map]).drop_duplicates()
desc_disPhe_map.head(n=3)

***
### Disease Ontology (DO) - MONDO mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map DO identifiers to MONDO identifiers.

**Output:** `DISEASE_DOID_MONDO_MAP.txt`

In [None]:
mondo_graph = Graph().parse(ontology_data_location + 'mondo_with_imports.owl')

mondo_dbxref = gets_ontology_class_dbxrefs(mondo_graph)[0]

# Fix DOIDs (substitute : with _)
mondo_dict = {str(k).replace(':','_').upper(): {str(i).split('/')[-1].replace(':','_') for i in v}
              for k, v in mondo_dbxref.items() if 'doid' in str(k)}
list({**mondo_dict}.items())[:5]

In [None]:
with open(processed_data_location + 'DOID_MONDO_MAP.txt', 'w') as outfile:
    for k, v in mondo_dict.items():
        outfile.write(str(k) + '\t' + str(v).replace('{','').replace('\'','').replace('}','') + '\n')

In [None]:
doid_mondo_map = pd.read_csv(processed_data_location+'DOID_MONDO_MAP.txt', header=None, delimiter='\t')
doid_mondo_map[1] = doid_mondo_map[1].str.split(', ')
doid_mondo_map = doid_mondo_map.explode(1)
doid_mondo_map.head(n=3)

***
### Disease description from DO - DO mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map Disease descriptions from DO to DO identifiers.

**Output:** None, this mapping will be used only internally.

Note: Provided by [mir2Disease](http://watson.compbio.iupui.edu:8080/miR2Disease/).

In [None]:
data_downloader('http://watson.compbio.iupui.edu:8080/miR2Disease/download/diseaseList.txt', unprocessed_data_location)

In [None]:
desc_do_map = pd.read_csv(unprocessed_data_location + 'diseaseList.txt', sep="\t")
desc_do_map.columns = ['desc', 'doid']
desc_do_map['desc'] = desc_do_map['desc'].str.lower()
desc_do_map['doid'] = desc_do_map['doid'].str.upper().str.replace(':', '_')
desc_do_map.head(n=3)

***
### TCGA - MONDO mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To manually map the 32 TCGA cancer types to MONDO ontology.

**Output:** `TCGA_MONDO_MAP.txt`

In [None]:
cancer_mondo_map = pd.DataFrame(data=[['ACC','MONDO_0004971'],
                                 ['BLCA','MONDO_0004163'],
                                 ['BRCA','MONDO_0006256'],
                                 ['CESC','MONDO_0005131'],
                                 ['CCRCC','MONDO_0005086'],
                                 ['CHOL','MONDO_0019087'],
                                 ['COAD','MONDO_0002271'],
                                 ['DLBC','MONDO_0018905'],
                                 ['ESCA','MONDO_0019086'],
                                 ['GBM','MONDO_0018177'],
                                 ['HNSC','MONDO_0010150'],
                                 ['KICH','MONDO_0017885'],
                                 ['KIRC','MONDO_0005005'],
                                 ['KIRP','MONDO_0017884'],
                                 ['LGG','MONDO_0005499'],
                                 ['LIHC','MONDO_0007256'],
                                 ['LUAD','MONDO_0005061'],
                                 ['LUSC','MONDO_0005097'],
                                 ['MESO','MONDO_0005065'],
                                 ['OV','MONDO_0006046'],
                                 ['PAAD','MONDO_0006047'],
                                 ['PCPG','MONDO_0035540'],
                                 ['PRAD','MONDO_0005082'],
                                 ['READ','MONDO_0002169'],
                                 ['SARC','MONDO_0005089'],
                                 ['SKCM','MONDO_0005012'],
                                 ['STAD','MONDO_0005036'],
                                 ['TGCT','MONDO_0010108'],
                                 ['THCA','MONDO_0015075'],
                                 ['THYM','MONDO_0006456'],
                                 ['UCEC','MONDO_0000553'],
                                 ['UCS','MONDO_0006485'],
                                 ['UVM','MONDO_0006486']
                                 ])

cancer_mondo_map.to_csv(processed_data_location + 'TCGA_MONDO_MAP.txt', header=None, sep='\t', index=None)

In [None]:
!!controllare se serve!! term_mapping = {
    'liver carcinoma': 'MONDO_0007256',
    'oesophagus carcinoma': 'MONDO_0019086',
    'breast carcinoma': 'MONDO_0004989',
    'lung carcinoma': 'MONDO_0005138',
    'haematopoietic and lymphoid tissue carcinoma': 'MONDO_0017348',
    'prostate carcinoma': 'MONDO_0005159',
    'large intestine carcinoma': 'MONDO_0024331',
    'skin carcinoma': 'MONDO_0002656',
    'pancreas carcinoma': 'MONDO_0006047',
    'central nervous system carcinoma': 'MONDO_0006130',
    'biliary tract carcinoma': 'MONDO_0003707',
    'endometrium carcinoma': 'MONDO_0005461',
    'ovary carcinoma': 'MONDO_0005140',
    'kidney carcinoma': 'MONDO_0005206',
    'urinary tract carcinoma': 'MONDO_0040679',
    'cervix carcinoma': 'MONDO_0005131',
    'soft tissue carcinoma': 'MONDO_0006424',
    'stomach carcinoma': 'MONDO_0004950',
    'bone carcinoma': 'MONDO_0002415',
    'small intestine carcinoma': 'MONDO_0005522',
    'thyroid carcinoma': 'MONDO_0015075',
    'upper aerodigestive tract carcinoma': 'MONDO_0005398',
    'placenta carcinoma': 'MONDO_0002178',
    'salivary gland carcinoma': 'MONDO_0000521',
    'adrenal gland carcinoma': 'MONDO_0002814',
    'autonomic ganglia carcinoma': 'MONDO_0003996',
    'meninges carcinoma': 'MONDO_0021322',
    'eye carcinoma': 'MONDO_0002466',
    'genital tract carcinoma': 'MONDO_0005140',
    'pleura carcinoma': 'MONDO_0006294',
    'parathyroid carcinoma': 'MONDO_0012004',
    'thymus carcinoma': 'MONDO_0006451',
    'pituitary carcinoma': 'MONDO_0017582',
    'testis carcinoma': 'MONDO_0005447',
    'peritoneum carcinoma': 'MONDO_0002113',
    'uterine adnexa carcinoma': 'MONDO_0001351',
    'gastrointestinal tract carcinoma': 'MONDO_0006181',
    'fallopian tube carcinoma': 'MONDO_0006206',
    'penis carcinoma': 'MONDO_0006360',
    'vulva carcinoma': 'MONDO_0005215',
    'ns': np.nan
}

lncRNA_disease2['desc'] = lncRNA_disease2['desc'].map(term_mapping)

***
### Amino Acid - ChEBI mapping 


**Purpose:** To manually map amino acids ChEBI ontology (SO could've been used too).

**Output:** `AminoAcid_ChEBI_MAP.txt`

In [None]:
aa_chebi_map = pd.DataFrame(data=[['Leu','CHEBI_25017'],
                                 ['Phe','CHEBI_28044'],
                                 ['Ala','CHEBI_16449'],
                                 ['Asn','CHEBI_22653'],
                                 ['Glu','CHEBI_18237'],
                                 ['His','CHEBI_27570'],
                                 ['Asp','CHEBI_22660'],
                                 ['Cys','CHEBI_22660'],
                                 ['Gly','CHEBI_15428'],
                                 ['Ile','CHEBI_24898'],
                                 ['Lys','CHEBI_25094'],
                                 ['Met','CHEBI_16811'],
                                 ['Ser','CHEBI_17822'],
                                 ['Val','CHEBI_27266'],
                                 ['Gln','CHEBI_28300'],
                                 ['Arg','CHEBI_29016'],
                                 ['Pro','CHEBI_26271'],
                                 ['Thr','CHEBI_26986'],
                                 ['iMe','PR_000021937'],
                                 ['Trp','CHEBI_27897'],
                                 ['Tyr','CHEBI_18186']#,
                                 #['Sup','tRNA-Suppressor NOT GROUNDED']
                                 ])

aa_chebi_map.to_csv(processed_data_location + 'AminoAcid_ChEBI_MAP.txt', header=None, sep='\t', index=None)

***
### Gene symbol - PRO mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map gene symbols to PRO identifiers.

**Output:** `GENE_SYMBOL_PRO_ONTOLOGY_MAP.txt`

In [None]:
symbol_ensembl_map = pd.read_csv(processed_data_location + 'GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt', sep="\t", header=None)
symbol_ensembl_map[[0,1]]

In [None]:
ensembl_pro_map = pd.read_csv(processed_data_location + 'ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt', sep="\t", header=None)
ensembl_pro_map[[1,0]]

In [None]:
symbol_to_pro = pd.merge(symbol_ensembl_map[[0,1]], ensembl_pro_map[[1,0]], left_on=[1], right_on=[0])
symbol_to_pro = symbol_to_pro[['0_x', '1_y']].drop_duplicates()
symbol_to_pro.head(n=3)

In [None]:
symbol_to_pro.drop_duplicates().to_csv(processed_data_location+
                                       'GENE_SYMBOL_PRO_ONTOLOGY_MAP.txt', header=None,
                                       sep='\t', index=None)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#symbol_to_pro = pd.read_csv(processed_data_location+'GENE_SYMBOL_PRO_ONTOLOGY_MAP.txt',names=['0_x','1_y'],sep='\t')

***
### Tissue labels+synonyms from Uberon - Uberon mapping


**Purpose:** To map Tissue labels+synonyms from Uberon to Uberon identifiers.

**Output:** `DESC_EXT_MAP.txt`

In [None]:
desc_uberon_map = gets_ontology_lookup('ext')
desc_uberon_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_uberon_map = pd.read_csv(processed_data_location + 'DESC_EXT_MAP.txt', header=None, sep='\t')

***
### Cell line labels+synonyms from CLO - CLO mapping


**Purpose:** To map Cell line labels+synonyms from CLO to CLO identifiers.

**Output:** `DESC_CLO_MAP.txt`

In [None]:
desc_clo_map = gets_ontology_lookup('clo')
desc_clo_map.head(n=3)

In [None]:
# If chunk above has already been run, uncomment and run the following line to speed up construction:
#desc_clo_map = pd.read_csv(processed_data_location + 'DESC_CLO_MAP.txt', header=None, sep='\t')

In [None]:
# We map also strings not ending with ' cell' or ' cells' within cell lines
desc_clo_map2 = desc_clo_map[(desc_clo_map[0].str.endswith(' cell')) | (desc_clo_map[0].str.endswith(' cells'))]
desc_clo_map2[0] = desc_clo_map2[0].str.replace(' cell', '').str.replace(' cells', '')
desc_clo_map = pd.concat([desc_clo_map, desc_clo_map2]).drop_duplicates()
desc_clo_map[1] = desc_clo_map[1].str.split(', ')
desc_clo_map = desc_clo_map.explode(1)
desc_clo_map.to_csv(processed_data_location + 'DESC_CLO_MAP.txt', header=None, sep='\t', index=None)

***
### Protein labels+synonyms from PRO - PRO mapping


**Purpose:** To map Protein labels+synonyms from PRO to PRO identifiers.

**Output:** `DESC_PR_MAP.txt` and `DESC_PR_MAP_ALL.txt`

Note: The employed PRO ontology is trimmed to contain only human and viral proteins.

In [None]:
desc_pro_map_all = gets_ontology_lookup('pr')
# Remove genes
desc_pro_map_all = desc_pro_map_all[~desc_pro_map_all[1].str.startswith('gene_symbol_report?hgnc_id=')]
desc_pro_map_all.to_csv(processed_data_location + 'DESC_PR_ALL_MAP.txt', header=None, sep='\t', index=False)

In [None]:
desc_pro_map = desc_pro_map_all.copy()
# We decide to preferentially keep proteins such that human ones have been defined
desc_pro_map_human = desc_pro_map.dropna()[desc_pro_map.dropna()[0].str.contains('human', case=False)]
desc_pro_map_human[0] = desc_pro_map_human[0].str.replace("human ", '')
desc_pro_map_human[0] = desc_pro_map_human[0].str.replace("human", '')
desc_pro_map_human[0] = desc_pro_map_human[0].str.replace(" \(", '', regex=True)
desc_pro_map_human[0] = desc_pro_map_human[0].str.replace("\)", '', regex=True)
desc_pro_map_human[0] = desc_pro_map_human[0].str.replace(",(.*)", '', regex=True)
desc_pro_map_human[1] = desc_pro_map_human[1].str.split(', ')
desc_pro_map_human = desc_pro_map_human.explode(1)
desc_pro_map_human.head(n=3)

In [None]:
desc_pro_map[0] = desc_pro_map[0].str.replace("human ", '')
desc_pro_map[0] = desc_pro_map[0].str.replace("human", '')
desc_pro_map[0] = desc_pro_map[0].str.replace(" \(", '', regex=True)
desc_pro_map[0] = desc_pro_map[0].str.replace("\)", '', regex=True)
desc_pro_map[0] = desc_pro_map[0].str.replace(",(.*)", '', regex=True)
desc_pro_map[1] = desc_pro_map[1].str.split(', ')
desc_pro_map = desc_pro_map.explode(1)
desc_pro_map = desc_pro_map[~desc_pro_map[0].isin(desc_pro_map_human[0])]
desc_pro_map = pd.concat([desc_pro_map, desc_pro_map_human]).drop_duplicates()
desc_pro_map.to_csv(processed_data_location + 'DESC_PR_MAP.txt', header=None, sep='\t', index=False)
desc_pro_map.head(n=3)

In this way (i.e., using this modified look-up table), an entity x will be linked to "double-stranded RNA-activated factor 1 complex (human)" (PR_000027111) instead of "double-stranded RNA-activated factor 1 complex" (PR_000027110).

In [None]:
# If chunk above has already been run, uncomment and run the following lines to speed up construction:
#desc_pro_map_all = pd.read_csv(processed_data_location + 'DESC_PR_ALL_MAP.txt', header=None, sep='\t')
#desc_pro_map = pd.read_csv(processed_data_location + 'DESC_PR_MAP.txt', header=None, sep='\t')

***
### NCI Thesaurus labels+synonyms from NCIT - NCIT mapping


**Purpose:** To map NCI Thesaurus labels+synonyms from NCIT to NCIT identifiers.

**Output:** `DESC_NCIT_MAP.txt`

Note: This is **not** an integrated ontology, but we use NCIT to standardize edge metadata as much as possible.

In [None]:
command = '{} {} --merge-import-closure -o {}'
os.system(command.format(owltools_location, 'http://purl.obolibrary.org/obo/ncit.owl',
                         ontology_data_location + 'ncit.owl'))

In [None]:
desc_ncit_map = gets_ontology_lookup('ncit', with_import=False)
os.remove(ontology_data_location + "ncit.owl")
desc_ncit_map.to_csv(processed_data_location + 'DESC_NCIT_MAP.txt', header=None, sep='\t', index=False)
desc_ncit_map.head(n=3)

In [None]:
# If chunks above have already been run, uncomment and run the following line to speed up construction:
#desc_ncit_map = pd.read_csv(processed_data_location + 'DESC_NCIT_MAP.txt', header=None, sep='\t')

***
### Gene symbol - ENTREZ mapping <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map gene symbols to ENTREZ identifiers.

**Output:** `GENE_SYMBOL_ENTREZ_ID_MAP.txt`

In [None]:
entrez_enst_map = pd.read_csv(processed_data_location + 'ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt', sep="\t", header=None)
entrez_enst_map.head(n=3)

In [None]:
symbol_entrez_map = pd.merge(symbol_ensembl_map, entrez_enst_map, on=[1])
symbol_entrez_map = symbol_entrez_map[['0_x','0_y']].drop_duplicates()
symbol_entrez_map.head(n=3)

In [None]:
symbol_entrez_map.to_csv(processed_data_location+'GENE_SYMBOL_ENTREZ_ID_MAP.txt',header=None, sep='\t', index=None)

In [None]:
# If chunks above have already been run, uncomment and run the following line to speed up construction:
#symbol_entrez_map = pd.read_csv(processed_data_location+'GENE_SYMBOL_ENTREZ_ID_MAP.txt',names=['0_x','0_y'],sep='\t')

***
### Ribozyme - RFAM mapping 

**Purpose:** To map ribozyme to RFAM identifiers.

**Output:** `ribozyme_RFAM_MAP.txt`

In [None]:
ribozyme_rfam_map = pd.DataFrame(data=[['LC ribozyme','RF00011'],
                                 ['hammerhead ribozyme','RF00008,RF00163'],
                                 ['glmS ribozyme','RF00234'],
                                 ['HDV-F-prausnitzii','RF02682'],
                                 ['HDV ribozyme','RF00094'],
                                 ['HDV_ribozyme','RF00094'],
                                 ['Hairpin','RF00173'],
                                 ['Hammerhead_1','RF00163'],
                                 ['Hammerhead_HH9','RF02275'],
                                 ['Hammerhead_3','RF00008'],
                                 ['Hammerhead_HH10','RF02277'],
                                 ['Hammerhead_II','RF02276'],
                                 ['Pistol','RF02679'],
                                 ['Pistol ribozyme','RF02679'],
                                 ['twister ribozyme','RF03160'],
                                 ['Twister-P5','RF02684'],
                                 ['Twister-P3','RF03154'],
                                 ['RNAse P','RF00009'],
                                 ['VS ribozyme',' RF00010'] 
                                 ])

ribozyme_rfam_map[1] = ribozyme_rfam_map[1].str.split(',')
ribozyme_rfam_map = ribozyme_rfam_map.explode(1)

ribozyme_rfam_map.to_csv(processed_data_location + 'ribozyme_RFAM_MAP.txt', header=None, sep='\t', index=None)

***
### RefSeq transcript ID - circBase mapping 

**Purpose:** To map RefSeq transcript to circBase identifiers.

**Output:** `CIRCBASE_MAP.txt`

In [None]:
!wget http://www.circbase.org/download/hsa_hg19_circRNA.txt -O $unprocessed_data_location/hsa_hg19_circRNA.txt

In [None]:
circbase = pd.read_csv(unprocessed_data_location + 'hsa_hg19_circRNA.txt', sep='\t')
circbase = circbase[['circRNA ID','best transcript']].drop_duplicates()
circbase.head(n=3)

In [None]:
circbase.to_csv(processed_data_location + 'CIRCBASE_MAP.txt', header=None, sep='\t', index=None)

***
### Drug labels+synonyms from DrugBank - DrugBank+ChEBI mapping 

**Purpose:** To map drug labels to DrugBank+ChEBI identifiers.

**Output:** `DESC_DRUGBANK_MAP.txt`

Note: the download link is https://go.drugbank.com/releases/latest and requires registration.

In [None]:
!mkdir $processed_data_location/DrugBank

In [None]:
DrugBank = pd.read_csv(processed_data_location + 'DrugBank/drugbank vocabulary.csv')
links = pd.read_csv(processed_data_location + 'DrugBank/drug links.csv',dtype={'ChEBI ID':str})[['DrugBank ID', 'ChEBI ID']]
links['ChEBI ID'] = 'CHEBI_' + links['ChEBI ID']
DrugBank = pd.merge(DrugBank,links,on='DrugBank ID')
DrugBank['ChEBI ID'] = DrugBank['ChEBI ID'].fillna(DrugBank['DrugBank ID'])
DrugBank.rename(columns={'ChEBI ID':'ID'},inplace=True)

DrugBank['Common name'] = DrugBank['Common name'].str.lower()
DrugBank['Synonyms'] = DrugBank['Synonyms'].str.lower()
DrugBank['Synonyms'] = DrugBank['Synonyms'].str.split(' \| ')
DrugBank = DrugBank.explode('Synonyms')
DrugBank_syn = DrugBank[['Synonyms','ID']].rename(columns={'Synonyms':'Common name'})
DrugBank = pd.concat([DrugBank[['Common name','ID']], DrugBank_syn]).drop_duplicates()
DrugBank.head(n=3)

In [None]:
DrugBank[['Common name', 'ID']].drop_duplicates().to_csv(
    processed_data_location + 'DESC_DRUGBANK_MAP.txt', header=None, sep='\t', index=None)

***
### RNA proprietary identifiers+labels - RNAcentral mappings <a class="anchor" id="ensemblgene-entrezgene"></a>


**Purpose:** To map RNA labels and proprietary identifiers to RNAcentral identifiers.

**Output:** `RNACENTRAL_MAP.txt`

In [None]:
!mkdir $processed_data_location/RNAcentral_MAP
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/id_mapping.tsv.gz -O $processed_data_location/RNAcentral_MAP/id_mapping.tsv.gz

In [None]:
rnacentral_map = pd.read_csv(processed_data_location + "RNAcentral_MAP/id_mapping.tsv", delimiter='\t',
                             names=['RNAcentral ID', 'DB', 'DB ID', 'Organism', 'RNA category',"Label"])
rnacentral_map.head(n=3)

In [None]:
rnacentral_map_human = rnacentral_map[rnacentral_map['Organism'] == 9606]
rnacentral_map_human.to_csv(processed_data_location + "RNAcentral_MAP/RNACENTRAL_MAP.txt", sep="\t", index=False)
rnacentral_map_human.head(n=3)

In [None]:
# If chunks above have already been run, uncomment and run the following line to speed up construction:
#rnacentral_map_human = pd.read_csv(processed_data_location + 'RNAcentral_MAP/RNACENTRAL_MAP.txt', sep='\t')

We also download single database mapping files for soruces proving relationships involving RNA molecules.

In [None]:
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/ensembl.tsv -O $processed_data_location/RNAcentral_MAP/ensembl.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/lncbook.tsv -O $processed_data_location/RNAcentral_MAP/lncbook.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/mirbase.tsv -O $processed_data_location/RNAcentral_MAP/mirbase.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/lncipedia.tsv -O $processed_data_location/RNAcentral_MAP/lncipedia.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/gtrnadb.tsv -O $processed_data_location/RNAcentral_MAP/gtrnadb.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/hgnc.tsv -O $processed_data_location/RNAcentral_MAP/hgnc.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/noncode.tsv -O $processed_data_location/RNAcentral_MAP/noncode.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/rfam.tsv -O $processed_data_location/RNAcentral_MAP/rfam.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/refseq.tsv -O $processed_data_location/RNAcentral_MAP/refseq.tsv
!wget https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/database_mappings/pirbase.tsv -O $processed_data_location/RNAcentral_MAP/pirbase.tsv

For piRBase, we keep only golden standard piRNA sequences.

In [None]:
!wget http://bigdata.ibp.ac.cn/piRBase/download/v3.0/fasta/hsa.gold.fa.gz -O $unprocessed_data_location/hsa.gold.fa.gz
import gzip
with gzip.open(unprocessed_data_location + 'hsa.gold.fa.gz', 'rb') as f_in:
    with open(unprocessed_data_location + 'hsa.gold.fa', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
rnacentral_map_pirbase = pd.read_csv(processed_data_location + "RNAcentral_MAP/pirbase.tsv",sep='\t',
                                     names=['RNAcentral ID', 'DB', 'piRBase ID', 'Organism', 'RNA category',"Label"])
rnacentral_map_pirbase.head(n=3)

In [None]:
from Bio import SeqIO

records = list(SeqIO.parse(unprocessed_data_location + 'hsa.gold.fa', "fasta"))
data = {"Identifier": [record.id for record in records], "Sequence": [str(record.seq) for record in records]}
golden_pirnas = pd.DataFrame(data)
golden_pirnas.head(n=3)

In [None]:
rnacentral_map_pirbase = rnacentral_map_pirbase[rnacentral_map_pirbase['piRBase ID'].isin(golden_pirnas['Identifier'])]
rnacentral_map_pirbase.head(n=3)

In [None]:
all(rnacentral_map_pirbase['Organism']==9606)

In [None]:
rnacentral_map_pirbase[['RNAcentral ID','piRBase ID']].drop_duplicates().to_csv(
    processed_data_location + 'RNAcentral_MAP/pirbase.tsv', header=None, sep='\t', index=None)

***
### tsRFun's tsRNA - RNAcentral mapping 

**Purpose:** To map tsRNA sequences from tsRFun to RNAcentral tRNA identifiers.

**Output:** `tRNA_tsRNA_RNACENTRAL_MAP.txt`

In [None]:
!wget https://rna.sysu.edu.cn/tsRFun/download/newID_20210202.txt  -O $unprocessed_data_location/newID_20210202.txt

In [None]:
tsRNA_map = pd.read_csv(unprocessed_data_location + 'newID_20210202.txt', sep="\t")
tsRNA_map = tsRNA_map[['tRNA','tsRNAid']]
tsRNA_map.head(n=3)

In [None]:
gtrnadb_rnacentral_map_human = pd.read_csv(processed_data_location + "RNAcentral_MAP/gtrnadb.tsv",sep='\t',
                                     names=['RNAcentral ID', 'DB', 'GtRNAdb ID', 'Organism', 'RNA category',"Label"])
gtrnadb_rnacentral_map_human = gtrnadb_rnacentral_map_human[gtrnadb_rnacentral_map_human['Organism']==9606]
tsRNA_RNAcentral_map = pd.merge(tsRNA_map, gtrnadb_rnacentral_map_human, left_on='tRNA', right_on='Label')[[
    'tRNA','tsRNAid','RNAcentral ID']].drop_duplicates()
tsRNA_RNAcentral_map.head(n=3)

In [None]:
tsRNA_RNAcentral_map[['RNAcentral ID','tsRNAid']].drop_duplicates().to_csv(
    processed_data_location + 'tRNA_tsRNA_RNACENTRAL_MAP.txt', header=None, sep='\t', index=None)

***
### GtRNAdb legacy - RNAcentral mapping 

**Purpose:** To map tRNA identifiers into RNAcentral identifiers.

**Output:** `tRNA_GTRNADBLegacy_RNACENTRAL_MAP.txt`

In [None]:
!wget http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/hg38-tRNAs.fa -O $unprocessed_data_location/hg38-tRNAs.fa --no-check-certificate

In [None]:
from Bio.SeqIO.FastaIO import SimpleFastaParser

identifiers = []
seq = []

# Replace the URL with the path to your local FASTA file
fasta_file_path = unprocessed_data_location + 'hg38-tRNAs.fa'

with open(fasta_file_path) as fasta_file:
    for title, sequence in SimpleFastaParser(fasta_file):
        identifiers.append(title.split(None, 1)[0])  # First word is ID
        seq.append(sequence)
        
data = {"Identifier": identifiers, "Sequence": seq}
df = pd.DataFrame(data)
df.head(n=3)

In [None]:
all(df['Identifier'].str.startswith('Homo_sapiens_'))

In [None]:
df['Identifier'] = df['Identifier'].str[len('Homo_sapiens_'):]
df.head(n=3)

In [None]:
# Example to show retrieval logic
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
tRNA = pd.read_html('http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/genes/tRNA-Ala-AGC-1-1.html')[0].T
tRNA2 = pd.read_html('http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/genes/tRNA-Ala-AGC-1-1.html')[1].T
tRNA = pd.concat([tRNA,tRNA2],axis=1)
tRNA.columns = tRNA.iloc[0]
tRNA = tRNA[1:]
tRNA

In [None]:
for identifier in df['Identifier'] [1:] :

    temp = pd.read_html('http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/genes/' + identifier + '.html')[0].T
    temp2 = pd.read_html('http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/genes/' + identifier + '.html')[1].T
    temp = pd.concat([temp,temp2],axis=1)
    temp.columns = temp.iloc[0]
    temp = temp[1:]
    tRNA = pd.concat([tRNA, temp])

tRNA.Locus = tRNA.Locus.str.replace(' View in Genome Browser', '')
tRNA = tRNA.drop(columns=['Organism', 'Known Modifications (Modomics)'])

tRNA = tRNA[['GtRNAdb 2009 Legacy Name and Score','GtRNAdb Gene Symbol','RNAcentral ID']]
tRNA_map = tRNA[~tRNA['GtRNAdb 2009 Legacy Name and Score'].isna()]
tRNA_map['GtRNAdb 2009 Legacy Name and Score'] = tRNA_map['GtRNAdb 2009 Legacy Name and Score'].str.replace(r'\s+\(.*\)', '', regex=True)
tRNA_map['RNAcentral ID'] = tRNA_map['RNAcentral ID'].str.split('_').str[0]
tRNA_map.rename(columns={'GtRNAdb 2009 Legacy Name and Score':0,'GtRNAdb Gene Symbol':1}, inplace=True) 
tRNA_map.head(n=3)

In [None]:
all(tRNA_map['RNAcentral ID'].isin(rnacentral_map_human['RNAcentral ID']))
# All legacy GtRNAdb IDs are mapped into RNAcentral identifiers

In [None]:
tRNA_map[[0,'RNAcentral ID']].drop_duplicates().to_csv(
    processed_data_location + 'tRNA_GTRNADBLegacy_RNACENTRAL_MAP.txt', header=None, index=None,sep='\t')

In [None]:
# If chunks above have already been run, uncomment and run the following line to speed up construction:
#tRNA_map = pd.read_csv(processed_data_location+'tRNA_GTRNADBLegacy_RNACENTRAL_MAP.txt',sep='\t',header=None)

***
### MINTbase - GtRNAdb tRNA mapping 

**Purpose:** To map MINTbase to GtRNAdb identifiers.

**Output:** `tRNA_MINTbase_GtRNAdb_MAP.txt`

Note: `MINTbase-gtRNAdb_mapping.txt` is obtained from [MINTbase](https://cm.jefferson.edu/MINTbase/) (--> tRNA alignment --> Minimum RPM value: All tRFs --> Download --> remove first comment lines marked using # from the html and rename it accordingly).

In [None]:
tRNA_MINTbase_GtRNAdb_map = pd.read_csv(unprocessed_data_location + 'MINTbase-gtRNAdb_mapping.txt',sep='\t')
tRNA_MINTbase_GtRNAdb_map = tRNA_MINTbase_GtRNAdb_map[['MINTbase tRNA name','gtRNAdb name']]
tRNA_MINTbase_GtRNAdb_map = tRNA_MINTbase_GtRNAdb_map[tRNA_MINTbase_GtRNAdb_map['gtRNAdb name'] != '-']
tRNA_MINTbase_GtRNAdb_map.head(n=3)

In [None]:
tRNA_RNAcentral_map = pd.merge(tRNA_MINTbase_GtRNAdb_map, gtrnadb_rnacentral_map_human, left_on='gtRNAdb name', right_on='Label')[[
    'MINTbase tRNA name','gtRNAdb name','RNAcentral ID']].drop_duplicates()
tRNA_RNAcentral_map.head(n=3)

In [None]:
tRNA_RNAcentral_map[['MINTbase tRNA name', 'RNAcentral ID']].drop_duplicates().to_csv(
    processed_data_location + 'tRNA_MINTbase_RNACENTRAL_MAP.txt', header=None, sep='\t', index=None)


<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```
```
@misc{cavalleri_e_2024_rna_kg,
  author       = {Cavalleri, E},
  title        = {RNA-KG},
  year         = 2024,
  doi          = {10.5281/zenodo.10078876},
  url          = {https://doi.org/10.5281/zenodo.10078876}
}
```