# Pre-Processing dataset

We create a KG in NetworkX-arangoDB

This requires downloading multi-modality datasets from [Bio Snap](https://snap.stanford.edu/biodata/) and other data source. This notebook does simple data engineering before pushing into ArangoDB to create a fully connected network.

##### Modalities:
- Drug-Drug interaction with side-effects
- Drug-Gene interaction
- Disease-Drug interaction
- Disease-Disease interaction
- Disease-Function interaction
- Function-Function interaction
- Gene-Function interaction
- Gene-Gene interaction
- Gene-Protien interaction
- Genomic Region-Genomic Region interaction
- Protien-Protien interaction
- Protien-Protien-Tissue interaction
- Tissue-Function-Gene interaction

In [None]:
# !pip install networkx

# !pip install python-arango
# !pip install matplotlib

# !pip install biopython


Collecting networkx
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: networkx
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
adbnx-adapter 5.0.6 requires python-arango>=7.4, which is not installed.
deeppurpose 0.1.5 requires matplotlib, which is not installed.[0m[31m
[0mSuccessfully installed networkx-3.4.2
Collecting nx-arangodb
  Using cached nx_arangodb-1.3.0-py3-none-any.whl.metadata (9.3 kB)
Collecting networkx<=3.4,>=3.0 (from nx-arangodb)
  Using cached networkx-3.4-py3-none-any.whl.metadata (6.3 kB)
Collecting python-arango~=8.1 (from nx-arangodb)
  Using cached python_arango-8.1.6-py3-none-any.whl.metadata (

In [2]:
# Interactions

!wget -nc https://snap.stanford.edu/biodata/datasets/10001/files/ChCh-Miner_durgbank-chem-chem.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10002/files/ChG-Miner_miner-chem-gene.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10004/files/DCh-Miner_miner-disease-chemical.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10006/files/DD-Miner_miner-disease-disease.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10019/files/DF-Miner_miner-disease-function.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10020/files/DG-Miner_miner-disease-gene.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10026/files/FF-Miner_miner-func-func.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10024/files/GF-Miner_miner-gene-function.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10027/files/GP-Miner_miner-gene-protein.tsv.gz
!wget -nc https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_35_chemreps.txt.gz # Chemical Representations
#!wget -nc https://snap.stanford.edu/biodata/datasets/10028/files/PP-Miner_miner-ppi.tsv.gz
#!wget -nc https://snap.stanford.edu/biodata/datasets/10032/files/GG-NE.tar.gz

# Entities and Feature tables

!wget -nc https://snap.stanford.edu/biodata/datasets/10021/files/D-DoMiner_miner-diseaseDOID.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10003/files/D-MeshMiner_miner-disease.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10025/files/D-OmimMiner_miner-diseaseOMIM.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10022/files/G-SynMiner_miner-geneHUGO.tsv.gz

File ‘ChCh-Miner_durgbank-chem-chem.tsv.gz’ already there; not retrieving.

File ‘ChG-Miner_miner-chem-gene.tsv.gz’ already there; not retrieving.

File ‘DCh-Miner_miner-disease-chemical.tsv.gz’ already there; not retrieving.

File ‘DD-Miner_miner-disease-disease.tsv.gz’ already there; not retrieving.

File ‘DF-Miner_miner-disease-function.tsv.gz’ already there; not retrieving.

File ‘DG-Miner_miner-disease-gene.tsv.gz’ already there; not retrieving.

File ‘FF-Miner_miner-func-func.tsv.gz’ already there; not retrieving.

File ‘GF-Miner_miner-gene-function.tsv.gz’ already there; not retrieving.

File ‘GP-Miner_miner-gene-protein.tsv.gz’ already there; not retrieving.

File ‘chembl_35_chemreps.txt.gz’ already there; not retrieving.

File ‘D-DoMiner_miner-diseaseDOID.tsv.gz’ already there; not retrieving.

File ‘D-MeshMiner_miner-disease.tsv.gz’ already there; not retrieving.

File ‘D-OmimMiner_miner-diseaseOMIM.tsv.gz’ already there; not retrieving.

File ‘G-SynMiner_miner-geneHUGO.tsv.g

In [24]:
import networkx as nx
import pandas as pd
import numpy as np
import requests

from transformers import AutoTokenizer, AutoModel
import torch

from arango import ArangoClient
import matplotlib.pyplot as plt
import os
import gzip
from Bio import SeqIO
import pandas as pd

os.environ["DATABASE_HOST"] = "http://localhost:8529"  # Replace with your ArangoDB host
os.environ["DATABASE_USERNAME"] = "root"               # Replace with your ArangoDB username
os.environ["DATABASE_PASSWORD"] = "openSesame"         # Replace with your ArangoDB password
os.environ["DATABASE_NAME"] = "NeuThera"

### Loading Dataset

#### Entities and Feature Tables

In [3]:
# Disease DOID Synopses

Doid = pd.read_csv(
    "./D-DoMiner_miner-diseaseDOID.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["doid", "name", "definition", "synonym"],
)

Doid

Unnamed: 0,doid,name,definition,synonym
0,DOID:0001816,angiosarcoma,A malignant vascular tumor that results_in rap...,hemangiosarcoma EXACT []
1,DOID:0002116,pterygium,,surfer's eye EXACT []
2,DOID:0014667,disease of metabolism,A disease that involving errors in metabolic p...,metabolic disease EXACT [SNOMEDCT_2005_07_31:...
3,DOID:0050001,Actinomadura madurae infectious disease,,
4,DOID:0050002,Actinomadura pelletieri infectious disease,,
...,...,...,...,...
9242,DOID:9989,metastasis to the orbit,,secondary malignant neoplasm of orbit (disorde...
9243,DOID:999,eosinophilia,,Eosinophilic leukocytosis EXACT [MTHICD9_2006:...
9244,DOID:9993,hypoglycemia,,Hypoglycaemia EXACT [SNOMEDCT_2005_07_31:15469...
9245,DOID:9995,endocrine and metabolic disturbances specific ...,,


In [4]:
# Disease MESH Synopses

Mesh = pd.read_csv(
    "./D-MeshMiner_miner-disease.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "name", "definition", "synonym"],
)

Mesh

Unnamed: 0,mesh,name,definition,synonym
0,MESH:C538288,10p Deletion Syndrome (Partial),,"Chromosome 10, 10p- Partial|Chromosome 10, mon..."
1,MESH:C535484,13q deletion syndrome,,Chromosome 13q deletion|Chromosome 13q deletio...
2,MESH:C579849,15q24 Microdeletion,,15q24 Deletion|15q24 Microdeletion Syndrome|In...
3,MESH:C579850,16p11.2 Deletion Syndrome,,
4,MESH:C567076,"17,20-Lyase Deficiency, Isolated",,"17-Alpha-Hydroxylase-17,20-Lyase Deficiency, C..."
...,...,...,...,...
11327,MESH:C536729,Zunich neuroectodermal syndrome,,CHIME syndrome
11328,MESH:C536730,Zuska's Disease,,Lactation and squamous metaplasia of lactifero...
11329,MESH:C565223,Zygodactyly 1,,ZD1
11330,MESH:D015051,Zygomatic Fractures,Fractures of the zygoma.,"Fractures, Zygomatic|Fracture, Zygomatic|Zygom..."


In [5]:
# OMIM Genetic Disorders

Omim = pd.read_csv(
    "./D-OmimMiner_miner-diseaseOMIM.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["omim", "phenotypes", "gene_name", "gene", "location", "_"],
)

Omim = Omim.iloc[:, :-1]

Omim

Unnamed: 0,omim,phenotypes,gene_name,gene,location
0,OMIM:115665,"Cataract 8, multiple types (2)","Cataract, congenital, Volkmann type","CTRCT8, CCV",1pter-p36.13
1,OMIM:607671,"Dystonia 13, torsion (2)","Dystonia 13, torsion",DYT13,1p36.32-p36.13
2,OMIM:606242,Kondoh syndrome (2),"Kondoh syndrome (mental retardation, microceph...",KONDS,1p36.32-p35.3
3,OMIM:614414,"Deafness, autosomal recessive 96 (2)","Deafness, autosomal recessive 96",DFNB96,1p36.31-p36.13
4,OMIM:609918,Gallbladder disease 2 (2),Gallbladder disease 2,GBD2,1p36.21
...,...,...,...,...,...
1186,OMIM:300519,"Mental retardation, X-linked, syndromic, Marti...","Mental retardation, X-linked, syndromic, Marti...",MRXSMP,Chr.X
1187,OMIM:400042,"Spermatogenic failure, Y-linked, 1 (4)",Chromosome Yq11 interstitial deletion syndrome,"DELYq11, CYDELq11, SPGFY1",Yq11
1188,OMIM:475000,,"Growth control, Y-chromosome influenced","GCY, TSY, STA",Yq12
1189,OMIM:400043,"Deafness, Y-linked 1 (1)","Deafness, Y-linked 1",DFNY1,Chr.Y


In [6]:
# Gene

Gene = pd.read_csv(
    "./G-SynMiner_miner-geneHUGO.tsv.gz",
    compression="gzip",
    sep="\t",
)

Gene = Gene.rename(columns={'# ensembl_gene_id': 'ensg', 'symbol': 'gene', 'name': 'gene_name'})

Gene

  Gene = pd.read_csv(


Unnamed: 0,ensg,hgnc_id,gene,gene_name,locus_group,locus_type,status,location,location_sortable,alias_symbol,...,horde_id,merops,imgt,iuphar,kznf_gene_catalog,mamit-trnadb,cd,lncrnadb,enzyme_id,intermediate_filament_db
0,ENSG00000121410,HGNC:5,A1BG,alpha-1-B glycoprotein,protein-coding gene,gene with protein product,Approved,19q13.43,19q13.43,,...,,I43.950,,,,,,,,
1,ENSG00000268895,HGNC:37133,A1BG-AS1,A1BG antisense RNA 1,non-coding RNA,"RNA, long non-coding",Approved,19q13.43,19q13.43,FLJ23569,...,,,,,,,,,,
2,ENSG00000148584,HGNC:24086,A1CF,APOBEC1 complementation factor,protein-coding gene,gene with protein product,Approved,10q21.1,10q21.1,ACF|ASP|ACF64|ACF65|APOBEC1CF,...,,,,,,,,,,
3,ENSG00000175899,HGNC:7,A2M,alpha-2-macroglobulin,protein-coding gene,gene with protein product,Approved,12p13.31,12p13.31,FWP007|S863-7|CPAMD5,...,,I39.001,,,,,,,,
4,ENSG00000245105,HGNC:27057,A2M-AS1,A2M antisense RNA 1 (head to head),non-coding RNA,"RNA, long non-coding",Approved,12p13.31,12p13.31,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35649,ENSG00000162378,HGNC:25820,ZYG11B,"zyg-11 family member B, cell cycle regulator",protein-coding gene,gene with protein product,Approved,1p32.3,01p32.3,FLJ13456,...,,,,,,,,,,
35650,ENSG00000159840,HGNC:13200,ZYX,zyxin,protein-coding gene,gene with protein product,Approved,7q32,07q32,,...,,,,,,,,,,
35651,ENSG00000274572,HGNC:51695,ZYXP1,zyxin pseudogene 1,pseudogene,pseudogene,Approved,8q24.23,08q24.23,,...,,,,,,,,,,
35652,ENSG00000074755,HGNC:29027,ZZEF1,zinc finger ZZ-type and EF-hand domain contain...,protein-coding gene,gene with protein product,Approved,17p13.3,17p13.3,KIAA0399|ZZZ4|FLJ10821,...,,,,,,,,,,


In [15]:
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
model = AutoModel.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

def get_chemberta_embedding(smiles):
    """Generate a ChemBERTa embedding for a molecule, ensuring input is a string."""
    if not isinstance(smiles, str) or not smiles.strip():
        return None 

    inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).tolist()[0]

In [32]:
# DrugBank

DrugBank = pd.read_csv(
    "./drugbank_all_drugbank_vocabulary.csv.zip",
    compression="zip",
    sep=",",
    header=0,
    names=["drug", "accession", "drug_name", "cas", "unii", "synonym", "key"],
    dtype={"smiles": str}
)

ChemRepresentation = pd.read_csv(
    "./chembl_35_chemreps.txt.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["chembl", "smiles", "inchi", "key"],
)

Drug = DrugBank.merge(ChemRepresentation, on="key", how="inner")
Drug = Drug.dropna(subset=["smiles"])
Drug["smiles"] = Drug["smiles"].astype(str).str.strip()
Drug["generated"] = False

embeddings = []
for index, row in Drug.iterrows():
    smiles = row["smiles"]
    embedding = get_chemberta_embedding(smiles)
    embeddings.append(embedding)

Drug["embedding"] = embeddings

Drug


Unnamed: 0,drug,accession,drug_name,cas,unii,synonym,key,chembl,smiles,inchi,generated,embedding
0,DB00014,BTD00113 | BIOD00113,Goserelin,65807-02-5,0F65R8P09N,Goserelin | Goserelina,BLCLNMBMMGCOAS-URPVMXJPSA-N,CHEMBL1201247,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,InChI=1S/C59H84N18O14/c1-31(2)22-40(49(82)68-3...,False,"[0.26175203919410706, 0.019564686343073845, -0..."
1,DB00027,BTD00036 | BIOD00036,Gramicidin D,1405-97-6,5IE62321P4,Bacillus brevis gramicidin D | Gramicidin | Gr...,NDAYQJDHGXTBJL-MWWSRJDJSA-N,CHEMBL557217,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...,InChI=1S/C96H135N19O16/c1-50(2)36-71(105-79(11...,False,"[0.505556583404541, 0.0775429829955101, -0.058..."
2,DB00035,BTD00112 | BTD00061 | BIOD00112 | BIOD00061,Desmopressin,16679-58-6,ENR1LLB0FP,1-(3-mercaptopropionic acid)-8-D-arginine-vaso...,NFLWUMRGJYTJIN-PNIOQBSNSA-N,CHEMBL1429,N=C(N)NCCC[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@@H]...,InChI=1S/C46H64N14O12S2/c47-35(62)15-14-29-40(...,False,"[0.1573418825864792, -0.010056326165795326, -0..."
3,DB00050,BTD00115 | APRD00686 | BIOD00115,Cetrorelix,120287-85-6,OON1HFZ4BA,Cetrorelix | Cetrorelixum,SBNPWPIBESPSIF-MHWMIDJBSA-N,CHEMBL1200490,CC(=O)N[C@H](Cc1ccc2ccccc2c1)C(=O)N[C@H](Cc1cc...,InChI=1S/C70H92ClN17O14/c1-39(2)31-52(61(94)82...,False,"[0.24220487475395203, 0.14190606772899628, -0...."
4,DB00080,BTD00111 | BIOD00111,Daptomycin,103060-53-3,NWQ5N31VKK,Daptomicina | Daptomycin | Daptomycine | Dapto...,DOAKLVKFURWEDJ-QCMAZARJSA-N,CHEMBL4744444,CCCCCCCCCC(=O)N[C@@H](Cc1c[nH]c2ccccc12)C(=O)N...,InChI=1S/C72H101N17O26/c1-5-6-7-8-9-10-11-22-5...,False,"[0.40401849150657654, 0.11877013742923737, -0...."
...,...,...,...,...,...,...,...,...,...,...,...,...
8990,DB19419,,Azadirachtin,11141-17-6,O4U1SAF85H,Azadirachtin a | Azatin | Biosal | Ecozin | Gr...,FTNJWQUOZFUQQJ-NDAWSKJSSA-N,CHEMBL509309,C/C=C(\C)C(=O)O[C@H]1C[C@@H](OC(C)=O)[C@@]2(C(...,InChI=1S/C35H44O16/c1-8-15(2)24(38)49-18-12-19...,False,"[0.6458231210708618, 0.2620610296726227, -0.11..."
8991,DB19436,,L-lysyl-l-arginine,29586-66-1,HX25W9VLX3,"(2s)-2-(((2s)-2,6-bis(azanyl)hexanoyl)amino)-5...",NPBGTPKLVJEOBE-IUCAKERBSA-N,CHEMBL380183,NCCCC[C@H](N)C(=O)N[C@@H](CCCN=C(N)N)C(=O)O,InChI=1S/C12H26N6O3/c13-6-2-1-4-8(14)10(19)18-...,False,"[0.288204163312912, 0.22193492949008942, 0.394..."
8992,DB19437,,Pralurbactam,2163782-59-8,GTP46AY74R,"(1R,2S,5R)-2-[[[2-[[(Aminoiminomethyl)amino]ox...",HOJIPBUGHMYVQD-RQJHMYQMSA-N,CHEMBL5314477,N=C(N)NOCCONC(=O)[C@@H]1CC[C@@H]2CN1C(=O)N2OS(...,InChI=1S/C10H18N6O8S/c11-9(12)14-23-4-3-22-13-...,False,"[0.22961297631263733, -0.17735253274440765, -0..."
8993,DB19445,,Rocbrutinib,2485861-07-0,KD68L3GRW2,"2-Propenamide, N-[5-[[6-[2-(1,3,4,6,7,8-hexahy...",OYJVFTNYBWVQHA-SANMLTNESA-N,CHEMBL5314539,C=CC(=O)Nc1cc(Nc2nc(-c3ccnc(N4CCn5c(cc6c5CC(C)...,InChI=1S/C42H51N9O5/c1-6-37(53)45-32-20-28(7-8...,False,"[0.38286489248275757, -0.31730952858924866, -0..."


In [50]:
PDB = pd.read_csv(
    "./pdb_chain_ensembl.csv.gz",
    compression="gzip",
    sep=",",
    header=1,
    # names=["ensg", "ensp"],
)

PDB.head(5)

Unnamed: 0,PDB,CHAIN,SP_PRIMARY,GENE_ID,TRANSCRIPT_ID,TRANSLATION_ID,EXON_ID
0,101m,A,P02185,,,,NM_001290722.1-1
1,101m,A,P02185,,,,NM_001290722.1-2
2,101m,A,P02185,,,,NM_001290722.1-3
3,102l,A,P00720,,,,AAD42568-1
4,102m,A,P02185,,,,NM_001290722.1-1


#### Interactions

In [8]:
# Drug-Drug interaction

ChCh = pd.read_csv(
    "./ChCh-Miner_durgbank-chem-chem.tsv.gz",
    compression="gzip",
    sep="\t",
    names=["drug", "drug_target"],
)

ChCh.head(5)

Unnamed: 0,drug,drug_target
0,DB00862,DB00966
1,DB00575,DB00806
2,DB01242,DB08893
3,DB01151,DB08883
4,DB01235,DB01275


In [9]:
# Drug-Gene interaction

ChG = pd.read_csv(
    "./ChG-Miner_miner-chem-gene.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["drug", "uniprot_ids"],
)

ChG = ChG.merge(Gene[['uniprot_ids', 'gene']], on='uniprot_ids', how='left')
ChG.drop(columns=['uniprot_ids'], inplace=True)
ChG.dropna(inplace=True)

ChG.head(5)


Unnamed: 0,drug,gene
0,DB00357,CYP11A1
1,DB02721,ADH1B
2,DB00773,PTGS1
3,DB07138,MAPK14
4,DB08136,CDK2


In [10]:
# Disease-Drug interaction

DCh = pd.read_csv(
    "./DCh-Miner_miner-disease-chemical.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "drug"],
)

DCh.head(5)

Unnamed: 0,mesh,drug
0,MESH:D005923,DB00564
1,MESH:D009503,DB01072
2,MESH:D016115,DB01759
3,MESH:D018476,DB00451
4,MESH:C567059,DB00641


In [11]:
# Disease-Disease interaction

DD = pd.read_csv(
    "./DD-Miner_miner-disease-disease.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["doid", "doid_target"],
)

DD.head(5)

Unnamed: 0,doid,doid_target
0,DOID:0001816,DOID:1115
1,DOID:0002116,DOID:10124
2,DOID:0014667,DOID:4
3,DOID:0050004,DOID:10400
4,DOID:0050012,DOID:934


In [12]:
# Disease-Function interaction

DF = pd.read_csv(
    "./DF-Miner_miner-disease-function.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "go"],
)

DF.head(5)

Unnamed: 0,mesh,go
0,MESH:D000037,GO:0009257
1,MESH:C536409,GO:0009257
2,MESH:D009436,GO:0009257
3,MESH:D000860,GO:0009258
4,MESH:D008106,GO:0009258


In [13]:
# Disease-Gene interaction

DG = pd.read_csv(
    "./DG-Miner_miner-disease-gene.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "uniprot_ids"],
)

DG = DG.merge(Gene[['uniprot_ids', 'gene']], on='uniprot_ids', how='left')
DG.drop(columns=['uniprot_ids'], inplace=True)
DG.dropna(inplace=True)

DG.head(5)

Unnamed: 0,mesh,gene
1,MESH:D055370,PSG1
2,MESH:D007410,KHSRP
3,MESH:D014062,LAIR2
4,MESH:D054549,TRIB1
5,MESH:D009771,CEP152


In [14]:
# Function-Function interaction

FF = pd.read_csv(
    "./FF-Miner_miner-func-func.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["go", "go_target"],
)

FF.head(5)

Unnamed: 0,go,go_target
0,GO:0008296,GO:0008408
1,GO:0016811,GO:0033970
2,GO:0045222,GO:0045223
3,GO:0021803,GO:0030031
4,GO:0033574,GO:1901654


In [15]:
# Gene-Function interaction (ONLY HUMANS)


GF = pd.read_csv(
    "./GF-Miner_miner-gene-function.tsv.gz",
    compression="gzip",
    sep="\t",
    skipinitialspace=True
)

GF = GF.rename(
    columns={
        "# GO_ID": "go",
        "Gene": "gene",
        "C8": "go_category",
        "C10": "protein",
        "C12": "organism",
        "C13": "date"
    }
)

GF = GF[GF["organism"] == "taxon:9606"]
GF = GF[["go", "gene"]]

GF.head(5)


Unnamed: 0,go,gene
0,GO:0005509,PDCD6
1,GO:0004672,CDK1
2,GO:0005524,CDK1
3,GO:0005634,CDK1
4,GO:0005737,CDK1


In [83]:
# Gene-Protien interaction

GP = pd.read_csv(
    "./GP-Miner_miner-gene-protein.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["ensg", "ensp"],
)

ENSP = GP[['ensp']].copy()

GP = GP.merge(Gene[['ensg', 'gene']], on='ensg', how='left')
GP = GP.merge(PDB, left_on="ensp", right_on="TRANSLATION_ID", how="left")
GP = GP[["gene", "PDB"]].rename(columns={"PDB": "pdb"})

GP.dropna(inplace=True)
GP.drop_duplicates(inplace=True)

GP.head(5)

Unnamed: 0,gene,pdb
0,MT-ND1,5xtc
1,MT-ND1,5xtd
2,MT-ND2,5xtc
3,MT-ND2,5xtd
4,MT-ND2,5xth


In [80]:
Protein = ENSP.merge(PDB, left_on="ensp", right_on="TRANSLATION_ID", how="inner") 

Protein = Protein[["ensp", "PDB"]].rename(columns={"PDB": "pdb"})
Protein.drop_duplicates(subset=["pdb"], inplace=True)

Protein = Protein[["pdb"]]

Protein.head(5)

Unnamed: 0,pdb
0,5xtc
1,5xtd
4,5xth
5,5xti
7,5z62


### Pushing to DB

In [33]:
db = ArangoClient(hosts="http://localhost:8529").db('NeuThera', username='root', password='openSesame')

In [None]:
# Adding Nodes

collections = ["drug", "gene", "disease", "protein"]

for col in collections:
    if not db.has_collection(col):
        db.create_collection(col)

def add_nodes(df, label, key_column, batch_size=100000):
    batch = []
    
    for _, row in df.iterrows():
        node_key = str(row[key_column]) if pd.notna(row[key_column]) else "NotAvailable"
        node_id = f"{label}/{node_key}"
        attributes = row.fillna("NaN").drop(key_column).to_dict()
        
        node = {"_key": node_key, **attributes}
        batch.append(node)
        
        if len(batch) >= batch_size:
            try:
                db[label].insert_many(batch, overwrite=True)
                print(f"Inserted {len(batch)} nodes into {label}")
                batch.clear()
            except Exception as e:
                print(f"Error inserting batch: {e}")
    
    if batch:
        try:
            db[label].insert_many(batch, overwrite=True)
            print(f"Inserted {len(batch)} nodes into {label}")
        except Exception as e:
            print(f"Error inserting final batch: {e}")
    

add_nodes(Drug, "drug", "drug")
add_nodes(Gene, "gene", "gene")
add_nodes(Doid, "disease", "doid")
add_nodes(Mesh, "disease", "mesh")
add_nodes(Omim, "disease", "omim")
add_nodes(Protein, "protein", "pdb")


Inserted 8995 nodes into drug


In [None]:
edge_collections = ["drug-drug", "drug-gene", "drug-protein", "disease-drug", "disease-disease", "disease-function", "disease-gene", "function-function", "gene-function", "gene-protein"]

for edge_col in edge_collections:
    if not db.has_collection(edge_col):
        db.create_collection(edge_col, edge=True)

def add_edges(df, src_label, dst_label, src_col, dst_col, edge_collection, batch_size=100000):
    batch = []
    
    for _, row in df.iterrows():
        src_node = f"{src_label}/{str(row[src_col])}"
        dst_node = f"{dst_label}/{str(row[dst_col])}"

        attributes = row.fillna("NaN").drop([src_col, dst_col]).to_dict()
        
        edge = {
            '_from': src_node,
            '_to': dst_node,
            **attributes
        }
        
        batch.append(edge)
        
        if len(batch) >= batch_size:
            try:
                db[edge_collection].insert_many(batch, overwrite=True)
                print(f"Inserted {len(batch)} edges into {edge_collection}")
                batch.clear()
            except Exception as e:
                print(f"Error inserting batch: {e}")
    
    if batch:
        try:
            db[edge_collection].insert_many(batch, overwrite=True)
            print(f"Inserted {len(batch)} edges into {edge_collection}")
        except Exception as e:
            print(f"Error inserting final batch: {e}")

# add_edges(ChCh, "drug", "drug", "drug", "drug_target", "drug-drug")
# add_edges(ChG, "drug", "gene", "drug", "gene", "drug-gene")
# add_edges(DCh, "disease", "drug", "mesh", "drug", "disease-drug")
# add_edges(DD, "disease", "disease", "doid", "doid_target", "disease-disease")
# add_edges(DF, "disease", "go", "mesh", "go", "disease-function")
# add_edges(DG, "disease", "gene", "mesh", "gene", "disease-gene")
# add_edges(FF, "go", "go", "go", "go_target", "function-function")
# add_edges(GF, "go", "gene", "go", "gene", "gene-function")
# add_edges(GP, "gene", "protein", "gene", "pdb", "gene-protein")

In [None]:
if not db.has_graph("NeuThera"):
    db.create_graph("NeuThera")

neuthera_graph = db.graph("NeuThera")

edge_definitions = [
    ("drug-drug", "drug", "drug"),
    ("drug-gene", "drug", "gene"),
    ("disease-drug", "disease", "drug"),
    ("disease-disease", "disease", "disease"),
    ("disease-function", "disease", "go"),
    ("disease-gene", "disease", "gene"),
    ("function-function", "go", "go"),
    ("gene-function", "gene", "go"),
    ("gene-protein", "gene", "protein")
]

for edge_col, from_col, to_col in edge_definitions:
    if edge_col not in neuthera_graph.edge_definitions():
        neuthera_graph.create_edge_definition(
            edge_collection=edge_col,
            from_vertex_collections=[from_col],
            to_vertex_collections=[to_col]
        )

print("NeuThera graph successfully created and linked with node and edge collections!")

NeuThera graph successfully created and linked with node and edge collections!
