# Pre-Processing dataset

We create a KG in NetworkX-arangoDB

This requires downloading multi-modality datasets from [Bio Snap](https://snap.stanford.edu/biodata/)

##### Modalities:
- Drug-Drug interaction with side-effects
- Drug-Gene interaction
- Disease-Drug interaction
- Disease-Disease interaction
- Disease-Function interaction
- Function-Function interaction
- Gene-Function interaction
- Gene-Gene interaction
- Gene-Protien interaction
- Genomic Region-Genomic Region interaction
- Protien-Protien interaction
- Protien-Protien-Tissue interaction
- Tissue-Function-Gene interaction

In [3]:
!pip install networkx pandas arango matplotlib
!pip install nx-arangodb

Collecting matplotlib
  Obtaining dependency information for matplotlib from https://files.pythonhosted.org/packages/c7/8b/92e9da1f28310a1f6572b5c55097b0c0ceb5e27486d85fb73b54f5a9b939/matplotlib-3.10.0-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading matplotlib-3.10.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Obtaining dependency information for contourpy>=1.0.1 from https://files.pythonhosted.org/packages/67/71/1e6e95aee21a500415f5d2dbf037bf4567529b6a4e986594d7026ec5ae90/contourpy-1.3.1-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading contourpy-1.3.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Obtaining dependency information for cycler>=0.10 from https://files.pythonhosted.org/packages/e7/05/c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6/cycler-0.12.1-py3-none-any.whl.metadata
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecti

In [None]:
# Interactions

!wget -nc https://snap.stanford.edu/biodata/datasets/10001/files/ChCh-Miner_durgbank-chem-chem.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10002/files/ChG-Miner_miner-chem-gene.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10004/files/DCh-Miner_miner-disease-chemical.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10006/files/DD-Miner_miner-disease-disease.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10019/files/DF-Miner_miner-disease-function.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10020/files/DG-Miner_miner-disease-gene.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10026/files/FF-Miner_miner-func-func.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10024/files/GF-Miner_miner-gene-function.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10027/files/GP-Miner_miner-gene-protein.tsv.gz
#!wget -nc https://snap.stanford.edu/biodata/datasets/10028/files/PP-Miner_miner-ppi.tsv.gz
#!wget -nc https://snap.stanford.edu/biodata/datasets/10032/files/GG-NE.tar.gz

# Entities and Feature tables

!wget -nc https://snap.stanford.edu/biodata/datasets/10021/files/D-DoMiner_miner-diseaseDOID.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10003/files/D-MeshMiner_miner-disease.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10025/files/D-OmimMiner_miner-diseaseOMIM.tsv.gz
!wget -nc https://snap.stanford.edu/biodata/datasets/10022/files/G-SynMiner_miner-geneHUGO.tsv.gz

--2025-02-24 13:31:42--  https://snap.stanford.edu/biodata/datasets/10032/files/GG-NE.tar.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 20134095145 (19G), 1593698601 (1.5G) remaining [application/x-gzip]
Saving to: ‘GG-NE.tar.gz’

GG-NE.tar.gz        100%[++++++++++++++++++=>]  18.75G   748KB/s    in 35m 6s  

2025-02-24 14:06:49 (739 KB/s) - ‘GG-NE.tar.gz’ saved [20134095145/20134095145]



In [3]:
import networkx as nx
import pandas as pd
import nx_arangodb as nxadb
import requests
from arango import ArangoClient
import matplotlib.pyplot as plt
import os

os.environ["DATABASE_HOST"] = "http://localhost:8529"  # Replace with your ArangoDB host
os.environ["DATABASE_USERNAME"] = "root"               # Replace with your ArangoDB username
os.environ["DATABASE_PASSWORD"] = "openSesame"         # Replace with your ArangoDB password
os.environ["DATABASE_NAME"] = "NeuThera"               

[19:08:33 +0530] [INFO]: NetworkX-cuGraph is unavailable: No module named 'cupy'.


### Loading Dataset

#### Interactions

In [5]:
# Drug-Drug interaction

ChCh = pd.read_csv(
    "./ChCh-Miner_durgbank-chem-chem.tsv.gz",
    compression="gzip",
    sep="\t",
    names=["drug", "drug_target"],
)

ChCh.head(5)

Unnamed: 0,drug,drug_target
0,DB00862,DB00966
1,DB00575,DB00806
2,DB01242,DB08893
3,DB01151,DB08883
4,DB01235,DB01275


In [6]:
# Drug-Gene interaction

ChG = pd.read_csv(
    "./ChG-Miner_miner-chem-gene.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["drug", "gene"],
)

ChG.head(5)

Unnamed: 0,drug,gene
0,DB00357,P05108
1,DB02721,P00325
2,DB00773,P23219
3,DB07138,Q16539
4,DB08136,P24941


In [7]:
# Disease-Drug interaction

DCh = pd.read_csv(
    "./DCh-Miner_miner-disease-chemical.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "drug"],
)

DCh.head(5)

Unnamed: 0,mesh,drug
0,MESH:D005923,DB00564
1,MESH:D009503,DB01072
2,MESH:D016115,DB01759
3,MESH:D018476,DB00451
4,MESH:C567059,DB00641


In [8]:
# Disease-Disease interaction

DD = pd.read_csv(
    "./DD-Miner_miner-disease-disease.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["doid", "doid_target"],
)

DD.head(5)

Unnamed: 0,doid,doid_target
0,DOID:0001816,DOID:1115
1,DOID:0002116,DOID:10124
2,DOID:0014667,DOID:4
3,DOID:0050004,DOID:10400
4,DOID:0050012,DOID:934


In [9]:
# Disease-Function interaction

DF = pd.read_csv(
    "./DF-Miner_miner-disease-function.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "go"],
)

DF.head(5)

Unnamed: 0,mesh,go
0,MESH:D000037,GO:0009257
1,MESH:C536409,GO:0009257
2,MESH:D009436,GO:0009257
3,MESH:D000860,GO:0009258
4,MESH:D008106,GO:0009258


In [10]:
# Disease-Gene interaction

DG = pd.read_csv(
    "./DG-Miner_miner-disease-gene.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "gene"],
)

DG.head(5)

Unnamed: 0,mesh,gene
0,MESH:D005756,A0A087WZV0
1,MESH:D055370,P11464
2,MESH:D007410,Q92945
3,MESH:D014062,Q6ISS4
4,MESH:D054549,Q96RU8


In [11]:
# Function-Function interaction

FF = pd.read_csv(
    "./FF-Miner_miner-func-func.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["go", "go_target"],
)

FF.head(5)

Unnamed: 0,go,go_target
0,GO:0008296,GO:0008408
1,GO:0016811,GO:0033970
2,GO:0045222,GO:0045223
3,GO:0021803,GO:0030031
4,GO:0033574,GO:1901654


In [12]:
# Gene-Function interaction (ONLY HUMANS)


GF = pd.read_csv(
    "./GF-Miner_miner-gene-function.tsv.gz",
    compression="gzip",
    sep="\t",
    skipinitialspace=True
)

GF = GF.rename(
    columns={
        "# GO_ID": "go",
        "Gene": "gene",
        "C8": "go_category",
        "C10": "protein",
        "C12": "organism",
        "C13": "date"
    }
)

GF = GF[GF["organism"] == "taxon:9606"]
GF = GF[["go", "gene"]]

GF.head(5)


Unnamed: 0,go,gene
0,GO:0005509,PDCD6
1,GO:0004672,CDK1
2,GO:0005524,CDK1
3,GO:0005634,CDK1
4,GO:0005737,CDK1


In [13]:
# Gene-Protien interaction

GP = pd.read_csv(
    "./GP-Miner_miner-gene-protein.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["ensg", "ensp"],
)

GP.head(5)

Unnamed: 0,ensg,ensp
0,ENSG00000198888,ENSP00000354687
1,ENSG00000198763,ENSP00000355046
2,ENSG00000198804,ENSP00000354499
3,ENSG00000198712,ENSP00000354876
4,ENSG00000228253,ENSP00000355265


In [14]:
# Protien-Protien interaction

# PP = pd.read_csv(
#     "./GG-NE.tar.gz",
#     compression="gzip",
#     sep="\t",
#     nrows=5,
#     # header=0,
#     # names=["ensg", "ensp"],
# )

# PP.head(5)

#### Entities and Feature Tables

In [15]:
# Disease DOID Synopses

Doid = pd.read_csv(
    "./D-DoMiner_miner-diseaseDOID.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["doid", "name", "definition", "synonym"],
)

Doid

Unnamed: 0,doid,name,definition,synonym
0,DOID:0001816,angiosarcoma,A malignant vascular tumor that results_in rap...,hemangiosarcoma EXACT []
1,DOID:0002116,pterygium,,surfer's eye EXACT []
2,DOID:0014667,disease of metabolism,A disease that involving errors in metabolic p...,metabolic disease EXACT [SNOMEDCT_2005_07_31:...
3,DOID:0050001,Actinomadura madurae infectious disease,,
4,DOID:0050002,Actinomadura pelletieri infectious disease,,
...,...,...,...,...
9242,DOID:9989,metastasis to the orbit,,secondary malignant neoplasm of orbit (disorde...
9243,DOID:999,eosinophilia,,Eosinophilic leukocytosis EXACT [MTHICD9_2006:...
9244,DOID:9993,hypoglycemia,,Hypoglycaemia EXACT [SNOMEDCT_2005_07_31:15469...
9245,DOID:9995,endocrine and metabolic disturbances specific ...,,


In [16]:
# Disease MESH Synopses

Mesh = pd.read_csv(
    "./D-MeshMiner_miner-disease.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["mesh", "name", "definition", "synonym"],
)

Mesh

Unnamed: 0,mesh,name,definition,synonym
0,MESH:C538288,10p Deletion Syndrome (Partial),,"Chromosome 10, 10p- Partial|Chromosome 10, mon..."
1,MESH:C535484,13q deletion syndrome,,Chromosome 13q deletion|Chromosome 13q deletio...
2,MESH:C579849,15q24 Microdeletion,,15q24 Deletion|15q24 Microdeletion Syndrome|In...
3,MESH:C579850,16p11.2 Deletion Syndrome,,
4,MESH:C567076,"17,20-Lyase Deficiency, Isolated",,"17-Alpha-Hydroxylase-17,20-Lyase Deficiency, C..."
...,...,...,...,...
11327,MESH:C536729,Zunich neuroectodermal syndrome,,CHIME syndrome
11328,MESH:C536730,Zuska's Disease,,Lactation and squamous metaplasia of lactifero...
11329,MESH:C565223,Zygodactyly 1,,ZD1
11330,MESH:D015051,Zygomatic Fractures,Fractures of the zygoma.,"Fractures, Zygomatic|Fracture, Zygomatic|Zygom..."


In [17]:
# OMIM Genetic Disorders

Omim = pd.read_csv(
    "./D-OmimMiner_miner-diseaseOMIM.tsv.gz",
    compression="gzip",
    sep="\t",
    header=0,
    names=["omim", "phenotypes", "gene_name", "gene", "location", "_"],
)

Omim = Omim.iloc[:, :-1]

Omim

Unnamed: 0,omim,phenotypes,gene_name,gene,location
0,OMIM:115665,"Cataract 8, multiple types (2)","Cataract, congenital, Volkmann type","CTRCT8, CCV",1pter-p36.13
1,OMIM:607671,"Dystonia 13, torsion (2)","Dystonia 13, torsion",DYT13,1p36.32-p36.13
2,OMIM:606242,Kondoh syndrome (2),"Kondoh syndrome (mental retardation, microceph...",KONDS,1p36.32-p35.3
3,OMIM:614414,"Deafness, autosomal recessive 96 (2)","Deafness, autosomal recessive 96",DFNB96,1p36.31-p36.13
4,OMIM:609918,Gallbladder disease 2 (2),Gallbladder disease 2,GBD2,1p36.21
...,...,...,...,...,...
1186,OMIM:300519,"Mental retardation, X-linked, syndromic, Marti...","Mental retardation, X-linked, syndromic, Marti...",MRXSMP,Chr.X
1187,OMIM:400042,"Spermatogenic failure, Y-linked, 1 (4)",Chromosome Yq11 interstitial deletion syndrome,"DELYq11, CYDELq11, SPGFY1",Yq11
1188,OMIM:475000,,"Growth control, Y-chromosome influenced","GCY, TSY, STA",Yq12
1189,OMIM:400043,"Deafness, Y-linked 1 (1)","Deafness, Y-linked 1",DFNY1,Chr.Y


In [18]:
# Gene

Gene = pd.read_csv(
    "./G-SynMiner_miner-geneHUGO.tsv.gz",
    compression="gzip",
    sep="\t",
)

Gene = Gene.rename(columns={'# ensembl_gene_id': 'ensg', 'symbol': 'gene', 'name': 'gene_name'})

Gene

  Gene = pd.read_csv(


Unnamed: 0,ensg,hgnc_id,gene,gene_name,locus_group,locus_type,status,location,location_sortable,alias_symbol,...,horde_id,merops,imgt,iuphar,kznf_gene_catalog,mamit-trnadb,cd,lncrnadb,enzyme_id,intermediate_filament_db
0,ENSG00000121410,HGNC:5,A1BG,alpha-1-B glycoprotein,protein-coding gene,gene with protein product,Approved,19q13.43,19q13.43,,...,,I43.950,,,,,,,,
1,ENSG00000268895,HGNC:37133,A1BG-AS1,A1BG antisense RNA 1,non-coding RNA,"RNA, long non-coding",Approved,19q13.43,19q13.43,FLJ23569,...,,,,,,,,,,
2,ENSG00000148584,HGNC:24086,A1CF,APOBEC1 complementation factor,protein-coding gene,gene with protein product,Approved,10q21.1,10q21.1,ACF|ASP|ACF64|ACF65|APOBEC1CF,...,,,,,,,,,,
3,ENSG00000175899,HGNC:7,A2M,alpha-2-macroglobulin,protein-coding gene,gene with protein product,Approved,12p13.31,12p13.31,FWP007|S863-7|CPAMD5,...,,I39.001,,,,,,,,
4,ENSG00000245105,HGNC:27057,A2M-AS1,A2M antisense RNA 1 (head to head),non-coding RNA,"RNA, long non-coding",Approved,12p13.31,12p13.31,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35649,ENSG00000162378,HGNC:25820,ZYG11B,"zyg-11 family member B, cell cycle regulator",protein-coding gene,gene with protein product,Approved,1p32.3,01p32.3,FLJ13456,...,,,,,,,,,,
35650,ENSG00000159840,HGNC:13200,ZYX,zyxin,protein-coding gene,gene with protein product,Approved,7q32,07q32,,...,,,,,,,,,,
35651,ENSG00000274572,HGNC:51695,ZYXP1,zyxin pseudogene 1,pseudogene,pseudogene,Approved,8q24.23,08q24.23,,...,,,,,,,,,,
35652,ENSG00000074755,HGNC:29027,ZZEF1,zinc finger ZZ-type and EF-hand domain contain...,protein-coding gene,gene with protein product,Approved,17p13.3,17p13.3,KIAA0399|ZZZ4|FLJ10821,...,,,,,,,,,,


In [None]:
# DrugBank

Drug = pd.read_csv(
    "./drugbank_all_drugbank_vocabulary.csv.zip",
    compression="zip",
    sep=",",
    header=0,
    names=["drug", "accession", "drug_name", "cas", "unii", "synonym", "key"],
)

ChemRepresentation = pd.read_csv(
    "./chembl_35_chemreps.txt.gz",
    compression="gzip",
    sep="\t",
    # header=0,
    # names=["drug", "accession", "drug_name", "cas", "unii", "synonym", "key"],
)

# Drug.head(0)

ChemRepresentation.head(5)

  ChemRepresentation = pd.read_csv(


Unnamed: 0,chembl_id\tcanonical_smiles\tstandard_inchi\tstandard_inchi_key
0,CHEMBL153534\tCc1cc(-c2csc(N=C(N)N)n2)cn1C\tIn...
1,CHEMBL440060\tCC[C@H](C)[C@H](NC(=O)[C@H](CC(C...
2,CHEMBL440245\tCCCC[C@@H]1NC(=O)[C@@H](NC(=O)[C...
3,CHEMBL440249\tCC(C)C[C@@H]1NC(=O)CNC(=O)[C@H](...
4,CHEMBL405398\tBrc1cccc(Nc2ncnc3ccncc23)c1NCCN1...


### NetworkX Graph

In [42]:
G = nxadb.MultiGraph(name="NeuThera")

db = G.db

[16:33:17 +0530] [INFO]: Graph 'NeuThera' exists.
[16:33:17 +0530] [INFO]: Default node type set to 'NeuThera_node'


In [52]:
# Adding Nodes

collections = ["drug", "gene", "disease_doid", "disease_mesh", "disease_omim"]

# Create collections if they don't exist
for col in collections:
    if not db.has_collection(col):
        db.create_collection(col)

def add_nodes(df, label, key_column):
    print(f"adding nodes to {label}")
    G.add_nodes_from(
        (
            f"{label}/{str(row[key_column]) if pd.notna(row[key_column]) else 'NotAvailable'}",  # Convert _key to string, replace NaN
            row.fillna("NaN").drop(key_column).to_dict()  # Replace NaN in attributes
        )
        for _, row in df.iterrows()
    )
    
add_nodes(Drug, "drug", "drug")
add_nodes(Gene, "gene", "gene")
add_nodes(Doid, "disease_doid", "doid")
add_nodes(Mesh, "disease_mesh", "mesh")
add_nodes(Omim, "disease_omim", "omim")


adding nodes to drug
adding nodes to gene
adding nodes to disease_doid
adding nodes to disease_mesh
adding nodes to disease_omim


In [None]:
edge_collections = ["drug-drug", "drug-gene", "disease-drug", "disease-disease", "disease-function", "disease-gene", "function-function", "gene-function", "gene-protien"]

for edge_col in edge_collections:
    if not db.has_collection(edge_col):
        db.create_collection(edge_col, edge=True)

# def add_edges(df, src_label, dst_label, src_col, dst_col, edge_collection):
#     print(f"adding Edges to {edge}")
#     edges = []
#     for _, row in df.iterrows():
#         src_node = f"{src_label}/{str(row[src_col]) if pd.notna(row[src_col]) else 'NotAvailable'}"
#         dst_node = f"{dst_label}/{str(row[dst_col]) if pd.notna(row[dst_col]) else 'NotAvailable'}"

#         attributes = row.fillna("NaN").drop([src_col, dst_col]).to_dict()

#         if not isinstance(attributes, dict):
#             raise ValueError("Attributes must be a dictionary.")
        
        # edges.append({
        #     '_from': src_node,
        #     '_to': dst_node,
        #     **attributes
        # })
    
#     G.add_edges_from(edges, collection=edge_collection)

def add_edges(df, src_label, dst_label, src_col, dst_col, edge_collection):
    """Adds edges between nodes, ensuring the correct edge collection is used."""
    G.add_edges_from(
        (
            f"{src_label}/{str(row[src_col]) if pd.notna(row[src_col]) else 'NotAvailable'}",
            f"{dst_label}/{str(row[dst_col]) if pd.notna(row[dst_col]) else 'NotAvailable'}",
            row.fillna("NaN").drop([src_col, dst_col]).to_dict(),  # Replace NaN in attributes

        )
        for _, row in df.iterrows()
    )

add_edges(ChCh, "drug", "drug", "drug", "drug_target", "drug-drug")
add_edges(ChG, "drug", "gene", "drug", "gene", "drug-gene")
add_edges(DCh, "disease_mesh", "drug", "mesh", "drug", "disease-drug")
add_edges(DD, "disease_doid", "disease_doid", "doid", "doid_target", "disease-disease")
add_edges(DF, "disease_mesh", "go", "mesh", "go", "disease-function")
add_edges(DG, "disease_mesh", "gene", "mesh", "gene", "disease-gene")
add_edges(FF, "go", "go", "go", "go_target", "function-function")
add_edges(GF, "go", "gene", "go", "gene", "gene-function")
add_edges(GP, "gene", "protein", "ensg", "ensp", "gene-protien")

adding Edges to drug-drug
adding Edges to drug-gene
adding Edges to disease-drug
adding Edges to disease-disease
adding Edges to disease-function
adding Edges to disease-gene


IOStream.flush timed out


ConnectionAbortedError: Can't connect to host(s) within limit (3)

In [None]:
# Networkx -> ArangoDB -> Gen AI -> x10 Drug Candidates (Novel Drugs SMILE) -> DeepPurpose -> Score -> Best gets selected