# **Preparation of Validation Datasets**

To evaluate the performance of our multi-omics ranking methods, we compiled independent validation datasets from reputable databases across multiple biological entities, including genes, miRNAs, methylation markers, and pathways. These datasets serve as external benchmarks to assess the biological relevance of our predicted features for each disease-specific dataset: **Alzheimer's Disease**, **Breast Cancer**, and **Lung Adenocarcinoma**.

**Gene Validation**

* **GeneCards**: A comprehensive gene database providing integrated annotations, used to extract high-confidence disease-associated genes.
* **Comparative Toxicogenomics Database (CTD)**: Used to validate gene–disease associations based on curated and inferred evidence.

**miRNA Validation**

* **HMDD (Human microRNA Disease Database)**: Employed to confirm known miRNA–disease associations based on experimental evidence.

**Methylation Validation**

* **EWAS Atlas**: Used to extract disease-associated epigenetic markers (CpGs or genes) identified in Epigenome-Wide Association Studies (EWAS).

**Pathway Validation**

* **CTD**: Provided curated information on disease-related biological pathways.
* **G\:Profiler**: Used for functional enrichment and pathway mapping of predicted features, enabling cross-validation of pathway-level relevance.

### Load Packages

In [1]:
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')

### Load Raw Datasets and Prepare

#### 1. Prepare GeneCards Valdata

In [2]:
# for alzheimer disease
alzhemer_gene_cards = pd.read_csv('../downloads/val_data/GeneCards-SearchResults-Alzheimers.csv') 
alzhemer_gene_cards = alzhemer_gene_cards[['Gene Symbol', 'Relevance score']]
alzhemer_gene_cards.rename(columns = {'Gene Symbol': 'Entity', 'Relevance score': 'Score'}, inplace=True) 
alzhemer_gene_cards.sort_values(by='Score', ascending=False)
alzhemer_gene_cards.head()

Unnamed: 0,Entity,Score
0,APP,147.905914
1,PSEN1,125.910118
2,APOE,99.596542
3,PSEN2,80.182045
4,MAPT,60.629429


In [3]:
# for Breast Cancer disease
bc_gene_cards = pd.read_csv('../downloads/val_data/GeneCards-SearchResults-BreastCancer.csv') 
bc_gene_cards = bc_gene_cards[['Gene Symbol', 'Relevance score']]
bc_gene_cards.rename(columns = {'Gene Symbol': 'Entity', 'Relevance score': 'Score'}, inplace=True) 
bc_gene_cards.sort_values(by='Score', ascending=False)
bc_gene_cards.head()

Unnamed: 0,Entity,Score
0,ATM,313.776306
1,PALB2,283.848755
2,BRIP1,268.905914
3,BRCA2,261.717804
4,CHEK2,252.202255


In [4]:
# for Lung Cancer disease
psp_gene_cards = pd.read_csv('../downloads/val_data/GeneCards-SearchResults-PSP.csv') 
psp_gene_cards = psp_gene_cards[['Gene Symbol', 'Relevance score']]
psp_gene_cards.rename(columns = {'Gene Symbol': 'Entity', 'Relevance score': 'Score'}, inplace=True) 
psp_gene_cards.sort_values(by='Score', ascending=False)
psp_gene_cards.head()

Unnamed: 0,Entity,Score
0,MAPT,128.200134
1,SNCA,18.441431
2,GRN,17.680164
3,STH,16.352251
4,LRRK2,15.583508


In [5]:
# aggregate these results
val_datasets = [
    pd.concat([pd.DataFrame({'Disease':['AD']*alzhemer_gene_cards.shape[0]}), alzhemer_gene_cards], axis =  1),
    pd.concat([pd.DataFrame({'Disease':['BRCA']*bc_gene_cards.shape[0]}), bc_gene_cards], axis =  1),
    pd.concat([pd.DataFrame({'Disease':['PSP']*psp_gene_cards.shape[0]}), psp_gene_cards], axis =  1)
]
genecards_gene_val_data = pd.concat(val_datasets, axis = 0)  
genecards_gene_val_data['Entity'] =genecards_gene_val_data['Entity'].str.strip()
genecards_gene_val_data.to_csv('../data/val_data/genecards_val_data.csv', index = False)

#### 2. Prepare CTD Valdata

In [7]:
# load pathway val data for Alzheimer's disease
ad_ctd_disease_pathways = pd.read_csv('../downloads/val_data/CTD_D000544_pathways_20250721063406.csv') 
ad_ctd_disease_pathways['Score'] = ad_ctd_disease_pathways['Association inferred via'].apply(lambda x: len(x.split('|'))) 
ad_ctd_disease_pathways['Entity'] = ad_ctd_disease_pathways.Pathway 
ad_ctd_disease_pathways = ad_ctd_disease_pathways[['Entity', 'Score']] 
ad_ctd_disease_pathways.sort_values('Score',  ascending = False, inplace = True) 
ad_ctd_disease_pathways.insert(0, 'Disease', ['AD']*ad_ctd_disease_pathways.shape[0])
ad_ctd_disease_pathways['Entity'] = ad_ctd_disease_pathways['Entity'].str.strip()
ad_ctd_disease_pathways.head()

Unnamed: 0,Disease,Entity,Score
0,AD,Immune System,31
1,AD,Signal Transduction,28
2,AD,Innate Immune System,23
3,AD,Metabolism,23
4,AD,Metabolism of proteins,22


In [8]:
# load pathway val data
brc_ctd_disease_pathways = pd.read_csv('../downloads/val_data/CTD_D001941_pathways_20250904053420.csv') 
brc_ctd_disease_pathways['Score'] = brc_ctd_disease_pathways['Association inferred via'].apply(lambda x: len(x.split('|'))) 
brc_ctd_disease_pathways['Entity'] = brc_ctd_disease_pathways.Pathway 
brc_ctd_disease_pathways = brc_ctd_disease_pathways[['Entity', 'Score']] 
brc_ctd_disease_pathways.sort_values('Score',  ascending = False, inplace = True) 
brc_ctd_disease_pathways.insert(0, 'Disease', ['BRCA']*brc_ctd_disease_pathways.shape[0])
brc_ctd_disease_pathways['Entity'] = brc_ctd_disease_pathways['Entity'].str.strip()
brc_ctd_disease_pathways.head()

Unnamed: 0,Disease,Entity,Score
0,BRCA,Signal Transduction,150
1,BRCA,Immune System,109
2,BRCA,Gene Expression,100
3,BRCA,Metabolism,94
4,BRCA,Pathways in cancer,78


In [9]:
# load pathway val data
psp_ctd_disease_pathways = pd.read_csv('../downloads/val_data/CTD_D013494_pathways_20250904052209.csv') 
psp_ctd_disease_pathways['Score'] = psp_ctd_disease_pathways['Association inferred via'].apply(lambda x: len(x.split('|'))) 
psp_ctd_disease_pathways['Entity'] = psp_ctd_disease_pathways.Pathway 
psp_ctd_disease_pathways = psp_ctd_disease_pathways[['Entity', 'Score']] 
psp_ctd_disease_pathways.sort_values('Score',  ascending = False, inplace = True) 
psp_ctd_disease_pathways.insert(0, 'Disease', ['PSP']*psp_ctd_disease_pathways.shape[0])
psp_ctd_disease_pathways['Entity'] = psp_ctd_disease_pathways['Entity'].str.strip()
psp_ctd_disease_pathways.head()

Unnamed: 0,Disease,Entity,Score
0,PSP,Alzheimer's disease,2
2,PSP,Herpes simplex infection,2
3,PSP,mRNA Splicing,2
4,PSP,mRNA Splicing - Major Pathway,2
5,PSP,Processing of Capped Intron-Containing Pre-mRNA,2


In [10]:
# combine and save
ctd_disease_pathways = pd.concat([ad_ctd_disease_pathways, brc_ctd_disease_pathways, psp_ctd_disease_pathways], axis = 0)
ctd_disease_pathways.to_csv('../data/val_data/CTD_val_data_pathways.csv', index = False) 

In [11]:
# load alzheimer val data (Genes)
ctd_alzheimer_genes = pd.read_csv('../downloads/val_data/CTD_D000544_genes_20250721065005.csv') 
ctd_alzheimer_genes = ctd_alzheimer_genes[['Disease Name', 'Gene Symbol', 'Inference Score']]
ctd_alzheimer_genes.rename(columns = {'Disease Name': 'Disease','Gene Symbol': 'Entity', 'Inference Score':'Score' }, inplace = True)
ctd_alzheimer_genes['Disease'] = ctd_alzheimer_genes['Disease'].apply(lambda x: x.split(' ')[0])
ctd_alzheimer_genes['Disease'] = ["AD"]*ctd_alzheimer_genes.shape[0]
ctd_alzheimer_genes['Entity'] = ctd_alzheimer_genes['Entity'].str.strip() 
ctd_alzheimer_genes.dropna(inplace = True)
ctd_alzheimer_genes.head()

Unnamed: 0,Disease,Entity,Score
0,AD,APP,232.85
1,AD,IL1B,148.68
2,AD,CASP3,143.6
3,AD,TNF,142.44
4,AD,BDNF,139.66


In [12]:
# load breast neoplasms val data (Genes)
ctd_brca_genes = pd.read_csv('../downloads/val_data/CTD_D001941_genes_20250904053237.csv') 
ctd_brca_genes = ctd_brca_genes[['Disease Name', 'Gene Symbol', 'Inference Score']]
ctd_brca_genes.rename(columns = {'Disease Name': 'Disease','Gene Symbol': 'Entity', 'Inference Score':'Score' }, inplace = True)
ctd_brca_genes['Entity'] = ctd_brca_genes['Entity'].str.strip() 
ctd_brca_genes = ctd_brca_genes[ctd_brca_genes.Disease == "Breast Neoplasms"]
ctd_brca_genes.Disease = ["BRCA"]*ctd_brca_genes.shape[0]
ctd_brca_genes.dropna(inplace = True)
ctd_brca_genes.head()

Unnamed: 0,Disease,Entity,Score
0,BRCA,TP53,362.21
1,BRCA,CCND1,339.15
2,BRCA,BCL2,327.23
3,BRCA,BIRC5,308.22
4,BRCA,PARP1,304.61


In [13]:
# load Progressive Suprenatal Palsy val data (Genes)
ctd_psp_genes = pd.read_csv('../downloads/val_data/CTD_D013494_genes_20250904052250.csv') 
ctd_psp_genes = ctd_psp_genes[['Disease Name', 'Gene Symbol', 'Inference Score']]
ctd_psp_genes.rename(columns = {'Disease Name': 'Disease','Gene Symbol': 'Entity', 'Inference Score':'Score' }, inplace = True)
ctd_psp_genes['Entity'] = ctd_psp_genes['Entity'].str.strip() 
ctd_psp_genes = ctd_psp_genes[ctd_psp_genes.Disease == "Supranuclear Palsy, Progressive"]
ctd_psp_genes.dropna(inplace = True)
ctd_psp_genes.Disease = ["PSP"]*ctd_psp_genes.shape[0]
ctd_psp_genes.head()

Unnamed: 0,Disease,Entity,Score
0,PSP,EIF2AK3,3.12
7,PSP,GSK3B,10.22
8,PSP,HEXA,9.49
9,PSP,OTC,9.08
10,PSP,DRD1,8.8


In [14]:
# combine and save
ctd_genes = pd.concat([ctd_alzheimer_genes, ctd_brca_genes, ctd_psp_genes], axis = 0)
ctd_genes.to_csv('../data/val_data/CTD_val_data_genes.csv', index = False)

#### 3. Prepare HMDD Valdata

In [15]:
hmdd = pd.read_excel('../downloads/val_data/alldata_v4.xlsx') 
hmdd = hmdd[['disease', 'miRNA']] 
hmdd['association_count'] = hmdd.groupby(['disease', 'miRNA'])['miRNA'].transform('count')
hmdd = hmdd.drop_duplicates().sort_values('association_count', ascending = False) 
hmdd.rename(columns = {'disease':'Disease', 'miRNA':'Entity', 'association_count': 'Score'}, inplace = True)
hmdd.head()

Unnamed: 0,Disease,Entity,Score
1333,Breast Neoplasms,hsa-mir-21,121
1390,Colorectal Neoplasms,hsa-mir-21,77
7624,"Carcinoma, Hepatocellular",hsa-mir-122,69
2626,Breast Neoplasms,hsa-mir-155,62
1366,"Carcinoma, Non-Small-Cell Lung",hsa-mir-21,50


In [16]:
adhmdd = hmdd[hmdd.Disease.isin(['Alzheimer Disease'])] 
adhmdd['Disease'] = ['AD']*adhmdd.shape[0]
brhmdd = hmdd[hmdd.Disease.str.contains('Breast')] 
brhmdd['Disease'] = ['BRCA']*brhmdd.shape[0]
psphmdd = hmdd[hmdd.Disease.isin(["Supranuclear Palsy, Progressive"])] 
psphmdd['Disease'] = ['PSP']*psphmdd.shape[0] 

hmdd_filtered = pd.concat([adhmdd, brhmdd, psphmdd], axis = 0)   
hmdd_filtered['Entity'] = hmdd_filtered['Entity'].str.strip()
hmdd_filtered.to_csv('../data/val_data/HMDD_val_data_miRNA.csv', index = False)
hmdd_filtered.head()

Unnamed: 0,Disease,Entity,Score
3881,AD,hsa-mir-146a,28
10958,AD,hsa-mir-34a,16
11341,AD,hsa-mir-132,16
2614,AD,hsa-mir-155,12
17541,AD,hsa-mir-107,12


#### 4. Prepare EWAS ATLAS Valdata

In [17]:
ewas = pd.read_csv('../downloads/val_data/EWAS_Atlas_associations.tsv', sep='\t', encoding='ISO-8859-1') 
ewas = ewas[['trait', 'probe_ID', 'p_value']] 
ewas.rename(columns = {'trait': 'Disease', 'probe_ID': 'Entity', 'p_value': 'Score'}, inplace = True) 
ewas.sort_values('Score', ascending = True, inplace =  True)
ewas = ewas[ewas.Score >=0].drop_duplicates().dropna()

In [18]:
ad_ewas = ewas[ewas.Disease.str.contains('Alzheimer')]
ad_ewas['Disease'] = ['AD']*ad_ewas.shape[0]
 
breast_cancer_terms = ['breast cancer', 'breast cancer overall survival', 
       'chemotherapy for breast cancer', 'breast cancerization',
       'non-invasive sporadic breast cancer', 'Breast Cancer',
       'invasive lobular breast cancer overall survival',
       'aerobic training for breast cancer', 'breast tumours']
brc_ewas = ewas[ewas.Disease.isin(breast_cancer_terms)] 
brc_ewas['Disease'] = ['BRCA']*brc_ewas.shape[0]

 
psp_ewas = ewas[ewas.Disease.str.contains("progressive supranuclear palsy")] 
psp_ewas['Disease'] = ['PSP']*psp_ewas.shape[0]

In [19]:
# concat ewas data
ewas_filtered = pd.concat([ad_ewas, brc_ewas, psp_ewas], axis =0)   
ewas_filtered['Entity'] = ewas_filtered['Entity'].str.strip()
ewas_filtered.to_csv('../data/val_data/EWAS_ATLAS_va_data.csv', index = False)
ewas_filtered.head()

Unnamed: 0,Disease,Entity,Score
780686,AD,cg23971107,4.09e-97
780683,AD,cg16100476,4.2e-97
780685,AD,cg17258228,4.23e-97
761125,AD,cg04399985,4.92e-97
780680,AD,cg01437515,5.17e-97
